tvgeetha-pres-TIC201..

advertisement
Work at TACOLA Lab
Team Members
T.V.Geetha Ranjani Parthasarathi Madhan Karky
E.UmaMaheswari J.Balaji Subalalitha Elanchezhiyan.K,
Karthika, Thenmalar, Radhakrishnan, Kandasamy,
Padmavathi, Aruna, Vijayavani
Tamil Language Processing
 Tamil Language Processing
 Morphological analyser
Normal Words, Compound
Words, Colloquial Words
 Parser
Simple, Complex and
Compound Sentences
 Semantic analysis based on
UNL
 Language Technology
 Blog Mining
 Ontology Based Information
Extraction
 Personalized Search
 Parallelization for NLP
Processing
 Emotion detection form text
 Carnatic Music Processing
 Raga Modelling
 Singer, Genre Identification
 Music Emotion Recognition
 Tamil Language Oriented Tools
 Dictionary
 Text Compaction
 UNL Based Work
 UNL for semantic
representation
 Nested UNL
 Concept based Search
 Bi-lingual Search
 Event Processing
 Discourse Analysis
 Summarization
 Question answering
 Thirukural Search
 Lyric Oriented Processing
 Lyric Mining
 Lyrics for Tunes
 Pleasantness
Dr.T.V.Geetha, Anna University
2
Papers for TIC 2011
Tamil Language Oriented Tools
 Agaraadhi: A Novel Online Dictionary Framework
 An Efficient Tamil Text Compaction System. (Surukkupai)
 Kuralagam, A Concept Relation Based Search Framework for
Thirukural.
 Popularity Based Scoring Model for Tamil Word Games
Tamil Language Processing
 Template based Multilingual Summary Generation.
 On Emotion detection from Tamil Text.
 Tamil Summary Generation for Cricket Match.
Lyric Oriented Processing
 Lyric Mining : Word, Rhyme & Concept Co-occurrence Analysis.
 Special Indices for LaaLaLaa Lyric Analysis & Generation Framework.
Dr.T.V.Geetha, Anna University
3
AGARAADHI
A NOVEL ONLINE DICTIONARY
FRAMEWORK
Elanchezhiyan.K
Karthikeyan.S
T.V.Geetha
Ranjani Parthasarathi
Madhan Karky
Dr.T.V.Geetha, Anna University
4
OBJECTIVES
Agaraadhi, a dictionary framework for
indexing and retrieving Tamil words, their
meaning, analysis and related information.
Framework to incorporate various unique
features - designed to provide additional
information to the user regarding the word that
they query about.
Dr.T.V.Geetha, Anna University
5
INTRODUCTION

Agaraadhi dictionary has more than 3 lac words in various
domains such as
• General,
• Literature,
• Medical,
• Engineering,
• Computer Science,
• Birds Name and More…

The Agaraadhi is a Tamil English bilingual dictionary.
Dr.T.V.Geetha, Anna University
6
INTRODUCTION CONT…

The Agaraadhi is a Tamil English bilingual dictionary with 20
features. such as
• morphological analysis,
• morphological generation,
• word usage statistics,
• word pleasantness analysis,
• spell checking,
• similar word finder,
• word usage in literature,
• picture dictionary,
• number to text conversion,
• phonetic transliteration,
• live usage analysis from micro blogs and more…
Dr.T.V.Geetha, Anna University
7
AGARAADHI FRAMEWORK CONT…
Dr.T.V.Geetha, Anna University
8
AGARAADHI FEATURES
Morphological Analyser
 gives the morphological features of the query word such as root
word, parts of speech, gender, tense and count.
 If the Query word is padithaan, Morphological Analyser gives as padi as
root, word represents male gender and query word is past tense and so on.
Morphological Generator
Tamil morphological generator tackles different syntactic
categories such as nouns, verbs, post positions, adjectives,
adverbs.
 The generator is used to generate possible morphological variations
of the query word.
Spell Checker
 used to check the spelling of Tamil words and to provide alternative
suggestions for the wrongly spelt words.
 If root word not in dictionary - generates all the possible suggestions
with minimum variations from the given word
Dr.T.V.Geetha, Anna University
9
AGARAADHI FEATURES
Word Suggestions
 gives the list of equivalent or related words for the given query
word.
Word Pleasantness
 score generator provides how easy it is to pronounce the word.
Word Popularity Score
 shows the word usage in the web based on frequency distribution
of the word across the popular blogs, news articles, social nets etc.
Word Usage Statistics
 shows the usage of the word in the social network over the past one
week.
Word Usage in Literature
 finds the usage of words in popular literature such as Thirukural,
Bharathiyar Padalgal, Avvai songs and also Lyrics of Tamil Movie
songs.
Dr.T.V.Geetha, Anna University
10
AGARAADHI FEATURES
Word of the Day
 A rare word is randomly chosen and is displayed in the opening
page to facilitate users to learn a new word every day.
Number to Text Converter
 converts a number to Tamil word equivalent as well as in English
text. For example in Tamil we represent oru Arpputham (அற்புதம்)
for 100 million, Kumbam (கும்பம்) for 10 billion and finally up to
Anniyan (அந்நியம்) for one zilli
Picture Dictionary
 Pictures, photos or line drawings to depict popular words have been
included in the dictionary to enable efficient learning for children
using this tool.
Dr.T.V.Geetha, Anna University
11
RESULTS
Query word: pookkal (பூக்கள்)
http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%
AA%E0%AF%82%E0%AE%95%E0%AF%8D%E0%
AE%95%E0%AE%B3%E0%AF%8D+&ln=ta&Submit
.x=8&Submit.y=7
Query word: mazhai (மழை)
http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE%
AE%E0%AE%B4%E0%AF%88+&ln=ta&Submit.x=21
&Submit.y=4
Query word: fruit
http://www.agaraadhi.com/dict/OD.jsp?w=fruit&ln=en
Dr.T.V.Geetha, Anna University
12
FUTURE WORK
Providing APIs for programmers and
developing mobile apps for Agaraadhi
framework will open a good platform for many
researchers and developers working in Tamil
Computing area.
Dr.T.V.Geetha, Anna University
13
REFERENCE
1.Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON
2002, 2002.
2.Anandan, R. Parthasarathi, and Geetha,
Morphological Generator for Tamil. Tamil
Inayam, Malaysia, 2001.
3.J. Jai Hari Raju, P. IndhuReka, Dr. Madhan
Karky, Statistical Analysis and visualization of
Tamil Usage in Live Text Streams, Tamil
Internet Conference,
Coimbatore, 2010.
Dr.T.V.Geetha, Anna University
14
N.M.Revathi
G.P.Shanthi
Elanchezhiyan.K
T V Geetha
Ranjani Parthasarathi
Madhan Karky
Dr.T.V.Geetha, Anna University
15
OBJECTIVES
Why Compacting?
limited message length in blog sites and tiny user
interface of mobile phones.
saves online storage space and hence reduction in
cost.
The paper proposes
a text compaction system for Tamil, first of its kind
in Tamil.
Idea of compaction
Getting the shortest word has no specific rule it is
mainly aimed at understanding.
can be obtained by omitting letters, replacing prefix
and suffix through suitable symbols and numbers.
Dr.T.V.Geetha, Anna University
16
FRAMEWORK ARCHITECTURE
Dr.T.V.Geetha, Anna University
17
FRAMEWORK CONT..
Input Processing
The morphological analyzer removes the suffix
(if present) added to the word and delivers the
root word (RW).
Dr.T.V.Geetha, Anna University
18
FRAMEWORK CONT..
Identification of the category & Extraction of compact
word
 Three categories of words ; common Tamil words,
abbreviations/acronyms, numbers.
 abbreviations /acronyms by comparing it with the keys of the
hashmap.
 With the help of the hash key and a mapping
algorithm, the compact word is retrieved.
 Otherwise belongs to either the common tamil word or numbers
 If numbers - Numerical analyser for text to number
conversion.
Output Processing :
 Tamil tool Morphological Generator to add the suitable suffix to
cater to the rules of the language.
Dr.T.V.Geetha, Anna University
19
RESULT AND ANALYSIS
Tested with over
10,000 words.
 The final result is
reduced to 40% of
the original text.
Dr.T.V.Geetha, Anna University
20
REFERENCES
 Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON 2002, 2002.
Fung, L. M. (2005). SMS short form identification and
codec. Unpublished master’s thesis, National
University of Singapore, Singapore .
 Acrophile (LSLarkey, P Ogilvie, MA Price, B
Tamilio, 2000) a system that automatically searches
acronym expansion pairs.
 Short Message Service (SMS) Texting Symbols: A
Functional Analysis of 10,000 Cellular Phone Text
Messages by Robert E. Beasley,Franklin College.
Dr.T.V.Geetha, Anna University
21
Kuralagam Concept Relation based Search Engine for
Thirukkural
Elanchezhiyan.K
T.V.Geetha
Ranjani Parthasarathi
Madhan Karky
Dr.T.V.Geetha, Anna University
22
Objectives
Kuralagam is a conceptual search framework for
Thirukkural – based on UNL Framework.
Searching with keywords – in kurals and
intepretations
Concept based search based on CoReX – conceptual
indexing based on UNL
Bilingual search – English and Tamil
Showing Relationships between the concepts.
Dr.T.V.Geetha, Anna University
23
Kuralagam Framework
Dr.T.V.Geetha, Anna University
24
Offline Processing
Web Crawler
A Thirukkural statistics crawler
crawls the news and blog documents - to find the usage of
each individual Thirukkural.
The usage recorded for measuring the popularity score for
each Thirukkural
Enconversion – Based on UNL
Indexed – based on CoReX Framework
Dr.T.V.Geetha, Anna University
25
UNL & Enconversion

UNL is an intermediate language



processes knowledge across languagebarriers.
captures semantics by converting natural
language terms present in the document to
concepts.
concepts are connected to the other concepts
through UNL relations - 46 UNL relations


plf(Place From), plt(Place To), tmf(Time from), tmt(Time
to) etc
Process of converting a natural language text to
UNL graph is known as Enconversion

reverse process is known as Deconversion.
Dr.T.V.Geetha, Anna University
26
An Example speaks more...

Ex:John was playing in the garden
john(iof>person)
agt
play(icl>action)
plc
garden(icl>place)
Dr.T.V.Geetha, Anna University
27
Indexer
The Kuralagam Indexer is designed based on
CoReX Techniques.
The Indexer stores and manages the UNL graphs in
two different indices.
Concept only index (C index), and
Concept-Relation-Concept index (CRC index)
Dr.T.V.Geetha, Anna University
28
Online Processing
 Query Translation and Expansion
 converts the user query to UNL graph.
 uses CRC (Concept Relation Concept) CoReX indices to fetch
similarity thesaurus and co-occurrence list to populate the Multi list
Data Structure.
 Search and Ranking
 fetches the Thirukkural number and its details.
 Thirukkurals for a given query are fetched using the two types of
concept relation indices namely CRC and C.
 The query concept is expanded using related CRC indices pointing
to the query concept.
helps in retrieving many Thirukkurals conceptually related to the
query – not possible with key word Thirukkural search engines.
 The ranking is based on
priority to the indices in the order CRC>C
usage score
frequency occurrence of the query concept
Dr.T.V.Geetha, Anna University
29
Tab Layout
Dr.T.V.Geetha, Anna University
30
Performance Evaluation
The accuracy of the Thirukkural search engine
was measured using the average precision and
mean average precision.
The comparisons between concept based search
and keyword based search were measured using
Average Precision methodology
Dr.T.V.Geetha, Anna University
31
Average Precision
Dr.T.V.Geetha, Anna University
32
Reference
 1. Subalalitha, T V Geetha, Ranjani Parthasarathy and Madhan Karky
Vairamuthu. CoReX: A Concept Based Semantic Indexing Technique.
In SWM-08. 2008. India.
 2. Foundation, U., the Universal Networking Language (UNL)
Specifications Version 3 3ed. December 2004: UNL Computer Society,
2004. 8(5).Center UNDL Foundation
 3. Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for
Tamil. ICON 2002, 2002.
 4. T.Dhanabalan, K.Saravanan, and T.V.Geetha. 2002. Tamil to UNL
Enconverter, ICUKL, Goa, India.
 5. Andrew, T. and S. Falk. User performance versus precision measures
for simple search tasks. In 29th Annual international ACM SIGIR
Conference on Research and Development in information Retrieval
2006. Seattle, Washington, USA.
Dr.T.V.Geetha, Anna University
33
Template Based MultiLingual
Summary Generation
Subalalitha C.N
E.Umamaheswari
T V Geetha
Ranjani Parthasarathi
Madhan Karky
Dr.T.V.Geetha, Anna University
34
Aim

To generate a multi lingual summary using based on
Universal Networking Language (UNL) Framework
Dr.T.V.Geetha, Anna University
35
The Architechture
Dr.T.V.Geetha, Anna University
36
Multi Lingual Summary Generation
using UNL
Template based Information Extraction
• Seven tourism specific templates have been
designed and used
• Templates filled using semantic information
inherent in UNL input graphs
• Template information is language independent and
can be used with any desired language.
Dr.T.V.Geetha, Anna University
37
Example Templates for Tourism
Domain
Template
Semantics inherited from UNL
God
iof>god, iof>goddess, icl>god
Food
icl>food, icl>fruit
Flaura and Fauna
icl>animal, icl>reptile, icl>mammal, icl> plant
Boarding facility
icl>facility
Transport facility
icl>transport
Place
icl>place, iof>place, iof>city, iof>country
Distance
icl >unit , icl >number
Dr.T.V.Geetha, Anna University
38
SummaryGeneration
•
•
•
•
The template information is converted to target language
using respective UNL-target language dictionaries.
UNL-target language dictionaries contains root words.
Natural language term from the root word is obtained using
target language information like case suffixes and language
technology tools like morphological generator
(சென்னை+இல்=சென்னையில்)
When these converted template information is fitted into
target language specific dynamic sentence patterns, a
summary is generated.
Dr.T.V.Geetha, Anna University
39
Performance Evaluation



Tested with 33,000 Tamil and English text
documents enconverted to UNL graphs.
The performance of the methodology proposed has
been evaluated using human judgement.
The accuracy of the
achieved 90% .
summary generated has
Further Enhancements
•Query specific summary
•Comparing the performance with human generated
summaries.
Dr.T.V.Geetha, Anna University
40
References
[1] Elanchezhiyan K, T V Geetha, Ranjani Parthasarathi & Madhan Karky, CoRe –
Concept Based Query Expansion, Tamil Internet Conference, Coimbatore, 2010.
[2] Alkesh Patel , Tanveer Siddiqui , U. S. Tiwary , “A language independent approach to
multilingual text summarization”, Conference RIAO2007, Pittsburgh PA, U.S.A. May
30-June 1,2007
[3]David Kirk Evans, “Identifying Similarity in Text: Multi-Lingual Analysis for
Summarization ”, Doctor of Philosophy thesis, Graduate School of Arts and Sciences ,
Columbia University, 2005
[4] Radev, Allison, Blair-Goldensohn et al (2004), MEAD – a platform for
multidocument multilingual text summarization
[5] The Universal Networking Language (UNL) Specifications Version 3 Edition 3, UNL
Center UNDL Foundation December 2004.
Jagadeesh J, Prasad Pingali, Vasudeva Varma, “ Sentence Extraction Based Single
Document Summarization” Workshop on Document Summarization, March, 2005, IIIT
Allahabad.
[7] Naresh Kumar Nagwani, Dr. Shrish Verma , “A Frequent Term and Semantic
Similarity based Single Document Text Summarization Algorithm ” International Journal
of Computer Applications (0975 – 8887) Volume 17– No.2, March 2011 .
[8]Prof. R. Nedunchelian, “Centroid Based Summarization of Multiple Documents
Implemented using Timestamps ” First International Conference on Emerging Trends in
Engineering and Technology, IEEE 2008
Dr.T.V.Geetha, Anna University
41
Download