tvgeetha-pres-TIC201..

Work at TACOLA Lab Team Members T.V.Geetha Ranjani Parthasarathi Madhan Karky E.UmaMaheswari J.Balaji Subalalitha Elanchezhiyan.K, Karthika, Thenmalar, Radhakrishnan, Kandasamy, Padmavathi, Aruna, Vijayavani Tamil Language Processing  Tamil Language Processing  Morphological analyser Normal Words, Compound Words, Colloquial Words  Parser Simple, Complex and Compound Sentences  Semantic analysis based on UNL  Language Technology  Blog Mining  Ontology Based Information Extraction  Personalized Search  Parallelization for NLP Processing  Emotion detection form text  Carnatic Music Processing  Raga Modelling  Singer, Genre Identification  Music Emotion Recognition  Tamil Language Oriented Tools  Dictionary  Text Compaction  UNL Based Work  UNL for semantic representation  Nested UNL  Concept based Search  Bi-lingual Search  Event Processing  Discourse Analysis  Summarization  Question answering  Thirukural Search  Lyric Oriented Processing  Lyric Mining  Lyrics for Tunes  Pleasantness Dr.T.V.Geetha, Anna University 2 Papers for TIC 2011 Tamil Language Oriented Tools  Agaraadhi: A Novel Online Dictionary Framework  An Efficient Tamil Text Compaction System. (Surukkupai)  Kuralagam, A Concept Relation Based Search Framework for Thirukural.  Popularity Based Scoring Model for Tamil Word Games Tamil Language Processing  Template based Multilingual Summary Generation.  On Emotion detection from Tamil Text.  Tamil Summary Generation for Cricket Match. Lyric Oriented Processing  Lyric Mining : Word, Rhyme & Concept Co-occurrence Analysis.  Special Indices for LaaLaLaa Lyric Analysis & Generation Framework. Dr.T.V.Geetha, Anna University 3 AGARAADHI A NOVEL ONLINE DICTIONARY FRAMEWORK Elanchezhiyan.K Karthikeyan.S T.V.Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University 4 OBJECTIVES Agaraadhi, a dictionary framework for indexing and retrieving Tamil words, their meaning, analysis and related information. Framework to incorporate various unique features - designed to provide additional information to the user regarding the word that they query about. Dr.T.V.Geetha, Anna University 5 INTRODUCTION  Agaraadhi dictionary has more than 3 lac words in various domains such as • General, • Literature, • Medical, • Engineering, • Computer Science, • Birds Name and More…  The Agaraadhi is a Tamil English bilingual dictionary. Dr.T.V.Geetha, Anna University 6 INTRODUCTION CONT…  The Agaraadhi is a Tamil English bilingual dictionary with 20 features. such as • morphological analysis, • morphological generation, • word usage statistics, • word pleasantness analysis, • spell checking, • similar word finder, • word usage in literature, • picture dictionary, • number to text conversion, • phonetic transliteration, • live usage analysis from micro blogs and more… Dr.T.V.Geetha, Anna University 7 AGARAADHI FRAMEWORK CONT… Dr.T.V.Geetha, Anna University 8 AGARAADHI FEATURES Morphological Analyser  gives the morphological features of the query word such as root word, parts of speech, gender, tense and count.  If the Query word is padithaan, Morphological Analyser gives as padi as root, word represents male gender and query word is past tense and so on. Morphological Generator Tamil morphological generator tackles different syntactic categories such as nouns, verbs, post positions, adjectives, adverbs.  The generator is used to generate possible morphological variations of the query word. Spell Checker  used to check the spelling of Tamil words and to provide alternative suggestions for the wrongly spelt words.  If root word not in dictionary - generates all the possible suggestions with minimum variations from the given word Dr.T.V.Geetha, Anna University 9 AGARAADHI FEATURES Word Suggestions  gives the list of equivalent or related words for the given query word. Word Pleasantness  score generator provides how easy it is to pronounce the word. Word Popularity Score  shows the word usage in the web based on frequency distribution of the word across the popular blogs, news articles, social nets etc. Word Usage Statistics  shows the usage of the word in the social network over the past one week. Word Usage in Literature  finds the usage of words in popular literature such as Thirukural, Bharathiyar Padalgal, Avvai songs and also Lyrics of Tamil Movie songs. Dr.T.V.Geetha, Anna University 10 AGARAADHI FEATURES Word of the Day  A rare word is randomly chosen and is displayed in the opening page to facilitate users to learn a new word every day. Number to Text Converter  converts a number to Tamil word equivalent as well as in English text. For example in Tamil we represent oru Arpputham (அற்புதம்) for 100 million, Kumbam (கும்பம்) for 10 billion and finally up to Anniyan (அந்நியம்) for one zilli Picture Dictionary  Pictures, photos or line drawings to depict popular words have been included in the dictionary to enable efficient learning for children using this tool. Dr.T.V.Geetha, Anna University 11 RESULTS Query word: pookkal (பூக்கள்) http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE% AA%E0%AF%82%E0%AE%95%E0%AF%8D%E0% AE%95%E0%AE%B3%E0%AF%8D+&ln=ta&Submit .x=8&Submit.y=7 Query word: mazhai (மழை) http://www.agaraadhi.com/dict/OD.jsp?w=%E0%AE% AE%E0%AE%B4%E0%AF%88+&ln=ta&Submit.x=21 &Submit.y=4 Query word: fruit http://www.agaraadhi.com/dict/OD.jsp?w=fruit&ln=en Dr.T.V.Geetha, Anna University 12 FUTURE WORK Providing APIs for programmers and developing mobile apps for Agaraadhi framework will open a good platform for many researchers and developers working in Tamil Computing area. Dr.T.V.Geetha, Anna University 13 REFERENCE 1.Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. 2.Anandan, R. Parthasarathi, and Geetha, Morphological Generator for Tamil. Tamil Inayam, Malaysia, 2001. 3.J. Jai Hari Raju, P. IndhuReka, Dr. Madhan Karky, Statistical Analysis and visualization of Tamil Usage in Live Text Streams, Tamil Internet Conference, Coimbatore, 2010. Dr.T.V.Geetha, Anna University 14 N.M.Revathi G.P.Shanthi Elanchezhiyan.K T V Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University 15 OBJECTIVES Why Compacting? limited message length in blog sites and tiny user interface of mobile phones. saves online storage space and hence reduction in cost. The paper proposes a text compaction system for Tamil, first of its kind in Tamil. Idea of compaction Getting the shortest word has no specific rule it is mainly aimed at understanding. can be obtained by omitting letters, replacing prefix and suffix through suitable symbols and numbers. Dr.T.V.Geetha, Anna University 16 FRAMEWORK ARCHITECTURE Dr.T.V.Geetha, Anna University 17 FRAMEWORK CONT.. Input Processing The morphological analyzer removes the suffix (if present) added to the word and delivers the root word (RW). Dr.T.V.Geetha, Anna University 18 FRAMEWORK CONT.. Identification of the category & Extraction of compact word  Three categories of words ; common Tamil words, abbreviations/acronyms, numbers.  abbreviations /acronyms by comparing it with the keys of the hashmap.  With the help of the hash key and a mapping algorithm, the compact word is retrieved.  Otherwise belongs to either the common tamil word or numbers  If numbers - Numerical analyser for text to number conversion. Output Processing :  Tamil tool Morphological Generator to add the suitable suffix to cater to the rules of the language. Dr.T.V.Geetha, Anna University 19 RESULT AND ANALYSIS Tested with over 10,000 words.  The final result is reduced to 40% of the original text. Dr.T.V.Geetha, Anna University 20 REFERENCES  Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002. Fung, L. M. (2005). SMS short form identification and codec. Unpublished master’s thesis, National University of Singapore, Singapore .  Acrophile (LSLarkey, P Ogilvie, MA Price, B Tamilio, 2000) a system that automatically searches acronym expansion pairs.  Short Message Service (SMS) Texting Symbols: A Functional Analysis of 10,000 Cellular Phone Text Messages by Robert E. Beasley,Franklin College. Dr.T.V.Geetha, Anna University 21 Kuralagam Concept Relation based Search Engine for Thirukkural Elanchezhiyan.K T.V.Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University 22 Objectives Kuralagam is a conceptual search framework for Thirukkural – based on UNL Framework. Searching with keywords – in kurals and intepretations Concept based search based on CoReX – conceptual indexing based on UNL Bilingual search – English and Tamil Showing Relationships between the concepts. Dr.T.V.Geetha, Anna University 23 Kuralagam Framework Dr.T.V.Geetha, Anna University 24 Offline Processing Web Crawler A Thirukkural statistics crawler crawls the news and blog documents - to find the usage of each individual Thirukkural. The usage recorded for measuring the popularity score for each Thirukkural Enconversion – Based on UNL Indexed – based on CoReX Framework Dr.T.V.Geetha, Anna University 25 UNL & Enconversion  UNL is an intermediate language    processes knowledge across languagebarriers. captures semantics by converting natural language terms present in the document to concepts. concepts are connected to the other concepts through UNL relations - 46 UNL relations   plf(Place From), plt(Place To), tmf(Time from), tmt(Time to) etc Process of converting a natural language text to UNL graph is known as Enconversion  reverse process is known as Deconversion. Dr.T.V.Geetha, Anna University 26 An Example speaks more...  Ex:John was playing in the garden john(iof>person) agt play(icl>action) plc garden(icl>place) Dr.T.V.Geetha, Anna University 27 Indexer The Kuralagam Indexer is designed based on CoReX Techniques. The Indexer stores and manages the UNL graphs in two different indices. Concept only index (C index), and Concept-Relation-Concept index (CRC index) Dr.T.V.Geetha, Anna University 28 Online Processing  Query Translation and Expansion  converts the user query to UNL graph.  uses CRC (Concept Relation Concept) CoReX indices to fetch similarity thesaurus and co-occurrence list to populate the Multi list Data Structure.  Search and Ranking  fetches the Thirukkural number and its details.  Thirukkurals for a given query are fetched using the two types of concept relation indices namely CRC and C.  The query concept is expanded using related CRC indices pointing to the query concept. helps in retrieving many Thirukkurals conceptually related to the query – not possible with key word Thirukkural search engines.  The ranking is based on priority to the indices in the order CRC>C usage score frequency occurrence of the query concept Dr.T.V.Geetha, Anna University 29 Tab Layout Dr.T.V.Geetha, Anna University 30 Performance Evaluation The accuracy of the Thirukkural search engine was measured using the average precision and mean average precision. The comparisons between concept based search and keyword based search were measured using Average Precision methodology Dr.T.V.Geetha, Anna University 31 Average Precision Dr.T.V.Geetha, Anna University 32 Reference  1. Subalalitha, T V Geetha, Ranjani Parthasarathy and Madhan Karky Vairamuthu. CoReX: A Concept Based Semantic Indexing Technique. In SWM-08. 2008. India.  2. Foundation, U., the Universal Networking Language (UNL) Specifications Version 3 3ed. December 2004: UNL Computer Society, 2004. 8(5).Center UNDL Foundation  3. Anandan, R. Parthasarathi, and Geetha, Morphological Analyser for Tamil. ICON 2002, 2002.  4. T.Dhanabalan, K.Saravanan, and T.V.Geetha. 2002. Tamil to UNL Enconverter, ICUKL, Goa, India.  5. Andrew, T. and S. Falk. User performance versus precision measures for simple search tasks. In 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval 2006. Seattle, Washington, USA. Dr.T.V.Geetha, Anna University 33 Template Based MultiLingual Summary Generation Subalalitha C.N E.Umamaheswari T V Geetha Ranjani Parthasarathi Madhan Karky Dr.T.V.Geetha, Anna University 34 Aim  To generate a multi lingual summary using based on Universal Networking Language (UNL) Framework Dr.T.V.Geetha, Anna University 35 The Architechture Dr.T.V.Geetha, Anna University 36 Multi Lingual Summary Generation using UNL Template based Information Extraction • Seven tourism specific templates have been designed and used • Templates filled using semantic information inherent in UNL input graphs • Template information is language independent and can be used with any desired language. Dr.T.V.Geetha, Anna University 37 Example Templates for Tourism Domain Template Semantics inherited from UNL God iof>god, iof>goddess, icl>god Food icl>food, icl>fruit Flaura and Fauna icl>animal, icl>reptile, icl>mammal, icl> plant Boarding facility icl>facility Transport facility icl>transport Place icl>place, iof>place, iof>city, iof>country Distance icl >unit , icl >number Dr.T.V.Geetha, Anna University 38 SummaryGeneration • • • • The template information is converted to target language using respective UNL-target language dictionaries. UNL-target language dictionaries contains root words. Natural language term from the root word is obtained using target language information like case suffixes and language technology tools like morphological generator (சென்னை+இல்=சென்னையில்) When these converted template information is fitted into target language specific dynamic sentence patterns, a summary is generated. Dr.T.V.Geetha, Anna University 39 Performance Evaluation    Tested with 33,000 Tamil and English text documents enconverted to UNL graphs. The performance of the methodology proposed has been evaluated using human judgement. The accuracy of the achieved 90% . summary generated has Further Enhancements •Query specific summary •Comparing the performance with human generated summaries. Dr.T.V.Geetha, Anna University 40 References [1] Elanchezhiyan K, T V Geetha, Ranjani Parthasarathi & Madhan Karky, CoRe – Concept Based Query Expansion, Tamil Internet Conference, Coimbatore, 2010. [2] Alkesh Patel , Tanveer Siddiqui , U. S. Tiwary , “A language independent approach to multilingual text summarization”, Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1,2007 [3]David Kirk Evans, “Identifying Similarity in Text: Multi-Lingual Analysis for Summarization ”, Doctor of Philosophy thesis, Graduate School of Arts and Sciences , Columbia University, 2005 [4] Radev, Allison, Blair-Goldensohn et al (2004), MEAD – a platform for multidocument multilingual text summarization [5] The Universal Networking Language (UNL) Specifications Version 3 Edition 3, UNL Center UNDL Foundation December 2004. Jagadeesh J, Prasad Pingali, Vasudeva Varma, “ Sentence Extraction Based Single Document Summarization” Workshop on Document Summarization, March, 2005, IIIT Allahabad. [7] Naresh Kumar Nagwani, Dr. Shrish Verma , “A Frequent Term and Semantic Similarity based Single Document Text Summarization Algorithm ” International Journal of Computer Applications (0975 – 8887) Volume 17– No.2, March 2011 . [8]Prof. R. Nedunchelian, “Centroid Based Summarization of Multiple Documents Implemented using Timestamps ” First International Conference on Emerging Trends in Engineering and Technology, IEEE 2008 Dr.T.V.Geetha, Anna University 41

tvgeetha-pres-TIC201..

Related documents

Products

Support

tvgeetha-pres-TIC201..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib