International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013 A Survey of Recent Keywords and Topic Extraction Systems for Indian Languages Vishal Gupta Assistant Professor, UIET, Panjab University, Sector-25, Chandigarh, India Abstract— Keywords are the thematic words in any document. They represent topic of that document. Keywords are commonly used for search engines and document databases to locate information and determine if two pieces of test are related to each other. Key terms retrieval is also addressed as mining of words, words retrieval, recognition of words, or retrieval of glossary, is a small phase of retrieval of information. Overall motive of retrieval of terminology is retrieving relevant words automatically in a corpus. Moreover, techniques of automatic words retrieval mainly apply language techniques (automatic chunking of words phrases and tagging part of speech) for retrieving suitable keywords. These retrieved keywords are very much helpful in field of knowledge or in favouring making ontology in same domain. Identification of topic is job related to identification of unknown concepts or topics which are hidden earlier. Identification of concept is the task of abstracting group of documents related to stories which represent same idea for that event. Concept or idea identification relates with association of documents related to stories and concepts or topics which are not hidden. Moreover, retrieval of words is much helpful concept related to semantic resemblance, automatic translation by machines and automatic managing of knowledge and data etc. The paper describes review of different recent keywords and topic extraction techniques from Indian languages. Keywords— Keywords extraction, topic extraction, Indian languages, term extraction, topic extraction. I. INTRODUCTION Keywords are the thematic words in any document. They represent topic of that document. Keywords are commonly used for search engines and document databases to locate information and determine if two pieces of test are related to each other. Key terms retrieval is also addressed as mining of words, words retrieval, recognition of words, or retrieval of glossary, is a small phase of retrieval of information. Overall motive of retrieval of terminology is retrieving relevant words automatically in a corpus. Moreover, techniques of automatic words retrieval mainly apply language techniques (automatic chunking of words phrases and tagging part of speech) for retrieving suitable keywords. These retrieved keywords are very much helpful in field of knowledge or in favouring making ontology in same domain. Identification of topic [11] is job related to identification of unknown concepts or topics which are hidden earlier. Identification of concept is the task of abstracting group of documents related to stories which represent same idea for that event. Concept or idea identification relates with association of documents related to stories and concepts or topics which are not hidden. Moreover, ISSN: 2231-5381 retrieval of words is much helpful concept related to semantic resemblance, automatic translation by machines and automatic managing of knowledge and data etc. Concept recognition [15] deals with guessing words of text which can represent concept or topic. In recent years, this task is done by persons in field of computational linguistic associated with different areas i.e. resolution of anaphora, coreference and discourse. Prediction of relevant terms and concepts from documents is very critical thing In field of information extraction for purpose of extracting important text documents, but these are not doing correspondence with topic or theme. Guessing relevant words includes giving numerical weight-age to different words of that text document. Words having higher scores are relevant and important. These terms can be called as denoting the whole text document. These retrieved keywords are very much helpful in field of knowledge or in favouring making ontology in same domain. Moreover, retrieval of words is much helpful concept related to semantic resemblance, automatic translation by machines and automatic managing of knowledge and data etc. The paper describes review of different recent keywords and topic extraction techniques from Indian languages. II. KEYWORDS AND TOPIC EXTRACTION TECHNIQUES FOR INDIAN LANGUAHES Preeti and Brahmaleen Kaur Sidhu (2013) [1] proposed a Punjabi keywords extraction system, in which Punjabi text is input in Unicode format. Text is scanned to filter out special tokens such as \\, ||, (,), [,] *, {,},!, ^, , +, -,. Several modifications are made: punctuation marks, brackets, and numbers are replaced by blank space. Word segmentation phase is applied for recognizing and dividing individual terms lying in input document in a manner as each term could be represented as separate token. Results from this words segmentation phase are treated as input by part of speech tagger. Each word is built using the words of various word classes like pronoun, pronoun, adjective etc. After POS tagging, the part of speech tags are added into the database. Then system identifies phrases from database using the rule subject-object-verb. The generated list of candidate phrase is input to the final step of key phrase extraction. After identification of phrases, the list of phrases is generated as output. The frequency of every phrase is calculated. The most frequently occurring phrases are selected as Punjabi key phrases. The average number of key phrases extracted from http://www.ijettjournal.org Page 340 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013 system are 11, 9.6 and 6.6 from Punjabi stories, articles and news documents respectively. Gupta and Lehal (2011) [2] proposed retrieval of key terms automatically in Punjabi. This system has different steps: eliminating Punjabi stop terms, recognizing nouns in Punjabi text and automatic rule based noun stemmer in Punjabi, Finding values of frequency of words and Inverse lines frequency also called as TF and ISF, noun key terms in Punjabi with high value of frequency of words-Inverse lines frequency and Punjabi sentences belonging to title/Punjabi news-headline sentence feature for Punjabi text documents. Punjabi noun terms having high value of TF-ISF can be considered as Punjabi Key terms. At last key terms of Punjabi are extracted using union operator of key terms in title and key terms retrieved from earlier step i.e noun key terms with higher value of TF-ISF. The values of Fmeasure, recall and precision, for this Punjabi key terms retrieval are 85.2%, 90.6% and 80.4% respectively. Kaur and Gupta (2011) [3] proposed another Punjabi keywords extraction system. This system used hybrid approach, containing different techniques, for example we can say that mixing many different features for creating key terms retrieval system. This system applies lists called gazetteer lists generated from Punjabi dictionary by using part of speech tagger. Thus key terms from Punjabi text belonging to cue terms, title terms and noun terms with more frequency are retrieved. Results given by system are very good independently for different types of features and final outputs are better after combining those features results good key terms retrieval. Score values of F-measure, recall and precision are 93.03, 90.19 and 98.28 respectively. Sarkar (2011) [4] proposed a technique of key terms retrieval for Bengali language. This system comprises various phases like retrieval of n-grams, detection of suitable key terms and giving score values to these key terms. Because Bengali is very inflectional in nature, therefore a Bengali stemmer which is lightweight in nature has been made for doing stemming of key terms. This system was tested thoroughly on set of documents in Bengali language which were taken from online corpus of Bengali which anyone can download on website of TDIL. Saraswathi et al. (2010) [5] proposed keywords extraction system for Tamil and English for bilingual information retrieval system. The motive of this proposed approach is to extract output for input question typed in language which is same with the language of query. In it, they built a tree called as ontological tree in the same field in a manner that entries could be done in the two languages at each node belongs to tree. They have used part of speech tagging for finding key terms in the input question. Question typed by a person is treated as input for this tagger. Input line is tagged by this tagger and gives parts of line. We can recognize nouns and verbs from results of this tagger which are treated as suitable key terms for doing search operation. On basis of topic, these key terms are converted into suitable target languages by applying tree of ontology. Then we can do search for extracting the text documents on the basis of key terms. ISSN: 2231-5381 Jayashree.R et al. (2011) [6] proposed Keyword extraction for Kannada document summarization. This system retrieves key terms from Kannada text documents which are categorized previously. These documents can be obtained from different types of resources which are online by mixing coefficients of GSS (Galavotti, Sebastiani, Simi) and by using methods of inverse document frequency with word frequency & then apply retrieved key terms in performing task of summarization. Sarkar (2011) [7] proposed automatic key phrase extraction from Bengali documents. A sequence of phrases are termed as Key terms which can highlight the concepts of any text document. key terms assist users to quickly grasp, manage, share and access information present in text documents. They proposed initial approach for extracting key terms in the documents related to Bengali by applying two very essential features, i.e. term frequency inverse document frequency, initial occurrence of the term in input text. They have designed a initial model of this approach that applies as: retrieve n-grams in input text document, recognize suitable key terms and at last scores those suitable key terms for finding required key terms. It was tested on large number of documents related to Bengali language. These documen6ts were taken from online corpus related to Bengali language. Balabantaray et al. (2012) [8] proposed key term extraction based Odia text summarization system. This system accepts text input which is having .txt as extension. Initially it applies tokenizer for tokenizing input text in to individual words or terms. After that they apply filter for filtering input by eliminating stop terms. Then they apply Odia stemmer for stemming of every term. Then they give value of weights to every word which can be obtained: ratio of word frequency to total frequency of words lying in text document. Next task is to assign scores to different lines in accordance with value of their weights. At last we can calculate final weight of line using summation of weights different words in that line and then divide that by frequency of words for that line. Das and Bandyopadhyay (2010) [9] developed a keyword-based Bengali opinion summarization system that finds information related to sentiments from every text document and then this system aggregates them & denotes information related to summary in that text. It applies model of topic sentiment for detection of sentiments & aggregation. His model is made in the form as discourse level detection of concept. Then it gets topic sentiment aggregation using clustering of concepts by using k-means approach and at the level of text document representation of relational graph. Finally this graph at the level of document is ultimately is utilized for selection of lines for summary using suitable algorithms of page rank which are applied for information extraction. This technique has been tested with F-measure, Recall and precision of 69.65%, 67.32% and 72.15% respectively. Das and Bandyopadhyay (2010) [10] proposed an approach for identifying topicKeywords from annotated Bengali blog sentences. They have made a system which is unsupervised and syntactic in nature on the basis of structure of argument in the lines according to its verb. If this structure which is acquired of blog in Bengali line according to verb satisfies the match from any frame http://www.ijettjournal.org Page 341 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013 syntax extracted for same verb in English with same meaning in VerbNet, then topic Key term and holder compared key roles attached with VerbNet in English frames are mapped to suitable terms in line of Bengali language. They have used rule oriented simple techniques for eliminating the errors and to better performance of system related to syntactic for creating lines. This approach outperforms the base line system with F-measures of 66.03% and 61.98% as compared to baseline technique having F-measures of 53.85% and 50.02% in case of multiple and single holders of emotions and concept respectively for 500 reference test lines. Das and Ching (2005) [12] proposed a system which is dependent on speaker and is called as spotter of Bengali key terms for speech of English which is not structured in nature had implemented in this method. They have used two techniques. Both of these approaches applied HMM which is full term based for key terms. Training was provided to Terms of Bengali as isolated terms. Full term filler based technique was used by 1st approach. Trained phoneme related to English model was used by 2nd approach along with network of all grammer related to phone network for modelling of filler part. Full term oriented technique shows very good optimal performance of 94.22%. 2nd technique shows very good performance with hit rate 95.83% J. Allan et al. (2003) described his effort for developing topic identification & tracking approach for Hindi stories related to news. Massachusetts university showed output for three tasks of topic tracking and identification in evaluation of surprise language by DARPA. It was based on vector space technique of information extraction. The approach told us the process for generating the judgements which were relevant and were used for evaluation of system. Output shows that effectiveness of tracking of topic is equivalent with topic detection methods for other languages. Outputs of clustering and identification of new event denotes that stetting of different parameters for those jobs are language sensitive which is currently used. Kaur and Gupta (2011) [14] proposed the topic tracking for Punjabi language. This system has been experimented with two approaches. NER based approach and keyword extraction approaches have been implemented. This method finds if any two news articles in Punjabi highlights same concept or topic or not. Many features are retrieved out of the text using the two approaches. The NER and keyword features of initial news document are compared with the respective features of target news document. The percentage of match or tracking same topic is evaluated. It was developed and implemented on Platform of VB.NET and different lists called gazetteers lists were made in the form of tables in database. This system takes news articles as input text, which are to be compared to check if they track same topic or not. These input text documents are obtained from different websites of Punjabi like: likhari.org, jagbani.com, ajitweekly.com, punjabispectrum.com, europevichpunjabi.com, quamiekta.com, sahitkar.com, onlineindian.com, europesamachar.com, parvasi.com etc. Four experiments have been carried out to implement topic tracking ISSN: 2231-5381 for Punjabi. In the first experiment, NER module has been tested. In the second experiment, the keyword extraction module has been tested. The third experiment tested the topic tracking system by evaluating using NER technique alone and keyword extraction technique alone. After that, topic tracking is implemented by combining both the techniques. In the last experiment, a number of similarity measures have been analysed to evaluate which similarity measure finds the best results for topic tracking. Dutta et al. (2005) [16] discussed a model which is hybrid in nature and is related to information retrieval which is on the basis of keywords identification geographical technique which is created for retrieving information related to geographic from Hindi text which is not restricted. The bond among objects of geographic retrieved with adjacent text is graphically depicted for relating information related to those entities. This technique is hybrid of linguistics and statistical methods, recognizes multiple and single geographical names. It is used on text in Hindi language, and this technique can be easily adapted to other languages of world. The author conducted some mathematical experiments for finding accuracy of this technique. Kothwal and Varma (2013) [17] proposed cross lingual text reuse detection based on keyphrase extraction This approach addressed the problem proposed in FIRE CLITR 2011 task of detecting plagiarized documents in Hindi language which was reused from English language source documents. This technique proposed three approaches using classification and key-phrase retrieval techniques and winning approach attained 0.792 F-measure. III. CONCLUSIONS This paper presents the survey of different recent keywords and topic extraction techniques from Indian languages. We can conclude from this survey that very less number of linguistic resources are available for Indian languages. It requires lot of research and development for developing these resources. Keywords extraction systems and Topic extraction systems for Indian languages are in the early stage of research. Although sufficient amount of linguistic resources are available for Hindi, but for other Indian languages, we are still lacking for these resources. REFERENCES [1] [2] [3] [4] [5] Preeti and B. K. Sidhu, “Keyphrase Extraction From Punjabi Corpus”, International Journal of Engineering Research and Application, vol. 3, pp. 491-494, 2013. V. Gupta and G.S. Lehal, “Automatic Keywords Extraction for Punjabi Language”, International Journal of Computer Science Issues, vol. 8, pp. 327-331, 2011. K. Kaur and V. Gupta, “Keyword Extraction for Punjabi Labguage ”, Indian Journal of Computer Science and Engineering, vol. 2, pp. 364370, 2011. K. Sarkar, “An N-Gram Based Method for Bengali Keyphrase Extraction”, In Proceedings of International Conference ICISIL-2011, Springer, Patiala, India, pp. 36-41, 2011. S.Saraswathi, M. A.Siddhiqaa., K. Kalaimagal. and M. Kalaiyarasi, “BiLingual Information Retrieval System for English and Tamil” , Journal of Computing, vol. 2, pp. 85-89, 2010. http://www.ijettjournal.org Page 342 International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013 [6] Jayashree.R, Srikanta Murthy.K and Sunny.K, “Document Summarization in Kannada using Keyword Extraction”, In Proceedings of AIAA 2011,CS & IT 03, pp. 121–127 , 2011. [7] K. Sarkar, “Automatic Key phrase Extraction from Bengali Documents: A Preliminary Study”, In Proceedings of IEEE Second International Conference on EAIT’11, pp. 125-128, 2011. [8] R. C. Balabantaray, B. Sahoo, D. K. Sahoo and M. Swain, "Odia Text Summarization using Stemmer", International Journal of Applied Information System,vol.1, pp. 21-24, 2012 [9] A. Das and S. Bandyopadhyay, “Topic-Based Bengali Opinion Summarization”, Coling 2010: Poster Volume, pp. 232–240, Beijing, 2010. [10] D. Das and S. Bandyopadhyay, "Identifying Emotion Holder and Topic from Bengali Emotional Sentences", Proceedings of ICON-2010: 8th International Conference on Natural Language Processing, India, 2010. [11] http://www.itl.nist.gov/iaui/894.01/tdt98/doc/tdtslides/sld001.htm, 1998 [12] S.Das and P.C Ching, "Speaker Dependent Bengali Keyword spotting in unconstrained English Speech", A Project report, Indian Institute of Technology Guwahati, India, 2005. [13] J. Allan, V. Lavrenko and M. E. Connell, "A month to topic detection and tracking in Hindi", International Journal ACM Transactions on Asian Language Information Processing (TALIP), vol. 2, pp. 85-100, 2003. [14] K. Kaur and V. Gupta, “Topic Tracking for Punjabi Language” , Computer Science & Engineering: An International Journal (CSEIJ), vol.1, pp. 37-49, 2011. [15] [16] T. Nomoto and Y. Matsumoto, "Exploring the text structure for Topic Identification", In Proceedings of the 4th Workshop on Very Large Corpora, pp.101-112, 1996. R. Kothwal and V.Varma, "Cross Lingual Text Reuse Detection Based on Keyphrase Extraction and Similarity Measures", Springer's Multilingual Information Access in South Asian Languages Lecture Notes in Computer Science, pp 71-78, vol.7536, 2013. ISSN: 2231-5381 http://www.ijettjournal.org Page 343