A Survey of Recent Keywords and Topic Extraction

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013
A Survey of Recent Keywords and Topic Extraction
Systems for Indian Languages
Vishal Gupta
Assistant Professor, UIET, Panjab University,
Sector-25, Chandigarh, India
Abstract— Keywords are the thematic words in any document.
They represent topic of that document. Keywords are commonly
used for search engines and document databases to locate
information and determine if two pieces of test are related to
each other. Key terms retrieval is also addressed as mining of
words, words retrieval, recognition of words, or retrieval of
glossary, is a small phase of retrieval of information. Overall
motive of retrieval of terminology is retrieving relevant words
automatically in a corpus. Moreover, techniques of automatic
words retrieval mainly apply language techniques (automatic
chunking of words phrases and tagging part of speech) for
retrieving suitable keywords. These retrieved keywords are very
much helpful in field of knowledge or in favouring making
ontology in same domain. Identification of topic is job related to
identification of unknown concepts or topics which are hidden
earlier. Identification of concept is the task of abstracting group
of documents related to stories which represent same idea for
that event. Concept or idea identification relates with association
of documents related to stories and concepts or topics which are
not hidden. Moreover, retrieval of words is much helpful concept
related to semantic resemblance, automatic translation by
machines and automatic managing of knowledge and data etc.
The paper describes review of different recent keywords and
topic extraction techniques from Indian languages.
Keywords— Keywords extraction, topic extraction, Indian
languages, term extraction, topic extraction.
I. INTRODUCTION
Keywords are the thematic words in any document. They
represent topic of that document. Keywords are commonly
used for search engines and document databases to locate
information and determine if two pieces of test are related to
each other. Key terms retrieval is also addressed as mining of
words, words retrieval, recognition of words, or retrieval of
glossary, is a small phase of retrieval of information. Overall
motive of retrieval of terminology is retrieving relevant words
automatically in a corpus. Moreover, techniques of automatic
words retrieval mainly apply language techniques (automatic
chunking of words phrases and tagging part of speech) for
retrieving suitable keywords. These retrieved keywords are
very much helpful in field of knowledge or in favouring
making ontology in same domain. Identification of topic [11]
is job related to identification of unknown concepts or topics
which are hidden earlier. Identification of concept is the task
of abstracting group of documents related to stories which
represent same idea for that event. Concept or idea
identification relates with association of documents related to
stories and concepts or topics which are not hidden. Moreover,
ISSN: 2231-5381
retrieval of words is much helpful concept related to semantic
resemblance, automatic translation by machines and automatic
managing of knowledge and data etc.
Concept recognition [15] deals with guessing words of text
which can represent concept or topic. In recent years, this task
is done by persons in field of computational linguistic
associated with different areas i.e. resolution of anaphora,
coreference and discourse. Prediction of relevant terms and
concepts from documents is very critical thing In field of
information extraction for purpose of extracting important text
documents, but these are not doing correspondence with topic
or theme. Guessing relevant words includes giving numerical
weight-age to different words of that text document. Words
having higher scores are relevant and important. These terms
can be called as denoting the whole text document.
These retrieved keywords are very much helpful in field of
knowledge or in favouring making ontology in same domain.
Moreover, retrieval of words is much helpful concept related
to semantic resemblance, automatic translation by machines
and automatic managing of knowledge and data etc. The paper
describes review of different recent keywords and topic
extraction techniques from Indian languages.
II.
KEYWORDS AND TOPIC EXTRACTION TECHNIQUES FOR
INDIAN LANGUAHES
Preeti and Brahmaleen Kaur Sidhu (2013) [1] proposed a
Punjabi keywords extraction system, in which Punjabi text is
input in Unicode format. Text is scanned to filter out special
tokens such as \\, ||, (,), [,] *, {,},!, ^, , +, -,. Several
modifications are made: punctuation marks, brackets, and
numbers are replaced by blank space. Word segmentation
phase is applied for recognizing and dividing individual terms
lying in input document in a manner as each term could be
represented as separate token. Results from this words
segmentation phase are treated as input by part of speech
tagger. Each word is built using the words of various word
classes like pronoun, pronoun, adjective etc. After POS
tagging, the part of speech tags are added into the database.
Then system identifies phrases from database using the rule
subject-object-verb. The generated list of candidate phrase is
input to the final step of key phrase extraction. After
identification of phrases, the list of phrases is generated as
output. The frequency of every phrase is calculated. The most
frequently occurring phrases are selected as Punjabi key
phrases. The average number of key phrases extracted from
http://www.ijettjournal.org
Page 340
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013
system are 11, 9.6 and 6.6 from Punjabi stories, articles and
news documents respectively. Gupta and Lehal (2011) [2]
proposed retrieval of key terms automatically in Punjabi. This
system has different steps: eliminating Punjabi stop terms,
recognizing nouns in Punjabi text and automatic rule based
noun stemmer in Punjabi, Finding values of frequency of
words and Inverse lines frequency also called as TF and ISF,
noun key terms in Punjabi with high value of frequency of
words-Inverse lines frequency and Punjabi sentences
belonging to title/Punjabi news-headline sentence feature for
Punjabi text documents. Punjabi noun terms having high value
of TF-ISF can be considered as Punjabi Key terms. At last
key terms of Punjabi are extracted using union operator of key
terms in title and key terms retrieved from earlier step i.e noun
key terms with higher value of TF-ISF. The values of Fmeasure, recall and precision, for this Punjabi key terms
retrieval are 85.2%, 90.6% and 80.4% respectively. Kaur and
Gupta (2011) [3] proposed another Punjabi keywords
extraction system. This system used hybrid approach,
containing different techniques, for example we can say that
mixing many different features for creating key terms retrieval
system. This system applies lists called gazetteer lists
generated from Punjabi dictionary by using part of speech
tagger. Thus key terms from Punjabi text belonging to cue
terms, title terms and noun terms with more frequency are
retrieved. Results given by system are very good
independently for different types of features and final outputs
are better after combining those features results good key
terms retrieval. Score values of F-measure, recall and
precision are 93.03, 90.19 and 98.28 respectively.
Sarkar (2011) [4] proposed a technique of key terms
retrieval for Bengali language. This system comprises various
phases like retrieval of n-grams, detection of suitable key
terms and giving score values to these key terms. Because
Bengali is very inflectional in nature, therefore a Bengali
stemmer which is lightweight in nature has been made for
doing stemming of key terms. This system was tested
thoroughly on set of documents in Bengali language which
were taken from online corpus of Bengali which anyone can
download on website of TDIL. Saraswathi et al. (2010) [5]
proposed keywords extraction system for Tamil and English
for bilingual information retrieval system. The motive of this
proposed approach is to extract output for input question
typed in language which is same with the language of query.
In it, they built a tree called as ontological tree in the same
field in a manner that entries could be done in the two
languages at each node belongs to tree. They have used part of
speech tagging for finding key terms in the input question.
Question typed by a person is treated as input for this
tagger. Input line is tagged by this tagger and gives
parts of line. We can recognize nouns and verbs from results
of this tagger which are treated as suitable key terms for
doing search operation. On basis of topic, these key terms are
converted into suitable target languages by applying tree of
ontology. Then we can do search for extracting the text
documents on the basis of key terms.
ISSN: 2231-5381
Jayashree.R et al. (2011) [6] proposed Keyword extraction
for Kannada document summarization. This system retrieves
key terms from Kannada text documents which are
categorized previously. These documents can be obtained
from different types of resources which are online by mixing
coefficients of GSS (Galavotti, Sebastiani, Simi) and by using
methods of inverse document frequency with word frequency
& then apply retrieved key terms in performing task of
summarization. Sarkar (2011) [7] proposed automatic key
phrase extraction from Bengali documents. A sequence of
phrases are termed as Key terms which can highlight the
concepts of any text document. key terms assist users to
quickly grasp, manage, share and access information present
in text documents. They proposed initial approach for
extracting key terms in the documents related to Bengali by
applying two very essential features, i.e. term frequency
inverse document frequency, initial occurrence of the term in
input text. They have designed a initial model of this approach
that applies as: retrieve n-grams in input text document,
recognize suitable key terms and at last scores those suitable
key terms for finding required key terms. It was tested on
large number of documents related to Bengali language. These
documen6ts were taken from online corpus related to Bengali
language. Balabantaray et al. (2012) [8] proposed key term
extraction based Odia text summarization system. This system
accepts text input which is having .txt as extension. Initially it
applies tokenizer for tokenizing input text in to individual
words or terms. After that they apply filter for filtering input
by eliminating stop terms. Then they apply Odia stemmer for
stemming of every term. Then they give value of weights to
every word which can be obtained: ratio of word frequency to
total frequency of words lying in text document. Next task is
to assign scores to different lines in accordance with value of
their weights. At last we can calculate final weight of line
using summation of weights different words in that line and
then divide that by frequency of words for that line. Das and
Bandyopadhyay (2010) [9] developed a keyword-based
Bengali opinion summarization system that finds information
related to sentiments from every text document and then this
system aggregates them & denotes information related to
summary in that text. It applies model of topic sentiment for
detection of sentiments & aggregation. His model is made in
the form as discourse level detection of concept. Then it gets
topic sentiment aggregation using clustering of concepts by
using k-means approach and at the level of text document
representation of relational graph. Finally this graph at the
level of document is ultimately is utilized for selection of lines
for summary using suitable algorithms of page rank which are
applied for information extraction. This technique has been
tested with F-measure, Recall and precision of 69.65%,
67.32% and 72.15% respectively. Das and Bandyopadhyay
(2010) [10] proposed an approach for identifying topicKeywords from annotated Bengali blog sentences. They have
made a system which is unsupervised and syntactic in nature
on the basis of structure of argument in the lines according to
its verb. If this structure which is acquired of blog in Bengali
line according to verb satisfies the match from any frame
http://www.ijettjournal.org
Page 341
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013
syntax extracted for same verb in English with same meaning
in VerbNet, then topic Key term and holder compared key
roles attached with VerbNet in English frames are mapped to
suitable terms in line of Bengali language. They have used
rule oriented simple techniques for eliminating the errors and
to better performance of system related to syntactic for
creating lines. This approach outperforms the base line system
with F-measures of 66.03% and 61.98% as compared to
baseline technique having F-measures of 53.85% and 50.02%
in case of multiple and single holders of emotions and concept
respectively for 500 reference test lines.
Das and Ching (2005) [12] proposed a system which is
dependent on speaker and is called as spotter of Bengali key
terms for speech of English which is not structured in nature
had implemented in this method. They have used two
techniques. Both of these approaches applied HMM which is
full term based for key terms. Training was provided to
Terms of Bengali as isolated terms. Full term filler based
technique
was
used
by 1st approach. Trained phoneme related to English model
was used by 2nd
approach along with network of
all grammer related to phone network for modelling of
filler part. Full term oriented technique shows very good
optimal performance of 94.22%. 2nd technique shows very
good performance with hit rate 95.83%
J. Allan et al. (2003) described his effort for developing
topic identification & tracking approach for Hindi stories
related to news. Massachusetts university showed output for
three tasks of topic tracking and identification in evaluation of
surprise language by DARPA. It was based on vector space
technique of information extraction. The approach told us the
process for generating the judgements which were relevant
and were used for evaluation of system. Output shows that
effectiveness of tracking of topic is equivalent with topic
detection methods for other languages. Outputs of clustering
and identification of new event denotes that stetting of
different parameters for those jobs are language sensitive
which is currently used.
Kaur and Gupta (2011) [14] proposed the topic tracking for
Punjabi language. This system has been experimented with
two approaches. NER based approach and keyword extraction
approaches have been implemented. This method finds if any
two news articles in Punjabi highlights same concept or topic
or not. Many features are retrieved out of the text using the
two approaches. The NER and keyword features of initial
news document are compared with the respective features of
target news document. The percentage of match or tracking
same topic is evaluated. It was developed and implemented on
Platform of VB.NET and different lists called gazetteers lists
were made in the form of tables in database. This system takes
news articles as input text, which are to be compared to check
if they track same topic or not. These input text documents are
obtained from different websites of Punjabi like: likhari.org,
jagbani.com,
ajitweekly.com,
punjabispectrum.com,
europevichpunjabi.com,
quamiekta.com,
sahitkar.com,
onlineindian.com, europesamachar.com, parvasi.com etc. Four
experiments have been carried out to implement topic tracking
ISSN: 2231-5381
for Punjabi. In the first experiment, NER module has been
tested. In the second experiment, the keyword extraction
module has been tested. The third experiment tested the topic
tracking system by evaluating using NER technique alone and
keyword extraction technique alone. After that, topic tracking
is implemented by combining both the techniques. In the last
experiment, a number of similarity measures have been
analysed to evaluate which similarity measure finds the best
results for topic tracking. Dutta et al. (2005) [16] discussed a
model which is hybrid in nature and is related to information
retrieval which is on the basis of keywords identification
geographical technique which is created for retrieving
information related to geographic from Hindi text which is not
restricted. The bond among objects of geographic retrieved
with adjacent text is graphically depicted for relating
information related to those entities. This technique is hybrid
of linguistics and statistical methods, recognizes multiple and
single geographical names. It is used on text in Hindi
language, and this technique can be easily adapted to other
languages of world. The author conducted some mathematical
experiments for finding accuracy of this technique.
Kothwal and Varma (2013) [17] proposed cross lingual text
reuse detection based on keyphrase extraction This approach
addressed the problem proposed in FIRE CLITR 2011 task of
detecting plagiarized documents in Hindi language which was
reused from English language source documents. This
technique proposed three approaches using classification and
key-phrase retrieval techniques and winning approach attained
0.792 F-measure.
III. CONCLUSIONS
This paper presents the survey of different recent keywords
and topic extraction techniques from Indian languages. We
can conclude from this survey that very less number of
linguistic resources are available for Indian languages. It
requires lot of research and development for developing these
resources. Keywords extraction systems and Topic extraction
systems for Indian languages are in the early stage of research.
Although sufficient amount of linguistic resources are
available for Hindi, but for other Indian languages, we are still
lacking for these resources.
REFERENCES
[1]
[2]
[3]
[4]
[5]
Preeti and B. K. Sidhu, “Keyphrase Extraction From Punjabi Corpus”,
International Journal of Engineering Research and Application, vol. 3,
pp. 491-494, 2013.
V. Gupta and G.S. Lehal, “Automatic Keywords Extraction for Punjabi
Language”, International Journal of Computer Science Issues, vol. 8,
pp. 327-331, 2011.
K. Kaur and V. Gupta, “Keyword Extraction for Punjabi Labguage ”,
Indian Journal of Computer Science and Engineering, vol. 2, pp. 364370, 2011.
K. Sarkar, “An N-Gram Based Method for Bengali Keyphrase
Extraction”, In Proceedings of International Conference ICISIL-2011,
Springer, Patiala, India, pp. 36-41, 2011.
S.Saraswathi, M. A.Siddhiqaa., K. Kalaimagal. and M. Kalaiyarasi,
“BiLingual Information Retrieval System for English and
Tamil” , Journal of Computing, vol. 2, pp. 85-89, 2010.
http://www.ijettjournal.org
Page 342
International Journal of Engineering Trends and Technology (IJETT) – Volume 6 Number 6- Dec 2013
[6]
Jayashree.R, Srikanta Murthy.K and Sunny.K, “Document
Summarization in Kannada using Keyword Extraction”, In Proceedings
of AIAA 2011,CS & IT 03, pp. 121–127 , 2011.
[7]
K. Sarkar, “Automatic Key phrase Extraction from Bengali Documents:
A Preliminary Study”, In Proceedings of IEEE Second International
Conference on EAIT’11, pp. 125-128, 2011.
[8]
R. C. Balabantaray, B. Sahoo, D. K. Sahoo and M. Swain, "Odia Text
Summarization using Stemmer", International Journal of Applied
Information System,vol.1, pp. 21-24, 2012
[9] A. Das and S. Bandyopadhyay, “Topic-Based Bengali Opinion
Summarization”, Coling 2010: Poster Volume, pp. 232–240, Beijing,
2010.
[10] D. Das and S. Bandyopadhyay, "Identifying Emotion Holder and Topic
from Bengali Emotional Sentences", Proceedings of ICON-2010: 8th
International Conference on Natural Language Processing, India, 2010.
[11] http://www.itl.nist.gov/iaui/894.01/tdt98/doc/tdtslides/sld001.htm,
1998
[12] S.Das and P.C Ching, "Speaker Dependent Bengali Keyword spotting
in unconstrained English Speech", A Project report, Indian Institute of
Technology Guwahati, India, 2005.
[13] J. Allan, V. Lavrenko and M. E. Connell, "A month to topic detection
and tracking in Hindi", International Journal ACM Transactions on
Asian Language Information Processing (TALIP), vol. 2, pp. 85-100,
2003.
[14] K. Kaur and V. Gupta, “Topic Tracking for Punjabi Language” ,
Computer Science & Engineering: An International Journal (CSEIJ),
vol.1, pp. 37-49, 2011.
[15]
[16]
T. Nomoto and Y. Matsumoto, "Exploring the text structure for Topic
Identification", In Proceedings of the 4th Workshop on Very Large
Corpora, pp.101-112, 1996.
R. Kothwal and V.Varma, "Cross Lingual Text Reuse Detection Based
on Keyphrase Extraction and Similarity Measures", Springer's
Multilingual Information Access in South Asian Languages Lecture
Notes in Computer Science, pp 71-78, vol.7536, 2013.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 343
Download