Task type and a faceted task classification

advertisement
Text Retrieval Improvement Based on Automatically Extracted
Keyphrase
Hohyon Ryu
University of Wisconsin – Milwaukee. School of Information Studies, Bolton Hall 5th Floor 3210
N Maryland Ave Milwaukee, WI 53211, Email: hohyon@gmail.com
Introduction
Yang, 2000; Lee S. S. et al., 2003; Lee C. Y., et al., 1993;
As Web 2.0 technologies such as free-tagging and
folksonomy become popular, many researchers have paid
attention on adopting those newly emerged information
technology to enhance text retrieval. Generally, tags reflect
the content of a text. Thus, it is assumed that using tags as a
feature for retrieval can improve the retrieval performance.
However, as Hotho (2006) and Passant (2007) remarked,
free-tagging has certain limits to be used in text retrieval.
Since free-tagging is done by users without any control,
tags can be ambiguous and/or irrelevant to the original text.
Instead of using the tags given by users randomly, the
present study is concerned with using automatically
extracted keywords or keyphrases in text retrieval. In this
study, a keyword is defined as a single word that represents
the content of a given text and a keyphrase is a phrase that
describes the content well. Since keywords and keyphrases
represent the main ideas and items of a given text, giving
more weight on keyphrase or keyword terms would
promote the performance of retrieval.
This study is based on Korean academic environment
where index terms are compound words or phrases in more
than 70% of the cases (Lee, et al., 2003). In Korean,
extracted single words have only limited capability of
representing the content. Keywords are extracted first by a
neural network algorithm and keyphrases are generated
combining the extracted keywords within the context.
To fully utilize the generated keywords and keyphrases,
they are experimented in text retrieval. They are coded into
a document vector with a certain weight and combined with
the original document vector. By doing so, the original
document vector can be enhanced to represent the context
of the original document better and can improve overall
retrieval performance.
Lee H. A., et al., 1997). Since keyphrases are more
prevalent than single-noun keywords and more complicated
process should be involved in Korean language, many
studies have been done by Korean researchers. Finally,
Hotho (2006) and Cho et al. (2005) conducted a study to
improve text retrieve performance with keywords or
folksonomy.
Literature Review
Previous studies that are related to keyword extraction
based
on
neural
network,
noun
phrases
generation/extraction, and improving text retrieval
performance with contextual features have been reviewed.
Neural network or other machine learning methods have
been utilized in several studies to decide if a given word
should be recognized as a keyword (Medelyan et al., 2008;
Jo T. C. et al., 2000). Extracting noun phrases also has been
approached in many different ways (Tomokiyo et al., 2003;
Experiment Design
As shown in Figure 1, a neural network, which was
implemented by Feed-forward Neural Network for Python
(Wojciechowski et al., 2007), judges each word if it is
eligible to be a keyword on the basis of TF*IDF and the
location of each word in the document.
Figure 1: Outline of the experiment.
Keyphrases are generated based on rule-based algorithm.
The keyphrase generation algorithm makes a window with
1 preceding and 3 following words and merges adjacent or
overlapping windows. The rule-based algorithm rules out
the words inadequate for a noun phrase by analyzing the
part of speech of each word. The words appear on the
automatically extracted or generated keywords and
keyphrases are added onto the original document vector
with a certain weight to give more weight on essential
words. For each vector, Okapi TF×IDF normalization was
applied. As shown on the example below, Keyphrases
include a certain keyword repeatedly. Thus, more important
keywords get more weight while less important keywords
often get no additional weight at all. The example of the
extracted keywords and keyphrases are shown below
(translated):
Title: Study on Guidelines for the Construction of a
Korean Thesaurus
Keywords: 1986, Korean, basic, Hangeul, definition,
standard, 2788, relation, word, ISO, alphabet, thesaurus,
rule, term, most
Keyphrases: standard for Hangeul thesaurus construction,
Hangeul thesaurus, word thesaurus, ISO standard, aspect
of Hangeul thesaurus, Hangeul thesaurus test, Hangeul
thesaurus data, ISO, word thesaurus construction standard,
Hangeul thesaurus management system
With the weighted vectors, along with the original
vectors as a baseline, text retrieval experiments were
carried out. The test collection consists of 545 abstracts of
academic papers from the library of Yonsei University in
ten different academic fields. The result of retrieval
experiments with keyword-added vector, keyphrase-added
vector, and a vector with both keyword and keyphrase
entry words is shown in Figure 2. The result was evaluated
in R-precision.
Result
Figure 2: The change of R-precision according to the
assigned weight and features
As Figure 2 suggests, text retrieval performance
increased by 15% from R-precision of 0.64 to 0.74 when
the words appear on keyphrases and keywords are both
added on the original vector with double weight. For the
keyword + keyphrase vector, lesser margin of the
improvement was shown as higher weight is assigned to the
additional terms.
On the other hand, the higher weights are assigned to the
keyword-added vector (the dashed line in Figure 2,) the
better performance is shown. This is because the same
weight is assigned to each keyword items, while words in
the keyphrase get different weight according to their
appearance in the keyphrase list.
Conclusions and Future Research
The present study shows that giving extra weight on
words that appear in keywords or keyphrase affects the
performance of text retrieval positively. Since current web
retrieval provides a significant number of irrelevant
documents for users’ request, modifying search algorithms
to be more sensitive to the subject of documents will help
to improve the retrieval performance. Additionally, neural
network keyword extraction and rule-based keyphrase
generation performed a stable efficiency. The evaluation of
keyword and keyphrase generation will be carried out in
the future to utilize the modules as independent software.
The result of retrieval test in Korean environment showed
significant improvement. In the future, further experiments
will be made in English expecting positive improvement on
retrieval performance.
REFERENCES
Cho M., Yun B., & Rim H. 1997. A Korean Document Retrieval
Model Considering Compound Nouns and Derived Nouns.
Proceedings of Korea Information Science Society Spring
Conference 24(1). 449-502.
Hotho, A., Jaschke, R., Schmitz, C., & Stumme, G. 2006.
Information Retrieval in Folksonomies: Search and Ranking.
Lecture Notes in Computer Science. Springer Berlin:
Heidelberg.
Jo, T. C., & Seo, J. 2000. Neural Based Approach to Keyword
Extraction from Documents. Proceedings of Korea Information
Science Society Autumn Conference 27(2). 317-319.
Lee, C. Y., Kang, H., Jang, H., & Park, S. 1993. A design of the
Automatic Keyword Maker. Proceedings of the 5th Conference
of Hangul and Korean Information Processing. 71-77.
Lee, H. A., Lee, J. H., & Lee, G. 1997. Noun Phrase Indexing
using Clausal Segmentation. Journal of Korea Information
Science Society(b) 25(3). 301-311.
Lee, S. S., & Lee, T. 2003. Concept-based Compound Keyword
Extraction. Journal of Korea Association of Computer
Education 6(2).
Medelyan, O., & Witten, I. H. 2008. Domain Independent
Automatic Keyphrase Indexing with Small Training Sets. Jasist,
59(7). 1026-1040.
Passant, A. 2007. Using Ontologies to Strengthen Folksonomies
and Enrich Information Retrieval in Weblogs. International
Conference on Web Services.
Tomokiyo, T., & Hurst, M. 2003. A Language Model Approach
to Keyphrase Extraction. Proceedings of the ACL Workshop on
Multiword Expressions.
Wojciechowski, M. 2007. Feed-forward neural network for
python. Technical University of Lodz (Poland), Department of
Civil
Engineering,
Architecture
and
Environmental
Engineering, http://ffnet.sourceforge.net/, ffnet-0.6, March
2007.
Yang J. 2000. Base Noun Phrase Recognition in Korean using
Rule-based Learning. Journal of Korea Information Science
Society: Software and Applications 27(10).
Download