Text Retrieval Improvement Based on Automatically Extracted Keyphrase Hohyon Ryu University of Wisconsin – Milwaukee. School of Information Studies, Bolton Hall 5th Floor 3210 N Maryland Ave Milwaukee, WI 53211, Email: hohyon@gmail.com Introduction Yang, 2000; Lee S. S. et al., 2003; Lee C. Y., et al., 1993; As Web 2.0 technologies such as free-tagging and folksonomy become popular, many researchers have paid attention on adopting those newly emerged information technology to enhance text retrieval. Generally, tags reflect the content of a text. Thus, it is assumed that using tags as a feature for retrieval can improve the retrieval performance. However, as Hotho (2006) and Passant (2007) remarked, free-tagging has certain limits to be used in text retrieval. Since free-tagging is done by users without any control, tags can be ambiguous and/or irrelevant to the original text. Instead of using the tags given by users randomly, the present study is concerned with using automatically extracted keywords or keyphrases in text retrieval. In this study, a keyword is defined as a single word that represents the content of a given text and a keyphrase is a phrase that describes the content well. Since keywords and keyphrases represent the main ideas and items of a given text, giving more weight on keyphrase or keyword terms would promote the performance of retrieval. This study is based on Korean academic environment where index terms are compound words or phrases in more than 70% of the cases (Lee, et al., 2003). In Korean, extracted single words have only limited capability of representing the content. Keywords are extracted first by a neural network algorithm and keyphrases are generated combining the extracted keywords within the context. To fully utilize the generated keywords and keyphrases, they are experimented in text retrieval. They are coded into a document vector with a certain weight and combined with the original document vector. By doing so, the original document vector can be enhanced to represent the context of the original document better and can improve overall retrieval performance. Lee H. A., et al., 1997). Since keyphrases are more prevalent than single-noun keywords and more complicated process should be involved in Korean language, many studies have been done by Korean researchers. Finally, Hotho (2006) and Cho et al. (2005) conducted a study to improve text retrieve performance with keywords or folksonomy. Literature Review Previous studies that are related to keyword extraction based on neural network, noun phrases generation/extraction, and improving text retrieval performance with contextual features have been reviewed. Neural network or other machine learning methods have been utilized in several studies to decide if a given word should be recognized as a keyword (Medelyan et al., 2008; Jo T. C. et al., 2000). Extracting noun phrases also has been approached in many different ways (Tomokiyo et al., 2003; Experiment Design As shown in Figure 1, a neural network, which was implemented by Feed-forward Neural Network for Python (Wojciechowski et al., 2007), judges each word if it is eligible to be a keyword on the basis of TF*IDF and the location of each word in the document. Figure 1: Outline of the experiment. Keyphrases are generated based on rule-based algorithm. The keyphrase generation algorithm makes a window with 1 preceding and 3 following words and merges adjacent or overlapping windows. The rule-based algorithm rules out the words inadequate for a noun phrase by analyzing the part of speech of each word. The words appear on the automatically extracted or generated keywords and keyphrases are added onto the original document vector with a certain weight to give more weight on essential words. For each vector, Okapi TF×IDF normalization was applied. As shown on the example below, Keyphrases include a certain keyword repeatedly. Thus, more important keywords get more weight while less important keywords often get no additional weight at all. The example of the extracted keywords and keyphrases are shown below (translated): Title: Study on Guidelines for the Construction of a Korean Thesaurus Keywords: 1986, Korean, basic, Hangeul, definition, standard, 2788, relation, word, ISO, alphabet, thesaurus, rule, term, most Keyphrases: standard for Hangeul thesaurus construction, Hangeul thesaurus, word thesaurus, ISO standard, aspect of Hangeul thesaurus, Hangeul thesaurus test, Hangeul thesaurus data, ISO, word thesaurus construction standard, Hangeul thesaurus management system With the weighted vectors, along with the original vectors as a baseline, text retrieval experiments were carried out. The test collection consists of 545 abstracts of academic papers from the library of Yonsei University in ten different academic fields. The result of retrieval experiments with keyword-added vector, keyphrase-added vector, and a vector with both keyword and keyphrase entry words is shown in Figure 2. The result was evaluated in R-precision. Result Figure 2: The change of R-precision according to the assigned weight and features As Figure 2 suggests, text retrieval performance increased by 15% from R-precision of 0.64 to 0.74 when the words appear on keyphrases and keywords are both added on the original vector with double weight. For the keyword + keyphrase vector, lesser margin of the improvement was shown as higher weight is assigned to the additional terms. On the other hand, the higher weights are assigned to the keyword-added vector (the dashed line in Figure 2,) the better performance is shown. This is because the same weight is assigned to each keyword items, while words in the keyphrase get different weight according to their appearance in the keyphrase list. Conclusions and Future Research The present study shows that giving extra weight on words that appear in keywords or keyphrase affects the performance of text retrieval positively. Since current web retrieval provides a significant number of irrelevant documents for users’ request, modifying search algorithms to be more sensitive to the subject of documents will help to improve the retrieval performance. Additionally, neural network keyword extraction and rule-based keyphrase generation performed a stable efficiency. The evaluation of keyword and keyphrase generation will be carried out in the future to utilize the modules as independent software. The result of retrieval test in Korean environment showed significant improvement. In the future, further experiments will be made in English expecting positive improvement on retrieval performance. REFERENCES Cho M., Yun B., & Rim H. 1997. A Korean Document Retrieval Model Considering Compound Nouns and Derived Nouns. Proceedings of Korea Information Science Society Spring Conference 24(1). 449-502. Hotho, A., Jaschke, R., Schmitz, C., & Stumme, G. 2006. Information Retrieval in Folksonomies: Search and Ranking. Lecture Notes in Computer Science. Springer Berlin: Heidelberg. Jo, T. C., & Seo, J. 2000. Neural Based Approach to Keyword Extraction from Documents. Proceedings of Korea Information Science Society Autumn Conference 27(2). 317-319. Lee, C. Y., Kang, H., Jang, H., & Park, S. 1993. A design of the Automatic Keyword Maker. Proceedings of the 5th Conference of Hangul and Korean Information Processing. 71-77. Lee, H. A., Lee, J. H., & Lee, G. 1997. Noun Phrase Indexing using Clausal Segmentation. Journal of Korea Information Science Society(b) 25(3). 301-311. Lee, S. S., & Lee, T. 2003. Concept-based Compound Keyword Extraction. Journal of Korea Association of Computer Education 6(2). Medelyan, O., & Witten, I. H. 2008. Domain Independent Automatic Keyphrase Indexing with Small Training Sets. Jasist, 59(7). 1026-1040. Passant, A. 2007. Using Ontologies to Strengthen Folksonomies and Enrich Information Retrieval in Weblogs. International Conference on Web Services. Tomokiyo, T., & Hurst, M. 2003. A Language Model Approach to Keyphrase Extraction. Proceedings of the ACL Workshop on Multiword Expressions. Wojciechowski, M. 2007. Feed-forward neural network for python. Technical University of Lodz (Poland), Department of Civil Engineering, Architecture and Environmental Engineering, http://ffnet.sourceforge.net/, ffnet-0.6, March 2007. Yang J. 2000. Base Noun Phrase Recognition in Korean using Rule-based Learning. Journal of Korea Information Science Society: Software and Applications 27(10).