Designing and Developing an Automatic Interactive Keyphrase Extraction System with Unified Modeling Language (UML) Min Song, College of Information Science & Technology, Drexel University, Philadelphia, PA 19104 (215) 895-2474, 01 Email: min.song@drexel.edu Il-Yeol Song, Xiaohua Hu College of Information Science & Technology, Drexel University, Philadelphia, PA 19104 (215) 895-2474, 01 Email: song@drexel.edu Email: tony.hu@cis.drexel.edu Abstract Designing and developing a system that assists the users in digesting and understanding information available has been a difficult challenge. In this paper, we discuss the design and development of an automatic interactive keyphrase extraction system, called KPSpotter, which is capable of processing various formats of data such as XML, HTML, and plain text through Internet. KPSpotter combines Information Gain data mining measure and several Natural Language Processing (NLP) techniques, such as Part of Speech (POS) technique and First Occurrence of Term. To improve extraction accuracy, WordNet is incorporated into KPSpotter. In designing and developing KPSpotter we utilized Unified Modeling Language (UML). UML modeling helps in the formalization of the preliminary analysis model and accomplishes iterative system design and development. We also conducted experiments for system performance testing by comparing keyphrases extracted by KPSPotter and KEA, a well-known naïve Baysiean-based keyphrase extraction system. The experiments show that KPSpotter outperforms KEA in most test cases. Introduction Digesting information available through the Internet has become a serious issue. There have been rigorous attempts to tackle this issue of information overload in the fields such as topic detection, text summarization, and keyphrase extraction. In this paper, we present KPSpotter, an automatic keyphrase extraction system that employs a novel extraction technique. Our technique combines the Information Gain data mining measure and several Natural Language Processing (NLP) techniques such as the Part Of Speech (POS), tagger Term Frequency*Inverse Document Frequency (TF*IDF), and Distance from First Occurrence (DIS). Information Gain is a well known data mining technique introduced in ID3 algorithm (Quinlan, 1993). In applying POS techniques to KPSpotter, we combine several POS tagging techniques such as 1) NLPParser (Charniak, 2000), 2) Link-Grammar (Lafferty et al., 1993) 3) PCKimmo (Antworth, 1993), and 4) Brill’s Tagger (Brill, 1993) to improve POS tagging accuracy. This combined approach to POS techniques enables us to assign the best POS tagging to lexical tokens, constituting candidate phrases by utilizing outstanding features of each POS technique. It is a challenging task to design and implement a keyphrase extraction system that requires such components as POS library and WordNet and text processing tools. The main objective of the paper is to discuss our design and implementation of KPSpotter, whose goals are to be 1) flexible in terms of processing various input data formats, 2) accessible through the Internet, and 3) robust in terms of extraction speed. In order to improve accuracy, we incorporate WordNet’s capability of conversion from verb form of a term to noun form into KPSpotter (WordNet2.0). In our previous study, we found that a set of keyphrases of the research paper assigned by the author often entails noun phrases that do not actually appear in the text -- instead the verb form of the noun appears in the text (Song et al., 2003). Incorporating WordNet into KPSpotter improves the accuracy of extracting keyphrases. In addition to the proposed novel extraction technique, KPSpotter is differentiated from other extraction systems in various aspects. First, with an object-oriented system architecture perspective, KPSpotter is developed to be a flexible keyphrase extraction system to handle various types of file formats such as HTML, XML, and ASCII, whereas other keyphrase extraction systems such as KEA (Frank et al., 1999) and GenEx (Turney, 2000) require the input data to be certain formats. In particular, UML is used to design and develop KPSpotter to embrace a variety of data algorithms and NLP techniques. Second, KPSpotter is an interactive keyphrase system capable of extracting keyphrases through a web interface. These strengths of KPSpotter make the system portable and flexible in the situation in which various data formats and system environments exist in the digital libraries. The effectiveness of KPSpotter was evaluated by comparing the keyphrases extracted by KPSpotter with the ones that the authors assigned. We then compare KPSpotter with KEA, a well-known naive Bayesian-based keyphrase extraction system. The preliminary experiments show that the KPSpotter outperforms KEA in most instances. To demonstrate flexibility of KPSpotter in handling various types of input format, we extract keyphrases from Web data such as html pages. From the set of candidate keyphrases extracted by KPSpotter, the user can weigh the candidate keyphrases and provide feedback to the system. With this user’s feedback, KPSpotter adjusts weighting scheme to extract keyphrases. the experiments. Section 5 discusses lessons we learned during the system design and development. Finally, Section 6 concludes the paper. System Design In this section, we describe how KPSpotter is architected. In addition, we illustrate the web interface of KPSpotter and explain how to use it. Throughout the development cycle of the system, UML was used to embed objectorientation in the system. UML diagrams we developed include use case, class, and activity diagram. As illustrated in Figure 1, KPSpotter comprises the following two stages: 1) building extraction model and 2) extracting keyphrases. Input of the “building extraction model” stage is training data and input of the “extracting keyphrases” stage is test data or production data. Both training and test data are processed by the three components: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. In Figure 1, the dotted line represents the processing logic for “building extraction model” whereas the solid line indicates the processing logic for “extracting keyphrases.” The detail descriptions are provided in the following subsections. These two stages are fully automated. Depending on the configuration parameters, KPSpotter processes either “building extraction model” mode or “extracting keyphrases” model. The outcomes of both processes by KPSpotter are stored in the XML form. TF*IDF KPSpotter can be applied to several document management areas. First, it can serve an extraction engine for a full-text document clustering system. A goal of the document clustering system is to cluster the retrieved documents on the fly, and it is practically impossible to cluster the fulltext documents due to the size of the document-term matrix that the system needs to process. KPSpotter extracts keyphrases for the given full-text documents in an indexing time, and then the clustering system takes keyphrases instead of full-text documents as input and clusters them. Another useful application area is information visualization. A critical issue addressed in information visualization is labeling (Song, 2000). Many visualization systems use the single terms for labeling, and a single term often obscures the meaning of the visual objects that each label intends to represent. KPSpotter can serve a better labeling engine for the information visualization system by supplying meaningful keyphrases. The remainder of this paper is as follows: Section 2 describes the system architecture and development details and section 3 explains the details of data processing and feature selection procedures. Section 4 reports the results of Training data POS Tagging Test data Distance from first occurrence DocInfo DB Data Cleaning Data Tokenizing Data Discretizing Token DB Stemming Dropping special characters WordNet DB Case-folding Model XML DB TF*IDF DIS POS POSITIVE Keyphrase XML DB text mining document summarization Figure 1: System architecture of KPSpotter Use Case Analysis The important UML modelling that provides useful knowledge about the usage of a system is the use case diagram. Use case diagrams document the functionality of a system and users of the system. The following five major components are shown in the class diagram: 1) ModelBuilder, 2) DBHandler, 3) POSHandler, 4) ModelManager, and 5) KeyphraseHandler. ModelBuilder component consists of classes processing various input formats such as HTML and XML. DBHandler component stores statistics on candidate phrases and input documents. POSHandler component interfaces with the four POS Tagging libraries implemented in KPSpotter. ModelManager component applies discretization and WordNet’s conversion capability of verb form to noun form. KeyphraseHandler component takes care of extracting keyphrases based on the information gain data mining measure. Figure 2: Use Case Diagram of KPSpotter Activity Diagram As illustrated in Figure 2, an actor is shown as agent who interacts with the system agent. This use case diagram shows KPSpotter consisting largely of three components: 1) train model, 2) extract phrase, and 3) apply information gain measure for extraction. To help understand how the system works, an activity diagram is provided (Figure 4). Activity diagrams represent the business and operational workflows of a system and show the activity and the event that causes the object to be in the particular state (Hofmeister, 1999). Class Diagram In this section, we present the structure of KPSpotter at class diagram level. Class diagrams provide a static representation of the structure of a system. Class diagrams appear in various levels of detail depending on the phase of the lifecycle (Fowler, 2003). Figure 3 depicts a high-level conceptual class diagram of KPSpotter. Figure 4. Activity diagram of KPSpotter Figure 3: Class diagram of KPSpotter As illustrated in Figure 4, depending on the process mode of the system, KPSpotter handles test data or train data. Consequently, it either generates a train model or extracts keyphrases. Which path KPSpotter takes is determined by the configuration settings in the form of XML. Web Interface of KPSpotter In this section, we describe a web interface of KPSpotter. KPSpotter provides a web interface for the user to access through the Internet (Figure 5). For the process mode, the user can select either “train model” or “extract keyphrases.” In the current state of the system, there are three options available in order to provide input data. The first option is for the input data to be accessible by the http protocol. By a URL that the user provides, KPSpotter fetches and processes the data. The second option is that the user can directly put the input data into the textbox. The last option makes it possible for the user to upload the input data. We measured the performance of KPSpotter by comparing key phrases with human-generated key phrases. Turney (2000) reports that an average of about 75% of the humangenerated keyphrases appears in the body of the corresponding document in his experiment data. With these findings, he argued that an ideal keyphrase extraction algorithm could generate phrases that match up to 75% of the author’s keyphrases. Taking this result into consideration, optimistically speaking, KPSpotter needs to extract three to four keyphrases matched from the list of keyphrases that the authors provided. The overall performance of KPSpotter is shown in Table 1 and also illustrated further in Figure 6. For the given test documents, KPSpotter extracted more than two “correct keyphrases” on average. By integrating with WordNet, we gain significant accuracy improvement comparing to our previous experiments (Song et al., 2003). Key phrase Range Figure 5: Web interface of KPSpotter No of keyphrases extracted Average matches keyphrases withWord Net 5 96 0.90327 1.30327 10 115 1.334 1.716 15 124 1.524 2.179 20 130 1.728 2.3556 Table 1. Overall quality of KPSpotter by accuracy In Figure 7, the first line from the top is the average number of keyphrases that the authors assigned. The second line from the top shows the number of keyphrases that appears in the documents. The third one indicates the average number of correct identifications. Number of correct keyphrases The output of executing KPSpotter, a list of keyphrases, is displayed on the browser in XML form (Figure 6). In Figure 6, for the record id, 10004, total fifteen keyphrases are extracted and each keyphrase is weighed with information gain data mining measure. Average matches keyphrases without WordNet 6 5 assigned by KPSpotter appearing in abstract text 4 3 2 assigned by author 1 0 0 5 10 15 20 25 Number of keyphrases Figure 6: Sample Keyphrases extracted by KPSpotter Evaluation In this section, we report the preliminary experimental results of the performance of KPSpotter. Figure 7. Overall Performances A similar result is reported by KEA (Witten et al., 1999). KEA generates about one to two “correct keyphrases.” As illustrated in Figure 8, KPSpotter outperforms KEA in the first four cases (For the last case, the result from KEA was No of keyphrases matched not available). The results from the experiments seem to indicate that KPSpotter produces acceptable performance in terms of average number of matches. 3 2 1 0 KEA KPSpotter 5 10 15 20 25 Keyphrase extracted Figure 8. Performance Comparison In order to demonstrate extensibility and flexibility of KPSpotter, we extracted keyphrases from publicationrelated web data available in IEEE digital library (IEEE digital library). We chose IEEE digital library because publication-related Web pages provided by IEEE digital library contain not only reasonably sized abstracts, but also keyphrases generated by the authors. We obtained 150 publication-related web pages for training data and 50 web pages for testing data. KPSpotter then parses HTML pages and stores author-generated keywords and abstracts separately. Figure 9 shows the sample keyphrases extracted from the publication web page whose the title of the article is “A Relevance Feedback Architecture for Content-based Multimedia Information Retrieval Systems.“ The list of keyphrases given by the authors of the article includes: 1) Multimedia Information Retrieval, 2) Relevance Feedback, and 3) Content-based Image/Video Retrieval. It should be noted that although the size of training data is small (20 Web sites), KPSpotter is able to match one or two keyphrases out of five keyphrases generated by the authors. Figure 9. A sample of keyphrases extracted from medical data Lesson Learned In this section, we summarize the lessons we learned in developing KPSpotter with object-oriented technologies. Third party software dependency: We used three POS libraries to identify the word sense. Some major issues on memory leaks and performance were raised due to the bugs of the third party library. Since the communication channel with the third party company wasn't established in an efficient manner, it took a while to fix the problems. We felt that it is critical to establish a solid communication channel with the third party software developers early in the development phase. System design with UML: After the requirements gathering was finished, there was not sufficient time to develop a fairly mature set of analysis specifications due to the tight development schedule. However, the core diagrams in UML such as use case, class, and sequence diagram improved the design team members' understanding about the project in a timely manner and also helped to develop quality software (Hofmeister, 1999). Since the requirements were continuously changing, we had to modify the diagrams to reflect the changes of requirement specifications in the design through an UML tool. However, due to the tight development schedule, we bypassed this update process and directly changed the code instead. It elicited confusion among the developers in discussing the code changes. It was especially confusing for the developers participating in the project at a later stage. Throughout the development of the system, we realized that it is crucial to update the UML diagrams and reflect the changes of requirement specifications in the design prior to the code changes. Handling special characters in XML entity: Since our XML-formatted web database contains data written in English as well as data in other languages, we had to cope with special characters such as ë or ä. The XML parser we used abruptly terminated its execution when it processed those special characters. To work around this problem, we took an ad hoc approach by replacing those characters in raw data with corresponding encoded characters. This is a well-known issue with XML parsers in handling some foreign characters in XML entities. For internationalization, handling of special characters needs to be addressed in the XML parser enhancement. Lack of communication among the development team: We realized an effective communication channel between development team members must be developed. Several developers wrote different pieces of the C++ classes based on common class libraries (STL) simultaneously. As a result, we experienced some inconsistency and redundancy in the writing of the program. In addition, the project suffered from ineffective communication among the internal clients, project mangers, and developers. Conclusion In this paper, a flexible automatic keyphrase extraction system, called KPSpotter, is proposed. KPSpotter employs a new technique combining the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. The three features by identified by KPSpotter for candidate keyphrases are 1) TF*IDF, 2) distance from the first occurrence of the phrase, and 3) POS tagging. KPSpotter was designed and developed in the spirit of object-oriented design and analysis. In particular, in order to help understand the system architecture in an effective way, UML notions and diagrams were employed. We also reported the lesion learned from designing and developing KPSpotter with UML. KPSpotter is characterized and differentiated from other keyphrase extraction systems by the following: 1) it introduces an extraction technique combining Information Gain and Natural Language Processing techniques; 2) it provides a web interface for the user to obtain a list of keyphrases for the supplied input data; 3) it processes various types of input data such as XML, HTML, and unstructured text data and generate XML output; 4) it stores statistical information of candidate phrases to BerkeleyDB, a persistent object storage device; 5) it also stores both the model and list of keyphrases for the target document in a XML file; and 6) WordNet is incorporated into the system to improve extraction accuracy. These features of KPSpotter make it suitable for the real world application where robustness, flexibility, and speed are important. To evaluate the performance of the system, we conducted a series of experiments and reported the experimental results. KPSpotter outperforms KEA in the cases of extracting 5 to 15 keyphrases and also demonstrates equivalent extraction quality to KEA in extracting 20 keyphrases in terms of the number of matches between system-generated and humangenerated keyphrases. The correct keyphrases out of 25 keyphrases extracted by KPSpotter is more than two on average. In addition, the results of extracting keyphrase from publication-related web sites in IEEE digital library indicated that KPSpotter is capable of extracting meaningful sets of keyphrases. These findings are encouraging because KPSpotter is able to extract one or two matched keyphrases despite that the size of training data was small, 50 web sites. KPSpotter can serve an extraction engine in the following several different document management areas: 1) a full-text document clustering system, which benefits from KPSpotter by clustering documents with keyphrases and 2) an information visualization system, which utilizes KPSpotter for generating meaningful labels for visual objects. We are conducting experiments of the performance comparison in predicting keyphrases among different data mining techniques such as Information Gain, Support Vector Machine (SVM), and K-Nearest Neighbor. These mining algorithms have been successfully applied to document classification tasks. Regarding the methodology of experiment, we are also undertaking a study on the robust and sophisticated accuracy measures for usability of the system. The emphasis of the follow-up study is on measuring usefulness of keyphrases to the users of digital libraries. REFERENCES Antworth, Evan L. (1993) Glossing text with the PC-KIMMO morphological parser. Computers and the Humanities. pp. 475484. Brill, Eric (1993) Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach. In: Proceedings of ACL, 259-265. Caropreso, F.M., Matwin, S. and Sebastiani, F (2001) A learnerindependent evaluation of the usefulness of statistical phrases for automated text categorization. In: Amita G. Chin (ed.), Text Databases and Document Management: Theory and Practice, Idea Group Publishing, Hershey, US, pp. 78-102. Charniak E., (2000) A Maximum-Entropy-Inspired Parser. In: Proceedings of NAACL-2000. Dougherty, J., Kohavi, R. and Sahami, M. (1995) Supervised and unsupervised discretization of continuous features. In: Proceeding of ICML-95, 12th International Conference on Machine Learning, Lake Tahoe, US, pp.194--202. Fowler M. (2003) UML Distilled: A Brief Guide to the Standard Object Modeling Language. Adison-Wesley. Frank E., Paynter G.W., Witten I.H., Gutwin C. and NevillManning C.G. (1999) Domain-specific keyphrase extraction, In: Proc. Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Francisco, CA, pp. 668-673. Hofmeister C, Nord RL, Soni D. (1999) Describing Software Architecture with UML, In: 1st Working IFIP Conference on Software Architecture (WICSA1), Feb 22-24, pp. 145-159. Lafferty J., Sleator D., and Temperley D. (1992) Grammatical Trigrams: A Probabilistic Model of Link Grammar. In: Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language, October. Manning C. Manning and Schütze H., (1999) Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May. Porter, M.F. (1980) An algorithm for suffix stripping, Program, 14(3), pp. 130-137. Quinlan, J. R. (1993) Programs for Machine Learning, San Mateo: Morgan Kaufmann Publishers. Song, M, Song, I.Y., and Hu, T. (2003) KPSpotter: A Flexible Information Gain-based Keyphrase Extraction System, Fifth International Workshop on Web Information and Data Management (WIDM'03), In Conjunction with the 12th International Confe rence on Information and Knowledge Management (CIKM 2003),November 7-8, 2003. Radev D.R, Qi H., Zheng, Z., Blair-Goldensohn S., Zhang Z., Fan W., and Prager J. (2001) Mining the web for answers to natural language questions. In: ACM CIKM 2001: Tenth International Conference on Information and Knowledge Management, Atlanta, GA. Turney, P.D. (2000) Learning algorithms for key phrase extraction. Information Retrieval, Information Retrieval, 2, pp. 303-336. Song, M. (2000) Visualization in information retrieval: a threelevel analysis, Journal of Information Science, 26 (1): 3-19. Witten I.H., Paynter G.W., Frank E., Gutwin C. and NevillManning C.G. (1999) KEA: Practical automatic keyphrase extraction. In: Proc. DL '99, pp. 254-256. BerkeleyDB, http://www.sleepycat.com. IEEE Digital Library, http://www.ieee.org/products/onlinepubs/iel/iel.html.