Web Interface of KPSpotter

advertisement
Designing and Developing an Automatic Interactive Keyphrase Extraction
System with Unified Modeling Language (UML)
Min Song,
College of Information Science & Technology, Drexel University, Philadelphia, PA 19104
(215) 895-2474, 01
Email: min.song@drexel.edu
Il-Yeol Song, Xiaohua Hu
College of Information Science & Technology, Drexel University, Philadelphia, PA 19104
(215) 895-2474, 01
Email: song@drexel.edu
Email: tony.hu@cis.drexel.edu
Abstract
Designing and developing a system that assists
the users in digesting and understanding
information available has been a difficult
challenge. In this paper, we discuss the design
and development of an automatic interactive
keyphrase extraction system, called KPSpotter,
which is capable of processing various formats of
data such as XML, HTML, and plain text through
Internet. KPSpotter combines Information Gain
data mining measure and several Natural
Language Processing (NLP) techniques, such as
Part of Speech (POS) technique and First
Occurrence of Term. To improve extraction
accuracy, WordNet is incorporated into KPSpotter.
In designing and developing KPSpotter we
utilized Unified Modeling Language (UML). UML
modeling helps in the formalization of the
preliminary analysis model and accomplishes
iterative system design and development. We also
conducted experiments for system performance
testing by comparing keyphrases extracted by
KPSPotter and KEA, a well-known naïve
Baysiean-based keyphrase extraction system. The
experiments show that KPSpotter outperforms
KEA in most test cases.
Introduction
Digesting information available through the Internet has
become a serious issue. There have been rigorous attempts
to tackle this issue of information overload in the fields
such as topic detection, text summarization, and keyphrase
extraction.
In this paper, we present KPSpotter, an automatic
keyphrase extraction system that employs a novel
extraction technique. Our technique combines the
Information Gain data mining measure and several Natural
Language Processing (NLP) techniques such as the Part Of
Speech (POS), tagger Term Frequency*Inverse Document
Frequency (TF*IDF), and Distance from First Occurrence
(DIS). Information Gain is a well known data mining
technique introduced in ID3 algorithm (Quinlan, 1993). In
applying POS techniques to KPSpotter, we combine
several POS tagging techniques such as 1) NLPParser
(Charniak, 2000), 2) Link-Grammar (Lafferty et al., 1993)
3) PCKimmo (Antworth, 1993), and 4) Brill’s Tagger (Brill,
1993) to improve POS tagging accuracy. This combined
approach to POS techniques enables us to assign the best
POS tagging to lexical tokens, constituting candidate
phrases by utilizing outstanding features of each POS
technique.
It is a challenging task to design and implement a
keyphrase extraction system that requires such components
as POS library and WordNet and text processing tools. The
main objective of the paper is to discuss our design and
implementation of KPSpotter, whose goals are to be 1)
flexible in terms of processing various input data formats,
2) accessible through the Internet, and 3) robust in terms of
extraction speed.
In order to improve accuracy, we incorporate WordNet’s
capability of conversion from verb form of a term to noun
form into KPSpotter (WordNet2.0). In our previous study,
we found that a set of keyphrases of the research paper
assigned by the author often entails noun phrases that do
not actually appear in the text -- instead the verb form of
the noun appears in the text (Song et al., 2003).
Incorporating WordNet into KPSpotter improves the
accuracy of extracting keyphrases.
In addition to the proposed novel extraction technique,
KPSpotter is differentiated from other extraction systems in
various aspects. First, with an object-oriented system
architecture perspective, KPSpotter is developed to be a
flexible keyphrase extraction system to handle various
types of file formats such as HTML, XML, and ASCII,
whereas other keyphrase extraction systems such as KEA
(Frank et al., 1999) and GenEx (Turney, 2000) require the
input data to be certain formats. In particular, UML is used
to design and develop KPSpotter to embrace a variety of
data algorithms and NLP techniques. Second, KPSpotter is
an interactive keyphrase system capable of extracting
keyphrases through a web interface. These strengths of
KPSpotter make the system portable and flexible in the
situation in which various data formats and system
environments exist in the digital libraries.
The effectiveness of KPSpotter was evaluated by
comparing the keyphrases extracted by KPSpotter with the
ones that the authors assigned. We then compare KPSpotter
with KEA, a well-known naive Bayesian-based keyphrase
extraction system. The preliminary experiments show that
the KPSpotter outperforms KEA in most instances.
To demonstrate flexibility of KPSpotter in handling various
types of input format, we extract keyphrases from Web
data such as html pages. From the set of candidate
keyphrases extracted by KPSpotter, the user can weigh the
candidate keyphrases and provide feedback to the system.
With this user’s feedback, KPSpotter adjusts weighting
scheme to extract keyphrases.
the experiments. Section 5 discusses lessons we learned
during the system design and development. Finally, Section
6 concludes the paper.
System Design
In this section, we describe how KPSpotter is architected.
In addition, we illustrate the web interface of KPSpotter
and explain how to use it. Throughout the development
cycle of the system, UML was used to embed objectorientation in the system. UML diagrams we developed
include use case, class, and activity diagram.
As illustrated in Figure 1, KPSpotter comprises the
following two stages: 1) building extraction model and 2)
extracting keyphrases. Input of the “building extraction
model” stage is training data and input of the “extracting
keyphrases” stage is test data or production data. Both
training and test data are processed by the three
components: 1) Data Cleaning, 2) Data Tokenizing, and
3) Data Discretizing. In Figure 1, the dotted line represents
the processing logic for “building extraction model”
whereas the solid line indicates the processing logic for
“extracting keyphrases.” The detail descriptions are
provided in the following subsections. These two stages are
fully automated.
Depending on the configuration
parameters, KPSpotter processes either “building extraction
model” mode or “extracting keyphrases” model. The
outcomes of both processes by KPSpotter are stored in the
XML form.
TF*IDF
KPSpotter can be applied to several document management
areas. First, it can serve an extraction engine for a full-text
document clustering system. A goal of the document
clustering system is to cluster the retrieved documents on
the fly, and it is practically impossible to cluster the fulltext documents due to the size of the document-term matrix
that the system needs to process. KPSpotter extracts
keyphrases for the given full-text documents in an indexing
time, and then the clustering system takes keyphrases
instead of full-text documents as input and clusters them.
Another useful application area is information visualization.
A critical issue addressed in information visualization is
labeling (Song, 2000). Many visualization systems use the
single terms for labeling, and a single term often obscures
the meaning of the visual objects that each label intends to
represent. KPSpotter can serve a better labeling engine for
the information visualization system by supplying
meaningful keyphrases.
The remainder of this paper is as follows: Section 2
describes the system architecture and development details
and section 3 explains the details of data processing and
feature selection procedures. Section 4 reports the results of
Training data
POS Tagging
Test data
Distance from first
occurrence
DocInfo DB
Data Cleaning
Data Tokenizing
Data Discretizing
Token DB
Stemming
Dropping special characters
WordNet DB
Case-folding
Model XML DB
TF*IDF
DIS
POS
POSITIVE
Keyphrase XML DB
text mining
document summarization
Figure 1: System architecture of KPSpotter
Use Case Analysis
The important UML modelling that provides useful
knowledge about the usage of a system is the use case
diagram. Use case diagrams document the functionality of
a system and users of the system.
The following five major components are shown in the
class diagram: 1) ModelBuilder, 2) DBHandler, 3)
POSHandler, 4) ModelManager, and 5) KeyphraseHandler.
ModelBuilder component consists of classes processing
various input formats such as HTML and XML.
DBHandler component stores statistics on candidate
phrases and input documents. POSHandler component
interfaces with the four POS Tagging libraries implemented
in KPSpotter. ModelManager component applies
discretization and WordNet’s conversion capability of verb
form to noun form. KeyphraseHandler component takes
care of extracting keyphrases based on the information gain
data mining measure.
Figure 2: Use Case Diagram of KPSpotter
Activity Diagram
As illustrated in Figure 2, an actor is shown as agent who
interacts with the system agent. This use case diagram
shows KPSpotter consisting largely of three components:
1) train model, 2) extract phrase, and 3) apply information
gain measure for extraction.
To help understand how the system works, an activity
diagram is provided (Figure 4). Activity diagrams represent
the business and operational workflows of a system and
show the activity and the event that causes the object to be
in the particular state (Hofmeister, 1999).
Class Diagram
In this section, we present the structure of KPSpotter at
class diagram level. Class diagrams provide a static
representation of the structure of a system. Class diagrams
appear in various levels of detail depending on the phase of
the lifecycle (Fowler, 2003). Figure 3 depicts a high-level
conceptual class diagram of KPSpotter.
Figure 4. Activity diagram of KPSpotter
Figure 3: Class diagram of KPSpotter
As illustrated in Figure 4, depending on the process mode
of the system, KPSpotter handles test data or train data.
Consequently, it either generates a train model or extracts
keyphrases. Which path KPSpotter takes is determined by
the configuration settings in the form of XML.
Web Interface of KPSpotter
In this section, we describe a web interface of KPSpotter.
KPSpotter provides a web interface for the user to access
through the Internet (Figure 5). For the process mode, the
user can select either “train model” or “extract keyphrases.”
In the current state of the system, there are three options
available in order to provide input data. The first option is
for the input data to be accessible by the http protocol. By a
URL that the user provides, KPSpotter fetches and
processes the data. The second option is that the user can
directly put the input data into the textbox. The last option
makes it possible for the user to upload the input data.
We measured the performance of KPSpotter by comparing
key phrases with human-generated key phrases. Turney
(2000) reports that an average of about 75% of the humangenerated keyphrases appears in the body of the
corresponding document in his experiment data. With these
findings, he argued that an ideal keyphrase extraction
algorithm could generate phrases that match up to 75% of
the author’s keyphrases.
Taking this result into consideration, optimistically
speaking, KPSpotter needs to extract three to four
keyphrases matched from the list of keyphrases that the
authors provided. The overall performance of KPSpotter is
shown in Table 1 and also illustrated further in Figure 6.
For the given test documents, KPSpotter extracted more
than two “correct keyphrases” on average. By integrating
with WordNet, we gain significant accuracy improvement
comparing to our previous experiments (Song et al., 2003).
Key
phrase
Range
Figure 5: Web interface of KPSpotter
No of
keyphrases
extracted
Average
matches
keyphrases
withWord
Net
5
96
0.90327
1.30327
10
115
1.334
1.716
15
124
1.524
2.179
20
130
1.728
2.3556
Table 1. Overall quality of KPSpotter by accuracy
In Figure 7, the first line from the top is the average
number of keyphrases that the authors assigned. The
second line from the top shows the number of keyphrases
that appears in the documents. The third one indicates the
average number of correct identifications.
Number of correct
keyphrases
The output of executing KPSpotter, a list of keyphrases, is
displayed on the browser in XML form (Figure 6). In
Figure 6, for the record id, 10004, total fifteen keyphrases
are extracted and each keyphrase is weighed with
information gain data mining measure.
Average
matches
keyphrases
without
WordNet
6
5
assigned by
KPSpotter
appearing in
abstract text
4
3
2
assigned by
author
1
0
0
5
10
15
20
25
Number of keyphrases
Figure 6: Sample Keyphrases extracted by KPSpotter
Evaluation
In this section, we report the preliminary experimental
results of the performance of KPSpotter.
Figure 7. Overall Performances
A similar result is reported by KEA (Witten et al., 1999).
KEA generates about one to two “correct keyphrases.” As
illustrated in Figure 8, KPSpotter outperforms KEA in the
first four cases (For the last case, the result from KEA was
No of keyphrases
matched
not available). The results from the experiments seem to
indicate that KPSpotter produces acceptable performance
in terms of average number of matches.
3
2
1
0
KEA
KPSpotter
5
10
15
20
25
Keyphrase extracted
Figure 8. Performance Comparison
In order to demonstrate extensibility and flexibility of
KPSpotter, we extracted keyphrases from publicationrelated web data available in IEEE digital library (IEEE
digital library). We chose IEEE digital library because
publication-related Web pages provided by IEEE digital
library contain not only reasonably sized abstracts, but also
keyphrases generated by the authors. We obtained 150
publication-related web pages for training data and 50 web
pages for testing data. KPSpotter then parses HTML pages
and stores author-generated keywords and abstracts
separately. Figure 9 shows the sample keyphrases extracted
from the publication web page whose the title of the article
is “A Relevance Feedback Architecture for Content-based
Multimedia Information Retrieval Systems.“ The list of
keyphrases given by the authors of the article includes: 1)
Multimedia Information Retrieval, 2) Relevance Feedback,
and 3) Content-based Image/Video Retrieval. It should be
noted that although the size of training data is small (20
Web sites), KPSpotter is able to match one or two
keyphrases out of five keyphrases generated by the authors.
Figure 9. A sample of keyphrases extracted from medical
data
Lesson Learned
In this section, we summarize the lessons we learned in
developing KPSpotter with object-oriented technologies.
Third party software dependency: We used three POS
libraries to identify the word sense. Some major issues on
memory leaks and performance were raised due to the bugs
of the third party library. Since the communication
channel with the third party company wasn't established in
an efficient manner, it took a while to fix the problems.
We felt that it is critical to establish a solid communication
channel with the third party software developers early in
the development phase.
System design with UML: After the requirements gathering
was finished, there was not sufficient time to develop a
fairly mature set of analysis specifications due to the tight
development schedule. However, the core diagrams in
UML such as use case, class, and sequence diagram
improved the design team members' understanding about
the project in a timely manner and also helped to develop
quality software (Hofmeister, 1999).
Since the
requirements were continuously changing, we had to
modify the diagrams to reflect the changes of requirement
specifications in the design through an UML tool. However,
due to the tight development schedule, we bypassed this
update process and directly changed the code instead. It
elicited confusion among the developers in discussing the
code changes.
It was especially confusing for the
developers participating in the project at a later stage.
Throughout the development of the system, we realized
that it is crucial to update the UML diagrams and reflect the
changes of requirement specifications in the design prior to
the code changes.
Handling special characters in XML entity: Since our
XML-formatted web database contains data written in
English as well as data in other languages, we had to cope
with special characters such as ë or ä. The XML parser we
used abruptly terminated its execution when it processed
those special characters. To work around this problem, we
took an ad hoc approach by replacing those characters in
raw data with corresponding encoded characters. This is a
well-known issue with XML parsers in handling some
foreign characters in XML entities. For internationalization,
handling of special characters needs to be addressed in the
XML parser enhancement.
Lack of communication among the development team: We
realized an effective communication channel between
development team members must be developed. Several
developers wrote different pieces of the C++ classes based
on common class libraries (STL) simultaneously. As a
result, we experienced some inconsistency and redundancy
in the writing of the program. In addition, the project
suffered from ineffective communication among the
internal clients, project mangers, and developers.
Conclusion
In this paper, a flexible automatic keyphrase extraction
system, called KPSpotter, is proposed.
KPSpotter
employs a new technique combining the Information Gain
data mining measure and several Natural Language
Processing techniques such as stemming and case-folding.
The three features by identified by KPSpotter for candidate
keyphrases are 1) TF*IDF, 2) distance from the first
occurrence of the phrase, and 3) POS tagging.
KPSpotter was designed and developed in the spirit of
object-oriented design and analysis. In particular, in order
to help understand the system architecture in an effective
way, UML notions and diagrams were employed. We also
reported the lesion learned from designing and developing
KPSpotter with UML.
KPSpotter is characterized and differentiated from other
keyphrase extraction systems by the following: 1) it
introduces an extraction technique combining Information
Gain and Natural Language Processing techniques; 2) it
provides a web interface for the user to obtain a list of
keyphrases for the supplied input data; 3) it processes
various types of input data such as XML, HTML, and
unstructured text data and generate XML output; 4) it
stores statistical information of candidate phrases to
BerkeleyDB, a persistent object storage device; 5) it also
stores both the model and list of keyphrases for the target
document in a XML file; and 6) WordNet is incorporated
into the system to improve extraction accuracy. These
features of KPSpotter make it suitable for the real world
application where robustness, flexibility, and speed are
important.
To evaluate the performance of the system, we conducted a
series of experiments and reported the experimental results.
KPSpotter outperforms KEA in the cases of extracting 5 to
15 keyphrases and also demonstrates equivalent extraction
quality to KEA in extracting 20 keyphrases in terms of the
number of matches between system-generated and humangenerated keyphrases. The correct keyphrases out of 25
keyphrases extracted by KPSpotter is more than two on
average. In addition, the results of extracting keyphrase
from publication-related web sites in IEEE digital library
indicated that KPSpotter is capable of extracting
meaningful sets of keyphrases. These findings are
encouraging because KPSpotter is able to extract one or
two matched keyphrases despite that the size of training
data was small, 50 web sites.
KPSpotter can serve an extraction engine in the following
several different document management areas: 1) a full-text
document clustering system, which benefits from
KPSpotter by clustering documents with keyphrases and 2)
an information visualization system, which utilizes
KPSpotter for generating meaningful labels for visual
objects.
We are conducting experiments of the performance
comparison in predicting keyphrases among different data
mining techniques such as Information Gain, Support
Vector Machine (SVM), and K-Nearest Neighbor. These
mining algorithms have been successfully applied to
document classification tasks.
Regarding the methodology of experiment, we are also
undertaking a study on the robust and sophisticated
accuracy measures for usability of the system. The
emphasis of the follow-up study is on measuring usefulness
of keyphrases to the users of digital libraries.
REFERENCES
Antworth, Evan L. (1993) Glossing text with the PC-KIMMO
morphological parser. Computers and the Humanities. pp. 475484.
Brill, Eric (1993) Automatic Grammar Induction and Parsing Free
Text: A Transformation-Based Approach. In: Proceedings of ACL,
259-265.
Caropreso, F.M., Matwin, S. and Sebastiani, F (2001) A learnerindependent evaluation of the usefulness of statistical phrases for
automated text categorization. In: Amita G. Chin (ed.), Text
Databases and Document Management: Theory and Practice,
Idea Group Publishing, Hershey, US, pp. 78-102.
Charniak E., (2000) A Maximum-Entropy-Inspired Parser. In:
Proceedings of NAACL-2000.
Dougherty, J., Kohavi, R. and Sahami, M. (1995) Supervised and
unsupervised discretization of continuous features. In: Proceeding
of ICML-95, 12th International Conference on Machine Learning,
Lake Tahoe, US, pp.194--202.
Fowler M. (2003) UML Distilled: A Brief Guide to the Standard
Object Modeling Language. Adison-Wesley.
Frank E., Paynter G.W., Witten I.H., Gutwin C. and NevillManning C.G. (1999) Domain-specific keyphrase extraction, In:
Proc. Sixteenth International Joint Conference on Artificial
Intelligence, Morgan Kaufmann Publishers, San Francisco, CA,
pp. 668-673.
Hofmeister C, Nord RL, Soni D. (1999) Describing Software
Architecture with UML, In: 1st Working IFIP Conference on
Software Architecture (WICSA1), Feb 22-24, pp. 145-159.
Lafferty J., Sleator D., and Temperley D. (1992) Grammatical
Trigrams: A Probabilistic Model of Link Grammar. In:
Proceedings of the AAAI Conference on Probabilistic Approaches
to Natural Language, October.
Manning C. Manning and Schütze H., (1999) Foundations of
Statistical Natural Language Processing, MIT Press. Cambridge,
MA: May.
Porter, M.F. (1980) An algorithm for suffix stripping, Program,
14(3), pp. 130-137.
Quinlan, J. R. (1993) Programs for Machine Learning, San
Mateo: Morgan Kaufmann Publishers.
Song, M, Song, I.Y., and Hu, T. (2003) KPSpotter: A Flexible
Information Gain-based Keyphrase Extraction System, Fifth
International Workshop on Web Information and Data
Management (WIDM'03), In Conjunction with the 12th
International Confe rence on Information and Knowledge
Management (CIKM 2003),November 7-8, 2003.
Radev D.R, Qi H., Zheng, Z., Blair-Goldensohn S., Zhang Z., Fan
W., and Prager J. (2001) Mining the web for answers to natural
language questions. In: ACM CIKM 2001: Tenth International
Conference on Information and Knowledge Management, Atlanta,
GA.
Turney, P.D. (2000) Learning algorithms for key phrase
extraction. Information Retrieval, Information Retrieval, 2, pp.
303-336.
Song, M. (2000) Visualization in information retrieval: a threelevel analysis, Journal of Information Science, 26 (1): 3-19.
Witten I.H., Paynter G.W., Frank E., Gutwin C. and NevillManning C.G. (1999) KEA: Practical automatic keyphrase
extraction. In: Proc. DL '99, pp. 254-256.
BerkeleyDB, http://www.sleepycat.com.
IEEE Digital Library,
http://www.ieee.org/products/onlinepubs/iel/iel.html.
Download