David Field

advertisement
Ranking Techniques for Keyphrase Extraction
by
David Field
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Iw
Masters of Engineering in Computer Science and Engineering
co
at the
CNJ
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
U)
September 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
Aut+hor~......
Lp
Department of Electrical Engineering and Computer Science
July 23, 2014
Signature redacted
.
Certified by..
/7
Accepted by.......
A
.
Regina Barzilay
Professor
Thesis Supervisor
Signature redacted
Albert R. Meyer
Chair, Masters of Engineering Thesis
Committee
2
Ranking Techniques for Keyphrase Extraction
by
David Field
Submitted to the Department of Electrical Engineering and Computer Science
on July 23, 2014, in partial fulfillment of the
requirements for the degree of
Masters of Engineering in Computer Science and Engineering
Abstract
This thesis focuses on the task of extracting keyphrases from research papers. Keyphrases
are short phrases that summarize and characterize the contents of documents. They
help users explore sets of documents and quickly understand the contents of individual documents. Most academic papers do not have keyphrases assigned to them, and
manual keyphrase assignment is highly laborious. As such, there is a strong demand
for automatic keyphrase extraction systems.
The task of automatic keyphrase extraction presents a number of challenges. Human indexers are heavily informed by domain knowledge and comprehension of the
contents of the papers. Keyphrase extraction is an intrinsically noisy and ambiguous task, as different human indexers select different keyphrases for the same paper.
Training data is limited in both quality and quantity.
In this thesis, we present a number of advancements to the ranking methods and
features used to automatically extract keyphrases. We demonstrate that, through
the reweighing of training examples, the quality of the learned bagged decision trees
can be improved with negligible runtime cost. We use reranking to improve accuracy and explore several extensions thereof. We propose a number of new features,
including augmented domain keyphraseness and average word length. Augmented
domain keyphraseness incorporates information from a hierarchical document clustering to improve the handling of multi-domain corpora. We explore the technique
of per-document feature scaling and discuss the impact of feature removal. Over
three diverse corpora, these advancements substantially improve accuracy and runtime. Combined, they give keyphrase assignments that are competitive with those
produced by human indexers.
Thesis Supervisor: Regina Barzilay
Title: Professor
3
4
Acknowledgments
First and foremost, I would like to thank my advisor, Regina Barzilay. Her guidance
and vision were instrumental in my research. I could not have made the progress I
did without her knowledge and intuition. When my ideas were failing, she was always
there to provide advice and encouragement.
I'd also like thank my group mates.
Their advice and insights have been in-
valuable in the writing of this thesis. Our discussions always left me with a deeper
understanding of natural language processing.
My friends at MIT have been a source of both wisdom and joy.
I will always
remember my time in Cambridge fondly.
Finally, this thesis would not have been possible without the inexhaustible support
of my family. For their love and support, I am forever grateful.
5
6
Contents
. . . . . . . . . . . . . . . . .
11
Keyphrases
1.2
Common Keyphrase Annotation Techniques
12
1.3
Challenges . . . . . . . . . . . . . . . . . .
14
1.4
Advancements . . . . . . . . . . . . . . . .
15
.
.
.
1.1
19
Task Description
2.4
19
.
. . . . . . . . . . . . . . .
Keyphrase Extraction
. . . . . .
. . . . . . . . . . . . . . .
20
2.1.2
Keyphrase Assignment . . . . . .
. . . . . . . . . . . . . . .
23
Evaluation . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
23
. . . . . . .
. . . . . . . . . . . . . . .
24
. . . . . . . . .
. . . . . . . . . . . . . . .
26
.
.
.
2.1.1
Performance Metrics
2.2.2
Cross-Validation
2.2.3
Comparison to Human Indexers
. . . . . . . . . . . . . . .
27
2.2.4
Use of Evaluation Systems.....
. . . . . . . . . . . . . . .
29
. . . . . . . .
. . . . . . . . . . . . . . .
29
2.3.1
Collabgraph . . . . . . . . . . . .
. . . . . . . . . . . . . . .
30
2.3.2
CiteULike180
. . . . . . . . . . .
. . . . . . . . . . . . . . .
31
2.3.3
Semeval 2010 Task 5 . . . . . . .
. . . . . . . . . . . . . . .
31
2.3.4
Author vs Reader Selected Keyphras es . . . . . . . . . . . . .
32
2.3.5
Testing Datasets
. . . . . . . . .
. . . . . . . . . . . . . . .
33
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
33
Baseline System . . . . . . . . . .
. . . . . . . . . . . . . . .
34
Baseline
2.4.1
.
.
.
Datasets . . . . . . . . ..
.
2.2.1
.
2.3
. . . . . . . . . . . . . . . .
.
2.2
Prior Work
.
2.1
.
2
11
Introduction
.
1
7
2.4.2
Reweighting positive examples . . . . . . . .
.
. . . . . .
40
3.1.2
Comparison to Tuning of Bagging Parameters
. . . . . .
41
3.1.3
Summary of Reweighting . . . . . . . . . . .
. . . . . .
42
Reranking . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
43
3.2.1
Bagged Decision Trees . . . . . . . . . . . .
. . . . . .
43
3.2.2
Support Vector Machines . . . . . . . . . . .
. . . . . .
45
3.2.3
Extensions . . . . . . . . . . . . . . . . . . .
. . . . . .
46
3.2.4
Summary of Reranking . . . . . . . . . . . .
. . . . . .
49
.
.
.
.
.
3.1.1
51
Features
Augmented Domain Keyphraseness
. . . . . . . . . . .
51
4.2
Average Word Length . . . . . . .
. . . . . . . . . . .
56
4.3
Feature Scaling . . . . . . . . . .
. . . . . . . . . . .
56
4.3.1
Scaling Features . . . . . .
. . . . . . . . . . .
57
4.3.2
Scaling and Reranking . .
. . . . . . . . . . .
58
. . . . . . . . .
. . . . . . . . . . .
59
.
.
.
4.1
4.4
Feature Selection
61
Conclusion
Combined Performance . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.2
Comparison to Human Indexers . . . . . . . . . . . . . . . . . . . .
62
.
5.1
.
5
39
.
4
. . . . . .
.
3.2
Reweighting and Bagging . . . . . . . . . . . . . . .
.
3.1
35
39
Ranking Methods
.
3
Baseline Performance . . . . . . . . . . . . . . . . . . . . . . .3
65
A Sample Keyphrases
8
List of Tables
2.1
Statistics for Datasets
. . . . . .
. . . . . . . . . . . . . . . . . . .
30
2.2
Collabgraph Baseline Performance
. . . . . . . . . . . . . . . . . . .
36
2.3
CiteULike180 Baseline Performance . . . . . . . . . . . . . . . . . . .
36
2.4
Semeval Baseline Performance . .
. . . . . . . . . . . . . . . . . . .
36
2.5
Summary Table . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
37
3.1
Reweighting Tuning . . . . . . . .
. . . . . . . .
41
3.2
Tuning Bagging Parameters
. . .
. . . . . . . .
42
3.3
Reweighting Summary . . . . . .
. . . . . . . .
43
3.4
Tuning Reranker Candidate Counts
. . . . . . . .
45
3.5
Tuning Reranker Bagging Parameter
. . . . . . . .
45
3.6
Reranking Summary
. . . . . . . .
45
3.7
Adding Wikipedia Features at Differ
. . . . . . . .
47
3.8
Effects of numSuper and numSub Features . . . . . .
. . . . . . . .
48
4.1
Augmented Domain Keyphraseness Parameter Tuning.
. . . . . . . .
54
4.2
Keyphrase Statistics for Datasets
. . . . . . . . . . .
. . . . . . . .
55
4.3
Average Word Length . . . . . . . . . . . . . . . . . .
. . . . . . . .
56
4.4
Scaling Methods . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
58
4.5
Scaling and Reranking on Collabgraph
. . . . . . . .
.
. . . . . . . .
59
4.6
Non Wikipedia Performance with Features Removed
.
. . . . . . . .
60
4.7
Wikipedia Performance with Features Removed
. . . . . . . .
60
.
.
.
.
.
.
.
Example Document . . . . . . . . .
.
13
1.1
.
. . . . . . .
.
.
.
.
Stages
9
.
. . .
...............................
62
5.1
Maui vs Oahu ........
5.2
Automatic vs Human Consistency . . . . . . . . . . . . . . . . . . . .
63
A.1
Sample Keyphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
10
Chapter 1
Introduction
1.1
Keyphrases
Keyphrases are short phrases which summarize and characterize documents.
They
can serve to identify the specific topics discussed in a document or place it within a
broader domain. Table 1.1 shows the title, abstract, and author-assigned keyphrases
of a paper from the Collabgraph dataset.
Some of the keyphrases of this paper,
such as "frustration" and "user emotion", describe the specific topics covered by the
paper. On the other hand, the keyphrase "human-centered design" is more general,
and describes a broader domain which the paper falls within. These two roles are not
exclusive, most keyphrases act in both roles, providing information on the specifics
of the paper and how it connects to other papers.
Keyphrases have a wide variety of applications in the exploration of large document collections, the understanding of individual documents, and as input to other
learning algorithms.
In large document collections, such as digital libraries, keyphrases can be used to
enable the searching, exploration, and analysis of the collections' contents. When a
collection of documents is annotated with keyphrases, users can search the collection
using keyphrases. They can utilize keyphrases to expand out from a single document
to find other documents about the same topics. Keyphrase statistics can be computed
for entire corpus, giving a high level view of the topics in the corpus and how they
11
relate.
Keyphrases also help readers understand single documents by summarizing their
contents. In effect, the task of keyphrase assignment is a more highly compressed variant of text summarization. Keyphrases allow readers to quickly understand what is
discussed in a long paper, without reading the paper itself. In this regard, keyphrases
play a similar role to the abstract of a paper, but with even greater compression.
They can serve to augment the abstract of a paper by identifying which portions of
the abstract are most important. Usually, the list of keyphrases contains phrases not
found in the abstract and provides additional information about subjects covered in
the paper.
Keyphrases can be used as the inputs to a wide variety of learning tasks. They
have been seen to improve the performance in text clustering and categorization, text
summarization, and thesaurus construction [11][1][26]. Naturally, these improvements
are dependent on the availability of high quality keyphrase lists.
Keyphrases are typically chosen manually. For academic papers, the authors generally select the keyphrases themselves. In other contexts, such as libraries, professional indexers may assign them from a fixed list of keyphrases. Unfortunately, the
vast majority of documents have no keyphrases assigned or have an incomplete list of
keyphrases. Manually assigning keyphrases to documents is a time consuming process
which requires familiarity with the subject matter of the documents. For this reason,
there is a strong demand for accurate automatic keyphrase annotation.
1.2
Common Keyphrase Annotation Techniques
There are two primary methods for automated keyphrase annotation: keyphrase extraction and keyphrase assignment [30]. In keyphrase extraction, the keyphrases are
drawn from the text of the document itself, using various ranking and filtering techniques. In keyphrase assignment, the keyphrases are drawn from a controlled list of
keyphrases and assigned to documents by per-keyphrase classifiers. The documents,
availability of training data, and type of keyphrases desired all impact the relative
12
Table 1.1: Example Document
Title
Abstract
Keywords
Computers that Recognise and Respond to User Emotion: Theoretical
and Practical Implications
Prototypes of interactive computer systems have been built that can begin to detect and label aspects of human emotional expression, and that
respond to users experiencing frustration and other negative emotions
with emotionally supportive interactions, demonstrating components of
human skills such as active listening, empathy, and sympathy. These
working systems support the prediction that a computer can begin to
undo some of the negative feelings it causes by helping a user manage
his or her emotional state. This paper clarifies the philosophy of this
new approach to human-computer interaction: deliberately recognising
and responding to an individual user's emotions in ways that help users
meet their needs. We define user needs in a broader perspective than has
been hitherto discussed in the HCI community, to include emotional and
social needs, and examine technology's emerging capability to address
and support such needs. We raise and discuss potential concerns and
objections regarding this technology, and describe several opportunities
for future work.
User emotion, affective computing, social interface, frustration, humancentred design, empathetic interface, emotional needs
13
suitability of these two techniques.
In keyphrase assignment, keyphrases are assigned to documents from a controlled
vocabulary of keyphrases. For each keyphrase in the controlled vocabulary, a classifier
is trained and then used to assign the keyphrase to new documents. This means that
only keyphrases that have been encountered in the training data will ever be assigned
to new documents.
In many cases, the controlled vocabulary is a canonical list of
keyphrases from a journal or library. Keyphrase assignment is also referred to as text
categorization or text classification in some other works [19].
In keyphrase extraction, keyphrases are chosen from the text of the documents,
instead of from a controlled vocabulary.
A single keyphrase extractor is trained,
which takes in documents and identifies phrases from the texts which are likely to
be keyphrases.
This extractor can utilize a wide variety of information, such as
the locations of the candidate phrases in the document and their term frequency.
Many keyphrase appear in the abstract of papers, such as "frustration" and "user
emotion" in the example paper from Table 1.1.
Keyphrases often have high term
frequency. For example, "frustration" occurs 34 times within the paper from Table
1.1. These patterns can be used to identify which candidate phrases are most likely
to be keyphrases.
1.3
Challenges
Automatic keyphrase extraction presents a number of challenges.
Human indexers rely on a deep understanding of both the contents of the documents and their domains. This is especially true when authors are assigning keyphrases
to their own papers. Indexers are able to understand how the different concepts in
the documents relate to each other, and how the paper fits within the broader context
of other papers in the domain.
The keyphrase lists available for documents are incomplete and noisy.
When
authors assign keyphrases to their own papers, they usually choose a small number
of keyphrases. Hence, the lists don't include all phrases that are good keyphrases for
14
the document. The lists also express the individual biases of the authors that create
them. Similar issues are present for reader assigned keyphrases. Additionally, human
readers typically have a weaker understanding of the paper, and hence select worse
keyphrases.
Only limited datasets are available for the training of keyphrase extraction systems.
This presents a number of challenges.
Firstly, only a small fraction of all
possible keyphrases are encountered in the training set, so it is difficult to say if
a candidate phrase is suitable to be a keyphrase. Secondly, for many domains no
training data is available, forcing the use of training data from a different domain.
A final challenge is that often keyphrases don't occur in the text of the document.
In our example document, the keyphrase "human-centred design" does not appear
anywhere. To select this keyphrase, the human indexer utilized semantic and domain
knowledge. Replicating this form of knowledge in an automated system is difficult,
especially with limited training data.
1.4
Advancements
In this thesis, we present a number of advancements to the state of the art in keyphrase
extraction. We build upon Maui, an open-source keyphrase extraction system [19].
We have released our improved system as an open source project called Oahu. Our
contributions are as follows:
1. Reweighting: We present a technique for the reweighting of positive examples
during training which improves both accuracy and runtime.
2. Reranking: We introduce a reranking system which improves performance and
enables several additional improvements.
(a) Delayed Feature Computation: By delaying the computation of Wikipedia
features until the reranking stage, we are able to dramatically reduce runtime with minimal cost to accuracy.
15
(b) Number of Higher Ranked Superstrings and Substrings: We compute the number of high ranked substrings and superstrings from the original ranking step and use them as a features in the reranking stage. These
new features are seen to improve performance by adding information on
the relationships between keyphrases.
3. Augmented Domain Keyphraseness: We incorporate information from hierarchical document clusterings into the keyphraseness feature to improve the
handling of corpora with documents from multiple domains.
4. Average Word Length Feature: We propose a new average word length
feature which is seen to substantially improve performance.
5. Feature Scaling: We consider the rescaling of features on a per-document
basis.
We explore a variety of schemes, some of which are seen to improve
performance.
6. Feature Selection: We examine the effects of feature removal We find that
although several features offer minimal benefit, no features have a significant
adverse effect on performance.
7. Evaluation on Multiple Datasets: We evaluate the performance of all our
changes across three diverse datasets.
8. Comparison to Human Indexers: We propose a new system for comparison
to human indexers, and use it to compare performance between human indexers
and our system on the CiteULike and Semeval datasets.
9. Oahu: We release our improved system as an open-source project called Oahu.
1
In addition to the various improvements to the keyphrase extraction system
discussed in this thesis, Oahu also offers a variety of improvements over Maui
not directly related to the quality of the keyphrases extracted. These include
lhttps://github.com/ringwith/oahu
16
a built in system for cross validation, multi-threading support, and improved
abstractions.
In Chapter 2, we will describe the task of keyphrase extraction and our procedures
for evaluation. In Chapter 3, we will present our advancements in the ranking methods
used for keyphrase extraction. In Chapter 4, we will describe our new features. In
Chapter 5, we report the combined effect of these improvements and compare our
system's performance to that of human indexers.
17
18
Chapter 2
Task Description
In this chapter, we will describe the task of keyphrase extraction in greater detail.
First, we will review prior work in keyphrase assignment.
Then we will focus on
the methods that have been used for the evaluation of keyphrase extraction systems,
and describe the evaluation methods we will be employing. We will also describe the
three corpora that we will evaluate performance on. Finally, we will describe Maui,
the baseline system, and report its performance on the three datasets.
2.1
Prior Work
Most automatic keyphrase annotation methods perform either keyphrase extraction
or keyphrase assignment. In keyphrase extraction the keyphrases are drawn from the
texts of the documents by a single extractor. In keyphrase annotation, the keyphrases
are drawn from a controlled list of keyphrases and then assigned to documents by
per-keyphrase classifiers. Although there is a potential for systems that are a hybrid
of these two techniques, or lie wholly outside of these categories, thus far there has
been minimal exploration such systems. Since keyphrase assignment is a relatively
straightforward task, recent work has focused primarily on keyphrase extraction.
19
2.1.1
Keyphrase Extraction
Keyphrase extraction techniques typically rely on a two step system, consisting of a
heuristic filtering stage to select candidates keyphrases from the text and a trained
ranking stage to select the top candidates.
The ranking stage utilizes a variety of
features which vary substantially from system to system.
KEA, the Keyphrase Extraction Algorithm, is a representative example of a
keyphrase extraction system [30] [6].
In KEA, a list of candidate keyphrases are
filtered from articles by regularizing the text, and then selecting phrases of at most 3
words, that are not proper nouns and do not begin or end with a stop word. Additionally, the keyphrases are stemmed, and their stemmed forms are used when evaluating
features and comparing to the human assigned keyphrases. The filtered lists of candidate keyphrases are ranked using a Naive Bayesian classifier with two features, tf-idf
and the location of the first occurrence of the keyphrase.
Maui is an open source keyphrase extraction developed by Medelyan et al. [19].
As will be discussed in Section 2.4, this system gives performance that is at or near
the state of the art for author assigned keyphrases. As with KEA, the filtering stage
outputs all phrases of n words or less that don't start or end with a stop word.
The ranking step is performed using bagged decision trees that are generated from
the training set. Maui utilizes a number of features, including term frequency, phrase
length, and position of first occurrence. Maui also allows for the use of domain specific
thesauruses or Wikipedia to compute additional features.
Integrating external data sources, such as domain specific thesauruses or Wikipedia,
has been seen to considerably improve the performance of keyphrase extraction techniques.
Thesauruses can be used as a fixed vocabulary, restricting the extracted
keyphrases to the set of phrases in the thesaurus. The use of the Agrovoc thesaurus
of food and agriculture terms dramatically improves KEA++'s performance on agricultural documents [22] [19]. High quality thesauruses are not available for all fields,
but other sources, such as Wikipedia can offer similar benefits. Several researchers
have investigated this approach, and achieved inter-indexer consistency levels compa-
20
rable to human indexers on the Wiki-20 and CiteULike180 corpora [23] [12]. However,
these gains depend substantially on the corpora used.
A number of keyphrase extraction systems identify the different sections of the
documents, and utilize this information to compute additional features. Typically,
the additional features are binary features that indicate which sections of the paper
the candidate appears in. These techniques yield the greatest performance improvements on reader selected keyphrases [14][31]. Some systems, such as SZTERGAK and
SEERLAB utilize simple rule based section identification on the text dumps [2][291.
In contrast, WINGUS and HUMB both perform boundary detection on the original
PDF files that were used to generate the text dumps [17][311. To do so, WINGUS
and HUMB utilize SectLabel and GROBID, respectively.
SectLabel and GROBID
are trainable section identification systems [18][16]. WINGUS and HUMB are seen
to have the highest performance on the reader assigned keyphrases in Semeval 2010
Task 5, indicating that their sophisticated section parsing may provide additional
value beyond simpler rule based parsing [15]. At a minimum, trainable document
parsing systems such as GROBID and SectLabel are more easily adapted to new
corpora than rule based systems.
Some systems have more sophisticated filtering step, selecting a more restricted
list of candidates than KEA or Maui. SZTERGAK restricts to candidates matching
predefined part of speech patterns [2].
KP-Miner filters candidates by frequency
and position of first occurrence [5]. The WINGUS system filters abridges the input
documents, ignoring all sentences after the first s in each body paragraph of the paper
[18].
This abridging technique was seen to improve performance on reader chosen
keyphrases, possibly reflecting a tendency of readers to choose keyphrases from the
first few sentences of paragraphs. However, these experiments were performed on a
single split of a single small dataset, so these results may be due to noise or may
generalize poorly to other corpora.
Some keyphrase extraction systems utilize heuristic postranking schemes to improve performance by modifying the candidate scores computed in the ranking step.
The HUMB system utilizes statistics from the HAL research archive to update the
21
scores of candidates after the scoring step [17].
A wide variety of methods have been proposed for the evaluation of keyphrase
extraction systems.
The standard method for evaluation is having the extraction
system select N keyphrases for each document and then computing the F-score of
the extracted keyphrases on the gold standard lists [14][6]. Other metrics have been
proposed to address issues of near misses and semantic similarity [131 [21]. Interindexer consistency measures, such as Rolling's consistency metric have been used to
compare the performance of automatic extraction systems to that of human indexers
[201. We will discuss these issues in greater detail in Section 2.2.
Previous papers in keyphrase extraction have utilized a number of different datasets
for evaluation. The Semeval dataset is a set of 244 ACM conference and workshop
papers with both author and reader assigned keyphrases [14]. It was used in Semeval
2010 Task 5 to evaluation the performance of 19 different keyphrase extraction systems. CiteULike180 is a dataset of 180 documents with keyphrases assigned by the
users of website citeulike . org [19]. Hulth (2003) released a corpus of 2000 abstracts
of journal articles from Inspec with keyphrases assigned by professional indexers [8].
Nguyen and Kan (2007) contributed a corpus of 120 computer science documents
with author and reader assigned keyphrases [24].
For the task of keyphrase extraction with a controlled thesaurus vocabulary, several dataset thesaurus pairs are available [22][19]. The FAO-780 dataset consists of
780 documents from the Food and Agriculture Organization of the United Nation
with keyphrases assigned from the Agrovoc agricultural thesaurus.
NLM-500 is a
dataset of 500 medical documents indexed with terms from MeSH, a thesaurus of
medical subject headings. CERN-290 is a set of 290 physics documents indexed with
terms from HEP, a high energy physics thesaurus. WIKI-20 is a small set of 20 documents with Wikipedia article titles assigned as keyphrases. Wikipedia acts as the
thesaurus for this dataset.
22
2.1.2
Keyphrase Assignment
Keyphrase assignment utilizes a separate classifier for each keyphrase.
Techniques
such as support vector machines, boosting, and multiplicative weight updating all
are effective given a sufficiently large training set [28]. Dumais et al. achieved high
accuracy assigning keyphrases to the Reuters-21578 dataset using support vector
machines [4]. They utilized the tf-idf term weights of the words in the documents. For
each keyphrase, they used only the 300 features that maximized mutual information.
The Reuters-21578 dataset contains 12,902 stories and 118 keyphrases (categories),
with all keyphrases assigned to at least one story. Unfortunately, most datasets have
substantially less training documents and more keyphrases.
In many datasets, a
given keyphrase is never encountered or is only encountered a few times, and such a
per-keyphrase classifier cannot be trained. This is particularly true when assigning
keyphrases from domain-specific thesauri, which are often very large. For example,
the MeSH controlled vocabulary of medical subject headings contains over 24,000
subject descriptors, so an enormous training set would be required to ensure that
each keyphrase was encountered multiple times in the training set [19].
2.2
Evaluation
A number of metrics have been proposed for the evaluation of keyphrase extraction
algorithms.
Generally, performance is computed by comparing the keyphrases ex-
tracted by the algorithm to a list of gold keyphrases generated by the author or
readers of the papers. This comparison can be done by comparing the strings directly, or using more sophisticated techniques that address near misses and semantic
similarity. In Section 2.2.1, we review prior work in this area, and describe the main
performance metric that we will use in this thesis.
Given the small corpora available for training and testing, the use of cross validation is essential. In Section 2.2.2, we describe our use of cross validation and some
issues with the lack of cross validation in prior work on keyphrase extraction.
In evaluating keyphrase extraction algorithms, it is important to understand how
23
their performance compares with that of human indexers. The level of consistency
among human indexers can be compared with the level of consistency between automatic indexers and human indexers. In Section 2.2.3, we review prior work in this
area and present an improved measure of consistency that addresses some of the
shortcomings of previous metrics.
2.2.1
Performance Metrics
The standard method for evaluating the performance of automatic keyphrase extraction systems is having the extraction system select N keyphrases for each document
and then computing the F-score based on the number of matches with the gold standard keyphrase lists [14][6]. The comparisons with the gold standard lists are done
after normalizing and stemming the keyphrases.
Measuring performance by checking for exact matches of the stemmed and normalized keyphrases fails to handle keyphrases that are similar, but not identical. For
example, "effective grid computing algorithm" and "grid computing algorithm" are
two similar keyphrases, but would be treated as complete misses. N-gram based evaluations can be used to address the effects of near misses 1131.
Another possibility
is measuring the semantic similarity of keyphrases and incorporating this into the
performance metric. Medelyan and Witten (2006) propose an alternative thesaurusbased consistency metric. [21j. In their metric, semantic similarity is computed from
the number of links between terms in the thesaurus.
However, these more sophisticated evaluation metrics have not achieved wide
spread adoption.
As such, we will evaluate performance using a standard evalua-
tion metric for automatic keyphrase extraction, the macro-averaged F-score (0 = 1).
Keyphrase comparison is done after stemming and normalizing the keyphrases by
lowercasing and alphabetizing.
Macro-averaged F-score is the harmonic mean of
macro-averaged precision and recall. Precision is the fraction of correctly extracted
keyphrases out of all keyphrases extracted, and recall is the fraction of correctly
extracted keyphrases out of all keyphrases in the gold list. For example, if the gold
standard keyphrases were {"ranking", "decision tree", "bagging"} and the keyphrases
24
extracted were {"ranking", "trees", "keyphrase extraction", "bagging"}, then the recall would be 2/3 and the precision would be 2/4. The macro-averaged precision and
recall are the averages of these statistics over all documents.
Previous work has varied in the use micro- vs macro-averaged F-scores. Medelyan
et al. use macro-averaged F-scores in their evaluation of Maui, Oahu's predecessor
[19]. Semeval 2010 Task 5 claims to use micro-averaged F-score [141. However, their
evaluation script does not correctly compute micro- or macro-averaged F-score. The
section of the script which claims to compute the micro-averaged F-score actual computes the average of the F-scores of the various documents. Micro-averaged F-score
is actually the harmonic mean of micro-averaged precision and recall. The section
which claims to compute the macro-averaged F-score actually computes the harmonic
mean of the micro-averaged precision and recall. Macro-averaged F-score is actually
the harmonic mean of macro-averaged precision and recall.
We evaluate on the top 10 keyphrases extracted for each document.
The re-
striction to extracting a fixed number of keyphrases for each document keeps the
focus on extracting high quality keyphrases, instead of predicting the number of gold
standard keyphrases there will be each document. If the number of gold standard
keyphrases for each document were known ahead of time, then F-score could be increased by predicting more keyphrases on documents with longer gold standard lists.
On most keyphrase extraction corpora, the gold standard lists do not include all good
keyphrases, and as such their lengths do not reflect how many keyphrases should be
assigned to their documents. Instead, they reflect variations in the keyphrase assigning styles of different authors and journals. As such, predicting the lengths of the
gold standard lists is not productive. Additionally, in practice, F-score is maximized
when only a few keyphrases are predicted for each document. However, for most applications longer complete lists of keyphrases are preferable over shorter incomplete
keyphrase lists. The evaluation system should reflect that and not encourage the
generation of short lists. As such, allowing a variable number of keyphrases to be
extracted is undesirable when F-score is used for evaluation. The desire for longer
keyphrase lists is also why we chose to evaluate the top 10 keyphrases instead of the
25
top 5 keyphrases.
2.2.2
Cross-Validation
Due to the small size of the available datasets, it is necessary to perform crossvalidation.
We use repeated random sub-sampling validation. This approach was
chosen because it allows the number of trials and the size of the training sets to be
chosen independently. In contrast, k-fold validation does not offer that freedom. The
F-score is computed for each sub-sampling and then averaged over the sub-samplings.
For training sets on the order of 100 to 200 documents, the standard deviation of the
F-score for a single sub-sample typically exceeds 1. As such the use of cross-validation
to eliminate noise is essential. Cross-validation also helps address and avoid issues of
overfitting.
We let a denote standard deviation of the F-scores from the sub-samplings. The
standard deviation of the averaged F-score computed from these subsamplings is
estimated to be,
01
Uaverage
-number of subsamplings
When reporting our results, we report error to be twice the estimated standard deviation of the averaged F-score. So a performance of 43
F-score of 43 with an estimated standard deviation,
12 corresponds to an averaged
gaverage,
of 6.
Previous papers in keyphrase extraction vary in their use of cross-validation.
Medelyan et al. (2009) utilize 10-fold from validation in their experiments [19]. In
contrast, Semeval 2010 Task 5 uses a single fixed split between training and test data
[141. Due the small dataset, this introduces substantial noise into the reported performance statistics. Further exacerbating this issue, each team was allowed to submit
the results of up to three runs of their system. As such, three different parameter
configurations could be used, causing performance to be overestimated. A similar
issue is present in the WINGUS system, where a single split of the training data was
used for feature selection [25]. The lack of cross validation in the feature selection
process may have led to poor choices of features, due to over-fitting to the single split.
26
2.2.3
Comparison to Human Indexers
Creating a keyphrase extraction system that is able to perfectly predict the gold
standard keyphrase lists is not a plausible goal. There are many good keyphrases for
any paper, and the authors and readers only select of subset thereof. Different human
indexers select different sets of keyphrases for the same papers.
Hence achieving
an F-score of 100 is not feasible. To understand how the performance of keyphrase
extraction systems compare to the theoretical maximum performance, we can compare
them with skilled human indexers.
Medelyan et al.
compare the performance of Maui that of human taggers by
comparing inter-indexer consistency on the CiteULike180 dataset [201.
They use
the Rolling's metric to evaluate the level of consistency between indexers. Rolling's
consistency metric is:
RC(I1 ,1 2 )
A=
A+ B
where C is the number of tags that indexers 11 and 12 have in common, and A and
B are the number of tags assigned by 1 and I2 respectively [27]. They observed that
the inter-indexer consistency among the best human indexers is comparable to the
inter-indexer consistency between Maui and the best human indexers.
One shortcoming of this approach is that Medelyan et al. are forced to choose
an arbitrary cutoff point for who qualify as the "best human indexers." They select
the top 50% of the human indexers as their set of "best human indexers".
Before
this step, they pre-filter the set of human indexers to indexers that are consistent
with other indexers on several tags. Due to these arbitrary cutoffs, it isn't possible to
use this procedure to make precise comparisons between the performance of human
indexers and the performance of automatic keyphrase extraction systems.
A second issue is that their automatic keyphrase extraction algorithm selects the
same number of keyphrases for all documents, while the human indexers select a
variable number keyphrases for each document.
the evaluation.
This introduces some biases into
As discussed previously, the number of keyphrases extracted per
document affects performance, so restricting the keyphrase extraction algorithm likely
27
puts it at a disadvantage relative to the human indexers.
We propose an improved evaluation metric that addresses some of the aforementioned issues. As will be discussed in Section 2.3, the Sevemal dataset has both reader
and author assigned keyphrases. As such, we can compare the quality of the reader
and algorithm assigned keyphrases by evaluating their consistency with the author
assigned keyphrases. For each document, d, we select the top IReaderdl keyphrases
generated by the keyphrase extraction algorithm, where IReaderl is the number of
reader assigned keyphrases for document d. This ensures that neither the reader nor
the algorithm are given an advantage due to the number of keyphrases they select.
We evaluate consistency on a single document using Rolling's metric. To compute
overall consistency, we average Rolling's metric over all documents. Formally,
EZdeD RC( Ad, Bd)
.
ID
Consistency(A, B) =
IDI
where D is the set of documents, and A and B are corresponding sets of keywords.
We compare Consistency(Author, Reader) to Consistency(Author,Algorithm), where
IAlgorithmdj
=
IReaderdl for all d. As with the F-score, we perform cross validation,
and average the inter-indexer consistency scores across runs.
This approach merges all of the readers into one indexer, effectively reducing the
number of human indexers to 2. On the Semeval dataset, the mapping between readers and keyphrases is not reported, so this is sufficient. On other datasets, such as the
CiteULike dataset, the mapping between readers and the keyphrases they assigned
is available. The above technique can be generalized to utilize this additional information. We can evaluate the internal consistency of human indexers by computing
the average inter-indexer consistency between pairs of indexers.
Then this can be
averaged over all documents. Formally,
.Zd
HumanConsistency =d
eD
RCQi,12
)
ELdW\{} E12 ELdj\{l,l}
|Ld.ud|-1)
wD
where Ld is the set of lists of keyphrases for document d. To evaluate the con-
28
sistency of the algorithm with the human indexers, we compute the average of the
pairwise consistencies per document as before, but when computing consistency for
each pair of indexers, we replace keyphrase list of the first human indexer with the
keyphrase list of the algorithm. Formally,
EZELdiL}
AlgorithmConsistency =Ld|(ILW-1)
RC(tOp
jli results of algo on d),l2
)
Z ED lEL \M}
IDI
This generalization enables the comparison of inter-indexer consistency between
multiple human indexers and automatic keyphrase extraction systems.
It avoids
given unfair advantages to either the human or computer indexers by restricting the
algorithm to submitting the same number of keyphrases as the human indexers.
2.2.4
Use of Evaluation Systems
In summary, we have presented two schemes for the evaluation of keyphrase extraction
systems. The first, macro-averaged F-score is consistent the evaluation methods used
in earlier works on keyphrase extraction. Our use of repeated random sub-sampling
validation in the computation of F-score ensures properly cross-validated results. We
will use this metric through this thesis to measure the effects of our improvements.
The second evaluation metric is a new inter-indexer consistency metric which allows
for the comparison of the relative performance of human indexers and automatic
extraction systems. We will use this metric to compare the performance of our system
to human indexers in Chapter 5.
2.3
Datasets
We will focus on three datasets of research documents, Collabgraph, CiteULike180,
and Semeval 2010 Task 5. As mentioned in Section 2.1, the CiteULike180 and Semeval
datasets are prexisting keyphrase extraction datasets which have been used in other
papers. The Collabgraph dataset is a new dataset which we compiled for this thesis.
These three datasets vary considerably in terms of the types of papers, the quality
29
Table 2.1: Statistics for Datasets
Collabgraph
CiteULike180
Semeval
Doc Length
Keyphrases per Doc
Keyphrase Length
Oracle Accuracy
49453
35807
48185
5.00
5.22
3.88
1.80
1.16
1.96
82%
85%
83%
of their textual data, and the processes used to select the gold standard keyphrases.
Appendix A contains example keyphrases from representative documents from each
dataset.
Table 2.1 shows summary statistics for the three datasets.
Average document
length in characters, the average number of keyphrases per document, the average
number of words per keyphrase, and oracle accuracy are reported. Oracle accuracy is
the fraction of keyphrases that appear in the text of their document. Oracle accuracy
is evaluated after normalization by stemming and lowercasing.
2.3.1
Collabgraph
The Collabgraph dataset consists of 942 research papers with author assigned keyphrases.
They are drawn from a wide variety of fields and journals. Each paper has at least
one author from MIT. The keyphrase lists were generated by running a simple script
to identify lists of keyphrases in the texts of a larger set of documents.
Not all of
these documents contained lists of keyphrases, and so they are omitted from the
dataset. After their identification, the keyphrase lists were removed from documents
so that they did not interfere the training or evaluation process. The scripts for this
document processing are included with Oahu.
Since this dataset is not comprised of freely available papers, we are not able to
provide the textual data.
Instead, we have provided a list of the papers used, in
BibTEX format, so that the dataset can be downloaded by anyone with access to
the papers from the dataset. This list of papers is included in the Oahu project on
GitHub.
30
2.3.2
CiteULike180
The CiteULike180 corpus is 180 documents with tags assigned by users of the website
citeulike . org [19]. CiteULike is a online bookmarking service for scholarly papers,
where users can mark papers with tags. Medelyan et al. generated a corpus from the
papers on CiteULike. To address issues of noise, they restricted to papers which had
at least three tags that at least two users agreed upon. For taggers, they restricted
to taggers with at least two co-taggers, where two users are said to be "co-taggers"
if they have tagged at least one common document. They also restricted to papers
from High-Wire and Nature, since those journals had easily accessible full text PDF
files. As in Medelyan et al. (2009), we consider only tags which at least two users
agreed upon. This gives a total of 180 documents, with 946 keyphrases assigned.
The keyphrases seen in this dataset are notably shorter than the keyphrases from
the other datasets, typically only a single word in length. The papers are primarily
about biology, with smaller number of math and computer science papers mixed in.
There are a some issues with cleanliness of this data. A number of the documents
contain sections from other papers at their beginning or end. This is because the
text of the documents was generated by running pdf2text on PDF files downloaded
from citeulike.org. Some of these PDFs are of pages from journals, such as Nature,
where the first or last page of an article often contains the end or beginning of another
article in the journal, respectively. This form of noise has impact on the accuracy
of keyphrase extraction by introducing spurious candidates and padding the length
of the documents.
As discussed in Section 2.1, a number of keyphrase extraction
systems utilize the locations of the first and last occurrence, which are affected by
these extraneous passages.
2.3.3
Semeval 2010 Task 5
The corpus from Semeval 2010 Task 5 consists of 244 conference and workshop papers
from the ACM Digital Library [14][151. The documents come from four 1998 ACM
classifications: C2.4 (Distributed Systems), H3.3 (Information Search and Retrieval),
31
12.11 (Distributed Artificial - Multiagent Systems) and J4 (Social and Behavioral
Science - Economics). There are an equal number of documents for each classification,
and the documents are marked with their classification. Each document has author
assigned keyphrases and reader assigned keyphrases. The reader assigned keyphrases
were assigned by 50 students from the Computer Science department of National
University of Singapore. Each reader was given 10 to 15 minutes per paper.
Evaluation in Semeval 2010 Task 5
Semeval 2010 Task 5 specifies a standard evaluation procedure. The corpus is split
into 144 training documents and 100 testing documents.
As previously mentioned
in Section 2.2, there are some issues with the procedure they specify and the script
provided to perform this procedure. Instead of evaluating using their split between
training and testing documents, we merge the two sets, and evaluate using cross
validation.
2.3.4
Author vs Reader Selected Keyphrases
Intuitively, we would expect the quality of author assigned keyphrases to exceed that
of reader assigned keyphrases. Authors have a deep understanding of the paper they
write, and hence are in a good position to assign keyphrases. A direct examination
of the Semeval dataset reveals that the keyphrases chosen by readers have some
shortcomings. The readers assigning keyphrases to the Semeval corpus were given 10
to 15 minutes per document, a fairly limited amount of time [14]. As such, we would
expect that location within the document would strongly inform their keyphrase
selection. This hypothesis is confirmed by the analysis of Nguyen et al. of this corpus
[25]. In Semeval 2010 Task 5, systems that utilize document section analysis, such
HUMB, WINGUS, and SEERLAB substantially outperform systems such as Maui
which lack such features [151. On the author assigned keywords, document section
analysis plays a much less important role. Since author selected keyphrases are high
quality, due to being generated without time pressure by people who have fully read
32
the papers, we will focus on author selected keyphrases.
2.3.5
Testing Datasets
Throughout this paper we will focus on the Collabgraph dataset. As discussed above,
author selected keyphrases are preferable to reader selected keyphrases.
For this
reason, we did not focus on the CiteULike dataset. The Semeval and Collabgraph
datasets both have author assigned keyphrases, but the Collabgraph dataset is substantially larger. Since a larger corpus reduces the risks of overfitting, we focus on
the Collabgraph corpus.
When measuring performance, we use 200, 100, and 150 training documents for
the Collabgraph, CiteULike180, and Semeval datasets, respectively.
These training
document counts were chosen so that the number of testing documents was not small
enough to substantially increase the variance of the F-score. Experientially, at least
50 testing documents were needed to avoid this issue.
This form of noise can be
resolved by running additional trials, but runtimes were already very long, so this
was undesirable. The numbers of training documents were also chosen to be similar
as to lessen the differences between the datasets. For this reason, the Collabgraph
document count was chosen to be 200 instead of 400. Additionally, using 400 training
documents would have substantially slowed the running of experiments.
2.4
Baseline
In this thesis, we build upon Maui, an open source keyphrase extraction system.
Maui achieves performance at or very near the state of the art on author assigned
keyphrases. In this section, we will discuss Maui in detail and report its performance
on all three datasets.
33
2.4.1
Baseline System
Maui performs keyphrase extraction in the standard fashion, first heuristically filtering to a list of candidate keyphrases and then selecting the top candidates with a
trained ranker [191. Filtering is performed by selecting phrases of n words or less, that
do not begin or end with a stopword. The ranking step is performed using bagged
decision trees generated from the training set. Maui utilizes a number of features:
1. TF, IDF, and TF-IDF of the candidate keyphrase.
2. Phrase Length, the number of words in the candidate.
3. Position of First Occurrence and Position of Last Occurrence, both
normalized by document length. Spread, the distance between the first and
last occurrences, also normalized by the document length.
4. Domain Keyphraseness, the number of times the candidate was chosen as a
keyphrase in the training dataset.
Maui also allows for the use of domain specific thesauruses or Wikipedia to compute addition features. In this thesis, we will only use the Wikipedia features:
1. Wikipedia Keyphraseness, the probability of an appearance of the candidate
in a article being an anchor. An anchor is the visible text in a hyperlink to
another article.
2. Inverse Wikipedia Keyphraseness, the probability of a candidate's article
being used in the text of other articles. Given a candidate, c, with corresponding
article A, the inverse Wikipedia keyphraseness is the number of incoming links,
inLinksTo(A) divided by the total number of Wikipedia articles. Then
-log 9 2
is applied, giving:
IW F(c)
-log 2
inLinksTo(A)
-
N
N
Since bagged decision trees are used, the normalization used here is immaterial.
This feature is equivalent to inLinksTo(A).
34
3. Total Wikipedia Keyphraseness, the sum of the Wikipedia keyphrasenesses
of the corresponding articles for each phrase in the document that was mapped
to the candidate.
4. Generality, the distance between the category corresponding to the candidate's
Wikipedia article and the root of the category tree.
By default, Maui filters to candidates that occur at least twice and uses 10 bags
of size 0.1 during ranking.
The performance of Maui in Semeval 2010 Task 5 indicates it is at or very close
to the state of the art for extraction of author assigned keyphrases. The only system
which outperformed Maui on this task was HUMB. As previously discussed in Section
2.2, there are a number of issues with the Semeval 2010 Task 5 evaluation system
which make it difficult to determine if this outperformance was due to noise or a
genuine advantage. A trial of Maui on the particular split used by Semeval 2010 Task
5 indicates that the training-test split used by the task may have been particularly
bad for Maui. Furthermore, HUMB utilized an additional 156 training documents
compared to the 144 documents used by the other systems evaluated in this task
[17][141.
This more than doubled the number of training documents available to
HUMB. As can be seen below, the number of training documents has a substantial
effect on the performance of keyphrase extraction systems.
Although this was fair
within the rules of Semeval 2010 Task 5, it makes it difficult to determine if HUMB
has any real advantage over Maui, or its out-performance was simply due to a larger
training corpus.
Unfortunately, the results of Semeval 2010 Task 5 are the only
published performance numbers for HUMB, and the system is not open-source. As
such, we will not compare our performance with HUMB.
2.4.2
Baseline Performance
Tables 2.2, 2.3, and 2.4 show the performance of Maui with and without Wikipedia
features, for a range of training document counts. Note that we run Maui will all
parameters set at their default values.
This includes only considering candidate
35
Table 2.2: Collabgraph Baseline Performance
Training Does
50
100
200
400
w/o Wikipedia
14.3 0.3
15.4 0.2
16.5 0.2
17.9 0.1
with Wikipedia
16.2 0.7
17.1 0.8
17.0 0.6
18.7 0.5
Table 2.3: CiteULike18O Baseline Performance
Training Does
50
100
w/o Wikipedia
28.8 0.2
31.1 0.3
with Wikipedia
29.2 0.8
31.2 0.7
Table 2.4: Semeval Baseline Performance
Training Docs
50
100
200
w/o Wikipedia
15.9 0.2
17.7 0.2
18.7 0.4
36
with Wikipedia
16.7 0.6
17.4 0.6
18.6 1.8
Table 2.5: Summary Table
w/o Wikipedia
with Wikipedia
Collabgraph
16.5 + 0.1
17.5 0.2
keyphrase that appear at least twice.
CiteULike180
31.1 0.2
31.2 0.2
Semeval
18.2 0.1
18.3 0.2
From these tables, we can see that increas-
ing the number of training documents has a substantial positive effect. The use of
Wikipedia features improves performance on the Collabgraph and Semeval datasets.
These improvements are more pronounced when that training sets are smaller. The
use of Wikipedia does not provide any substantial performance improvement on the
Collabgraph dataset.
Table 2.5 shows the baseline performance of Maui on the Collabgraph, CiteULike180, and Semeval datasets for 200, 150, and 100 training documents, respectively.
These are the training document counts we will be using for the rest of this thesis.
As such, these are the performance numbers that all improvements will be compared
to.
37
38
Chapter 3
Ranking Methods
In this chapter, we present several advancements to the ranking methods used for
keyphrase extraction. In Maui, bagged decision trees are used to rank the candidate
keyphrases generated by the filtering stage.
Our first advancement is a modified
training procedure where the weight of positive examples is increased during bagging.
This is seen to improve both performance and runtime.
We also discuss how this
improvement relates to the tuning of bagging parameters. Our second advancement
is the use of a second set of bagged decision trees for reranking. We also propose a
number of extensions to this technique.
3.1
Reweighting and Bagging
Maui, HUMB, and a variety of other keyphrase extraction systems rely on bagged
decision trees to rank candidate keyphrases [201 [17]. In the training process, these
trees are trained on the candidate keyphrases that the filtering stage generates from
the training documents. Here, the number of negative examples vastly exceeds the
number of positive examples, due the simplicity of candidate extraction process and
the small number of gold standard keyphrases per document. Maui simply extracts
N-grams, that do not begin or end with a stopwords. This results in thousands to
tens of thousands of candidate keyphrases per document. In all three datasets, the
average document has more than 100 times as many negative candidates as it has
39
positive candidates.
A second complimentary issue is that during bagging some of the original data
points may be omitted. In bagging, m new datasets Di are drawn with replacement
from the original dataset D, such that
1D 2 /IDI = p. The probability that some
candidate c E D is not in any dataset Di is:
P(c 0 UDj) = (1 - 1/DI)mpIDI
Assuming that IDI is large, this can be approximated as
P(c
U Dj)
_
-'P
By default, Maui uses parameters m = 10 and p = .1, meaning that P(c g
Uj Dj)
- e-1 ~ .37. So on average, 37% of data points are not used to train any of
the decision trees. Naturally, this can be addressed by increasing the values of m and
p, however this comes at the cost of runtime.
3.1.1
Reweighting positive examples
The samples in D can be reweighted before the bagged sets, {Dj}, are drawn. After
upweighting the positive examples to w times the weight of the negative examples,
the probability of a positive candidate c+ appearing in none of the new datasets Di
is:
P(c+ 0
UDj)
= (1 - w/(wjD+I+ D_- ))mpDI
where D+ and D_ are the sets of positive and negative candidates, respectively.
Assuming that IDI is large and wID+I < ID1,
P(c+ 0UDj) = (1 - w/(|D|))mpID|
40
-wrp
Table 3.1: Reweighting Tuning
w
P(c+ 0
F-score
1
2
4
8
16.5
16.9
17.4
17.3
0.2
0.2
0.1
0.1
16
32
16.9
16.1
0.2
0.2
64
13.1
0.4
Uj Dj)
0.37
0.14
0.018
0.00034
1.le - 07
1.3e - 14
1.6e - 28
Table 3.1 shows the effect of varying w on the Collabgraph dataset. Empirically,
we see that increasing w can dramatically improve performance at negligible cost to
runtime. We see that w = 4 yields the highest performance.
P(c+ V
Uj
Note that for w = 4,
D) = .02, so on average all but 2% of the positive examples will be used
at least once.
This performance improvement makes intuitive sense, especially with the default
bagging parameters of m = 10 and p = .1. This reweighting ensures that the precious
positive data points are used in at least one decision tree with high probability. The
negative data points are still omitted, but the information encoded by the negative
data points is not particularly unique. A second benefit is the reweighting results in
deeper decision trees because there are more positive candidates in the training data
for each tree. Deeper decision trees express the roles of a greater number of features.
Since C4.5 decision trees are used, less predictive features will only appear in the
decision trees if the trees are sufficiently deep. This is because the most predictive
features occupy the first few layers of the trees.
3.1.2
Comparison to Tuning of Bagging Parameters
The bagging parameters, bag size and bag count can also be tuned for performance.
Table 3.2 shows the effects of the tuning the bagging parameters.
Increasing the
number of bags and bag size yields substantial performance improvements. However,
these improvements come at a high runtime cost, with runtime scaling linearly in the
41
Table 3.2: Tuning Bagging Parameters
Number of Bags\Bag Size
10
20
40
80
0.1
16.4
17.2
17.8
18.1
0.2
0.3
0.3
0.3
0.3
17.1 0.3
17.5 0.2
17.8 4 0.2
18.1 0.2
0.4
17.0
17.3
18.0
17.9
0.2
0.3
0.2
0.2
number of bags, and super-linearly in bag size. Runtime is an important constraint in
this problem. Generating all of the experimental results for this thesis took upward
of 3 days of runtime on a single machine (Quad core i7-4770k at 3.7GHz running on
6 threads, using more than 12GB of RAM). As such, increasing bag number and size
is undesirable.
Hence, reweighting is very useful, as it improves performance with
almost no cost to runtime.
Note that the effect of reweighting is similar to the effect of increasing bag size.
Doubling bag size and doubling the weight of positive example, both double the
number of positive examples per decision tree. The performance increases from scaling
up bag size in Table 3.2 are comparable to the improvements seen from reweighting
in Table 3.1, indicating there some validity to this interpretation.
Reweighting positive examples can be effectively combined with the tuning of the
bagging parameters. Setting the number of bags to 20, the bag size to 0.1, and w
to 4 gives an F-score of .181 i .002. This is the equal to the highest F-score seen
from tuning the bagging parameters, but requires substantially less runtime. So, the
reweighting of positive examples reduces the runtime required to achieve maximum
performance.
3.1.3
Summary of Reweighting
Reweighting provides a substantial performance improvement on all three datasets.
Table 3.3 show the improvement from upweighting positive examples by a factor of
4 on each dataset. Unlike increasing bag size or bag count, this performance gain
comes at minimal cost to runtime. Reweighting can also be productively combined
42
Table 3.3: Reweighting Summary
W = 1
w 4
Collabgraph
16.5 4 0.1
17.3 0.1
CiteULike180
0.2
31.1
32.4 0.2
Semeval
18.2 t 0.1
18.8 0.2
with the tuning of bagging parameters to improve performance and reduce runtime.
3.2
Reranking
In reranking, the top k results from the initial ranking step are passed to a reranker
which rescores them.
The reranker is trained on the output of the ranker on the
training set.
Reranking the output of an initial ranking step with a more sophisticated scorer
has been used with great success in parsing tasks [3].
In decoding tasks, rerank-
ing allows for the use of more sophisticated scoring functions which cannot be used
during the initial ranking step due to computational limitations. The exponential
space of parse trees makes it difficult to use global features during the ranking stage.
Unlike decoding, the candidate space in keyphrase extraction is not exponentially
large, typically consisting of several thousand candidates. However, this space is sufficiently large that computing expensive features, such as Wikipedia features, can be
prohibitively slow.
3.2.1
Bagged Decision Trees
Bagged decision trees can be used for reranking. Even with the same set of features
at the ranking and reranking stages, the chaining of two sets of bagged decision trees
has advantages over a single bagged decision tree ranker. Since the filtering step used
by Oahu is simple, it generates many terrible candidates. For example, Oahu would
generate the candidate "filtering step used by Oahu" from the previous sentence.
This is clearly a poor candidate, but would affect the training of the bagged decision
trees used for ranking. The bagged decision trees in the ranking step learn which
43
features discriminate between positive candidates and all negative candidates. When
the reranking step is added, it is trained on the output of the ranking stage, the good
candidates. Poor candidates such as "filtering step used by Oahu" are filtered out by
the ranking stage. Since the reranker takes the top candidates as its input, it learns
which features discriminate between the positive candidates and the good candidates
which did not appear in the gold standard lists.
There are four hyperparameters for bagged decision trees in reranking. The first
two parameters are the number of candidates passed to the reranker during training
and the number of candidates passed to the reranker during testing. The second two
parameters are the bagging parameters, bag size and bag count.
Table 3.4 shows performance as function of the number candidates passed to the
reranker during training and testing. Traditionally, the same number of candidates
are passed to the reranker during training and testing. We consider the possibility of
using different values, since it is theoretically possible that using different numbers of
candidates for training and testing could improve performance. For example, decreasing the number of testing candidates could improve the performance by increasing
the average quality of candidates passed to the reranker.
Based on the results from Table 3.4, we see that there are a large set of parameter
values that maximize performance. We choose 160 training candidates and 160 testing candidates because having the same number of training and testing candidates
increases symmetry between the training and testing steps, which is useful when we
consider features like semantic relatedness later in this chapter. Although 320 candidates would also work, we choose 160 because less reranking candidates results in
lower runtimes.
We also tune the bagging parameters. Table 3.5 shows F-score for various reranking bag counts and bag sizes. This experiment was performed with 160 reranking candidates on the Collabgraph dataset. It shows that maximum performance is achieved
with 160 bags with a bag size of 0.2.
Reranking improves the F-score dramatically on all three datasets. As can be seen
in Table 5.2, reranking improves F-score by 2 to 3 points. Table 5.2 was generated
44
Table 3.4: Tuning Reranker Candidate Counts
40
Training \Testing
40
80
160
320
18.2
18.3
18.4
18.5
0.2
0.2
0.2
0.2
80
17.4 t
18.2
18.5
18.6
0.2
0.2
0.2
0.2
160
16.6 0.2
18.2 0.2
18.4 0.2
18.4 0.2
320
16.4 0.2
18.2 t 0.2
18.4 0.1
18.5 0.1
Table 3.5: Tuning Reranker Bagging Parameters
Bag Count\Bag Size
40
80
160
320
0.1
18.1
18.3 t
18.4
18.4
0.2
0.2
0.2
0.1
0.2
18.2 0.2
18.3 0.1
18.6 0.2
18.4 0.1
0.4
18.1 0.1
18.5 0.1
18.7 0.2
18.4 0.1
0.8
17.8
18.1
18.0
18.1
0.2
0.1
0.1
0.2
with the hyperparameters chosen above, and no Wikipedia features.
3.2.2
Support Vector Machines
We also experimented with the use of support vector machines for reranking, using the
SVMlight package [101. Linear, polynomial, radial, and sigmoidal kernels were tested,
with a variety of hyperparameter values. The reweighting of positive examples was
also considered to address the imbalanced dataset. However, in all cases, reranking
with SVMs was worse than no reranking. For the simpler linear kernel, performance
decreased dramatically as the SVM failed to split the data. For the other kernels,
the performance was often only slightly reduced relative to no reranking. However, a
substantial fraction of the time, no good fit was found and performance was decreased
dramatically.
Table 3.6: Reranking Summary
w/o Reranking
with Reranking
Collabgraph
16.5 0.1
18.5 0.1
45
CiteULike180
0.2
31.1
35.2 0.2
Semeval
18.2 0.1
20.6 0.2
The failure of support vector machines can be explained largely by the noise
present in the data. As evidenced by the low inter-indexer consistency seen between
humans, a single human indexer will omit many good keyphrases [19].
As such,
training data generated by a single human indexer has only a fraction of the candidates
that would good keyphrases marked as keyphrases. This results in many negative
examples mixed in with the positive examples. The data is also highly unbalanced
and quite complex, which makes training difficult.
3.2.3
Extensions
New features can be added at the reranking stage. This is useful for two categories of
features, features which are too computational expensive to be computed for all candidates and features that are computed using the ranked list of candidates generated
by the ranker.
Delayed Feature Computation
When Wikipedia features are used, they dominate the runtime because their computation is very expensive. As such, a natural optimization is to only compute Wikipedia
features for candidates that are selected for reranking.
Table 3.7 shows the effects
of adding Wikipedia features at different stages of ranking and reranking process.
This is on the Collabgraph dataset with the reranking hyperparameters discussed in
Section 3.2.1. As can be seen in the table, delaying the computation of Wikipedia
features until after ranking has minimal effect on the F-score, but decreases runtime
by a factor of 5. As such, computing Wikipedia features before reranking instead of
before ranking is suitable for runtime sensitive environments. In Chapter 4, when we
explore the effects of removing features, we will employ this optimization to make
experimentation with Wikipedia features feasible.
46
Table 3.7: Adding Wikipedia Features at Different Stages
Wikipedia Features
None
Computed before Ranking
Computed before Reranking
F-score
Runtime (normalized)
0.2
0.3
1.0
17.2
20.4 + 0.2
3.4
18.7
20.5
Semantic Relatedless
Medelyan et al. introduce a semantic relatedness feature, semRel, computed using
Wikipedia [191.
This feature is the average semantic relatedness of the Wikipedia
article of the candidate to the articles of the other candidates in the document. This
semantic relatedness feature is prohibitively computationally intensive when there
are a substantial number of candidates, since it's computation is quadratic in the
number of candidates.
Maui supports the use of this semantic relatedness feature
when Wikipedia is used as a thesaurus. The use of Wikipedia as thesaurus reduces the
number of candidates, and hence the time required to compute semantic relatedness.
We are not using Wikipedia as thesaurus; we allow for keyphrases that are not the
titles of Wikipedia articles.
We tried the adding of this semantic relatedness feature after ranking. Instead
of computing the average of the semantic relatedness of the candidate to all other
candidates in the document, we restricted to the top k candidates generated by ranking. However, we found that the addition of this semantic relatedness feature after
reranking gave no meaningful increase on F-score on any of the three datasets, and
increases runtime by more than an order of magnitude.
Number of Higher Ranked Superstrings and Substrings
Author assigned keyphrase lists rarely contain two phrases such that one phrase contains the other. If an author chooses the keyphrase "information theory", they are
unlikely to also choose "information" as a keyphrase. To allow the keyphrase extraction system to learn this, we introduce a two new reranking features, numSuper and
numSub. They are the number of superphrases and the number of subphrases that
47
Table 3.8: Effects of numSuper and numSub Features
Collabgraph
Neither
numSuper
numSub
Both
18.6
18.9
18.7
19.0
CiteULike180
35.1 t 0.1
35.0 0.1
35.0 0.1
35.2 0.1
0.1
0.1
0.1
0.1
Semeval
20.5
20.8
20.4
20.7
0.1
0.1
0.1
0.1
are ranked higher than a candidate by the ranker. String containment testing is done
after normalization and stemming.
Table 3.8 shows the effects of adding the numSuper and numSub features on
all three datasets.
Individually, the features either improve performance or have
no significant effect on it. Together, the features improve performance on both the
Collabgraph and Semeval datasets.
The CiteULike180 dataset is not affected by
the features is any significant fashion. This is consistent with the high frequency of
single word keyphrases in the CiteULike180 dataset. Since almost the majority of
keyphrases in the CiteULike180 dataset are a single word, there are less interactions
between keyphrases of different lengths.
Anti-Keyphrasenesses
Some phrases may unsuitable as keyphrases even though they appear to be suitable
keyphrases based on the values of their features. For example, in physics papers, the
phrase "electron" may appear to be a good keyphrase based on its tf-idf, position,
and so on. However, physicists may not consider "electron" to be a suitable keyphrase
because it is too general.
As a result, "electron" is never chosen as a keyphrase in
the training. However, the keyphrase extraction algorithm will choose "electron" as
a keyphrase on both the training and testing data.
To attempt to address this issue, we introduce an anti-keyphraseness feature,
antiKeyphr, which indicates how often a phrase was chosen as a keyphrase by the
algorithm but was not actually a keyphrase. Anti-keyphraseness is the fraction of
the times that the phrase was in the top k candidates selected by the ranker but was
48
not actually in the gold standard list. This feature is then added and used by the
reranker. Formally,
Ed'EDR(c)\{d} 1
-
K(c, d')
IDR(c) \ {d}|
where DR(c) is the set of documents where c is in the top k candidates selected by
the ranker, and d is document containing c. K(c, d') is an indicator function which
is 1 if c is a keyphrase of d' and 0 otherwise. If a phrase has not been encountered
previously, it has an anti-keyphraseness of 1.
The addition of the antiKeyph feature substantially decreased performance. We
explored a wide range of values of k and a few close variants of anti-keyphraseness and
did not see any improvement. This feature often appeared high in the decision trees.
We believe that this feature fails because it is tightly connected to other features
and has high noise, so it decreases the accuracy of subsequent splits. Note that antikeyphraseness is linked to domain keyphraseness because a phrase will have non zero
domain keyphraseness if and only if it has anti-keyphraseness of less than 1. Since
the keyphrase lists used for training are not complete, there is a high degree of noise,
and suitable phrases may still be assigned high anti-keyphraseness.
Although the anti-keyphraseness feature did not prove effective, we believe that
the concept may still have value. It may be possible to incorporate this information
more effectively with a heuristic step following reranking or a different feature.
3.2.4
Summary of Reranking
Reranking proved to be a highly effective strategy, both as a stand alone improvement
and by enabling further extensions. Delaying the computation of expensive Wikipedia
features until the reranking stage dramatically reduced runtime at minimal cost to
F-score. The new numSuper and numSub features computed from the ranked list
outputted from the ranker were seen to improve performance on all three datasets.
49
50
Chapter 4
Features
In this chapter, we work to improve the set of features used during the ranking step.
We introduce two new features, augmented domain keyphraseness and average word
length.
These two features are seen to substantially improve the accuracy of our
keyphrase assignment system. Poorly chosen features can have a dramatic negative
effect on performance of bagged decision trees. As such, we also explore the effects
of feature removal across all three datasets.
4.1
Augmented Domain Keyphraseness
Maui utilizes a domain keyphraseness feature which indicates the number of times a
candidate appears in the training set. This feature is used during the ranking step,
not the filtering step. Hence it does not prevent candidates that have a keyphraseness of zero from being chosen during keyphrase extraction. Formally, the domain
keyphraseness of a candidate c from document d is:
1
DomainKeyphr(c) =
d'ED(c)\{d}
where D(c) is the set of documents in the training set that have c as a keyphrase.
Here we assume that for each document, each keyphrase has a frequency of 0 or 1.
This is the case when each document has only a single indexer or when the keyphrase
51
lists of multiple indexers are merged and any frequency information is discarded.
This is the case for all of our datasets. When this not the case, we could either keep
keyphraseness as is, or modify it to
DonainKeyphr(c) =
E
f(c, d')
d'ED(c)\{d}
where f(c, d') is the number of times c is a keyphrase for document d'. The modified form may be substantially worse, since it seems that the same keyphrase being
assigned to multiple documents is substantially more significant than multiple indexers assigning the same keyphrase to one document. However, this depends on the
specifics of the corpus and how the training keyphrases were generated. We will not
explore this issue, since it is not relevant on our corpora.
The domain keyphraseness feature improves performance on datasets where the
same phrases are used as keyphrases many times. However, on datasets which span
multiple domains, the keyphraseness feature can result in poor choices of keyphrases.
For example, the CiteULike180 dataset contains both math papers and biology papers. The phrase "graph" has high keyphraseness because it is frequently used as a
keyphrase for math papers on the training set. As a result, "graph" is often incorrectly
selected by the extraction system as a keyphrase for biology papers.
To address this shortcoming of domain keyphraseness, we propose an augmented
form of domain keyphraseness which is aware of which documents are similar. Instead
of weighing all appearances of a candidate as a keyphrase equally, we weight them by
the inverse of the pair-wise document distance between the document of the candidate
and the training document where the candidate appeared as a keyphrase. The new
formula for keyphraseness is,
'
d
AugmDomainKeyphr(c) ='D
d'ED(c)\{d}
I
where Id, d'II is the pairwise distance between documents d and d'. Note that we
have not yet specified our distance metric.
52
The document distance metric should place documents from the same domain
close together. Suppose we have a hierarchical document clustering, with N clusters
of documents and a binary tree with the clusters as leaves. We assume these clusters
are composed of similar documents, and the tree is arranged with similar clusters close
together. Then similar documents will be close together in the tree, while dissimilar
documents will be far away. Hence, a suitable metric is,
j|d, d'11 =1
d'
1
max(T - icad(d, d'), 0)
where lcad(d, d') is the distance to the lowest common ancestor of the clusters of
d and d', and T is a thresholding constant. Let a be the lowest common ancestor of
the clusters of d and d'. If a is the kth and k'th ancestor of the clusters of d and d'
respectively, then
lcad(d, d') = max(k, k')
Lowest common ancestor distance, icad, is a measure of the distance between d and
d'. However, having a scaling parameter is desirable, which is why we do not simply
use icad as our distance metric.
If T = 1, then
lid, d'I = 1 if d and d' are from the same cluster, otherwise
Id, d'II = oo. Hence for T = 1, augmented domain keyphraseness is equivalent to
domain keyphraseness, but restricted to documents in the same cluster. If T > 1,
then the documents from the same cluster as d are given weight T, and all other
clusters are weighted less based on how far they are from the cluster of d. Sufficiently
far away clusters are weighted at 0.
Note that as with all features, any constant
scaling does not matter, since bagged decision trees are used for ranking.
To generate a hierarchical document clustering, we use the CLUTO software package [32]. The -showtree flag is used to generate a hierarchical clustering, otherwise
the default parameters are used. Doc2mat is used to preprocess the documents into
the vector space format used by CLUTO. The default doc2mat options are used. The
clustering generated is a set of numCluster clusters, and then a hierarchical tree built
on top of the clusters.
53
Table 4.1: Augmented Domain Keyphraseness Parameter Tuning
T\N
1
2
4
8
16
10
30.1
30.8
31.4
31.5
31.4
20
29.4
29.8 +
31.5
31.6
31.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
40
28.1
28.8
30.5
31.6
31.4
0.2
0.2
0.2
0.2
0.2
We consider the effects of replacing the domain keyphraseness feature with the
augmented domain keyphraseness feature on the CiteULike180 dataset.
shows performance for a range of values of T and N.
Table 4.1
We observe that for T = 1,
this augmentation worsens performance. As mentioned earlier, for T = 1, augmented
domain keyphraseness is equivalent to domain keyphraseness but restricted to the
clusters. Since the clusters are small, this reduces performance. For T = 1, performance decreases as N increases, since the clusters are getting smaller and smaller,
decreasing the usefulness of keyphraseness. For greater values of T, we see that this
augmentation improves performance.
The gains in performance is not particularly
sensitive to the values of N and T. We select N = 20 and T = 8 , since they lie
within the region of maximum performance. This gives an improvement of about 0.5
to the F-score.
We performed the same exploration on both the Collabgraph and Semeval datasets.
We did not see any meaningful improvement on the these datasets from the augmentation.
However, the use of augmented domain keyphraseness also did not have
an adverse effect on performance.
This augmentation was introduced to avoid er-
rors where the extraction algorithm selects candidates that appeared frequently as
keyphrases, but for documents from other domains. Without augumented domain
keyphraseness, this form of error occurs frequently in the CiteULike datasets. Even
without augumented domain keyphraseness, this form of error occurs very rarely in
the Collabgraph and Semeval datasets.
This difference is due to the dramatically
shorter keyphrases in the CiteULike180 dataset. Table 4.2 shows the average number
of words in keyphrases in each data set. Since the keyphrases in the CiteULike180
54
Table 4.2: Keyphrase Statistics for Datasets
Average Keyphrase Length
Zero Keyphraseness
Collabgraph
CiteULike180
Semeval
1.80
75%
1.16
30%
1.96
68%
dataset are so short, they often appear in the texts of documents from other domains.
A phrase such "algorithm" will frequency appear in non-computer science papers, and
hence may be incorrectly chosen as a keyphrase for a biology paper. On the other
hand, longer keyphrases, such as "ray tracing algorithm" are unlikely to appear in
papers outside of the domain of computer science. As such, the keyphraseness feature
is unlikely to cause "ray tracing algorithm" to be chosen a keyphrase for a biology
paper, since "ray tracing algorithm" is very unlikely to appear in the text of a biology paper. Hence, the longer keyphrases of Collabgraph and Semeval cause there
to be less keyphraseness induced errors. Additionally, the Collabgraph and Semeval
keyphrases tend to be fairly specific, even when they are a single word, further reducing the likelyhood of keyphraseness errors. This can be seen by examining the sample
keyphrases in Appendix A.
Table 4.2 also shows the percentage of keyphrases that have zero keyphraseness.
A keyphrase is said to have zero keyphraseness if it appear as a keyphrase only once in
the training data. Few keyphrases in the CiteULike180 dataset have zero keyphraseness, so keyphraseness is a highly important feature on this dataset. Keyphraseness
plays a lesser role on the other datasets, where most keyphrases have zero keyphraseness.
If the Collabgraph or Semeval datasets were larger, keyphraseness induced errors
would become more important. A larger corpus increases the number of keyphrases
in the training data. In turn, this increases chance of a keyphrase from one document
appearing in a document from another domain. Although the Collabgraph and Semeval datasets have mostly long specific keyphrases, they all have shorter less specific
keyphrases, as with CiteULike180. Augmented keyphraseness can help eliminate the
interference of these short non-specific keyphrases.
55
Table 4.3: Average Word Length
Dataset
w/o Average Word Length
with Average Word Length
4.2
Collabgraph
16.5
17.3
0.1
0.1
CiteULike180
31.1
32.4
0.2
0.2
Semeval
18.2
18.8
0.1
0.2
Average Word Length
We discovered that a new simple feature, average word length, substantially improves
performance on all three datasets.
characters per word in a candidate.
Average word length is the average number of
This excludes spaces, although the feature is
equally effective with spaces included in its computation. Table 4.3 shows the effect
of adding this feature to each of the datasets, when Wikipedia features are disabled.
If we instead add a feature for the total number of characters, there is no performance improvement. This is interesting because the average number of characters
per word is just the total number of characters divided by the number of words. The
number of words in a candidate is already a feature, so one might expect the total
number of characters feature to yield an equal improvement. However, decision trees
cannot effectively represent division based effects, so the combination of two existing
features can improve performance beyond that of the individual features [7].
4.3
Feature Scaling
The ranking step in keyphrase extraction selects the best candidate keyphrases from
the list of candidates for each document. The ranker is trained on the candidates for
all documents pooled together. As such, a candidate with a high term frequency compared to other candidates for its document may have low or moderate term frequency
relative to the pooled candidates. Scaling features on a per document basis can address this issue. In this chapter, we consider the effectiveness of various rescaling
schemes, and their interaction with reranking.
Note that due to the use of decision trees in the ranking step, the application of the
56
same monotonic function to the feature values of all candidates would have no effect
of the ranking step. Hence, if feature scaling was performed across all documents,
instead of on a per document basis, it would be equivalent to performing no scaling
at all.
4.3.1
Scaling Features
The intuition behind feature scaling is that feature values are most meaningful when
compared to the feature values of the other candidates from the same document.
This intuition is more applicable to frequency based features, such as tf or tf-idf,
than features such as number of words in the candidate. Due to the simple filtering
process, the distribution of the number of words feature determined entirely by the
distribution of stopwords in the document. As such, we would not expect the scaling
of the length feature to improve performance. In fact, by adding additional noise,
this scaling could potentially have an adverse effect on performance.
We consider two standard methods for feature scaling, scaling to unit range, and
scaling to unit variance. To scale features to unit range, each feature is linearly scaled
on a per document basis, such that the feature values for document range from 0 to
1. Formally, we scale feature i of candidate c of document d as follows,
Ci - minIEd(c')
maxcEd(c') - minc',Id(c)
Scaling to unit variance is analogous, each feature is linearly scaled so the feature
values for each document have unit variance. Equivalently, each feature is mapped
to its z-score.
Table 4.4 shows the effects of scaling the features on Collabgraph. We consider the
scaling of different two sets of features, the set of all features, and a more restricted
set of features that only includes the three frequency based features, tf, idf, and tf-idf.
Due to the sensitivity of both scaling techniques to outliers, we additionally consider
the effects of truncating the top and bottom t% of the candidates for each document
when computing the linear transforms.
57
Table 4.4: Scaling Methods
5%
Percent Truncation
0%
Unit Variance, All Features
Unit Range, All Features
15.5 t 0.2
16.7 0.2
15.3
16.7
1%
0.2
0.1
14.5
15.8
Unit Variance, Frequency Features
Unit Range, Frequency Features
16.9 + 0.2
0.1
16.8
16.8
16.8
0.1
0.2
16.7 0.1
16.5 t 0.1
0.4
0.2
Rescaling only the frequency based features consistently outperforms scaling all
of the features. As mentioned previously, some features, such as length, have no need
for scaling. The scaling of such features introduces noise, which can have an adverse
effect on performance. When the frequency feature set is used, the choice of scaling to
unit variance vs unit range has minimal effect. Similarly, the level of truncation has
no significant effect that is distinguishable from noise. As such, the use of truncation
is undesirable, as it introduces additional parameters.
Rescaling the frequency features to unit variance or unit range with no truncation
improves F-Score by about 0.3 on Collabgraph.
This is a small, but non trivial
improvement.
4.3.2
Scaling and Reranking
When reranking is being used, feature scaling can be applied before the ranking step,
reranking step, or both.
Before the reranking step, the features of the reranking
candidates can be rescaled, just as the features of the ranking candidates can be
scaled.
Table 4.5 shows performance for all four possible configurations. In all cases, the
scaling method is scaling to unit variance with no truncation. Unfortunately, no use
of scaling improves performance beyond vanilla reranking. As such, we will not use
rescaling in our final system. It is possible that a more sophisticated scaling scheme
may offer performance improvements even when combined with reranking.
58
Table 4.5: Scaling and Reranking on Collabgraph
Rescaling
4.4
F-score
None
Before Ranking
18.6
18.3
0.1
0.2
Before Rerank
18.4
0.2
Before Both
18.4
0.2
Feature Selection
Poor choices of features can have dramatic negative effects on performance. In experimenting with various potential new features, we frequently encountered this effect.
This was often seen with predictive but complicated features, which would appear
high in decision trees, but prevent other features from being used effectively.
To better understand the set of features, we consider the effects of removing individual features. Table 4.6 presents the effects of feature removal when no Wikipedia
features are active. Table 4.7 shows the effects of removing features when the Wikipedia
features are active. All results are with reranking enabled. Due to high cost of the
computation of Wikipedia features, we employ the optimization discussed in Chapter
3, and only compute Wikipedia features before reranking.
From Tables 4.6 and 4.7 we see that no features have a statistically significant
adverse effect on performance. Interestingly, we see that a number of features have
minimal positive impact. For example, the removal of tf-idf has no significant effect
on any of the datasets.
This is surprising given that tf-idf often occurs high in
the bagged decision trees.
Evidently, in the absence of tf-idf, other features are
able to provide the same information. As mentioned previously in Section 4.1, the
domain keyphraseness feature is most important on the CiteULike180 dataset. On
the Collabgraph dataset, where 75% of keyphrases only occur once in the training
set, keyphraseness has no significant effect. The lastOccur and spread features are
seen to have minimal effect while the firstOccur feature provides substantial value.
We believe this is because firstOccur can be used to determine if the keyphrase
appears in the abstract. spread and lastOccur are tied to term frequency and each
59
Table 4.6: Non Wikipedia Performance with Features Removed
Feature Removed
none
tf
tfidf
idf
firstOccur
lastOccur
spread
length
domainKeyph
averageWordLength
Collabgraph
19.4
19.5
19.4
19.2
18.9
19.6
19.5
19.1
18.7
18.7
0.1
0.2
0.1
0.1
0.2
0.2
0.2
0.2
0.1
0.2
CiteULike180
35.4 0.2
35.0 i 0.2
35.0 0.2
34.4 0.2
32.9 0.2
35.1 0.2
35.1 0.2
34.9 0.2
28.0 0.2
35.2 0.2
Semeval
20.7
20.7
20.8 i
20.8
20.4
20.7
20.8
20.5
19.2
20.5
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
Table 4.7: Wikipedia Performance with Features Removed
Feature Removed
none
totalWikipKeyphr
generality
invWikipFreq
wikipKeyphr
Collabgraph
21.3
21.2
21.4
21.4
21.3
0.2
0.2
0.2
0.2
0.1
CiteULike180
35.9
35.8
35.7
35.8
35.4
0.2
0.2
0.2
0.2
0.2
Semeval
21.5
21.4
21.6
21.5
21.6
0.2
0.2
0.2
0.2
0.2
other, so they don't provide as that much value individually. Given that the set of
features which can be removed without adversely impacting performance varies from
corpus to corpus, and no significant improvements are seen from feature removal, we
recommend the use of the full set of features on all datasets.
60
Chapter 5
Conclusion
We have presented a number of advancements to the state of the art in keyphrase
extraction. In this final chapter, we report the combined effects of these improvements
and compare the accuracy of our system to that of human indexers.
Oahu, our new system, combines all of the advancements discussed in the previous chapters.
During the training of the ranker, positive examples are reweighted.
Reranking is used, as are the new numSuper and numSub features. Although the
computation of Wikipedia features can be delayed to reduce runtime, we will do so
that in this chapter, since we want to report the maximum accuracy achievable by our
system. The new average keyphrase length feature is used on all three datasets, and
the new augmented domain keyphraseness feature is employed with the CiteULike180
dataset.
5.1
Combined Performance
Table 5.1 compares the performance of Maui and Oahu on all three datasets. We
see that Oahu substantially improves upon the performance of Maui on all datasets.
Even without the use of Wikipedia, Oahu is able to outperform both Wikipedia and
non-Wikipedia Maui. Interesting, there is a difference between the performance of
Oahu with and without Wikipedia on the CiteULike180 and Semeval datasets, even
though no substantial difference exists with Maui. Evidently, the more powerful Oahu
61
Table 5.1: Maui vs Oahu
Collabgraph
Maui
Maui
Oahu
Oahu
(w/o Wiki)
(with Wiki)
(w/o Wiki)
(with Wiki)
16.5 t
17.5
19.2
21.0
0.1
0.2
0.1
0.3
CiteULike180
31.1 t 0.2
31.2 0.2
35.6 0.2
36.4 0.4
Semeval
18.2
18.3
20.7
21.8
0.1
0.2
0.2
0.4
ranking and reranking system is able to better exploit the Wikipedia features.
Overall, Oahu is able to improve the performance of Maui with no substantial
increase in runtime.
Through the delaying of the computation of Wikipedia fea-
tures, these gains can be achieved while reducing total runtime by nearly an order of
magnitude.
5.2
Comparison to Human Indexers
In Chapter 2, we described a new metric for the comparison of automatic keyphrase
extraction systems to human indexers. This metric has two forms. The first evaluates
the relative performance of a human reader and an extraction algorithm by comparing
their consistency with the author assigned keyphrases. The second metric compares
the internal consistency of multiple human indexers to the consistency of an extraction
algorithm with the same set of human indexers. The first form can be used with the
Semeval dataset, where both human and author assigned keyphrases are available.
The second form can be used with the CiteULike180 datasets, where keyphrases from
multiple human indexers are available.
Table 5.2 shows the consistency of human indexers, Maui, and Oahu, on CiteULike and Semeval datasets, with Wikipedia features enabled. For these consistency
measures, 150 and 200 training documents were used for CiteULike180 and Semeval
respectively.
Oahu is competitive with the human indexers on both datasets.
On
CiteULike180, Oahu achieves the same level of consistency as the human indexers.
On Semeval, Oahu has a higher consistency than the human indexers. This means
that the keyphrase lists extracted by Oahu are more similar to the author assigned
62
Table 5.2: Automatic vs Human Consistency
Human
Maui
Oahu
CiteULike180
0.497
0.453 0.012
0.495 0.012
Semeval
0.179
0.169 t 0.009
0.191 0.009
keyphrase lists than the reader assigned lists are. The readers of the the Semeval
dataset only had 15 minutes per paper, so they may have been able to achieve higher
consistency if they were given more time per paper. Nonetheless, Oahu's ability to
perform at levels competitive to human indexers indicates that its performance is not
far from the theoretical maximum.
63
64
Appendix A
Sample Keyphrases
Table A.1: Sample Keyphrases
Collabgraph
CiteULike180
Semeval
affect, affective computing,
user interface
baffled microbial fuel cell,
stacking, electricity generation, organic wastewater
hallway, left, railing, left,
hallway, left, computers,
right,
conference
room,
associative memory, mutant
mice, amygdala, c fos
coding, correlation
argumentation, negotiation
bridge,
right,
wireless sensor network, localization
inference, review, networks
xml, rank, information retrieval
text mining
content addressable storage,
relational database system,
database cache, wide area
network, bandwidth optimization
inferred
regions, evaluation
maria,
volcanism,
lunar
interior,
thermochemical
properties, convection
65
66
Bibliography
[11 Ken Barker and Nadia Cornacchia. Using noun phrase heads to extract document
keyphrases. In Advances in Artificial Intelligence, pages 40-52. Springer, 2000.
[21 Gdbor Berend and Richard Farkas. Sztergak: Feature engineering for keyphrase
extraction. In Proceedings of the 5th internationalworkshop on semantic evaluation, pages 186-189. Association for Computational Linguistics, 2010.
[31 Michael Collins and Terry Koo. Discriminative reranking for natural language
parsing. Computational Linguistics, 31(1):25-70, 2005.
[41 Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive
learning algorithms and representations for text categorization. In Proceedings
of the Seventh International Conference on Information and Knowledge Man-
agement, CIKM '98, pages 148-155, New York, NY, USA, 1998. ACM.
[51 Samhaa R El-Beltagy and Ahmed Rafea. Kp-miner: A keyphrase extraction
system for english and arabic documents. Information Systems, 34(1):132-144,
2009.
[61 Eibe Frank, Gordon W Paynter, Ian H Witten, Carl Gutwin, and Craig G NevillManning. Domain-specific keyphrase extraction. 1999.
[71 Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and
R Tibshirani. The elements of statisticallearning, volume 2. Springer, 2009.
[8] Anette Hulth. Improved automatic keyword extraction given more linguistic
knowledge. In Proceedings of the 2003 conference on Empiricalmethods in natural
language processing, pages 216-223. Association for Computational Linguistics,
2003.
[91 Mario Jarmasz and Caroline Barriere. Using semantic similarity over tera-byte
corpus, compute the performance of keyphrase extraction. Proceedings of CLINE,
2004.
[101 Thorsten Joachims. Making large scale svm learning practical. 1999.
[11] Steve Jones and Malika Mahoui. Hierarchical document clustering using automatically extracted keyphrases. 2000.
67
[121 Arash Joorabchi and Abdulhussain E Mahdi. Automatic keyphrase annotation
of scientific documents using wikipedia and genetic algorithms. Journal of In-
formation Science, 39(3):410-426, 2013.
[13] Su Nam Kim, Timothy Baldwin, and Min-Yen Kan. Evaluating n-gram based
evaluation metrics for automatic keyphrase extraction. In Proceedingsof the 23rd
internationalconference on computationallinguistics, pages 572-580. Association
for Computational Linguistics, 2010.
[14] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Semeval2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21-26.
Association for Computational Linguistics, 2010.
[15] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Automatic
keyphrase extraction from scientific articles. Language resources and evaluation,
47(3):723-742, 2013.
[16] Patrice Lopez. Grobid: Combining automatic bibliographic data recognition
and term extraction for scholarship publications. In Research and Advanced
Technology for DigitalLibraries, pages 473-474. Springer, 2009.
[17] Patrice Lopez and Laurent Romary. Humb: Automatic key term extraction from
scientific articles in grobid. In Proceedings of the 5th internationalworkshop on
semantic evaluation, pages 248-251. Association for Computational Linguistics,
2010.
[181 Minh-Thang Luong, Thuy Dung Nguyen, and Min-Yen Kan. Logical structure
recovery in scholarly articles with rich document features. InternationalJournal
of Digital Library Systems (IJDLS), 1(4):1-23, 2010.
[191 Olena Medelyan. Human-competitive automatic topic indexing. PhD thesis, The
University of Waikato, 2009.
[20] Olena Medelyan, Eibe Frank, and Ian H Witten. Human-competitive tagging
using automatic keyphrase extraction. In Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing: Volume 3- Volume 3, pages
1318-1327. Association for Computational Linguistics, 2009.
[21] Olena Medelyan and Ian H Witten. Measuring inter-indexer consistency using a
thesaurus. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital
libraries, pages 274-275. ACM, 2006.
[221 Olena Medelyan and Ian H Witten. Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital
libraries,pages 296-297. ACM, 2006.
[23] Olena Medelyan, Ian H Witten, and David Milne. Topic indexing with wikipedia.
68
[24] Thuy Dung Nguyen and Min-Yen Kan. Keyphrase extraction in scientific publications. In Asian Digital Libraries. Looking Back 10 Years and Forging New
Frontiers, pages 317-326. Springer, 2007.
[25] Thuy Dung Nguyen and Minh-Thang Luong. Wingnus: Keyphrase extraction
utilizing document logical structure. In Proceedings of the 5th internationalworkshop on semantic evaluation, pages 166-169. Association for Computational Lin-
guistics, 2010.
[26] Gordon W Paynter, IH Witten, and SJ Cunningham. Evaluating extracted
phrases and extending thesauri. In Proceedings of the third internationalconference on Asian digital libraries, pages 131-138. Citeseer, 2000.
[27] L Rolling. Indexing consistency, quality and efficiency. Information Processing
& Management, 17(2):69-76, 1981.
[281 Fabrizio Sebastiani. Machine learning in automated text categorization. ACM
computing surveys (CSUR), 34(1):1-47, 2002.
[29] Pucktada Treeratpituk, Pradeep Teregowda, Jian Huang, and C. Lee Giles. Seerlab: A system for extracting keyphrases from scholarly documents. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 182-185,
Uppsala, Sweden, July 2010. Association for Computational Linguistics.
[30] Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G NevillManning. Kea: Practical automatic keyphrase extraction. In Proceedings of the
fourth ACM conference on Digital libraries, pages 254-255. ACM, 1999.
[31] Wei You, Dominique Fontaine, and Jean-Paul Barthes. An automatic keyphrase
extraction system for scientific documents. Knowledge and information systems,
34(3):691-724, 2013.
[32] Ying Zhao, George Karypis, and Usama Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141-
168, 2005.
69
Download