Ranking Techniques for Keyphrase Extraction by David Field Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Iw Masters of Engineering in Computer Science and Engineering co at the CNJ MASSACHUSETTS INSTITUTE OF TECHNOLOGY U) September 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. Signature redacted Aut+hor~...... Lp Department of Electrical Engineering and Computer Science July 23, 2014 Signature redacted . Certified by.. /7 Accepted by....... A . Regina Barzilay Professor Thesis Supervisor Signature redacted Albert R. Meyer Chair, Masters of Engineering Thesis Committee 2 Ranking Techniques for Keyphrase Extraction by David Field Submitted to the Department of Electrical Engineering and Computer Science on July 23, 2014, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Engineering Abstract This thesis focuses on the task of extracting keyphrases from research papers. Keyphrases are short phrases that summarize and characterize the contents of documents. They help users explore sets of documents and quickly understand the contents of individual documents. Most academic papers do not have keyphrases assigned to them, and manual keyphrase assignment is highly laborious. As such, there is a strong demand for automatic keyphrase extraction systems. The task of automatic keyphrase extraction presents a number of challenges. Human indexers are heavily informed by domain knowledge and comprehension of the contents of the papers. Keyphrase extraction is an intrinsically noisy and ambiguous task, as different human indexers select different keyphrases for the same paper. Training data is limited in both quality and quantity. In this thesis, we present a number of advancements to the ranking methods and features used to automatically extract keyphrases. We demonstrate that, through the reweighing of training examples, the quality of the learned bagged decision trees can be improved with negligible runtime cost. We use reranking to improve accuracy and explore several extensions thereof. We propose a number of new features, including augmented domain keyphraseness and average word length. Augmented domain keyphraseness incorporates information from a hierarchical document clustering to improve the handling of multi-domain corpora. We explore the technique of per-document feature scaling and discuss the impact of feature removal. Over three diverse corpora, these advancements substantially improve accuracy and runtime. Combined, they give keyphrase assignments that are competitive with those produced by human indexers. Thesis Supervisor: Regina Barzilay Title: Professor 3 4 Acknowledgments First and foremost, I would like to thank my advisor, Regina Barzilay. Her guidance and vision were instrumental in my research. I could not have made the progress I did without her knowledge and intuition. When my ideas were failing, she was always there to provide advice and encouragement. I'd also like thank my group mates. Their advice and insights have been in- valuable in the writing of this thesis. Our discussions always left me with a deeper understanding of natural language processing. My friends at MIT have been a source of both wisdom and joy. I will always remember my time in Cambridge fondly. Finally, this thesis would not have been possible without the inexhaustible support of my family. For their love and support, I am forever grateful. 5 6 Contents . . . . . . . . . . . . . . . . . 11 Keyphrases 1.2 Common Keyphrase Annotation Techniques 12 1.3 Challenges . . . . . . . . . . . . . . . . . . 14 1.4 Advancements . . . . . . . . . . . . . . . . 15 . . . 1.1 19 Task Description 2.4 19 . . . . . . . . . . . . . . . . Keyphrase Extraction . . . . . . . . . . . . . . . . . . . . . 20 2.1.2 Keyphrase Assignment . . . . . . . . . . . . . . . . . . . . . 23 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . . . . . . . . . . . . . 26 . . . 2.1.1 Performance Metrics 2.2.2 Cross-Validation 2.2.3 Comparison to Human Indexers . . . . . . . . . . . . . . . 27 2.2.4 Use of Evaluation Systems..... . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.1 Collabgraph . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.2 CiteULike180 . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.3 Semeval 2010 Task 5 . . . . . . . . . . . . . . . . . . . . . . 31 2.3.4 Author vs Reader Selected Keyphras es . . . . . . . . . . . . . 32 2.3.5 Testing Datasets . . . . . . . . . . . . . . . . . . . . . . . . 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . 34 Baseline 2.4.1 . . . Datasets . . . . . . . . .. . 2.2.1 . 2.3 . . . . . . . . . . . . . . . . . 2.2 Prior Work . 2.1 . 2 11 Introduction . 1 7 2.4.2 Reweighting positive examples . . . . . . . . . . . . . . . 40 3.1.2 Comparison to Tuning of Bagging Parameters . . . . . . 41 3.1.3 Summary of Reweighting . . . . . . . . . . . . . . . . . 42 Reranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Bagged Decision Trees . . . . . . . . . . . . . . . . . . 43 3.2.2 Support Vector Machines . . . . . . . . . . . . . . . . . 45 3.2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.4 Summary of Reranking . . . . . . . . . . . . . . . . . . 49 . . . . . 3.1.1 51 Features Augmented Domain Keyphraseness . . . . . . . . . . . 51 4.2 Average Word Length . . . . . . . . . . . . . . . . . . 56 4.3 Feature Scaling . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Scaling Features . . . . . . . . . . . . . . . . . 57 4.3.2 Scaling and Reranking . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . 59 . . . 4.1 4.4 Feature Selection 61 Conclusion Combined Performance . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Comparison to Human Indexers . . . . . . . . . . . . . . . . . . . . 62 . 5.1 . 5 39 . 4 . . . . . . . 3.2 Reweighting and Bagging . . . . . . . . . . . . . . . . 3.1 35 39 Ranking Methods . 3 Baseline Performance . . . . . . . . . . . . . . . . . . . . . . .3 65 A Sample Keyphrases 8 List of Tables 2.1 Statistics for Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Collabgraph Baseline Performance . . . . . . . . . . . . . . . . . . . 36 2.3 CiteULike180 Baseline Performance . . . . . . . . . . . . . . . . . . . 36 2.4 Semeval Baseline Performance . . . . . . . . . . . . . . . . . . . . . 36 2.5 Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Reweighting Tuning . . . . . . . . . . . . . . . . 41 3.2 Tuning Bagging Parameters . . . . . . . . . . . 42 3.3 Reweighting Summary . . . . . . . . . . . . . . 43 3.4 Tuning Reranker Candidate Counts . . . . . . . . 45 3.5 Tuning Reranker Bagging Parameter . . . . . . . . 45 3.6 Reranking Summary . . . . . . . . 45 3.7 Adding Wikipedia Features at Differ . . . . . . . . 47 3.8 Effects of numSuper and numSub Features . . . . . . . . . . . . . . 48 4.1 Augmented Domain Keyphraseness Parameter Tuning. . . . . . . . . 54 4.2 Keyphrase Statistics for Datasets . . . . . . . . . . . . . . . . . . . 55 4.3 Average Word Length . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 Scaling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Scaling and Reranking on Collabgraph . . . . . . . . . . . . . . . . . 59 4.6 Non Wikipedia Performance with Features Removed . . . . . . . . . 60 4.7 Wikipedia Performance with Features Removed . . . . . . . . 60 . . . . . . . Example Document . . . . . . . . . . 13 1.1 . . . . . . . . . . . . Stages 9 . . . . ............................... 62 5.1 Maui vs Oahu ........ 5.2 Automatic vs Human Consistency . . . . . . . . . . . . . . . . . . . . 63 A.1 Sample Keyphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 10 Chapter 1 Introduction 1.1 Keyphrases Keyphrases are short phrases which summarize and characterize documents. They can serve to identify the specific topics discussed in a document or place it within a broader domain. Table 1.1 shows the title, abstract, and author-assigned keyphrases of a paper from the Collabgraph dataset. Some of the keyphrases of this paper, such as "frustration" and "user emotion", describe the specific topics covered by the paper. On the other hand, the keyphrase "human-centered design" is more general, and describes a broader domain which the paper falls within. These two roles are not exclusive, most keyphrases act in both roles, providing information on the specifics of the paper and how it connects to other papers. Keyphrases have a wide variety of applications in the exploration of large document collections, the understanding of individual documents, and as input to other learning algorithms. In large document collections, such as digital libraries, keyphrases can be used to enable the searching, exploration, and analysis of the collections' contents. When a collection of documents is annotated with keyphrases, users can search the collection using keyphrases. They can utilize keyphrases to expand out from a single document to find other documents about the same topics. Keyphrase statistics can be computed for entire corpus, giving a high level view of the topics in the corpus and how they 11 relate. Keyphrases also help readers understand single documents by summarizing their contents. In effect, the task of keyphrase assignment is a more highly compressed variant of text summarization. Keyphrases allow readers to quickly understand what is discussed in a long paper, without reading the paper itself. In this regard, keyphrases play a similar role to the abstract of a paper, but with even greater compression. They can serve to augment the abstract of a paper by identifying which portions of the abstract are most important. Usually, the list of keyphrases contains phrases not found in the abstract and provides additional information about subjects covered in the paper. Keyphrases can be used as the inputs to a wide variety of learning tasks. They have been seen to improve the performance in text clustering and categorization, text summarization, and thesaurus construction [11][1][26]. Naturally, these improvements are dependent on the availability of high quality keyphrase lists. Keyphrases are typically chosen manually. For academic papers, the authors generally select the keyphrases themselves. In other contexts, such as libraries, professional indexers may assign them from a fixed list of keyphrases. Unfortunately, the vast majority of documents have no keyphrases assigned or have an incomplete list of keyphrases. Manually assigning keyphrases to documents is a time consuming process which requires familiarity with the subject matter of the documents. For this reason, there is a strong demand for accurate automatic keyphrase annotation. 1.2 Common Keyphrase Annotation Techniques There are two primary methods for automated keyphrase annotation: keyphrase extraction and keyphrase assignment [30]. In keyphrase extraction, the keyphrases are drawn from the text of the document itself, using various ranking and filtering techniques. In keyphrase assignment, the keyphrases are drawn from a controlled list of keyphrases and assigned to documents by per-keyphrase classifiers. The documents, availability of training data, and type of keyphrases desired all impact the relative 12 Table 1.1: Example Document Title Abstract Keywords Computers that Recognise and Respond to User Emotion: Theoretical and Practical Implications Prototypes of interactive computer systems have been built that can begin to detect and label aspects of human emotional expression, and that respond to users experiencing frustration and other negative emotions with emotionally supportive interactions, demonstrating components of human skills such as active listening, empathy, and sympathy. These working systems support the prediction that a computer can begin to undo some of the negative feelings it causes by helping a user manage his or her emotional state. This paper clarifies the philosophy of this new approach to human-computer interaction: deliberately recognising and responding to an individual user's emotions in ways that help users meet their needs. We define user needs in a broader perspective than has been hitherto discussed in the HCI community, to include emotional and social needs, and examine technology's emerging capability to address and support such needs. We raise and discuss potential concerns and objections regarding this technology, and describe several opportunities for future work. User emotion, affective computing, social interface, frustration, humancentred design, empathetic interface, emotional needs 13 suitability of these two techniques. In keyphrase assignment, keyphrases are assigned to documents from a controlled vocabulary of keyphrases. For each keyphrase in the controlled vocabulary, a classifier is trained and then used to assign the keyphrase to new documents. This means that only keyphrases that have been encountered in the training data will ever be assigned to new documents. In many cases, the controlled vocabulary is a canonical list of keyphrases from a journal or library. Keyphrase assignment is also referred to as text categorization or text classification in some other works [19]. In keyphrase extraction, keyphrases are chosen from the text of the documents, instead of from a controlled vocabulary. A single keyphrase extractor is trained, which takes in documents and identifies phrases from the texts which are likely to be keyphrases. This extractor can utilize a wide variety of information, such as the locations of the candidate phrases in the document and their term frequency. Many keyphrase appear in the abstract of papers, such as "frustration" and "user emotion" in the example paper from Table 1.1. Keyphrases often have high term frequency. For example, "frustration" occurs 34 times within the paper from Table 1.1. These patterns can be used to identify which candidate phrases are most likely to be keyphrases. 1.3 Challenges Automatic keyphrase extraction presents a number of challenges. Human indexers rely on a deep understanding of both the contents of the documents and their domains. This is especially true when authors are assigning keyphrases to their own papers. Indexers are able to understand how the different concepts in the documents relate to each other, and how the paper fits within the broader context of other papers in the domain. The keyphrase lists available for documents are incomplete and noisy. When authors assign keyphrases to their own papers, they usually choose a small number of keyphrases. Hence, the lists don't include all phrases that are good keyphrases for 14 the document. The lists also express the individual biases of the authors that create them. Similar issues are present for reader assigned keyphrases. Additionally, human readers typically have a weaker understanding of the paper, and hence select worse keyphrases. Only limited datasets are available for the training of keyphrase extraction systems. This presents a number of challenges. Firstly, only a small fraction of all possible keyphrases are encountered in the training set, so it is difficult to say if a candidate phrase is suitable to be a keyphrase. Secondly, for many domains no training data is available, forcing the use of training data from a different domain. A final challenge is that often keyphrases don't occur in the text of the document. In our example document, the keyphrase "human-centred design" does not appear anywhere. To select this keyphrase, the human indexer utilized semantic and domain knowledge. Replicating this form of knowledge in an automated system is difficult, especially with limited training data. 1.4 Advancements In this thesis, we present a number of advancements to the state of the art in keyphrase extraction. We build upon Maui, an open-source keyphrase extraction system [19]. We have released our improved system as an open source project called Oahu. Our contributions are as follows: 1. Reweighting: We present a technique for the reweighting of positive examples during training which improves both accuracy and runtime. 2. Reranking: We introduce a reranking system which improves performance and enables several additional improvements. (a) Delayed Feature Computation: By delaying the computation of Wikipedia features until the reranking stage, we are able to dramatically reduce runtime with minimal cost to accuracy. 15 (b) Number of Higher Ranked Superstrings and Substrings: We compute the number of high ranked substrings and superstrings from the original ranking step and use them as a features in the reranking stage. These new features are seen to improve performance by adding information on the relationships between keyphrases. 3. Augmented Domain Keyphraseness: We incorporate information from hierarchical document clusterings into the keyphraseness feature to improve the handling of corpora with documents from multiple domains. 4. Average Word Length Feature: We propose a new average word length feature which is seen to substantially improve performance. 5. Feature Scaling: We consider the rescaling of features on a per-document basis. We explore a variety of schemes, some of which are seen to improve performance. 6. Feature Selection: We examine the effects of feature removal We find that although several features offer minimal benefit, no features have a significant adverse effect on performance. 7. Evaluation on Multiple Datasets: We evaluate the performance of all our changes across three diverse datasets. 8. Comparison to Human Indexers: We propose a new system for comparison to human indexers, and use it to compare performance between human indexers and our system on the CiteULike and Semeval datasets. 9. Oahu: We release our improved system as an open-source project called Oahu. 1 In addition to the various improvements to the keyphrase extraction system discussed in this thesis, Oahu also offers a variety of improvements over Maui not directly related to the quality of the keyphrases extracted. These include lhttps://github.com/ringwith/oahu 16 a built in system for cross validation, multi-threading support, and improved abstractions. In Chapter 2, we will describe the task of keyphrase extraction and our procedures for evaluation. In Chapter 3, we will present our advancements in the ranking methods used for keyphrase extraction. In Chapter 4, we will describe our new features. In Chapter 5, we report the combined effect of these improvements and compare our system's performance to that of human indexers. 17 18 Chapter 2 Task Description In this chapter, we will describe the task of keyphrase extraction in greater detail. First, we will review prior work in keyphrase assignment. Then we will focus on the methods that have been used for the evaluation of keyphrase extraction systems, and describe the evaluation methods we will be employing. We will also describe the three corpora that we will evaluate performance on. Finally, we will describe Maui, the baseline system, and report its performance on the three datasets. 2.1 Prior Work Most automatic keyphrase annotation methods perform either keyphrase extraction or keyphrase assignment. In keyphrase extraction the keyphrases are drawn from the texts of the documents by a single extractor. In keyphrase annotation, the keyphrases are drawn from a controlled list of keyphrases and then assigned to documents by per-keyphrase classifiers. Although there is a potential for systems that are a hybrid of these two techniques, or lie wholly outside of these categories, thus far there has been minimal exploration such systems. Since keyphrase assignment is a relatively straightforward task, recent work has focused primarily on keyphrase extraction. 19 2.1.1 Keyphrase Extraction Keyphrase extraction techniques typically rely on a two step system, consisting of a heuristic filtering stage to select candidates keyphrases from the text and a trained ranking stage to select the top candidates. The ranking stage utilizes a variety of features which vary substantially from system to system. KEA, the Keyphrase Extraction Algorithm, is a representative example of a keyphrase extraction system [30] [6]. In KEA, a list of candidate keyphrases are filtered from articles by regularizing the text, and then selecting phrases of at most 3 words, that are not proper nouns and do not begin or end with a stop word. Additionally, the keyphrases are stemmed, and their stemmed forms are used when evaluating features and comparing to the human assigned keyphrases. The filtered lists of candidate keyphrases are ranked using a Naive Bayesian classifier with two features, tf-idf and the location of the first occurrence of the keyphrase. Maui is an open source keyphrase extraction developed by Medelyan et al. [19]. As will be discussed in Section 2.4, this system gives performance that is at or near the state of the art for author assigned keyphrases. As with KEA, the filtering stage outputs all phrases of n words or less that don't start or end with a stop word. The ranking step is performed using bagged decision trees that are generated from the training set. Maui utilizes a number of features, including term frequency, phrase length, and position of first occurrence. Maui also allows for the use of domain specific thesauruses or Wikipedia to compute additional features. Integrating external data sources, such as domain specific thesauruses or Wikipedia, has been seen to considerably improve the performance of keyphrase extraction techniques. Thesauruses can be used as a fixed vocabulary, restricting the extracted keyphrases to the set of phrases in the thesaurus. The use of the Agrovoc thesaurus of food and agriculture terms dramatically improves KEA++'s performance on agricultural documents [22] [19]. High quality thesauruses are not available for all fields, but other sources, such as Wikipedia can offer similar benefits. Several researchers have investigated this approach, and achieved inter-indexer consistency levels compa- 20 rable to human indexers on the Wiki-20 and CiteULike180 corpora [23] [12]. However, these gains depend substantially on the corpora used. A number of keyphrase extraction systems identify the different sections of the documents, and utilize this information to compute additional features. Typically, the additional features are binary features that indicate which sections of the paper the candidate appears in. These techniques yield the greatest performance improvements on reader selected keyphrases [14][31]. Some systems, such as SZTERGAK and SEERLAB utilize simple rule based section identification on the text dumps [2][291. In contrast, WINGUS and HUMB both perform boundary detection on the original PDF files that were used to generate the text dumps [17][311. To do so, WINGUS and HUMB utilize SectLabel and GROBID, respectively. SectLabel and GROBID are trainable section identification systems [18][16]. WINGUS and HUMB are seen to have the highest performance on the reader assigned keyphrases in Semeval 2010 Task 5, indicating that their sophisticated section parsing may provide additional value beyond simpler rule based parsing [15]. At a minimum, trainable document parsing systems such as GROBID and SectLabel are more easily adapted to new corpora than rule based systems. Some systems have more sophisticated filtering step, selecting a more restricted list of candidates than KEA or Maui. SZTERGAK restricts to candidates matching predefined part of speech patterns [2]. KP-Miner filters candidates by frequency and position of first occurrence [5]. The WINGUS system filters abridges the input documents, ignoring all sentences after the first s in each body paragraph of the paper [18]. This abridging technique was seen to improve performance on reader chosen keyphrases, possibly reflecting a tendency of readers to choose keyphrases from the first few sentences of paragraphs. However, these experiments were performed on a single split of a single small dataset, so these results may be due to noise or may generalize poorly to other corpora. Some keyphrase extraction systems utilize heuristic postranking schemes to improve performance by modifying the candidate scores computed in the ranking step. The HUMB system utilizes statistics from the HAL research archive to update the 21 scores of candidates after the scoring step [17]. A wide variety of methods have been proposed for the evaluation of keyphrase extraction systems. The standard method for evaluation is having the extraction system select N keyphrases for each document and then computing the F-score of the extracted keyphrases on the gold standard lists [14][6]. Other metrics have been proposed to address issues of near misses and semantic similarity [131 [21]. Interindexer consistency measures, such as Rolling's consistency metric have been used to compare the performance of automatic extraction systems to that of human indexers [201. We will discuss these issues in greater detail in Section 2.2. Previous papers in keyphrase extraction have utilized a number of different datasets for evaluation. The Semeval dataset is a set of 244 ACM conference and workshop papers with both author and reader assigned keyphrases [14]. It was used in Semeval 2010 Task 5 to evaluation the performance of 19 different keyphrase extraction systems. CiteULike180 is a dataset of 180 documents with keyphrases assigned by the users of website citeulike . org [19]. Hulth (2003) released a corpus of 2000 abstracts of journal articles from Inspec with keyphrases assigned by professional indexers [8]. Nguyen and Kan (2007) contributed a corpus of 120 computer science documents with author and reader assigned keyphrases [24]. For the task of keyphrase extraction with a controlled thesaurus vocabulary, several dataset thesaurus pairs are available [22][19]. The FAO-780 dataset consists of 780 documents from the Food and Agriculture Organization of the United Nation with keyphrases assigned from the Agrovoc agricultural thesaurus. NLM-500 is a dataset of 500 medical documents indexed with terms from MeSH, a thesaurus of medical subject headings. CERN-290 is a set of 290 physics documents indexed with terms from HEP, a high energy physics thesaurus. WIKI-20 is a small set of 20 documents with Wikipedia article titles assigned as keyphrases. Wikipedia acts as the thesaurus for this dataset. 22 2.1.2 Keyphrase Assignment Keyphrase assignment utilizes a separate classifier for each keyphrase. Techniques such as support vector machines, boosting, and multiplicative weight updating all are effective given a sufficiently large training set [28]. Dumais et al. achieved high accuracy assigning keyphrases to the Reuters-21578 dataset using support vector machines [4]. They utilized the tf-idf term weights of the words in the documents. For each keyphrase, they used only the 300 features that maximized mutual information. The Reuters-21578 dataset contains 12,902 stories and 118 keyphrases (categories), with all keyphrases assigned to at least one story. Unfortunately, most datasets have substantially less training documents and more keyphrases. In many datasets, a given keyphrase is never encountered or is only encountered a few times, and such a per-keyphrase classifier cannot be trained. This is particularly true when assigning keyphrases from domain-specific thesauri, which are often very large. For example, the MeSH controlled vocabulary of medical subject headings contains over 24,000 subject descriptors, so an enormous training set would be required to ensure that each keyphrase was encountered multiple times in the training set [19]. 2.2 Evaluation A number of metrics have been proposed for the evaluation of keyphrase extraction algorithms. Generally, performance is computed by comparing the keyphrases ex- tracted by the algorithm to a list of gold keyphrases generated by the author or readers of the papers. This comparison can be done by comparing the strings directly, or using more sophisticated techniques that address near misses and semantic similarity. In Section 2.2.1, we review prior work in this area, and describe the main performance metric that we will use in this thesis. Given the small corpora available for training and testing, the use of cross validation is essential. In Section 2.2.2, we describe our use of cross validation and some issues with the lack of cross validation in prior work on keyphrase extraction. In evaluating keyphrase extraction algorithms, it is important to understand how 23 their performance compares with that of human indexers. The level of consistency among human indexers can be compared with the level of consistency between automatic indexers and human indexers. In Section 2.2.3, we review prior work in this area and present an improved measure of consistency that addresses some of the shortcomings of previous metrics. 2.2.1 Performance Metrics The standard method for evaluating the performance of automatic keyphrase extraction systems is having the extraction system select N keyphrases for each document and then computing the F-score based on the number of matches with the gold standard keyphrase lists [14][6]. The comparisons with the gold standard lists are done after normalizing and stemming the keyphrases. Measuring performance by checking for exact matches of the stemmed and normalized keyphrases fails to handle keyphrases that are similar, but not identical. For example, "effective grid computing algorithm" and "grid computing algorithm" are two similar keyphrases, but would be treated as complete misses. N-gram based evaluations can be used to address the effects of near misses 1131. Another possibility is measuring the semantic similarity of keyphrases and incorporating this into the performance metric. Medelyan and Witten (2006) propose an alternative thesaurusbased consistency metric. [21j. In their metric, semantic similarity is computed from the number of links between terms in the thesaurus. However, these more sophisticated evaluation metrics have not achieved wide spread adoption. As such, we will evaluate performance using a standard evalua- tion metric for automatic keyphrase extraction, the macro-averaged F-score (0 = 1). Keyphrase comparison is done after stemming and normalizing the keyphrases by lowercasing and alphabetizing. Macro-averaged F-score is the harmonic mean of macro-averaged precision and recall. Precision is the fraction of correctly extracted keyphrases out of all keyphrases extracted, and recall is the fraction of correctly extracted keyphrases out of all keyphrases in the gold list. For example, if the gold standard keyphrases were {"ranking", "decision tree", "bagging"} and the keyphrases 24 extracted were {"ranking", "trees", "keyphrase extraction", "bagging"}, then the recall would be 2/3 and the precision would be 2/4. The macro-averaged precision and recall are the averages of these statistics over all documents. Previous work has varied in the use micro- vs macro-averaged F-scores. Medelyan et al. use macro-averaged F-scores in their evaluation of Maui, Oahu's predecessor [19]. Semeval 2010 Task 5 claims to use micro-averaged F-score [141. However, their evaluation script does not correctly compute micro- or macro-averaged F-score. The section of the script which claims to compute the micro-averaged F-score actual computes the average of the F-scores of the various documents. Micro-averaged F-score is actually the harmonic mean of micro-averaged precision and recall. The section which claims to compute the macro-averaged F-score actually computes the harmonic mean of the micro-averaged precision and recall. Macro-averaged F-score is actually the harmonic mean of macro-averaged precision and recall. We evaluate on the top 10 keyphrases extracted for each document. The re- striction to extracting a fixed number of keyphrases for each document keeps the focus on extracting high quality keyphrases, instead of predicting the number of gold standard keyphrases there will be each document. If the number of gold standard keyphrases for each document were known ahead of time, then F-score could be increased by predicting more keyphrases on documents with longer gold standard lists. On most keyphrase extraction corpora, the gold standard lists do not include all good keyphrases, and as such their lengths do not reflect how many keyphrases should be assigned to their documents. Instead, they reflect variations in the keyphrase assigning styles of different authors and journals. As such, predicting the lengths of the gold standard lists is not productive. Additionally, in practice, F-score is maximized when only a few keyphrases are predicted for each document. However, for most applications longer complete lists of keyphrases are preferable over shorter incomplete keyphrase lists. The evaluation system should reflect that and not encourage the generation of short lists. As such, allowing a variable number of keyphrases to be extracted is undesirable when F-score is used for evaluation. The desire for longer keyphrase lists is also why we chose to evaluate the top 10 keyphrases instead of the 25 top 5 keyphrases. 2.2.2 Cross-Validation Due to the small size of the available datasets, it is necessary to perform crossvalidation. We use repeated random sub-sampling validation. This approach was chosen because it allows the number of trials and the size of the training sets to be chosen independently. In contrast, k-fold validation does not offer that freedom. The F-score is computed for each sub-sampling and then averaged over the sub-samplings. For training sets on the order of 100 to 200 documents, the standard deviation of the F-score for a single sub-sample typically exceeds 1. As such the use of cross-validation to eliminate noise is essential. Cross-validation also helps address and avoid issues of overfitting. We let a denote standard deviation of the F-scores from the sub-samplings. The standard deviation of the averaged F-score computed from these subsamplings is estimated to be, 01 Uaverage -number of subsamplings When reporting our results, we report error to be twice the estimated standard deviation of the averaged F-score. So a performance of 43 F-score of 43 with an estimated standard deviation, 12 corresponds to an averaged gaverage, of 6. Previous papers in keyphrase extraction vary in their use of cross-validation. Medelyan et al. (2009) utilize 10-fold from validation in their experiments [19]. In contrast, Semeval 2010 Task 5 uses a single fixed split between training and test data [141. Due the small dataset, this introduces substantial noise into the reported performance statistics. Further exacerbating this issue, each team was allowed to submit the results of up to three runs of their system. As such, three different parameter configurations could be used, causing performance to be overestimated. A similar issue is present in the WINGUS system, where a single split of the training data was used for feature selection [25]. The lack of cross validation in the feature selection process may have led to poor choices of features, due to over-fitting to the single split. 26 2.2.3 Comparison to Human Indexers Creating a keyphrase extraction system that is able to perfectly predict the gold standard keyphrase lists is not a plausible goal. There are many good keyphrases for any paper, and the authors and readers only select of subset thereof. Different human indexers select different sets of keyphrases for the same papers. Hence achieving an F-score of 100 is not feasible. To understand how the performance of keyphrase extraction systems compare to the theoretical maximum performance, we can compare them with skilled human indexers. Medelyan et al. compare the performance of Maui that of human taggers by comparing inter-indexer consistency on the CiteULike180 dataset [201. They use the Rolling's metric to evaluate the level of consistency between indexers. Rolling's consistency metric is: RC(I1 ,1 2 ) A= A+ B where C is the number of tags that indexers 11 and 12 have in common, and A and B are the number of tags assigned by 1 and I2 respectively [27]. They observed that the inter-indexer consistency among the best human indexers is comparable to the inter-indexer consistency between Maui and the best human indexers. One shortcoming of this approach is that Medelyan et al. are forced to choose an arbitrary cutoff point for who qualify as the "best human indexers." They select the top 50% of the human indexers as their set of "best human indexers". Before this step, they pre-filter the set of human indexers to indexers that are consistent with other indexers on several tags. Due to these arbitrary cutoffs, it isn't possible to use this procedure to make precise comparisons between the performance of human indexers and the performance of automatic keyphrase extraction systems. A second issue is that their automatic keyphrase extraction algorithm selects the same number of keyphrases for all documents, while the human indexers select a variable number keyphrases for each document. the evaluation. This introduces some biases into As discussed previously, the number of keyphrases extracted per document affects performance, so restricting the keyphrase extraction algorithm likely 27 puts it at a disadvantage relative to the human indexers. We propose an improved evaluation metric that addresses some of the aforementioned issues. As will be discussed in Section 2.3, the Sevemal dataset has both reader and author assigned keyphrases. As such, we can compare the quality of the reader and algorithm assigned keyphrases by evaluating their consistency with the author assigned keyphrases. For each document, d, we select the top IReaderdl keyphrases generated by the keyphrase extraction algorithm, where IReaderl is the number of reader assigned keyphrases for document d. This ensures that neither the reader nor the algorithm are given an advantage due to the number of keyphrases they select. We evaluate consistency on a single document using Rolling's metric. To compute overall consistency, we average Rolling's metric over all documents. Formally, EZdeD RC( Ad, Bd) . ID Consistency(A, B) = IDI where D is the set of documents, and A and B are corresponding sets of keywords. We compare Consistency(Author, Reader) to Consistency(Author,Algorithm), where IAlgorithmdj = IReaderdl for all d. As with the F-score, we perform cross validation, and average the inter-indexer consistency scores across runs. This approach merges all of the readers into one indexer, effectively reducing the number of human indexers to 2. On the Semeval dataset, the mapping between readers and keyphrases is not reported, so this is sufficient. On other datasets, such as the CiteULike dataset, the mapping between readers and the keyphrases they assigned is available. The above technique can be generalized to utilize this additional information. We can evaluate the internal consistency of human indexers by computing the average inter-indexer consistency between pairs of indexers. Then this can be averaged over all documents. Formally, .Zd HumanConsistency =d eD RCQi,12 ) ELdW\{} E12 ELdj\{l,l} |Ld.ud|-1) wD where Ld is the set of lists of keyphrases for document d. To evaluate the con- 28 sistency of the algorithm with the human indexers, we compute the average of the pairwise consistencies per document as before, but when computing consistency for each pair of indexers, we replace keyphrase list of the first human indexer with the keyphrase list of the algorithm. Formally, EZELdiL} AlgorithmConsistency =Ld|(ILW-1) RC(tOp jli results of algo on d),l2 ) Z ED lEL \M} IDI This generalization enables the comparison of inter-indexer consistency between multiple human indexers and automatic keyphrase extraction systems. It avoids given unfair advantages to either the human or computer indexers by restricting the algorithm to submitting the same number of keyphrases as the human indexers. 2.2.4 Use of Evaluation Systems In summary, we have presented two schemes for the evaluation of keyphrase extraction systems. The first, macro-averaged F-score is consistent the evaluation methods used in earlier works on keyphrase extraction. Our use of repeated random sub-sampling validation in the computation of F-score ensures properly cross-validated results. We will use this metric through this thesis to measure the effects of our improvements. The second evaluation metric is a new inter-indexer consistency metric which allows for the comparison of the relative performance of human indexers and automatic extraction systems. We will use this metric to compare the performance of our system to human indexers in Chapter 5. 2.3 Datasets We will focus on three datasets of research documents, Collabgraph, CiteULike180, and Semeval 2010 Task 5. As mentioned in Section 2.1, the CiteULike180 and Semeval datasets are prexisting keyphrase extraction datasets which have been used in other papers. The Collabgraph dataset is a new dataset which we compiled for this thesis. These three datasets vary considerably in terms of the types of papers, the quality 29 Table 2.1: Statistics for Datasets Collabgraph CiteULike180 Semeval Doc Length Keyphrases per Doc Keyphrase Length Oracle Accuracy 49453 35807 48185 5.00 5.22 3.88 1.80 1.16 1.96 82% 85% 83% of their textual data, and the processes used to select the gold standard keyphrases. Appendix A contains example keyphrases from representative documents from each dataset. Table 2.1 shows summary statistics for the three datasets. Average document length in characters, the average number of keyphrases per document, the average number of words per keyphrase, and oracle accuracy are reported. Oracle accuracy is the fraction of keyphrases that appear in the text of their document. Oracle accuracy is evaluated after normalization by stemming and lowercasing. 2.3.1 Collabgraph The Collabgraph dataset consists of 942 research papers with author assigned keyphrases. They are drawn from a wide variety of fields and journals. Each paper has at least one author from MIT. The keyphrase lists were generated by running a simple script to identify lists of keyphrases in the texts of a larger set of documents. Not all of these documents contained lists of keyphrases, and so they are omitted from the dataset. After their identification, the keyphrase lists were removed from documents so that they did not interfere the training or evaluation process. The scripts for this document processing are included with Oahu. Since this dataset is not comprised of freely available papers, we are not able to provide the textual data. Instead, we have provided a list of the papers used, in BibTEX format, so that the dataset can be downloaded by anyone with access to the papers from the dataset. This list of papers is included in the Oahu project on GitHub. 30 2.3.2 CiteULike180 The CiteULike180 corpus is 180 documents with tags assigned by users of the website citeulike . org [19]. CiteULike is a online bookmarking service for scholarly papers, where users can mark papers with tags. Medelyan et al. generated a corpus from the papers on CiteULike. To address issues of noise, they restricted to papers which had at least three tags that at least two users agreed upon. For taggers, they restricted to taggers with at least two co-taggers, where two users are said to be "co-taggers" if they have tagged at least one common document. They also restricted to papers from High-Wire and Nature, since those journals had easily accessible full text PDF files. As in Medelyan et al. (2009), we consider only tags which at least two users agreed upon. This gives a total of 180 documents, with 946 keyphrases assigned. The keyphrases seen in this dataset are notably shorter than the keyphrases from the other datasets, typically only a single word in length. The papers are primarily about biology, with smaller number of math and computer science papers mixed in. There are a some issues with cleanliness of this data. A number of the documents contain sections from other papers at their beginning or end. This is because the text of the documents was generated by running pdf2text on PDF files downloaded from citeulike.org. Some of these PDFs are of pages from journals, such as Nature, where the first or last page of an article often contains the end or beginning of another article in the journal, respectively. This form of noise has impact on the accuracy of keyphrase extraction by introducing spurious candidates and padding the length of the documents. As discussed in Section 2.1, a number of keyphrase extraction systems utilize the locations of the first and last occurrence, which are affected by these extraneous passages. 2.3.3 Semeval 2010 Task 5 The corpus from Semeval 2010 Task 5 consists of 244 conference and workshop papers from the ACM Digital Library [14][151. The documents come from four 1998 ACM classifications: C2.4 (Distributed Systems), H3.3 (Information Search and Retrieval), 31 12.11 (Distributed Artificial - Multiagent Systems) and J4 (Social and Behavioral Science - Economics). There are an equal number of documents for each classification, and the documents are marked with their classification. Each document has author assigned keyphrases and reader assigned keyphrases. The reader assigned keyphrases were assigned by 50 students from the Computer Science department of National University of Singapore. Each reader was given 10 to 15 minutes per paper. Evaluation in Semeval 2010 Task 5 Semeval 2010 Task 5 specifies a standard evaluation procedure. The corpus is split into 144 training documents and 100 testing documents. As previously mentioned in Section 2.2, there are some issues with the procedure they specify and the script provided to perform this procedure. Instead of evaluating using their split between training and testing documents, we merge the two sets, and evaluate using cross validation. 2.3.4 Author vs Reader Selected Keyphrases Intuitively, we would expect the quality of author assigned keyphrases to exceed that of reader assigned keyphrases. Authors have a deep understanding of the paper they write, and hence are in a good position to assign keyphrases. A direct examination of the Semeval dataset reveals that the keyphrases chosen by readers have some shortcomings. The readers assigning keyphrases to the Semeval corpus were given 10 to 15 minutes per document, a fairly limited amount of time [14]. As such, we would expect that location within the document would strongly inform their keyphrase selection. This hypothesis is confirmed by the analysis of Nguyen et al. of this corpus [25]. In Semeval 2010 Task 5, systems that utilize document section analysis, such HUMB, WINGUS, and SEERLAB substantially outperform systems such as Maui which lack such features [151. On the author assigned keywords, document section analysis plays a much less important role. Since author selected keyphrases are high quality, due to being generated without time pressure by people who have fully read 32 the papers, we will focus on author selected keyphrases. 2.3.5 Testing Datasets Throughout this paper we will focus on the Collabgraph dataset. As discussed above, author selected keyphrases are preferable to reader selected keyphrases. For this reason, we did not focus on the CiteULike dataset. The Semeval and Collabgraph datasets both have author assigned keyphrases, but the Collabgraph dataset is substantially larger. Since a larger corpus reduces the risks of overfitting, we focus on the Collabgraph corpus. When measuring performance, we use 200, 100, and 150 training documents for the Collabgraph, CiteULike180, and Semeval datasets, respectively. These training document counts were chosen so that the number of testing documents was not small enough to substantially increase the variance of the F-score. Experientially, at least 50 testing documents were needed to avoid this issue. This form of noise can be resolved by running additional trials, but runtimes were already very long, so this was undesirable. The numbers of training documents were also chosen to be similar as to lessen the differences between the datasets. For this reason, the Collabgraph document count was chosen to be 200 instead of 400. Additionally, using 400 training documents would have substantially slowed the running of experiments. 2.4 Baseline In this thesis, we build upon Maui, an open source keyphrase extraction system. Maui achieves performance at or very near the state of the art on author assigned keyphrases. In this section, we will discuss Maui in detail and report its performance on all three datasets. 33 2.4.1 Baseline System Maui performs keyphrase extraction in the standard fashion, first heuristically filtering to a list of candidate keyphrases and then selecting the top candidates with a trained ranker [191. Filtering is performed by selecting phrases of n words or less, that do not begin or end with a stopword. The ranking step is performed using bagged decision trees generated from the training set. Maui utilizes a number of features: 1. TF, IDF, and TF-IDF of the candidate keyphrase. 2. Phrase Length, the number of words in the candidate. 3. Position of First Occurrence and Position of Last Occurrence, both normalized by document length. Spread, the distance between the first and last occurrences, also normalized by the document length. 4. Domain Keyphraseness, the number of times the candidate was chosen as a keyphrase in the training dataset. Maui also allows for the use of domain specific thesauruses or Wikipedia to compute addition features. In this thesis, we will only use the Wikipedia features: 1. Wikipedia Keyphraseness, the probability of an appearance of the candidate in a article being an anchor. An anchor is the visible text in a hyperlink to another article. 2. Inverse Wikipedia Keyphraseness, the probability of a candidate's article being used in the text of other articles. Given a candidate, c, with corresponding article A, the inverse Wikipedia keyphraseness is the number of incoming links, inLinksTo(A) divided by the total number of Wikipedia articles. Then -log 9 2 is applied, giving: IW F(c) -log 2 inLinksTo(A) - N N Since bagged decision trees are used, the normalization used here is immaterial. This feature is equivalent to inLinksTo(A). 34 3. Total Wikipedia Keyphraseness, the sum of the Wikipedia keyphrasenesses of the corresponding articles for each phrase in the document that was mapped to the candidate. 4. Generality, the distance between the category corresponding to the candidate's Wikipedia article and the root of the category tree. By default, Maui filters to candidates that occur at least twice and uses 10 bags of size 0.1 during ranking. The performance of Maui in Semeval 2010 Task 5 indicates it is at or very close to the state of the art for extraction of author assigned keyphrases. The only system which outperformed Maui on this task was HUMB. As previously discussed in Section 2.2, there are a number of issues with the Semeval 2010 Task 5 evaluation system which make it difficult to determine if this outperformance was due to noise or a genuine advantage. A trial of Maui on the particular split used by Semeval 2010 Task 5 indicates that the training-test split used by the task may have been particularly bad for Maui. Furthermore, HUMB utilized an additional 156 training documents compared to the 144 documents used by the other systems evaluated in this task [17][141. This more than doubled the number of training documents available to HUMB. As can be seen below, the number of training documents has a substantial effect on the performance of keyphrase extraction systems. Although this was fair within the rules of Semeval 2010 Task 5, it makes it difficult to determine if HUMB has any real advantage over Maui, or its out-performance was simply due to a larger training corpus. Unfortunately, the results of Semeval 2010 Task 5 are the only published performance numbers for HUMB, and the system is not open-source. As such, we will not compare our performance with HUMB. 2.4.2 Baseline Performance Tables 2.2, 2.3, and 2.4 show the performance of Maui with and without Wikipedia features, for a range of training document counts. Note that we run Maui will all parameters set at their default values. This includes only considering candidate 35 Table 2.2: Collabgraph Baseline Performance Training Does 50 100 200 400 w/o Wikipedia 14.3 0.3 15.4 0.2 16.5 0.2 17.9 0.1 with Wikipedia 16.2 0.7 17.1 0.8 17.0 0.6 18.7 0.5 Table 2.3: CiteULike18O Baseline Performance Training Does 50 100 w/o Wikipedia 28.8 0.2 31.1 0.3 with Wikipedia 29.2 0.8 31.2 0.7 Table 2.4: Semeval Baseline Performance Training Docs 50 100 200 w/o Wikipedia 15.9 0.2 17.7 0.2 18.7 0.4 36 with Wikipedia 16.7 0.6 17.4 0.6 18.6 1.8 Table 2.5: Summary Table w/o Wikipedia with Wikipedia Collabgraph 16.5 + 0.1 17.5 0.2 keyphrase that appear at least twice. CiteULike180 31.1 0.2 31.2 0.2 Semeval 18.2 0.1 18.3 0.2 From these tables, we can see that increas- ing the number of training documents has a substantial positive effect. The use of Wikipedia features improves performance on the Collabgraph and Semeval datasets. These improvements are more pronounced when that training sets are smaller. The use of Wikipedia does not provide any substantial performance improvement on the Collabgraph dataset. Table 2.5 shows the baseline performance of Maui on the Collabgraph, CiteULike180, and Semeval datasets for 200, 150, and 100 training documents, respectively. These are the training document counts we will be using for the rest of this thesis. As such, these are the performance numbers that all improvements will be compared to. 37 38 Chapter 3 Ranking Methods In this chapter, we present several advancements to the ranking methods used for keyphrase extraction. In Maui, bagged decision trees are used to rank the candidate keyphrases generated by the filtering stage. Our first advancement is a modified training procedure where the weight of positive examples is increased during bagging. This is seen to improve both performance and runtime. We also discuss how this improvement relates to the tuning of bagging parameters. Our second advancement is the use of a second set of bagged decision trees for reranking. We also propose a number of extensions to this technique. 3.1 Reweighting and Bagging Maui, HUMB, and a variety of other keyphrase extraction systems rely on bagged decision trees to rank candidate keyphrases [201 [17]. In the training process, these trees are trained on the candidate keyphrases that the filtering stage generates from the training documents. Here, the number of negative examples vastly exceeds the number of positive examples, due the simplicity of candidate extraction process and the small number of gold standard keyphrases per document. Maui simply extracts N-grams, that do not begin or end with a stopwords. This results in thousands to tens of thousands of candidate keyphrases per document. In all three datasets, the average document has more than 100 times as many negative candidates as it has 39 positive candidates. A second complimentary issue is that during bagging some of the original data points may be omitted. In bagging, m new datasets Di are drawn with replacement from the original dataset D, such that 1D 2 /IDI = p. The probability that some candidate c E D is not in any dataset Di is: P(c 0 UDj) = (1 - 1/DI)mpIDI Assuming that IDI is large, this can be approximated as P(c U Dj) _ -'P By default, Maui uses parameters m = 10 and p = .1, meaning that P(c g Uj Dj) - e-1 ~ .37. So on average, 37% of data points are not used to train any of the decision trees. Naturally, this can be addressed by increasing the values of m and p, however this comes at the cost of runtime. 3.1.1 Reweighting positive examples The samples in D can be reweighted before the bagged sets, {Dj}, are drawn. After upweighting the positive examples to w times the weight of the negative examples, the probability of a positive candidate c+ appearing in none of the new datasets Di is: P(c+ 0 UDj) = (1 - w/(wjD+I+ D_- ))mpDI where D+ and D_ are the sets of positive and negative candidates, respectively. Assuming that IDI is large and wID+I < ID1, P(c+ 0UDj) = (1 - w/(|D|))mpID| 40 -wrp Table 3.1: Reweighting Tuning w P(c+ 0 F-score 1 2 4 8 16.5 16.9 17.4 17.3 0.2 0.2 0.1 0.1 16 32 16.9 16.1 0.2 0.2 64 13.1 0.4 Uj Dj) 0.37 0.14 0.018 0.00034 1.le - 07 1.3e - 14 1.6e - 28 Table 3.1 shows the effect of varying w on the Collabgraph dataset. Empirically, we see that increasing w can dramatically improve performance at negligible cost to runtime. We see that w = 4 yields the highest performance. P(c+ V Uj Note that for w = 4, D) = .02, so on average all but 2% of the positive examples will be used at least once. This performance improvement makes intuitive sense, especially with the default bagging parameters of m = 10 and p = .1. This reweighting ensures that the precious positive data points are used in at least one decision tree with high probability. The negative data points are still omitted, but the information encoded by the negative data points is not particularly unique. A second benefit is the reweighting results in deeper decision trees because there are more positive candidates in the training data for each tree. Deeper decision trees express the roles of a greater number of features. Since C4.5 decision trees are used, less predictive features will only appear in the decision trees if the trees are sufficiently deep. This is because the most predictive features occupy the first few layers of the trees. 3.1.2 Comparison to Tuning of Bagging Parameters The bagging parameters, bag size and bag count can also be tuned for performance. Table 3.2 shows the effects of the tuning the bagging parameters. Increasing the number of bags and bag size yields substantial performance improvements. However, these improvements come at a high runtime cost, with runtime scaling linearly in the 41 Table 3.2: Tuning Bagging Parameters Number of Bags\Bag Size 10 20 40 80 0.1 16.4 17.2 17.8 18.1 0.2 0.3 0.3 0.3 0.3 17.1 0.3 17.5 0.2 17.8 4 0.2 18.1 0.2 0.4 17.0 17.3 18.0 17.9 0.2 0.3 0.2 0.2 number of bags, and super-linearly in bag size. Runtime is an important constraint in this problem. Generating all of the experimental results for this thesis took upward of 3 days of runtime on a single machine (Quad core i7-4770k at 3.7GHz running on 6 threads, using more than 12GB of RAM). As such, increasing bag number and size is undesirable. Hence, reweighting is very useful, as it improves performance with almost no cost to runtime. Note that the effect of reweighting is similar to the effect of increasing bag size. Doubling bag size and doubling the weight of positive example, both double the number of positive examples per decision tree. The performance increases from scaling up bag size in Table 3.2 are comparable to the improvements seen from reweighting in Table 3.1, indicating there some validity to this interpretation. Reweighting positive examples can be effectively combined with the tuning of the bagging parameters. Setting the number of bags to 20, the bag size to 0.1, and w to 4 gives an F-score of .181 i .002. This is the equal to the highest F-score seen from tuning the bagging parameters, but requires substantially less runtime. So, the reweighting of positive examples reduces the runtime required to achieve maximum performance. 3.1.3 Summary of Reweighting Reweighting provides a substantial performance improvement on all three datasets. Table 3.3 show the improvement from upweighting positive examples by a factor of 4 on each dataset. Unlike increasing bag size or bag count, this performance gain comes at minimal cost to runtime. Reweighting can also be productively combined 42 Table 3.3: Reweighting Summary W = 1 w 4 Collabgraph 16.5 4 0.1 17.3 0.1 CiteULike180 0.2 31.1 32.4 0.2 Semeval 18.2 t 0.1 18.8 0.2 with the tuning of bagging parameters to improve performance and reduce runtime. 3.2 Reranking In reranking, the top k results from the initial ranking step are passed to a reranker which rescores them. The reranker is trained on the output of the ranker on the training set. Reranking the output of an initial ranking step with a more sophisticated scorer has been used with great success in parsing tasks [3]. In decoding tasks, rerank- ing allows for the use of more sophisticated scoring functions which cannot be used during the initial ranking step due to computational limitations. The exponential space of parse trees makes it difficult to use global features during the ranking stage. Unlike decoding, the candidate space in keyphrase extraction is not exponentially large, typically consisting of several thousand candidates. However, this space is sufficiently large that computing expensive features, such as Wikipedia features, can be prohibitively slow. 3.2.1 Bagged Decision Trees Bagged decision trees can be used for reranking. Even with the same set of features at the ranking and reranking stages, the chaining of two sets of bagged decision trees has advantages over a single bagged decision tree ranker. Since the filtering step used by Oahu is simple, it generates many terrible candidates. For example, Oahu would generate the candidate "filtering step used by Oahu" from the previous sentence. This is clearly a poor candidate, but would affect the training of the bagged decision trees used for ranking. The bagged decision trees in the ranking step learn which 43 features discriminate between positive candidates and all negative candidates. When the reranking step is added, it is trained on the output of the ranking stage, the good candidates. Poor candidates such as "filtering step used by Oahu" are filtered out by the ranking stage. Since the reranker takes the top candidates as its input, it learns which features discriminate between the positive candidates and the good candidates which did not appear in the gold standard lists. There are four hyperparameters for bagged decision trees in reranking. The first two parameters are the number of candidates passed to the reranker during training and the number of candidates passed to the reranker during testing. The second two parameters are the bagging parameters, bag size and bag count. Table 3.4 shows performance as function of the number candidates passed to the reranker during training and testing. Traditionally, the same number of candidates are passed to the reranker during training and testing. We consider the possibility of using different values, since it is theoretically possible that using different numbers of candidates for training and testing could improve performance. For example, decreasing the number of testing candidates could improve the performance by increasing the average quality of candidates passed to the reranker. Based on the results from Table 3.4, we see that there are a large set of parameter values that maximize performance. We choose 160 training candidates and 160 testing candidates because having the same number of training and testing candidates increases symmetry between the training and testing steps, which is useful when we consider features like semantic relatedness later in this chapter. Although 320 candidates would also work, we choose 160 because less reranking candidates results in lower runtimes. We also tune the bagging parameters. Table 3.5 shows F-score for various reranking bag counts and bag sizes. This experiment was performed with 160 reranking candidates on the Collabgraph dataset. It shows that maximum performance is achieved with 160 bags with a bag size of 0.2. Reranking improves the F-score dramatically on all three datasets. As can be seen in Table 5.2, reranking improves F-score by 2 to 3 points. Table 5.2 was generated 44 Table 3.4: Tuning Reranker Candidate Counts 40 Training \Testing 40 80 160 320 18.2 18.3 18.4 18.5 0.2 0.2 0.2 0.2 80 17.4 t 18.2 18.5 18.6 0.2 0.2 0.2 0.2 160 16.6 0.2 18.2 0.2 18.4 0.2 18.4 0.2 320 16.4 0.2 18.2 t 0.2 18.4 0.1 18.5 0.1 Table 3.5: Tuning Reranker Bagging Parameters Bag Count\Bag Size 40 80 160 320 0.1 18.1 18.3 t 18.4 18.4 0.2 0.2 0.2 0.1 0.2 18.2 0.2 18.3 0.1 18.6 0.2 18.4 0.1 0.4 18.1 0.1 18.5 0.1 18.7 0.2 18.4 0.1 0.8 17.8 18.1 18.0 18.1 0.2 0.1 0.1 0.2 with the hyperparameters chosen above, and no Wikipedia features. 3.2.2 Support Vector Machines We also experimented with the use of support vector machines for reranking, using the SVMlight package [101. Linear, polynomial, radial, and sigmoidal kernels were tested, with a variety of hyperparameter values. The reweighting of positive examples was also considered to address the imbalanced dataset. However, in all cases, reranking with SVMs was worse than no reranking. For the simpler linear kernel, performance decreased dramatically as the SVM failed to split the data. For the other kernels, the performance was often only slightly reduced relative to no reranking. However, a substantial fraction of the time, no good fit was found and performance was decreased dramatically. Table 3.6: Reranking Summary w/o Reranking with Reranking Collabgraph 16.5 0.1 18.5 0.1 45 CiteULike180 0.2 31.1 35.2 0.2 Semeval 18.2 0.1 20.6 0.2 The failure of support vector machines can be explained largely by the noise present in the data. As evidenced by the low inter-indexer consistency seen between humans, a single human indexer will omit many good keyphrases [19]. As such, training data generated by a single human indexer has only a fraction of the candidates that would good keyphrases marked as keyphrases. This results in many negative examples mixed in with the positive examples. The data is also highly unbalanced and quite complex, which makes training difficult. 3.2.3 Extensions New features can be added at the reranking stage. This is useful for two categories of features, features which are too computational expensive to be computed for all candidates and features that are computed using the ranked list of candidates generated by the ranker. Delayed Feature Computation When Wikipedia features are used, they dominate the runtime because their computation is very expensive. As such, a natural optimization is to only compute Wikipedia features for candidates that are selected for reranking. Table 3.7 shows the effects of adding Wikipedia features at different stages of ranking and reranking process. This is on the Collabgraph dataset with the reranking hyperparameters discussed in Section 3.2.1. As can be seen in the table, delaying the computation of Wikipedia features until after ranking has minimal effect on the F-score, but decreases runtime by a factor of 5. As such, computing Wikipedia features before reranking instead of before ranking is suitable for runtime sensitive environments. In Chapter 4, when we explore the effects of removing features, we will employ this optimization to make experimentation with Wikipedia features feasible. 46 Table 3.7: Adding Wikipedia Features at Different Stages Wikipedia Features None Computed before Ranking Computed before Reranking F-score Runtime (normalized) 0.2 0.3 1.0 17.2 20.4 + 0.2 3.4 18.7 20.5 Semantic Relatedless Medelyan et al. introduce a semantic relatedness feature, semRel, computed using Wikipedia [191. This feature is the average semantic relatedness of the Wikipedia article of the candidate to the articles of the other candidates in the document. This semantic relatedness feature is prohibitively computationally intensive when there are a substantial number of candidates, since it's computation is quadratic in the number of candidates. Maui supports the use of this semantic relatedness feature when Wikipedia is used as a thesaurus. The use of Wikipedia as thesaurus reduces the number of candidates, and hence the time required to compute semantic relatedness. We are not using Wikipedia as thesaurus; we allow for keyphrases that are not the titles of Wikipedia articles. We tried the adding of this semantic relatedness feature after ranking. Instead of computing the average of the semantic relatedness of the candidate to all other candidates in the document, we restricted to the top k candidates generated by ranking. However, we found that the addition of this semantic relatedness feature after reranking gave no meaningful increase on F-score on any of the three datasets, and increases runtime by more than an order of magnitude. Number of Higher Ranked Superstrings and Substrings Author assigned keyphrase lists rarely contain two phrases such that one phrase contains the other. If an author chooses the keyphrase "information theory", they are unlikely to also choose "information" as a keyphrase. To allow the keyphrase extraction system to learn this, we introduce a two new reranking features, numSuper and numSub. They are the number of superphrases and the number of subphrases that 47 Table 3.8: Effects of numSuper and numSub Features Collabgraph Neither numSuper numSub Both 18.6 18.9 18.7 19.0 CiteULike180 35.1 t 0.1 35.0 0.1 35.0 0.1 35.2 0.1 0.1 0.1 0.1 0.1 Semeval 20.5 20.8 20.4 20.7 0.1 0.1 0.1 0.1 are ranked higher than a candidate by the ranker. String containment testing is done after normalization and stemming. Table 3.8 shows the effects of adding the numSuper and numSub features on all three datasets. Individually, the features either improve performance or have no significant effect on it. Together, the features improve performance on both the Collabgraph and Semeval datasets. The CiteULike180 dataset is not affected by the features is any significant fashion. This is consistent with the high frequency of single word keyphrases in the CiteULike180 dataset. Since almost the majority of keyphrases in the CiteULike180 dataset are a single word, there are less interactions between keyphrases of different lengths. Anti-Keyphrasenesses Some phrases may unsuitable as keyphrases even though they appear to be suitable keyphrases based on the values of their features. For example, in physics papers, the phrase "electron" may appear to be a good keyphrase based on its tf-idf, position, and so on. However, physicists may not consider "electron" to be a suitable keyphrase because it is too general. As a result, "electron" is never chosen as a keyphrase in the training. However, the keyphrase extraction algorithm will choose "electron" as a keyphrase on both the training and testing data. To attempt to address this issue, we introduce an anti-keyphraseness feature, antiKeyphr, which indicates how often a phrase was chosen as a keyphrase by the algorithm but was not actually a keyphrase. Anti-keyphraseness is the fraction of the times that the phrase was in the top k candidates selected by the ranker but was 48 not actually in the gold standard list. This feature is then added and used by the reranker. Formally, Ed'EDR(c)\{d} 1 - K(c, d') IDR(c) \ {d}| where DR(c) is the set of documents where c is in the top k candidates selected by the ranker, and d is document containing c. K(c, d') is an indicator function which is 1 if c is a keyphrase of d' and 0 otherwise. If a phrase has not been encountered previously, it has an anti-keyphraseness of 1. The addition of the antiKeyph feature substantially decreased performance. We explored a wide range of values of k and a few close variants of anti-keyphraseness and did not see any improvement. This feature often appeared high in the decision trees. We believe that this feature fails because it is tightly connected to other features and has high noise, so it decreases the accuracy of subsequent splits. Note that antikeyphraseness is linked to domain keyphraseness because a phrase will have non zero domain keyphraseness if and only if it has anti-keyphraseness of less than 1. Since the keyphrase lists used for training are not complete, there is a high degree of noise, and suitable phrases may still be assigned high anti-keyphraseness. Although the anti-keyphraseness feature did not prove effective, we believe that the concept may still have value. It may be possible to incorporate this information more effectively with a heuristic step following reranking or a different feature. 3.2.4 Summary of Reranking Reranking proved to be a highly effective strategy, both as a stand alone improvement and by enabling further extensions. Delaying the computation of expensive Wikipedia features until the reranking stage dramatically reduced runtime at minimal cost to F-score. The new numSuper and numSub features computed from the ranked list outputted from the ranker were seen to improve performance on all three datasets. 49 50 Chapter 4 Features In this chapter, we work to improve the set of features used during the ranking step. We introduce two new features, augmented domain keyphraseness and average word length. These two features are seen to substantially improve the accuracy of our keyphrase assignment system. Poorly chosen features can have a dramatic negative effect on performance of bagged decision trees. As such, we also explore the effects of feature removal across all three datasets. 4.1 Augmented Domain Keyphraseness Maui utilizes a domain keyphraseness feature which indicates the number of times a candidate appears in the training set. This feature is used during the ranking step, not the filtering step. Hence it does not prevent candidates that have a keyphraseness of zero from being chosen during keyphrase extraction. Formally, the domain keyphraseness of a candidate c from document d is: 1 DomainKeyphr(c) = d'ED(c)\{d} where D(c) is the set of documents in the training set that have c as a keyphrase. Here we assume that for each document, each keyphrase has a frequency of 0 or 1. This is the case when each document has only a single indexer or when the keyphrase 51 lists of multiple indexers are merged and any frequency information is discarded. This is the case for all of our datasets. When this not the case, we could either keep keyphraseness as is, or modify it to DonainKeyphr(c) = E f(c, d') d'ED(c)\{d} where f(c, d') is the number of times c is a keyphrase for document d'. The modified form may be substantially worse, since it seems that the same keyphrase being assigned to multiple documents is substantially more significant than multiple indexers assigning the same keyphrase to one document. However, this depends on the specifics of the corpus and how the training keyphrases were generated. We will not explore this issue, since it is not relevant on our corpora. The domain keyphraseness feature improves performance on datasets where the same phrases are used as keyphrases many times. However, on datasets which span multiple domains, the keyphraseness feature can result in poor choices of keyphrases. For example, the CiteULike180 dataset contains both math papers and biology papers. The phrase "graph" has high keyphraseness because it is frequently used as a keyphrase for math papers on the training set. As a result, "graph" is often incorrectly selected by the extraction system as a keyphrase for biology papers. To address this shortcoming of domain keyphraseness, we propose an augmented form of domain keyphraseness which is aware of which documents are similar. Instead of weighing all appearances of a candidate as a keyphrase equally, we weight them by the inverse of the pair-wise document distance between the document of the candidate and the training document where the candidate appeared as a keyphrase. The new formula for keyphraseness is, ' d AugmDomainKeyphr(c) ='D d'ED(c)\{d} I where Id, d'II is the pairwise distance between documents d and d'. Note that we have not yet specified our distance metric. 52 The document distance metric should place documents from the same domain close together. Suppose we have a hierarchical document clustering, with N clusters of documents and a binary tree with the clusters as leaves. We assume these clusters are composed of similar documents, and the tree is arranged with similar clusters close together. Then similar documents will be close together in the tree, while dissimilar documents will be far away. Hence, a suitable metric is, j|d, d'11 =1 d' 1 max(T - icad(d, d'), 0) where lcad(d, d') is the distance to the lowest common ancestor of the clusters of d and d', and T is a thresholding constant. Let a be the lowest common ancestor of the clusters of d and d'. If a is the kth and k'th ancestor of the clusters of d and d' respectively, then lcad(d, d') = max(k, k') Lowest common ancestor distance, icad, is a measure of the distance between d and d'. However, having a scaling parameter is desirable, which is why we do not simply use icad as our distance metric. If T = 1, then lid, d'I = 1 if d and d' are from the same cluster, otherwise Id, d'II = oo. Hence for T = 1, augmented domain keyphraseness is equivalent to domain keyphraseness, but restricted to documents in the same cluster. If T > 1, then the documents from the same cluster as d are given weight T, and all other clusters are weighted less based on how far they are from the cluster of d. Sufficiently far away clusters are weighted at 0. Note that as with all features, any constant scaling does not matter, since bagged decision trees are used for ranking. To generate a hierarchical document clustering, we use the CLUTO software package [32]. The -showtree flag is used to generate a hierarchical clustering, otherwise the default parameters are used. Doc2mat is used to preprocess the documents into the vector space format used by CLUTO. The default doc2mat options are used. The clustering generated is a set of numCluster clusters, and then a hierarchical tree built on top of the clusters. 53 Table 4.1: Augmented Domain Keyphraseness Parameter Tuning T\N 1 2 4 8 16 10 30.1 30.8 31.4 31.5 31.4 20 29.4 29.8 + 31.5 31.6 31.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 40 28.1 28.8 30.5 31.6 31.4 0.2 0.2 0.2 0.2 0.2 We consider the effects of replacing the domain keyphraseness feature with the augmented domain keyphraseness feature on the CiteULike180 dataset. shows performance for a range of values of T and N. Table 4.1 We observe that for T = 1, this augmentation worsens performance. As mentioned earlier, for T = 1, augmented domain keyphraseness is equivalent to domain keyphraseness but restricted to the clusters. Since the clusters are small, this reduces performance. For T = 1, performance decreases as N increases, since the clusters are getting smaller and smaller, decreasing the usefulness of keyphraseness. For greater values of T, we see that this augmentation improves performance. The gains in performance is not particularly sensitive to the values of N and T. We select N = 20 and T = 8 , since they lie within the region of maximum performance. This gives an improvement of about 0.5 to the F-score. We performed the same exploration on both the Collabgraph and Semeval datasets. We did not see any meaningful improvement on the these datasets from the augmentation. However, the use of augmented domain keyphraseness also did not have an adverse effect on performance. This augmentation was introduced to avoid er- rors where the extraction algorithm selects candidates that appeared frequently as keyphrases, but for documents from other domains. Without augumented domain keyphraseness, this form of error occurs frequently in the CiteULike datasets. Even without augumented domain keyphraseness, this form of error occurs very rarely in the Collabgraph and Semeval datasets. This difference is due to the dramatically shorter keyphrases in the CiteULike180 dataset. Table 4.2 shows the average number of words in keyphrases in each data set. Since the keyphrases in the CiteULike180 54 Table 4.2: Keyphrase Statistics for Datasets Average Keyphrase Length Zero Keyphraseness Collabgraph CiteULike180 Semeval 1.80 75% 1.16 30% 1.96 68% dataset are so short, they often appear in the texts of documents from other domains. A phrase such "algorithm" will frequency appear in non-computer science papers, and hence may be incorrectly chosen as a keyphrase for a biology paper. On the other hand, longer keyphrases, such as "ray tracing algorithm" are unlikely to appear in papers outside of the domain of computer science. As such, the keyphraseness feature is unlikely to cause "ray tracing algorithm" to be chosen a keyphrase for a biology paper, since "ray tracing algorithm" is very unlikely to appear in the text of a biology paper. Hence, the longer keyphrases of Collabgraph and Semeval cause there to be less keyphraseness induced errors. Additionally, the Collabgraph and Semeval keyphrases tend to be fairly specific, even when they are a single word, further reducing the likelyhood of keyphraseness errors. This can be seen by examining the sample keyphrases in Appendix A. Table 4.2 also shows the percentage of keyphrases that have zero keyphraseness. A keyphrase is said to have zero keyphraseness if it appear as a keyphrase only once in the training data. Few keyphrases in the CiteULike180 dataset have zero keyphraseness, so keyphraseness is a highly important feature on this dataset. Keyphraseness plays a lesser role on the other datasets, where most keyphrases have zero keyphraseness. If the Collabgraph or Semeval datasets were larger, keyphraseness induced errors would become more important. A larger corpus increases the number of keyphrases in the training data. In turn, this increases chance of a keyphrase from one document appearing in a document from another domain. Although the Collabgraph and Semeval datasets have mostly long specific keyphrases, they all have shorter less specific keyphrases, as with CiteULike180. Augmented keyphraseness can help eliminate the interference of these short non-specific keyphrases. 55 Table 4.3: Average Word Length Dataset w/o Average Word Length with Average Word Length 4.2 Collabgraph 16.5 17.3 0.1 0.1 CiteULike180 31.1 32.4 0.2 0.2 Semeval 18.2 18.8 0.1 0.2 Average Word Length We discovered that a new simple feature, average word length, substantially improves performance on all three datasets. characters per word in a candidate. Average word length is the average number of This excludes spaces, although the feature is equally effective with spaces included in its computation. Table 4.3 shows the effect of adding this feature to each of the datasets, when Wikipedia features are disabled. If we instead add a feature for the total number of characters, there is no performance improvement. This is interesting because the average number of characters per word is just the total number of characters divided by the number of words. The number of words in a candidate is already a feature, so one might expect the total number of characters feature to yield an equal improvement. However, decision trees cannot effectively represent division based effects, so the combination of two existing features can improve performance beyond that of the individual features [7]. 4.3 Feature Scaling The ranking step in keyphrase extraction selects the best candidate keyphrases from the list of candidates for each document. The ranker is trained on the candidates for all documents pooled together. As such, a candidate with a high term frequency compared to other candidates for its document may have low or moderate term frequency relative to the pooled candidates. Scaling features on a per document basis can address this issue. In this chapter, we consider the effectiveness of various rescaling schemes, and their interaction with reranking. Note that due to the use of decision trees in the ranking step, the application of the 56 same monotonic function to the feature values of all candidates would have no effect of the ranking step. Hence, if feature scaling was performed across all documents, instead of on a per document basis, it would be equivalent to performing no scaling at all. 4.3.1 Scaling Features The intuition behind feature scaling is that feature values are most meaningful when compared to the feature values of the other candidates from the same document. This intuition is more applicable to frequency based features, such as tf or tf-idf, than features such as number of words in the candidate. Due to the simple filtering process, the distribution of the number of words feature determined entirely by the distribution of stopwords in the document. As such, we would not expect the scaling of the length feature to improve performance. In fact, by adding additional noise, this scaling could potentially have an adverse effect on performance. We consider two standard methods for feature scaling, scaling to unit range, and scaling to unit variance. To scale features to unit range, each feature is linearly scaled on a per document basis, such that the feature values for document range from 0 to 1. Formally, we scale feature i of candidate c of document d as follows, Ci - minIEd(c') maxcEd(c') - minc',Id(c) Scaling to unit variance is analogous, each feature is linearly scaled so the feature values for each document have unit variance. Equivalently, each feature is mapped to its z-score. Table 4.4 shows the effects of scaling the features on Collabgraph. We consider the scaling of different two sets of features, the set of all features, and a more restricted set of features that only includes the three frequency based features, tf, idf, and tf-idf. Due to the sensitivity of both scaling techniques to outliers, we additionally consider the effects of truncating the top and bottom t% of the candidates for each document when computing the linear transforms. 57 Table 4.4: Scaling Methods 5% Percent Truncation 0% Unit Variance, All Features Unit Range, All Features 15.5 t 0.2 16.7 0.2 15.3 16.7 1% 0.2 0.1 14.5 15.8 Unit Variance, Frequency Features Unit Range, Frequency Features 16.9 + 0.2 0.1 16.8 16.8 16.8 0.1 0.2 16.7 0.1 16.5 t 0.1 0.4 0.2 Rescaling only the frequency based features consistently outperforms scaling all of the features. As mentioned previously, some features, such as length, have no need for scaling. The scaling of such features introduces noise, which can have an adverse effect on performance. When the frequency feature set is used, the choice of scaling to unit variance vs unit range has minimal effect. Similarly, the level of truncation has no significant effect that is distinguishable from noise. As such, the use of truncation is undesirable, as it introduces additional parameters. Rescaling the frequency features to unit variance or unit range with no truncation improves F-Score by about 0.3 on Collabgraph. This is a small, but non trivial improvement. 4.3.2 Scaling and Reranking When reranking is being used, feature scaling can be applied before the ranking step, reranking step, or both. Before the reranking step, the features of the reranking candidates can be rescaled, just as the features of the ranking candidates can be scaled. Table 4.5 shows performance for all four possible configurations. In all cases, the scaling method is scaling to unit variance with no truncation. Unfortunately, no use of scaling improves performance beyond vanilla reranking. As such, we will not use rescaling in our final system. It is possible that a more sophisticated scaling scheme may offer performance improvements even when combined with reranking. 58 Table 4.5: Scaling and Reranking on Collabgraph Rescaling 4.4 F-score None Before Ranking 18.6 18.3 0.1 0.2 Before Rerank 18.4 0.2 Before Both 18.4 0.2 Feature Selection Poor choices of features can have dramatic negative effects on performance. In experimenting with various potential new features, we frequently encountered this effect. This was often seen with predictive but complicated features, which would appear high in decision trees, but prevent other features from being used effectively. To better understand the set of features, we consider the effects of removing individual features. Table 4.6 presents the effects of feature removal when no Wikipedia features are active. Table 4.7 shows the effects of removing features when the Wikipedia features are active. All results are with reranking enabled. Due to high cost of the computation of Wikipedia features, we employ the optimization discussed in Chapter 3, and only compute Wikipedia features before reranking. From Tables 4.6 and 4.7 we see that no features have a statistically significant adverse effect on performance. Interestingly, we see that a number of features have minimal positive impact. For example, the removal of tf-idf has no significant effect on any of the datasets. This is surprising given that tf-idf often occurs high in the bagged decision trees. Evidently, in the absence of tf-idf, other features are able to provide the same information. As mentioned previously in Section 4.1, the domain keyphraseness feature is most important on the CiteULike180 dataset. On the Collabgraph dataset, where 75% of keyphrases only occur once in the training set, keyphraseness has no significant effect. The lastOccur and spread features are seen to have minimal effect while the firstOccur feature provides substantial value. We believe this is because firstOccur can be used to determine if the keyphrase appears in the abstract. spread and lastOccur are tied to term frequency and each 59 Table 4.6: Non Wikipedia Performance with Features Removed Feature Removed none tf tfidf idf firstOccur lastOccur spread length domainKeyph averageWordLength Collabgraph 19.4 19.5 19.4 19.2 18.9 19.6 19.5 19.1 18.7 18.7 0.1 0.2 0.1 0.1 0.2 0.2 0.2 0.2 0.1 0.2 CiteULike180 35.4 0.2 35.0 i 0.2 35.0 0.2 34.4 0.2 32.9 0.2 35.1 0.2 35.1 0.2 34.9 0.2 28.0 0.2 35.2 0.2 Semeval 20.7 20.7 20.8 i 20.8 20.4 20.7 20.8 20.5 19.2 20.5 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 Table 4.7: Wikipedia Performance with Features Removed Feature Removed none totalWikipKeyphr generality invWikipFreq wikipKeyphr Collabgraph 21.3 21.2 21.4 21.4 21.3 0.2 0.2 0.2 0.2 0.1 CiteULike180 35.9 35.8 35.7 35.8 35.4 0.2 0.2 0.2 0.2 0.2 Semeval 21.5 21.4 21.6 21.5 21.6 0.2 0.2 0.2 0.2 0.2 other, so they don't provide as that much value individually. Given that the set of features which can be removed without adversely impacting performance varies from corpus to corpus, and no significant improvements are seen from feature removal, we recommend the use of the full set of features on all datasets. 60 Chapter 5 Conclusion We have presented a number of advancements to the state of the art in keyphrase extraction. In this final chapter, we report the combined effects of these improvements and compare the accuracy of our system to that of human indexers. Oahu, our new system, combines all of the advancements discussed in the previous chapters. During the training of the ranker, positive examples are reweighted. Reranking is used, as are the new numSuper and numSub features. Although the computation of Wikipedia features can be delayed to reduce runtime, we will do so that in this chapter, since we want to report the maximum accuracy achievable by our system. The new average keyphrase length feature is used on all three datasets, and the new augmented domain keyphraseness feature is employed with the CiteULike180 dataset. 5.1 Combined Performance Table 5.1 compares the performance of Maui and Oahu on all three datasets. We see that Oahu substantially improves upon the performance of Maui on all datasets. Even without the use of Wikipedia, Oahu is able to outperform both Wikipedia and non-Wikipedia Maui. Interesting, there is a difference between the performance of Oahu with and without Wikipedia on the CiteULike180 and Semeval datasets, even though no substantial difference exists with Maui. Evidently, the more powerful Oahu 61 Table 5.1: Maui vs Oahu Collabgraph Maui Maui Oahu Oahu (w/o Wiki) (with Wiki) (w/o Wiki) (with Wiki) 16.5 t 17.5 19.2 21.0 0.1 0.2 0.1 0.3 CiteULike180 31.1 t 0.2 31.2 0.2 35.6 0.2 36.4 0.4 Semeval 18.2 18.3 20.7 21.8 0.1 0.2 0.2 0.4 ranking and reranking system is able to better exploit the Wikipedia features. Overall, Oahu is able to improve the performance of Maui with no substantial increase in runtime. Through the delaying of the computation of Wikipedia fea- tures, these gains can be achieved while reducing total runtime by nearly an order of magnitude. 5.2 Comparison to Human Indexers In Chapter 2, we described a new metric for the comparison of automatic keyphrase extraction systems to human indexers. This metric has two forms. The first evaluates the relative performance of a human reader and an extraction algorithm by comparing their consistency with the author assigned keyphrases. The second metric compares the internal consistency of multiple human indexers to the consistency of an extraction algorithm with the same set of human indexers. The first form can be used with the Semeval dataset, where both human and author assigned keyphrases are available. The second form can be used with the CiteULike180 datasets, where keyphrases from multiple human indexers are available. Table 5.2 shows the consistency of human indexers, Maui, and Oahu, on CiteULike and Semeval datasets, with Wikipedia features enabled. For these consistency measures, 150 and 200 training documents were used for CiteULike180 and Semeval respectively. Oahu is competitive with the human indexers on both datasets. On CiteULike180, Oahu achieves the same level of consistency as the human indexers. On Semeval, Oahu has a higher consistency than the human indexers. This means that the keyphrase lists extracted by Oahu are more similar to the author assigned 62 Table 5.2: Automatic vs Human Consistency Human Maui Oahu CiteULike180 0.497 0.453 0.012 0.495 0.012 Semeval 0.179 0.169 t 0.009 0.191 0.009 keyphrase lists than the reader assigned lists are. The readers of the the Semeval dataset only had 15 minutes per paper, so they may have been able to achieve higher consistency if they were given more time per paper. Nonetheless, Oahu's ability to perform at levels competitive to human indexers indicates that its performance is not far from the theoretical maximum. 63 64 Appendix A Sample Keyphrases Table A.1: Sample Keyphrases Collabgraph CiteULike180 Semeval affect, affective computing, user interface baffled microbial fuel cell, stacking, electricity generation, organic wastewater hallway, left, railing, left, hallway, left, computers, right, conference room, associative memory, mutant mice, amygdala, c fos coding, correlation argumentation, negotiation bridge, right, wireless sensor network, localization inference, review, networks xml, rank, information retrieval text mining content addressable storage, relational database system, database cache, wide area network, bandwidth optimization inferred regions, evaluation maria, volcanism, lunar interior, thermochemical properties, convection 65 66 Bibliography [11 Ken Barker and Nadia Cornacchia. Using noun phrase heads to extract document keyphrases. In Advances in Artificial Intelligence, pages 40-52. Springer, 2000. [21 Gdbor Berend and Richard Farkas. Sztergak: Feature engineering for keyphrase extraction. In Proceedings of the 5th internationalworkshop on semantic evaluation, pages 186-189. Association for Computational Linguistics, 2010. [31 Michael Collins and Terry Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25-70, 2005. [41 Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Man- agement, CIKM '98, pages 148-155, New York, NY, USA, 1998. ACM. [51 Samhaa R El-Beltagy and Ahmed Rafea. Kp-miner: A keyphrase extraction system for english and arabic documents. Information Systems, 34(1):132-144, 2009. [61 Eibe Frank, Gordon W Paynter, Ian H Witten, Carl Gutwin, and Craig G NevillManning. Domain-specific keyphrase extraction. 1999. [71 Trevor Hastie, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and R Tibshirani. The elements of statisticallearning, volume 2. Springer, 2009. [8] Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on Empiricalmethods in natural language processing, pages 216-223. Association for Computational Linguistics, 2003. [91 Mario Jarmasz and Caroline Barriere. Using semantic similarity over tera-byte corpus, compute the performance of keyphrase extraction. Proceedings of CLINE, 2004. [101 Thorsten Joachims. Making large scale svm learning practical. 1999. [11] Steve Jones and Malika Mahoui. Hierarchical document clustering using automatically extracted keyphrases. 2000. 67 [121 Arash Joorabchi and Abdulhussain E Mahdi. Automatic keyphrase annotation of scientific documents using wikipedia and genetic algorithms. Journal of In- formation Science, 39(3):410-426, 2013. [13] Su Nam Kim, Timothy Baldwin, and Min-Yen Kan. Evaluating n-gram based evaluation metrics for automatic keyphrase extraction. In Proceedingsof the 23rd internationalconference on computationallinguistics, pages 572-580. Association for Computational Linguistics, 2010. [14] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Semeval2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 21-26. Association for Computational Linguistics, 2010. [15] Su Nam Kim, Olena Medelyan, Min-Yen Kan, and Timothy Baldwin. Automatic keyphrase extraction from scientific articles. Language resources and evaluation, 47(3):723-742, 2013. [16] Patrice Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Research and Advanced Technology for DigitalLibraries, pages 473-474. Springer, 2009. [17] Patrice Lopez and Laurent Romary. Humb: Automatic key term extraction from scientific articles in grobid. In Proceedings of the 5th internationalworkshop on semantic evaluation, pages 248-251. Association for Computational Linguistics, 2010. [181 Minh-Thang Luong, Thuy Dung Nguyen, and Min-Yen Kan. Logical structure recovery in scholarly articles with rich document features. InternationalJournal of Digital Library Systems (IJDLS), 1(4):1-23, 2010. [191 Olena Medelyan. Human-competitive automatic topic indexing. PhD thesis, The University of Waikato, 2009. [20] Olena Medelyan, Eibe Frank, and Ian H Witten. Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3- Volume 3, pages 1318-1327. Association for Computational Linguistics, 2009. [21] Olena Medelyan and Ian H Witten. Measuring inter-indexer consistency using a thesaurus. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, pages 274-275. ACM, 2006. [221 Olena Medelyan and Ian H Witten. Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries,pages 296-297. ACM, 2006. [23] Olena Medelyan, Ian H Witten, and David Milne. Topic indexing with wikipedia. 68 [24] Thuy Dung Nguyen and Min-Yen Kan. Keyphrase extraction in scientific publications. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, pages 317-326. Springer, 2007. [25] Thuy Dung Nguyen and Minh-Thang Luong. Wingnus: Keyphrase extraction utilizing document logical structure. In Proceedings of the 5th internationalworkshop on semantic evaluation, pages 166-169. Association for Computational Lin- guistics, 2010. [26] Gordon W Paynter, IH Witten, and SJ Cunningham. Evaluating extracted phrases and extending thesauri. In Proceedings of the third internationalconference on Asian digital libraries, pages 131-138. Citeseer, 2000. [27] L Rolling. Indexing consistency, quality and efficiency. Information Processing & Management, 17(2):69-76, 1981. [281 Fabrizio Sebastiani. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1-47, 2002. [29] Pucktada Treeratpituk, Pradeep Teregowda, Jian Huang, and C. Lee Giles. Seerlab: A system for extracting keyphrases from scholarly documents. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 182-185, Uppsala, Sweden, July 2010. Association for Computational Linguistics. [30] Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G NevillManning. Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, pages 254-255. ACM, 1999. [31] Wei You, Dominique Fontaine, and Jean-Paul Barthes. An automatic keyphrase extraction system for scientific documents. Knowledge and information systems, 34(3):691-724, 2013. [32] Ying Zhao, George Karypis, and Usama Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141- 168, 2005. 69