Keyword Extraction Algorithm Contents 1. 2. 3. Introduce ..................................................................................................................................................................... 2 1.1 Keyword ............................................................................................................................................................... 2 1.2 Keyword Extraction system................................................................................................................................. 2 Advantage and disadvantage ....................................................................................................................................... 3 2.1. Rapid Automatic Keyword Extraction .................................................................................................................. 3 2.2. Keyword extraction for text characterization ...................................................................................................... 4 2.3. Automatic keyword extraction from documents using conditional random fields ............................................. 5 2.4. Keyword extraction from a single document using word co-occurrence statistical information ....................... 7 Conclusion .................................................................................................................................................................... 9 1. 1.1 Introduce Keyword Keywords, which we define as a sequence of one or more words, provide a compact representation of a document’s content. Ideally, keywords represent in condensed form the essential content of a document. Keywords are widely used to define queries within information retrieval (IR) systems as they are easy to define, revise, remember, and share. In comparison to mathematical signatures, keywords are independent of any corpus and can be applied across multiple corpora and IR systems. Keywords have also been applied to improve the functionality of IR systems. Here are some characters that help us to define a keyword: - position method which defines relevant words according to their text position ( heading, title). - cue phrase indicator criteria (specific text items signal that the following/previous words are relevant). - frequency criteria (words which are infrequent in the whole collection but relatively - frequent in the given text, are relevant for this text). - connectedness criteria (like repetition, co-reference, synonymy, semantic association). 1.2 Keyword Extraction system To compare and searching document’s content, the best way is use keyword. Whit this way we can increase performance and still assurance the quality. To create capstone project management, we must build a keyword extraction system support searching content. This system runs when upload or update a project to extract keyword and insert to data. When searching, we don’t need run system to extract data again so this system don’t need to fast, but must has high accurate and high reliability when extract large document. To find out the suitable algorithm for this system, we study some keyword extraction algorithms which extract keyword in large individual document. Then compare advantages and disadvantages of each algorithm to pick one which most reliable. 2. Advantage and disadvantage 2.1. Rapid Automatic Keyword Extraction RAKE is based on our observation that keywords frequently contain multiple words but rarely contain standard punctuation or stop words, such as the function words and, the, and of, or other words with minimal lexical meaning. Stop words are typically dropped from indexes within IR systems and not included in various text analyses as they are considered to be uninformative or meaningless. This reasoning is based on the expectation that such words are too frequently and broadly used to aid users in their analyses or search tasks. Words that do carry meaning within a document are described as content bearing and are often referred to as content words. The input parameters for RAKE comprise a list of stop words (or stop-list), a set of phrase delimiters, and a set of word delimiters. RAKE uses stop words and phrase delimiters to partition the document text into candidate keywords, which are sequences of content words as they occur in the text. Co-occurrences of words within these candidate keywords are meaningful and allow us to identify word co-occurrence without the application of an arbitrarily sized sliding window. Word associations are thus measured in a manner that automatically adapts to the style and content of the text, enabling adaptive and fine-grained measurement of word co-occurrences that will be used to score candidate keywords. A. Advantages: This algorithm gets keywords only by compare frequency of keywords, it use the simple and basic formula like tf/idf so the complex is low, this make system run faster than others when work with large document, extract more keyword and less bug. For example, compare this algorithm with TextRank, when extracted keywords from the 500 abstracts used the same hard ware, RAKE extracted in 160 milliseconds and TextRank extracted keywords in 1002 milliseconds, over 6 times the time of RAKE. B. Disadvantages: Because the keyword not only determine by frequency but also by it’s mean. In addition this algorithm makes compound word get higher weight so they are nor accurate enough. 2.2. Keyword extraction for text characterization We use quadgram-based to extraction keyword. In this way, the vector space model is used for representing textual documents and queries and N-grams is used to calculate word weight. N-grams are tolerant of textual errors, but also well-suited for inflectional rich languages like German. Computation of N-grams is fast, robust, and completely independent of language or domain. In the following, we only consider quadgrams because in various experiments on our text collections, they have overtopped trigrams. A. Advantages: * Use vector space model, the weights of text are usually computed by measures like tf/idf * N-grams are well-suited for inflectional rich languages like German. Computation of N-grams is fast, robust, and completely independent of language or domain. * Use basic algorithm as tf/idf so algorithm’s complex is O(n*log(n)). As exemplary time effort (on Pentium II, 400 MHz): for the analysis of 45000 intranet documents, the step of keyword extraction needs 115 seconds using tf/idf-keywords and 848 seconds using quadgram-based keywords. B. Disadvantages: * Not accurate enough. Result is similar tf/idf so there are many words that are not keyword had listed. * N-grams are tolerant of textual errors. 2.3. Automatic keyword extraction from documents using conditional random fields Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences. The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world tasks in many fields, including bioinformatics, computational linguistics and speech recognition. A. Advantages: * A key advantage of CRFs is their great flexibility to include a wide variety of arbitrary, non-independent features of the input. * Automated feature induction enables not only improved accuracy and dramatic reduction in parameter count, but also the use of larger cliques, and more freedom to liberally hypothesize atomic input variables that may be relevant to a task. * Many tasks are best performed by models that have the flexibility to use arbitrary, overlapping, multi-granularity and non-independent features. * Conditional Random Fields are undirected graphical models, trained to maximize the conditional probability of the outputs given the inputs. * CRFs have achieved empirical success recently in POS tagging (Lafferty et al., 2001), noun phrase segmentation (Sha & Pereira, 2003) and table extraction from government reports (Pinto et al., 2003). B. Disadvantages: * Even with this many parameters, the feature set is still restricted. For example, in some cases capturing a word tri-gram is important, but there is not sufficient memory or computation to include all word tri-grams. As the number of overlapping atomic features increases, the difficulty and importance of constructing only select feature combinations grows. * Even after a new conjunction is added to the model, it can still have its weight changed; this is quite significant because one often sees Boosting inefficiently “relearning” an identical conjunction solely for the purpose of “changing its weight”; and furthermore, when many induced features have been added to a CRF model, all their weights can efficiently be adjusted in concert by a quasi-Newton method such as BFGS. * Boosting has been applied to CRF-like models (Altun et al., 2003), however, without learning new conjunctions and with the inefficiency of not changing the weights of features once they are added. Other work (Dietterich, 2003) estimates parameters of a CRF by building trees (with many conjunctions), but again without adjusting weights once a tree is incorporated. Furthermore it can be expensive to add many trees, and some tasks may be diverse and complex enough to inherently require several thousand features. 2.4. Keyword extraction from a single document using word co- occurrence statistical information We present a new keyword extraction algorithm that applies to a single document without using a corpus. Frequent terms are extracted first, and then a set of co-occurrence between each term and the frequent terms, i.e., occurrences in the same sentences, is generated. Co-occurrence distribution shows importance of a term in the document as follows. If probability distribution of co-occurrence between term a and the frequent terms is biased to a particular subset of frequent terms, then term a is likely to be a keyword. The degree of biases of distribution is measured by the χ2measure. Our algorithm shows comparable performance to tfidf without using a corpus. A. Advantages: * We show that our keyword extraction performs well without the need for a corpus. * Coverage using our method exceeds that of tf and KeyGraph and is comparable to that of tfidf; both tf and tfidf selected terms which appeared frequently in the document (although tfidf considers frequencies in other documents). On the other hand, our method can extract keywords even if they do not appear frequently. The frequency index in the table shows average frequency of the top 15 terms. Terms extracted by tf appear about 28.6 times, on average, while terms by our method appear only 11.5 times. Therefore, our method can detect “hidden” keywords. We can use χ2 value as a priority criterion for keywords because precision of the top 10 terms by our method is 0.52, that of the top 5 is 0.60, while that of the top 2 is as high as 0.72. Though our method detects keywords consisting of two or more words well, it is still nearly comparable to tfidf if we discard such phrases. * The system is implemented in C++ on a Linux OS, Celeron 333MHz CPU machine. Computational time increases approximately linearly with respect to the number of terms; the process completes itself in a few seconds if the given number of terms is less than 20,000. * Main advantages of our method are its simplicity without requiring use of a corpus and its high performance comparable to tfidf. As more electronic documents become available, we believe our method will be useful in many applications, especially for domain-independent keyword extraction. B. Disadvantages: The number of keyword extracted by this algorithm is less than tfidf method. 3. Conclusion After view advantages and disadvantages of methods above we chose the method “Keyword extraction from a single document using word co-occurrence statistical information” to deployment. Because our system need extract keyword as well as possible and don’t need too fast so this algorithm is suitable. In addition this algorithm is not to complex, it reasonably with us. Now we will overview this algorithm. Step 1. Preprocessing: Stem words by Porter algorithm (Porter 1980) and extract phrases based on the APRIORI algorithm (Furnkranz 1998). Discard stop words included in stop list used in SMART system (Salton 1988). Step 2. Selection of frequent terms: Select the top frequent terms up to 30% of the number of running terms, Ntotal. Ex: Step 3. Clustering frequent terms: Cluster a pair of terms whose Jensen-Shannon divergence is above the threshold (0.95x log 2). Cluster a pair of terms whose mutual information is above the threshold (log(2.0)). The obtained clusters are denoted as C. Step 4. Calculation of expected probability: Count the number of terms co-occurring with c ∈ C, denoted as nc, to yield the expected probability pc = nc/Ntotal. Step 5. Calculation of χ’2value: For each term w, count co-occurrence frequency with c ∈ C, denoted as freq (w, c). Count the total number of terms in the sentences including w, denoted as nw. Calculate χ’2value following (2). o Pg as (the sum of the total number of terms in sentences where c appears) divided by (the total number of terms in the document). o nw as the total number of terms in sentences where c appears. Step 6. Output keywords: Show a given number of terms having the largest χ’2value.