Team Members: Xin Li Yuanpei Zhang EE322 Assignment 4 2/18/2014 Automatic Keywords Extraction Chrome Extension Section 1 Our group consists of two members, and project we intend to do is Automatic Keywords Extraction Extension on the platform of Google Chrome. The two members are Xin Li and Yuanpei Zhang. For this assignment, each one of us is given specific tasks to be completed. Xin Li is responsible to find information and do research on the developing extension program on the platform of Google Chrome. Yuanpei Zhang is in charge of doing research on the automatic keyword extraction algorithms such as IF-IDF. Percentage of effort towards this assignment Yuanpei Zhang Xin Li 50% 50% Section 2 Our project is a plug-in program, which could be easily installed on Google Chrome. It helps users to quickly extract the keywords selected text fragments. These extracted keywords feature the major content of the text, assisting users to grasp the essential topic of the text in one sight and helping decide what to read. The project consists of two majors parts: the algorithm and the Chrome extension program development. TF-IDF Algorithm Algorithm is the essence of this project. The algorithm is required to be able to efficiently find out the keywords that can accurately represent the selected text fragments. One commonly used algorithm in automatic keyword extraction is Term Frequency – Inverse Document Frequency (TF-IDF), which is a numerical statistic method that is intended to reflect how important a word is to a document in a collection or corpus. This algorithm finds out the keyword from a piece of text by calculating not only the occurrence of each word, but also the frequency of each word in the corpus. This algorithm helps to control for the fact that some words are generally more common than others. The implementation of this algorithm is pretty simple. It is composed by two terms: one first computes the normalized Term Frequency, which is the number of times a word appears in a document, divided by the total number of words in that document. Then, the second term is the Inverse Document Frequency, which is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the term ti appears. Each word in a piece of text is assigned to a tf-idf value, which is the product of the term frequency and the inverse document frequency. Various ways for determining the exact values of both statistics exist. In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d. If we demote the raw frequency t by f(t,d), then the simple tf scheme is tf(t,d)=f(t,d). To prevent a bias towards longer documents, raw frequency divided by the maximum frequency of any term in the document: 𝑡𝑓(𝑡, 𝑑) = 0.5 + 0.5 × 𝑓(𝑡, 𝑑) max{𝑓(𝑤, 𝑑): 𝑤 ∈ 𝑑} The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient: 𝑁 𝑖𝑑𝑓(𝑡, 𝑑) = 𝑙𝑜𝑔 |{𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑}| Then tf-idf value of a word is calculated as: 𝑡𝑓𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) × 𝑖𝑑𝑓(𝑡, 𝐷) A high weight in tf-idf is reached by a high frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common term. Since the ratio inside the idf’s log function is always greater or equal to 1, the value of idf (and tf-idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0. Application Extension The Automatic Keywords Extraction can be developed as a program on computer. If users want to know the keywords of the specific article, they can just copy and paste it into the program platform so that they can get the keywords. There would also be a function inside the program that is to calculate the similarity of two articles and display the similar part. In addition, it can be developed as an extension of other software, such as Google Chrome. When user feel the article or novels are too long to read, they can use the extension to circle out the keywords, which can help them decrease the reading task. Moreover, this algorithm can be better used in a search engine, especially in Web Search. For example, Google search is handling about 5.3 million per day. There are about billions source on the Internet; how to make it efficient and effective to get the search results becomes a really attractive issue. Its functionality of comparing the similarity of two articles would be good for the issue. The following picture is the actual process the search would carry out. Compare the similarity of two items and record it in database, after comparing all the items, there would a table of results of law of cosines. If the result is closer to 1, it is going to appear in the first Web Page. The rest would appear according to how close it is to 1. Identifying the keywords is also useful for web images. When users are doing web search, “a web page is segmented into several text blocks based on semantic cohesion. The text blocks that contain web images are taken as the associated texts of corresponding images and TF-IDF model is initially used to index those web images. Then, for each keyword, both relevant web image set and irrelevant web image set are selected according to their TF-IDF values” (Gong). The modern world is a world with huge information. People are always trying to improve the speed of seeking the information they want. Other Techniques May Be Used Cosine Similarity is another technique may be used in automatic keyword extraction algorithm. This is actually the secondary function of the algorithm, which is to compare the similarity between two articles. The keywords extraction or how to determine the similarity of two articles depends on the frequency of keywords. For example, two articles both talk about the history of the airplanes. They would definitely mention who is the inventor of the airplane maybe in different phases. No matter how different the phases are, they would mention a name that is Wright Brothers. It may be “ Wright Brothers invented the airplane eventually.” or “ The airplane was invented by Wright Brothers”. Considering the frequency of the words, we first divide these two sentences into some words such as the, airplane, invented, and Wright Brothers. If they appear once in one sentence, we give a number 1 on it; if it appears twice, we give a number of 2 on it, and so on. Then, what the algorithm doing is like creating two vectors staring from origin in certain dimensional coordinates. It is quite hard to imagine the vectors that are in four or higher dimensional coordinates, but it is possible to analyze it theoretically. According to the formula: 𝑎2 + 𝑏 2 − 𝑐 2 𝑐𝑜𝑠𝜃 = 2𝑎𝑏 We can get the angle between two vectors in two dimensions. As proved by many mathematicians, the cosine formula can be applied into more than two dimensions. In the previous example, we would have two maybe 5-dimension vectors, which are denoted 𝐴 = (𝐴1 , 𝐴2 , 𝐴3 , 𝐴4 , 𝐴5 )𝑎𝑛𝑑𝐵 = (𝐵1 , 𝐵2 , 𝐵3 , 𝐵4 , 𝐵5 ). B and A could also be 0 vectors. Then, the formula would become: ∑51(𝐴𝑖 × 𝐵𝑖 ) 𝑐𝑜𝑠𝜃 = √∑51(𝐴𝑖 )2 × √∑51(𝐵𝑖 )2 If the result of cosθ is close to 1, then we can conclude that these two vectors is actually one vector, which means these two sentences are nearly the same. They would provide the same information with slightly different order of words. This is so called cosine similarity. Reference 1. http://en.wikipedia.org/wiki/Tf%E2%80%93idf 2. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04063591 3. http://aimotion.blogspot.com/2011/12/machine-learning-with-pythonmeeting-tf.html 4. Robertson, Stephen. "Understanding Inverse Documant Frequency: On Theoretical Arguments for IDF." Journal of Documentation60.5 (2004): 50320. ProQuest. Web. 18 Feb. 2014. 5. Gong, Zhiguo, and Qian Liu. "Improving Keyword Based Web Image Search with Visual Feature Distribution and Term Expansion."Knowledge and Information Systems 21.1 (2009): 113-32. ProQuest. Web. 18 Feb. 2014. 6. http://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/