Zhang_Automatic_Keyword_Extraction_hw4

advertisement
Team Members:
Xin Li
Yuanpei Zhang
EE322 Assignment 4
2/18/2014
Automatic Keywords Extraction Chrome Extension
Section 1
Our group consists of two members, and project we intend to do is
Automatic Keywords Extraction Extension on the platform of Google Chrome.
The two members are Xin Li and Yuanpei Zhang.
For this assignment, each one of us is given specific tasks to be completed.
Xin Li is responsible to find information and do research on the developing
extension program on the platform of Google Chrome. Yuanpei Zhang is in
charge of doing research on the automatic keyword extraction algorithms such
as IF-IDF.
Percentage of
effort towards
this assignment
Yuanpei Zhang
Xin Li
50%
50%
Section 2
Our project is a plug-in program, which could be easily installed on
Google Chrome. It helps users to quickly extract the keywords selected text
fragments. These extracted keywords feature the major content of the text,
assisting users to grasp the essential topic of the text in one sight and helping
decide what to read.
The project consists of two majors parts: the algorithm and the Chrome
extension program development.
 TF-IDF Algorithm
Algorithm is the essence of this project. The algorithm is required to be
able to efficiently find out the keywords that can accurately represent the
selected text fragments. One commonly used algorithm in automatic keyword
extraction is Term Frequency – Inverse Document Frequency (TF-IDF), which
is a numerical statistic method that is intended to reflect how important a
word is to a document in a collection or corpus. This algorithm finds out the
keyword from a piece of text by calculating not only the occurrence of each
word, but also the frequency of each word in the corpus. This algorithm helps
to control for the fact that some words are generally more common than
others.
The implementation of this algorithm is pretty simple. It is composed by
two terms: one first computes the normalized Term Frequency, which is the
number of times a word appears in a document, divided by the total number
of words in that document. Then, the second term is the Inverse Document
Frequency, which is computed as the logarithm of the number of the
documents in the corpus divided by the number of documents where the term
ti appears.
Each word in a piece of text is assigned to a tf-idf value, which is the
product of the term frequency and the inverse document frequency. Various
ways for determining the exact values of both statistics exist. In the case of the
term frequency tf(t,d), the simplest choice is to use the raw frequency of a
term in a document, i.e. the number of times that term t occurs in document d.
If we demote the raw frequency t by f(t,d), then the simple tf scheme is
tf(t,d)=f(t,d). To prevent a bias towards longer documents, raw frequency
divided by the maximum frequency of any term in the document:
𝑡𝑓(𝑡, 𝑑) = 0.5 +
0.5 × 𝑓(𝑡, 𝑑)
max⁡{𝑓(𝑤, 𝑑): 𝑤 ∈ 𝑑}
The inverse document frequency is a measure of whether the term is common
or rare across all documents. It is obtained by dividing the total number of
documents by the number of documents containing the term, and then taking
the logarithm of that quotient:
𝑁
𝑖𝑑𝑓(𝑡, 𝑑) = 𝑙𝑜𝑔
|{𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑}|
Then tf-idf value of a word is calculated as:
𝑡𝑓𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) × 𝑖𝑑𝑓(𝑡, 𝐷)
A high weight in tf-idf is reached by a high frequency (in the given document)
and a low document frequency of the term in the whole collection of
documents; the weights hence tend to filter out common term. Since the ratio
inside the idf’s log function is always greater or equal to 1, the value of idf
(and tf-idf) is greater than or equal to 0. As a term appears in more documents,
the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer
to 0.
 Application Extension
The Automatic Keywords Extraction can be developed as a program on
computer. If users want to know the keywords of the specific article, they can
just copy and paste it into the program platform so that they can get the
keywords. There would also be a function inside the program that is to
calculate the similarity of two articles and display the similar part. In addition,
it can be developed as an extension of other software, such as Google Chrome.
When user feel the article or novels are too long to read, they can use the
extension to circle out the keywords, which can help them decrease the
reading task. Moreover, this algorithm can be better used in a search engine,
especially in Web Search. For example, Google search is handling about 5.3
million per day. There are about billions source on the Internet; how to make
it efficient and effective to get the search results becomes a really attractive
issue. Its functionality of comparing the similarity of two articles would be
good for the issue. The following picture is the actual process the search
would carry out.
Compare the similarity of two items and record it in database, after comparing
all the items, there would a table of results of law of cosines. If the result is
closer to 1, it is going to appear in the first Web Page. The rest would appear
according to how close it is to 1.
Identifying the keywords is also useful for web images. When users are
doing web search, “a web page is segmented into several text blocks based on
semantic cohesion. The text blocks that contain web images are taken as the
associated texts of corresponding images and TF-IDF model is initially used to
index those web images. Then, for each keyword, both relevant web image set
and irrelevant web image set are selected according to their TF-IDF values”
(Gong). The modern world is a world with huge information. People are
always trying to improve the speed of seeking the information they want.
 Other Techniques May Be Used
Cosine Similarity is another technique may be used in automatic keyword
extraction algorithm. This is actually the secondary function of the algorithm,
which is to compare the similarity between two articles. The keywords
extraction or how to determine the similarity of two articles depends on the
frequency of keywords. For example, two articles both talk about the history
of the airplanes. They would definitely mention who is the inventor of the
airplane maybe in different phases. No matter how different the phases are,
they would mention a name that is Wright Brothers. It may be “ Wright
Brothers invented the airplane eventually.” or “ The airplane was invented by
Wright Brothers”. Considering the frequency of the words, we first divide
these two sentences into some words such as the, airplane, invented, and
Wright Brothers. If they appear once in one sentence, we give a number 1 on it;
if it appears twice, we give a number of 2 on it, and so on. Then, what the
algorithm doing is like creating two vectors staring from origin in certain
dimensional coordinates. It is quite hard to imagine the vectors that are in
four or higher dimensional coordinates, but it is possible to analyze it
theoretically. According to the formula:
𝑎2 + 𝑏 2 − 𝑐 2
𝑐𝑜𝑠𝜃 = ⁡
2𝑎𝑏
We can get the angle between two vectors in two dimensions. As proved by
many mathematicians, the cosine formula can be applied into more than two
dimensions. In the previous example, we would have two maybe 5-dimension
vectors, which are denoted 𝐴 = (𝐴1 , 𝐴2 , 𝐴3 , 𝐴4 , 𝐴5 )⁡𝑎𝑛𝑑⁡𝐵 = (𝐵1 , 𝐵2 , 𝐵3 , 𝐵4 , 𝐵5 ).
B and A could also be 0 vectors. Then, the formula would become:
∑51(𝐴𝑖 × 𝐵𝑖 )
𝑐𝑜𝑠𝜃 =
√∑51(𝐴𝑖 )2 × √∑51(𝐵𝑖 )2
If the result of cosθ is close to 1, then we can conclude that these two vectors
is actually one vector, which means these two sentences are nearly the same.
They would provide the same information with slightly different order of
words. This is so called cosine similarity.
Reference
1. http://en.wikipedia.org/wiki/Tf%E2%80%93idf
2. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04063591
3. http://aimotion.blogspot.com/2011/12/machine-learning-with-pythonmeeting-tf.html
4. Robertson, Stephen. "Understanding Inverse Documant Frequency: On
Theoretical Arguments for IDF." Journal of Documentation60.5 (2004): 50320. ProQuest. Web. 18 Feb. 2014.
5. Gong, Zhiguo, and Qian Liu. "Improving Keyword Based Web Image Search with
Visual Feature Distribution and Term Expansion."Knowledge and Information
Systems 21.1 (2009): 113-32. ProQuest. Web. 18 Feb. 2014.
6. http://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/
Download