Zhang_Automatic_Keywords_Extraction_hw6

advertisement
E 322 Engineering Designs 6
Group Leader Name: Yuanpei Zhang
Group Member Name: Xin Li
Homework 6
I pledge my honor that I have abided by the Stevens Honor System
Section 1
Summary of Assignments
For this project, our big goal is to implement the TF-IDF algorithm in different
ways. Yuanpei Zhang deals with the software implementation with C/C++ programming
language. He does the research, figures out the block diagram and finds out the means he
need for the whole software parts. Xin Li takes charge of the extension development
because he has the web programming experience; it is easier for him to figure out the web
developing. The block diagram would be done separately for each part and will be
combined into one block diagram together. The one who takes charge of the part is
responsible for all the report writing.
Percentage of effort towards this assignment
Yuanpei Zhang
50%
Xin Li
50%
Section 2
Software
Implementation
with C language
Yuanpei
Zhang
Automatic
Keywords
Extraction
Work
together
Individual
Task
TF IDF Algorithm
Xin Li
Web Browser and
software extension
development
(Figure 1 black box)
This is the overall outline for our project. Basically, the goals of this project are
to implement the TF-IDF algorithm and develop the application of automatic keywords
extraction. According to figure 1, we generally divide the tasks into two parts so that any
related research and work is done by the name assigned to that subtopic. For example, the
group member Xin Li takes charge of any related word of the Extension Development.
However, the algorithm is the core of our project in which we have to be very careful
about it and it is also hard enough so that our team needs to work together to figure it out.
The following diagram is going to show how the application works.
Input Text
Send to recycle
station
Put human unreadable
text into a certain
position
Detect the
reasonable input
Human readable
words stored in
database
Set up database
Calculate the frequency of
each word in the input
Database existing no
meaning word
Delete it or put it tn
recycle station
Compare to database for
the high frequency but
useless words such as
‘the’
New vocabulary
Frequently good
keywords
Implement the TF IDF
Algorithm
Output Keyword
(Figure 2 Transparent Box inside the TF-IDF algorithm)
This is more specific about how the software or extension works internally. The
first step is to develop an interface for users to input the text. The software detects the
input text and determines whether the user can understand the input according the
language used in the detection devise. There is also a translation options for users to
translate the input text into appropriate language. The next step is to implement the TFIDF algorithm. In the database, there is a certain memory location which stores the high
frequency but useless word such as “the”, “a”, “an” and so on. If calculated high
frequency term is in the useless section, it would then be deleted or sent to recycle station
so that it does not bother users. If the calculated words are not in the useless section,
which means they are good keywords. The software finally outputs the keywords and
located them in the input text.
Mathematical
details
TF-IDF
Algorithm
Cosine
Similarity
Multidimensional
application of law
of cosines
Interface
design
Extract Keywords
Automatically
Software
Development
skill
Functionality
design
C/Java/C++/Python
programming
language
Debugging
HTML
CSS
Web
Development
Web/Software
Extension skill
JavaScript
PHP
JAVA
Extension
Development
PHP
(Figure 3 Function-means tree)
The primary goal of this project is to develop some applications which can do the
function of extracting the keywords automatically. There are several approaches of this
topic. “The traditional algorithm implementing method is called chi-square test. This
method is actually an inefficient and time-consuming one. However, the algorithm of this
method based on distributed computing model is developed to solve the time problem
(Nugumanova)”. Another approach is also a statistics approach—the TF-IDF algorithm,
which is used by our team to develop software and the extension of other software. For
the TF-IDF algorithm, it is not hard to understand. The TF-IDF algorithm is just some
mathematical calculations to get the frequency of each term appearing in the input text.
“The TF part intends to give a higher score to a document that has more occurrences of a
term, while the IDF part is to penalize words that are popular in the whole collection. The
further factors such as position of the word in a document or the length of a document are
not comparable, as the database entries are much shorter (Ing).” The cosine similarity is a
certain property of the TF-IDF algorithm, used when there is a comparison between
multiple inputs. The cosine similarity is based on the law of cosines.
The hardest part is to figure out the algorithm, the rest is relatively easy, but it still
requires a lot of time and skills. The software would be used by users in the lowest level,
which means that users do not have to know the internal technology. What users should
know is how to input text into a certain box and start the program, which is very simple,
so the interface has to be simple enough to use. The functionality design is just to use
programming language to implement the TF-IDF algorithm. It looks simple, but it
requires the programming skills such as JAVA, C++ or Python. Programming is really a
long process to go, because it is kind of tough to find small errors in many chunks of
code, so debugging test is very important. We cannot make sure whether the software
works properly when it just comes out, so our team still needs to do a lot of testing.
The last part of this project is to make the web browsers such as Google Chrome
or any document software have a function to perform TF-IDF algorithm as an extension.
In addition, this kind s of software will be enabled to catch the input text by selecting the
text by mouth. Also, it is doable to develop a certain website for it. It works like software
but in Web format, it is not the same as the web extension. For example, the PDF convert
on the website is to convert any input document into PDF format. Web Development
requires different programming skills from software programming. There are several
ways of programming the website, but we choose the very basic four elements which are
HTML, CSS, JavaScript, and PHP. It also can be done by C#, but it is beyond our
knowledge. The extension development is easier than the software. There is an existing
website designed especially for Chrome Extension program. Once it is applied on one
web browser, it will be everywhere.
At this point, we can obviously see that this project is totally the software project,
there is nothing related to the hardware implementation. It is a good thing that we do not
need to consider the error from the hardware. The Automatic keywords extraction
probably works well with the TF-IDF algorithm, but as I mentioned before, this is not the
only approach. This is not even a good approach because statistics based methods are
eventually inefficient when used in real world. Consequently, after the application is
developed, it needs to be revised and updated in order to have a better performance.
Reference:
1. http://www.l3s.de/~demidova/students/thesis_oelze.pdf Ing. Wolfgang Nejdl.
Automatic Keyword Extraction for Database Search. Hannover, den 27
February 2009
2. Nugumanova, A.; Novosselov, A.; Baiburin, Y.; Karimov, A., "Automatic
keywords extraction from the domain texts: Implementation of the algorithm
based on the MapReduce model," Current Trends in Information Technology
(CTIT), 2013 International Conference on , vol., no., pp.186,189, 11-12 Dec.
2013
Download