E 322 Engineering Designs 6 Group Leader Name: Yuanpei Zhang Group Member Name: Xin Li Homework 6 I pledge my honor that I have abided by the Stevens Honor System Section 1 Summary of Assignments For this project, our big goal is to implement the TF-IDF algorithm in different ways. Yuanpei Zhang deals with the software implementation with C/C++ programming language. He does the research, figures out the block diagram and finds out the means he need for the whole software parts. Xin Li takes charge of the extension development because he has the web programming experience; it is easier for him to figure out the web developing. The block diagram would be done separately for each part and will be combined into one block diagram together. The one who takes charge of the part is responsible for all the report writing. Percentage of effort towards this assignment Yuanpei Zhang 50% Xin Li 50% Section 2 Software Implementation with C language Yuanpei Zhang Automatic Keywords Extraction Work together Individual Task TF IDF Algorithm Xin Li Web Browser and software extension development (Figure 1 black box) This is the overall outline for our project. Basically, the goals of this project are to implement the TF-IDF algorithm and develop the application of automatic keywords extraction. According to figure 1, we generally divide the tasks into two parts so that any related research and work is done by the name assigned to that subtopic. For example, the group member Xin Li takes charge of any related word of the Extension Development. However, the algorithm is the core of our project in which we have to be very careful about it and it is also hard enough so that our team needs to work together to figure it out. The following diagram is going to show how the application works. Input Text Send to recycle station Put human unreadable text into a certain position Detect the reasonable input Human readable words stored in database Set up database Calculate the frequency of each word in the input Database existing no meaning word Delete it or put it tn recycle station Compare to database for the high frequency but useless words such as ‘the’ New vocabulary Frequently good keywords Implement the TF IDF Algorithm Output Keyword (Figure 2 Transparent Box inside the TF-IDF algorithm) This is more specific about how the software or extension works internally. The first step is to develop an interface for users to input the text. The software detects the input text and determines whether the user can understand the input according the language used in the detection devise. There is also a translation options for users to translate the input text into appropriate language. The next step is to implement the TFIDF algorithm. In the database, there is a certain memory location which stores the high frequency but useless word such as “the”, “a”, “an” and so on. If calculated high frequency term is in the useless section, it would then be deleted or sent to recycle station so that it does not bother users. If the calculated words are not in the useless section, which means they are good keywords. The software finally outputs the keywords and located them in the input text. Mathematical details TF-IDF Algorithm Cosine Similarity Multidimensional application of law of cosines Interface design Extract Keywords Automatically Software Development skill Functionality design C/Java/C++/Python programming language Debugging HTML CSS Web Development Web/Software Extension skill JavaScript PHP JAVA Extension Development PHP (Figure 3 Function-means tree) The primary goal of this project is to develop some applications which can do the function of extracting the keywords automatically. There are several approaches of this topic. “The traditional algorithm implementing method is called chi-square test. This method is actually an inefficient and time-consuming one. However, the algorithm of this method based on distributed computing model is developed to solve the time problem (Nugumanova)”. Another approach is also a statistics approach—the TF-IDF algorithm, which is used by our team to develop software and the extension of other software. For the TF-IDF algorithm, it is not hard to understand. The TF-IDF algorithm is just some mathematical calculations to get the frequency of each term appearing in the input text. “The TF part intends to give a higher score to a document that has more occurrences of a term, while the IDF part is to penalize words that are popular in the whole collection. The further factors such as position of the word in a document or the length of a document are not comparable, as the database entries are much shorter (Ing).” The cosine similarity is a certain property of the TF-IDF algorithm, used when there is a comparison between multiple inputs. The cosine similarity is based on the law of cosines. The hardest part is to figure out the algorithm, the rest is relatively easy, but it still requires a lot of time and skills. The software would be used by users in the lowest level, which means that users do not have to know the internal technology. What users should know is how to input text into a certain box and start the program, which is very simple, so the interface has to be simple enough to use. The functionality design is just to use programming language to implement the TF-IDF algorithm. It looks simple, but it requires the programming skills such as JAVA, C++ or Python. Programming is really a long process to go, because it is kind of tough to find small errors in many chunks of code, so debugging test is very important. We cannot make sure whether the software works properly when it just comes out, so our team still needs to do a lot of testing. The last part of this project is to make the web browsers such as Google Chrome or any document software have a function to perform TF-IDF algorithm as an extension. In addition, this kind s of software will be enabled to catch the input text by selecting the text by mouth. Also, it is doable to develop a certain website for it. It works like software but in Web format, it is not the same as the web extension. For example, the PDF convert on the website is to convert any input document into PDF format. Web Development requires different programming skills from software programming. There are several ways of programming the website, but we choose the very basic four elements which are HTML, CSS, JavaScript, and PHP. It also can be done by C#, but it is beyond our knowledge. The extension development is easier than the software. There is an existing website designed especially for Chrome Extension program. Once it is applied on one web browser, it will be everywhere. At this point, we can obviously see that this project is totally the software project, there is nothing related to the hardware implementation. It is a good thing that we do not need to consider the error from the hardware. The Automatic keywords extraction probably works well with the TF-IDF algorithm, but as I mentioned before, this is not the only approach. This is not even a good approach because statistics based methods are eventually inefficient when used in real world. Consequently, after the application is developed, it needs to be revised and updated in order to have a better performance. Reference: 1. http://www.l3s.de/~demidova/students/thesis_oelze.pdf Ing. Wolfgang Nejdl. Automatic Keyword Extraction for Database Search. Hannover, den 27 February 2009 2. Nugumanova, A.; Novosselov, A.; Baiburin, Y.; Karimov, A., "Automatic keywords extraction from the domain texts: Implementation of the algorithm based on the MapReduce model," Current Trends in Information Technology (CTIT), 2013 International Conference on , vol., no., pp.186,189, 11-12 Dec. 2013