International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016 An approach towards Intelligent Text Extraction for Document Mining Hanishree.N Siddalingesh Abstract— With the increasing amount of machine readable information being available, Internet users are facing challenge of selecting and sorting out the data relevant to the subject he/she is really interested in. This project is devoted to demonstrate the methods to create text extracts depending upon the verbose context and help the user by providing selective and compact text, about which the user is interested. Text mining is the discipline that deals with such intelligent text extraction [9]. This paper is about proposing and implementing the text mining strategies that would help the user to find the information he/she is interested without having to go through the complete document. I. INTRODUCTION Internet is the El Dorado of knowledge. In this Web 2.0 age, everyday thousands of websites and millions of web pages are add to the Internet. The user who wanders through the web pages searching for the information, get perplexed since so much research has been done that it is impossible for him/her to read all materials and get the information required. It has also made the life more difficult as making an association to other research is not so easy. Hence the concept of text mining which is categorized under Artificial Intelligence can be used to assist such user. Artificial Intelligence is the study of mental faculties through the use of computational models. It is the branch of Computer Science that deals with writing computer programs that can solve problems creatively. Text mining, also known as intelligent text analysis, text data mining or knowledge- discovery in text (KDT), is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. It is different from what we are familiar with in web search. In web search, the user is typically looking for something that is already known and has been written by someone else. In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down. Text Mining processes un-structured inputs from different sources mostly from web like - news articles, ISSN: 2231-5381 Anish Agarwal research papers, e-books, digital library, e-mails, web pages, etc. Though corporate and government stores data mostly in structured format (databases), some data like employee feedback will be a free flow text. If a company wants to know the overall picture or message from its employees, it would be difficult without implementing Text Mining strategies. II. PROPOSED SYSTEM The main aim of the project is to create a web application that accepts large text like news article, research papers and provide the summarization on the input text. The summary produced should give only the important part of whole text. Key word based and Word weight based algorithms will be implemented to perform text mining and provide summary of the large text. A.Text Mining using Key Word Based Algorithm . WordWeb Input Document (Research Paper. New Article) Key Words Get Key Word Count in Sentence Get Synonyms Count in Sentence Summary of input Document Select sentences for Summarization Summarization - Key Word Based Algorithm Decision Making Fig. 1 – Block diagram of Key Word Based Text Mining Input Document – The document on which the text mining algorithm will be extracted and summary will be http://www.ijettjournal.org Page 91 International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016 produced. Any document like research paper, news article and text written in a blog can be given as input. If any two sentences have equal number of keywords then the one with more number of synonyms for key words among the two is extracted. If any two sentences have equal number of keywords and equal number of synonyms then the sentence that appears first in the context is selected and extracted. Key Words – Few words that form a heading of the input document can be considered as key words. Also, any words based on which the user is interested to text mining can be given. If there is more than one key word, it should be given as a comma separated list. Thesaurus –Thesaurus is a web service providing search capability for synonyms in different languages. The synonyms are retrieved from the thesauri dictionaries of OpenOffice according to the relevant licenses. Summarization – Key word Based algorithm is used to summarize the input document based on the key words. It mainly has sub modules as below . Get Key Word Count – For each sentence in the input document get the key word count and store. Get Synonym Count – For each sentence in the input document get the synonym count and store. B.Text Mining using Word Weight Based Algorithm . Input Document Calculate Word Weight (Research Paper. New Article) Calculate Sentence Weight for input doc Relevant Sentences Calculate Sentence Weight for Context Compare and select sentences for Summarization Summarization - Word Weight Based Algorithm Provide summary – For each sentence in the input document compare the number of key words and synonyms and select the sentences to produce final summary. Summary of document and Decision Making – This is the final output of the Key word Based text mining algorithm. Based on the summary, the user can make valid decisions. The Key Word Based algorithm performs the following important steps. 1. The number of keywords and related words present in each sentence in the context is calculated and stored. 2. The user who needs to extract the sentences from the context is given the choice of specifying the amount of text he needs to extract. 3. Assumption is that the user needs to view 50% of the input text. 4. The sentences are extracted depending upon the following criteria: The sentence with the highest number of keywords is extracted. ISSN: 2231-5381 Summary of input Document Decision Making Fig. 2 - Block diagram of Word Weight Based Text Mining Input Document – The document on which the text mining algorithm will be extracted and summary will be produced. Any document like research paper, news article and text written in a blog can be given as input. Relevant Sentences – It is one or more sentences from the document which the user feels as important and is selected at the initial stage by user. Summarization – Word Based algorithm is used to summarize the input document based on the context. It mainly has sub modules as below. Calculate Word Weight – For each word in the input document get the word count and store. Calculate sentence weight of input document – For each sentence in the input document; get sentence weight based on the word count. http://www.ijettjournal.org Page 92 International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016 In the second approach called Word Weight Based Method, we assign different weights to the words and then based on these weights calculate the sentence weight as per the algorithm. We then proceed to extract sentences above a particular value called the average threshold value. The Word Weight Based algorithm performs the following important steps. 1. The given context is taken as input for extraction. 2. The sentences that are selected at the initial stage of extraction are termed as relevant sentences or relevant passages; else they are termed as irrelevant passage. 8. Then the sentences that have weights greater than average threshold value are extracted to the final output file. For the text mining application that we are proposing, below is the flow diagram. Start Open Text Mining Web Page 3. The weights of each word in the context are calculated, immaterial of its occurrence in relevant or irrelevant passage by the following formula. where, w - Weight of the word. p - Number of relevant passages in which appears. n - Number of irrelevant passages in which appears. fp - Absolute frequency of a word in relevant or passages. fn - Absolute frequency of a word in sentences or passages. If Key Word Based Algorithm is selected? Key Word Based Word Weight Based Provide Key Words and Large Text Provide Context and Large Text Perform Key Word Based Text Mining Perform Word Weight Based Text Mining the word the word WordWeb sentences irrelevant Display Summary (Text Mining Results) 4. Weights of stop words (like propositions, conjunctions, and also the words that appear at a high frequency in the English language) are not calculated. End 5. The relevant sentences are then extracted into an intermediate file. 6. The weight of the sentences present in the intermediate file is calculated by the formula. III MODULES In this project implementation we have identified three main modules. User Interface Module Application Logic Server Module External Tools Module 7. The weights of sentences are compared to a threshold value to decide whether the sentence should be selected or not. The average threshold value is calculated as below. ISSN: 2231-5381 http://www.ijettjournal.org Page 93 International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016 External Tools WordWeb.Plugin extraction of data present in web pages to reduce length/size to be displayed on certain mobile devices. Finally, in this fast world where everyone runs out of time, text mining methods implemented in this project comes handy when it is utilized and applied in the right way. ACKNOWLEDGMENT User Interface Module Application Logic Server Text_Mining.HTML Text_Mining.Service We would like to thank raghuveer for his great idea which gave us motivation to implement.we would like to thank our guide Mrs sumathi who supported in all ways. REFERENCES [1] Choi (2000) Advances in domain independent linear text segmentation. In proceedings of 1 North American chapter of the Association for computational Linguistics Seattle, pp 26-33. Fig. 3-Module diagram of Text Mining Web Application User Interface Module – With this module user can provide inputs for text mining and also can view the results of text mining. This module will be developed using HTML and will call text mining programs implemented on application logic server and will also receive the results from it to display it to user. Application Logic Server Module – On this module, text mining algorithms will be implemented with Java language. It will respond to the calls made by User Interface module for text mining and will send the results back to it. It has the core logic of text mining. External Tools Module – Thesaurus is an English online dictionary. The key word based algorithm makes use of synonyms to produce summary. Hence this module mainly provides the synonyms for the key words provided by the user. It accepts the request from application logic server module and send the synonyms back to it. IV. CONCLUSION Text mining technologies which go beyond simple searching methods are the key to information discovery and have a promising outlook for application in all areas of work. The methods proposed and implemented in this project can be innovatively used in different fields like biomedical, biosciences, research and development by researches. Using these methods save their time. Another area that is benefited from text mining methods implemented in this project is education, where students, educators and researchers can find more information relating to their topics at faster speeds than traditional ad hoc searching. It can be also used in ISSN: 2231-5381 [2] Dunning T (1993) Accurate methods for the statistics of surprise and coincidence “computational Linguistics” 19, pp 61-74 [3] Eduard Hovy (2002) Automated Text summarization, DUC 2002. [4] Hearst M.A (1997) TextTiling : Segmentation text into subtopic passages. In Computational Linguistics 23, pp 33-44. [5] Joel Larocca Neto (2003) Doucment clustering and Text summarization, DUC 2003. [6] Meru Brunn (2001) Text summarization using Lexical chains. [7] Moen M & De Busser R (2001) Genreric topic segmentation of document text. “ proceedings of 24 the Annual ACM SIGIR conference on Research and Development in information Retrieval” , ACM , New York pp 418-419. [8] Morris , Lexical cohesion computed by thesaural relations as an indicator of the structure of text “ , computational Linguistics , pp 2148. [9] R. Barzilay and M. Elhadad (1997) Using lexical chains for text summarization. Intelligent Text summarization, ACL, 1997. [10] A Survey of Text Mining Techniques and Applications by Vishal Gupta, Panjab University Chandigarh, India and Gurpreet S. Lehal Professor & Head, Department of Computer Science, Punjabi University Patiala, India [11] Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang, (2005), “Tapping into the Power of Text Mining”, Journal of ACM, Blacksburg. [12] Webpages: http://people.ischool.berkeley.edu/~hearst/text-mining.html http://en.wikipedia.org/wiki/Text_mining http://summly.com/ http://www.ijettjournal.org Page 94