Document 12917716

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016
An approach towards Intelligent Text Extraction
for Document Mining
Hanishree.N
Siddalingesh
Abstract— With the increasing amount of machine
readable information being available, Internet users are
facing challenge of selecting and sorting out the data
relevant to the subject he/she is really interested in. This
project is devoted to demonstrate the methods to create
text extracts depending upon the verbose context and
help the user by providing selective and compact text,
about which the user is interested. Text mining is the
discipline that deals with such intelligent text extraction
[9]. This paper is about proposing and implementing
the text mining strategies that would help the user to
find the information he/she is interested without having
to go through the complete document.
I. INTRODUCTION
Internet is the El Dorado of knowledge. In this
Web 2.0 age, everyday thousands of websites and
millions of web pages are add to the Internet. The user
who wanders through the web pages searching for the
information, get perplexed since so much research has
been done that it is impossible for him/her to read all
materials and get the information required. It has also
made the life more difficult as making an association to
other research is not so easy. Hence the concept of text
mining which is categorized under Artificial
Intelligence can be used to assist such user. Artificial
Intelligence is the study of mental faculties through the
use of computational models. It is the branch of
Computer Science that deals with writing computer
programs that can solve problems creatively.
Text mining, also known as intelligent text analysis,
text data mining or knowledge- discovery in text
(KDT), is the discovery by computer of new, previously
unknown information, by automatically extracting
information from different written resources. It is
different from what we are familiar with in web search.
In web search, the user is typically looking for
something that is already known and has been written
by someone else. In text mining, the goal is to discover
heretofore unknown information, something that no one
yet knows and so could not have yet written down.
Text Mining processes un-structured inputs from
different sources mostly from web like - news articles,
ISSN: 2231-5381
Anish Agarwal
research papers,
e-books, digital library, e-mails, web pages, etc. Though
corporate and government stores data mostly in
structured format (databases), some data like employee
feedback will be
a free flow text. If a company wants to know the overall
picture or message from its employees, it would be
difficult without implementing Text Mining strategies.
II. PROPOSED SYSTEM
The main aim of the project is to create a web
application that accepts large text like news article,
research papers and provide the summarization on the
input text. The summary produced should give only the
important part of whole text. Key word based and Word
weight based algorithms will be implemented to
perform text mining and provide summary of the large
text.
A.Text Mining using Key Word Based Algorithm .
WordWeb
Input Document
(Research Paper. New
Article)
Key Words
Get Key Word
Count in
Sentence
Get Synonyms
Count in
Sentence
Summary of input
Document
Select
sentences for
Summarization
Summarization - Key Word Based Algorithm
Decision Making
Fig. 1 – Block diagram of Key Word Based Text
Mining
Input Document – The document on which the text
mining algorithm will be extracted and summary will be
http://www.ijettjournal.org
Page 91
International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016
produced. Any document like research paper, news
article and text written in a blog can be given as input.
If any two sentences have equal number of
keywords then the one with more number of
synonyms for key words among the two is
extracted.
If any two sentences have equal number of
keywords and equal number of synonyms then
the sentence that appears first in the context is
selected and extracted.
Key Words – Few words that form a heading of the
input document can be considered as key words. Also,
any words based on which the user is interested to text
mining can be given. If there is more than one key
word, it should be given as a comma separated list.
Thesaurus –Thesaurus is a web service providing
search capability for synonyms in different languages.
The synonyms are retrieved from the thesauri
dictionaries of OpenOffice according to the relevant
licenses.
Summarization – Key word Based algorithm is used to
summarize the input document based on the key words.
It mainly has sub modules as below
.
Get Key Word Count – For each sentence in
the input document get the key word count and
store.
Get Synonym Count – For each sentence in the
input document get the synonym count and
store.
B.Text Mining using Word Weight Based Algorithm
.
Input Document
Calculate
Word Weight
(Research Paper. New
Article)
Calculate
Sentence Weight
for input doc
Relevant
Sentences
Calculate
Sentence Weight
for Context
Compare and select
sentences for
Summarization
Summarization - Word Weight Based Algorithm
Provide summary – For each sentence in the
input document compare the number of key
words and synonyms and select the sentences
to produce final summary.
Summary of document and Decision Making – This
is the final output of the Key word Based text mining
algorithm. Based on the summary, the user can make
valid decisions.
The Key Word Based algorithm performs the following
important steps.
1. The number of keywords and related words present in
each sentence in the context is calculated and stored.
2. The user who needs to extract the sentences from the
context is given the choice of specifying the amount of
text he needs to extract.
3. Assumption is that the user needs to view 50% of the
input text.
4. The sentences are extracted depending upon the
following criteria:
The sentence with the highest number of
keywords is extracted.
ISSN: 2231-5381
Summary of input
Document
Decision Making
Fig. 2 - Block diagram of Word Weight Based Text
Mining
Input Document – The document on which the text
mining algorithm will be extracted and summary will be
produced. Any document like research paper, news
article and text written in a blog can be given as input.
Relevant Sentences – It is one or more sentences from
the document which the user feels as important and is
selected at the initial stage by user.
Summarization – Word Based algorithm is used to
summarize the input document based on the context. It
mainly has sub modules as below.
Calculate Word Weight – For each word in the
input document get the word count and store.
Calculate sentence weight of input document –
For each sentence in the input document; get
sentence weight based on the word count.
http://www.ijettjournal.org
Page 92
International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016
In the second approach called Word Weight Based
Method, we assign different weights to the words and
then based on these weights calculate the sentence
weight as per the algorithm. We then proceed to extract
sentences above a particular value called the average
threshold value.
The Word Weight Based algorithm performs the
following important steps.
1. The given context is taken as input for extraction.
2. The sentences that are selected at the initial stage of
extraction are termed as relevant sentences or relevant
passages; else they are termed as irrelevant passage.
8. Then the sentences that have weights greater than
average threshold value are extracted to the final output
file.
For the text mining application that we are proposing,
below is the flow diagram.
Start
Open Text Mining
Web Page
3. The weights of each word in the context are
calculated, immaterial of its occurrence in relevant or
irrelevant passage by the following formula.
where,
w - Weight of the word.
p - Number of relevant passages in which
appears.
n - Number of irrelevant passages in which
appears.
fp - Absolute frequency of a word in relevant
or passages.
fn - Absolute frequency of a word in
sentences or passages.
If Key Word Based
Algorithm is selected?
Key Word Based
Word Weight Based
Provide Key
Words and Large
Text
Provide Context
and
Large Text
Perform Key Word
Based Text Mining
Perform Word
Weight Based Text
Mining
the word
the word
WordWeb
sentences
irrelevant
Display Summary
(Text Mining
Results)
4. Weights of stop words (like propositions,
conjunctions, and also the words that appear at a high
frequency in the English language) are not calculated.
End
5. The relevant sentences are then extracted into an
intermediate file.
6. The weight of the sentences present in the
intermediate file is calculated by the formula.
III MODULES
In this project implementation we have identified three
main modules.
User Interface Module
Application Logic Server Module
External Tools Module
7. The weights of sentences are compared to a threshold
value to decide whether the sentence should be selected
or not. The average threshold value is calculated as
below.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 93
International Journal of Engineering Trends and Technology (IJETT) – Volume 35 Number 2- May 2016
External Tools
WordWeb.Plugin
extraction of data present in web pages to reduce
length/size to be displayed on certain mobile devices.
Finally, in this fast world where everyone runs out
of time, text mining methods implemented in this
project comes handy when it is utilized and applied in
the right way.
ACKNOWLEDGMENT
User Interface Module
Application Logic Server
Text_Mining.HTML
Text_Mining.Service
We would like to thank raghuveer for his great idea
which gave us motivation to implement.we would like
to thank our guide Mrs sumathi who supported in all
ways.
REFERENCES
[1] Choi (2000) Advances in domain independent linear text
segmentation. In proceedings of 1 North American chapter of the
Association for computational Linguistics Seattle, pp 26-33.
Fig. 3-Module diagram of Text Mining Web
Application
User Interface Module – With this module user can
provide inputs for text mining and also can view the
results of text mining. This module will be developed
using HTML and will call text mining programs
implemented on application logic server and will also
receive the results from it to display it to user.
Application Logic Server Module – On this module,
text mining algorithms will be implemented with Java
language. It will respond to the calls made by User
Interface module for text mining and will send the
results back to it. It has the core logic of text mining.
External Tools Module – Thesaurus is an English
online dictionary. The key word based algorithm makes
use of synonyms to produce summary. Hence this
module mainly provides the synonyms for the key
words provided by the user. It accepts the request from
application logic server module and send the synonyms
back to it.
IV. CONCLUSION
Text mining technologies which go beyond simple
searching methods are the key to information discovery
and have a promising outlook for application in all areas
of work. The methods proposed and implemented in this
project can be innovatively used in different fields like
biomedical, biosciences, research and development by
researches. Using these methods save their time.
Another area that is benefited from text mining
methods implemented in this project is education, where
students, educators and researchers can find more
information relating to their topics at faster speeds than
traditional ad hoc searching. It can be also used in
ISSN: 2231-5381
[2] Dunning T (1993) Accurate methods for the statistics of surprise
and coincidence “computational Linguistics” 19, pp 61-74
[3] Eduard Hovy (2002) Automated Text summarization, DUC 2002.
[4] Hearst M.A (1997) TextTiling : Segmentation text into subtopic
passages. In Computational Linguistics 23, pp 33-44.
[5] Joel Larocca Neto (2003) Doucment clustering and Text
summarization, DUC 2003.
[6] Meru Brunn (2001) Text summarization using Lexical chains.
[7] Moen M & De Busser R (2001) Genreric topic segmentation of
document text. “ proceedings of 24 the Annual ACM SIGIR
conference on Research and Development in information Retrieval” ,
ACM , New York pp 418-419.
[8] Morris , Lexical cohesion computed by thesaural relations as an
indicator of the structure of text “ , computational Linguistics , pp 2148.
[9] R. Barzilay and M. Elhadad (1997) Using lexical chains for text
summarization. Intelligent Text summarization, ACL, 1997.
[10] A Survey of Text Mining Techniques and Applications by Vishal
Gupta, Panjab University Chandigarh, India and Gurpreet S. Lehal
Professor & Head, Department of Computer Science, Punjabi
University Patiala, India
[11] Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju
Zhang, (2005), “Tapping into the Power of Text Mining”, Journal of
ACM, Blacksburg.
[12] Webpages:
http://people.ischool.berkeley.edu/~hearst/text-mining.html
http://en.wikipedia.org/wiki/Text_mining
http://summly.com/
http://www.ijettjournal.org
Page 94
Download