Text Mining and Applications

advertisement
Text mining and applications
Nour Khalid Khalil, Hajar Faisal Al-Katheery
Computer Information Science Dept., Prince Sultan College for women
Riyadh
Abstract: There has been a large extent of technology used
to aim attention at the human natural language such as
the speech recognition technology. Text mining is
considered one of the technologies that use the human
language as an input. In this research more will be
explained about text mining like what does it do, its
applications And what software uses it.
I. Introduction
Text mining main activity is extracting meaningful
information from texts but it derives to many more purposes to
it.
StatSoft website [reference 5] states that the purpose of text
mining is “to process unstructured (textual) information,
extract meaningful numeric indices from the text” like for
example counting how much a word is mentioned or how the
document is related to other sources. To clarify the purposes
more some explanation is lsted as follows:
Analysis of natural text is needed nowadays considering the
amount of documents and researches and the richness of
textual data that’s being used. According to Gartner Group,
“almost 90% of knowledge available at an organization today
is dispersed throughout piles of documents buried within
unstructured text. Books, magazine articles, research papers,
product manuals, memorandums, e-mails, and of course the
Web, all contain textual information in the natural language
form”.
A text is words representing paragraphs or sentences. It’s easy
for humans to process texts and search for what they want in it
if it was at a limited size about a paragraph or two but imagine
if you had a whole document with 2 or 3 pages or even much
more and you want to see how many words are there, here
comes the use of text mining.
Therefore a special algorithm was implemented to create text
mining software and it differs from one another bases on the
purpose. But the main functionality of text mining is handling
texts, clarifying the meaning and creates summaries of
documents as clarified on (Figure 1).
II. What is text mining?
Text mining also called by text analytics is as Wikipedia
[reference 4] mentions “the process of deriving high quality
information from text”. That means the discovery of unknown
information from textual data.
Figure 1: shows functionality of text mining.
1.
First purpose is text mining summarization is
decreasing a text into needed information .it’s also
called as sentiment analysis such as having lots of
book reviews and text mining is used to extract the
overall review whether it got a good or bad rates.
2.
3.
4.
5.
6.
Next purpose is distilling the meaning .distilling the
meaning is the process of purification the
information.
Text base navigation is tracking information.
Topic structured explication is clarifying the topic of
the text.
Clustering is as described in Wikipedia” a set of
observations into subsets (called clusters) so that
observations in the same cluster are similar in some
sense”.
Semantic information retrieval is retrieving the
information in a meaningful way to humans.
III. Text mining and searching
As Marti Hearst [reference 1] explains that “In search, the user
is typically looking for something that is already known and
has been written by someone else “while as said in the
definition that text mining is looking for unknown
information.
IV. Data mining and text mining
The major difference between the two subjects on data mining
is its ability to formulate large quantities of databases into
useful analysis. On the other hand, text mining is a smaller
division of data mining. It uses terms and unorganized texts to
format a database with useful information. Such diversity
would make text mining an important step in data mining.
V. Applications
There are many applications to text mining but all aims to
extract useful information from a textual data.
One of the interesting ways I saw text mining applied was in
social networks where they process the posts to detect what is
being talked about. As shown in figure1 Jacobus Van Eeden
claims in meme burn website [reference 11] that this graph
(figure 2) identifies the most frequent words occurred in
popular posts.
Figure 2: most occurring words in social media
Another fascinating use of text mining is in email filtering.
Text mining filters the email messages into spam, unread and
so on by finding word that are not possible to appear on a
regular massage.
The next application of text mining is the ability to analyze an
open ended survey responses. Results of an open ended survey
is usually paragraphs of describing or opinions and for a
researcher it might be hard to analyze thousands of surveys
hence using text mining to find particular terms or words that
are frequently used to identify for example the pros and cons
of a product as said in statsoft website[reference 5].
One of the most used applications in the academic industry
such as high schools and colleges is detecting plagiarism. A
special algorithm is used for it, the processes start with
scanning the paper then looking for similar combinations of
words online to detect similarity and highlight what’s found.
Finally one of the most famous applications is biomedical text
mining. And it’s is as defined in Wikipedia [reference 10]
“refers to text mining applied to texts and literature of the
biomedical and molecular biology domain”. Biomedical text
mining is used due to the large quantity of articles,
researches….etc that encounter the medical field to extract the
only needed topics.
VI. Challenges to text mining
Text mining method has some difficulties and issues. Kuan C.
Chen from Purdue University Calumet, US [reference 8]
mentions some of them like the result interpretation as he says
“It is a difficult aspect because result interpretation is
dependent on the skill of the software technician. The greater
the skill of the technician, the more effective the data or text
mine. Even if a skilled technician is very successful with the
data mine, the data mine still may not reach its potential as the
user may not have the analytical skills to interpret the results
of the text mine”.
Another challenge to text mining is that most of its tools use
English therefore it’s limited for English speakers only.
There are other challenges like the ambiguity of the language
like in term of semantic or lexical levels, the context or
spelling may affect results.
[3]
Erin Scroggins, MAY 15, 2008,
http://erinscroggins.blogspot.com/2008/05/applicatio
ns- of-text-mining.html
[4]
http://en.wikipedia.org/wiki/Text_mining#Academic
_applications
[5]
http://www.statsoft.com/textbook/text-mining/
[6]
Mark Sharp, 11 December 2001,
http://comminfo.rutgers.edu/~msharp/text_mining.ht
m
[7]
Anne Kao & Steve Poteet,
http://www.sigkdd.org/explorations/issues/7-1-200506/1-Intro.pdf
[8]
Kuan C. Chen, May, 2009, http://www.cluteinstituteonlinejournals.com/PDFs/1483.pdf
[9]
http://www.kdnuggets.com/software/text.html
VII. Text mining tools
Of course you can find a lot of software that use text mining
technique. There are two kinds of tools you can use, there is
an online text mining such as Ranks, Vivisimo/Clusty and
Wordle which has and interesting use of text mining. The
other kind is the regular text mining software like Basis
Technology, Clarabridge and Compare Suite. All links of the
software will be listed in the references [reference 12].
[10]
http://en.wikipedia.org/wiki/Biomedical_text_mining
[11]
Jacobus van Eeden, 09.30.10
http://memeburn.com/2010/09/text-mining-revealsbest-and-worst-words-used-in-social-media/
VIII. Conclusion
In conclusion technology always evolves to ease tasks for
mankind. Text mining is one of those technologies that assist
people with textual data. Text mining helps institutes and
individuals organize information into understandable format
by saving time, effort and manpower.
IX. References
[1]
Marti Hearst, October 17, 2003,
http://people.ischool.berkeley.edu/~hearst/textmining.html
[2]
Miloš Radovanovic, Mirjana Ivanovic, 2008,
http://www.emis.de/journals/NSJOM/Papers/38_3/N
SJOM_38_3_227_234.pdf
[12]
Text mining tools :
a. http://ranks.nl/
b. http://search.yippy.com/
c. http://www.wordle.net/
d. http://www.basistech.com/
e. http://www.clarabridge.com/
f. http://comparesuite.com/
[13]
Michael W.Berry ,Jacob Kogan.”Text Mining:
applications and theory”. (2010).
Download