Automated process chain for Text Mining with RSS and Corpora

advertisement
Automated process chain for Text Mining
with RSS and Corpora- Builder
Knappe MARCUS
HS-Anhalt, Lange Straße 52, Köthen, 06366, Germany
Tel: +49 172 1854616, Email: Marcus.Knappe@inf.hs-anhalt.de
Abstract: With normal Data Mining methods cannot all the information of a
company or project uses. Text Mining makes many correlations visible in texts and
provides some contained information, which are not visible at first view. In this
paper, the general process chain of Text Mining is shown and a special process chain
with RSS. This was designed to collect easily a lot of text for analysis. Moreover,
anything said about the use of the analysis and visualization with a java framework
named Prefuse.
Keywords: Corpora, Feed, Prefuse, RSS, Text Mining.
1. Introduction
A study from IBM shows, that 80 % of all existent data are in unstructured form. For
example this could be emails, call center notices, free texts, news, blogs, twitter, surveys,
web forms, company internal documents and other text sources or documents.[1] The normal
Data Mining can’t use the contained pattern and trends for analysis like paradigmatic
relations, syntagmatic relations, semantic relations, logical relations and technical terms.
Paradigmatic relation is in the tradition of linguistic structuralism, the name for the
occurrence of two forms of words in similar contexts. Syntagmatic relation is in the
tradition of linguistic structuralism, the name for the common occurrence of two forms of
words in a sentence or text window. Semantic relations are sets of pairs of word forms of a
language. Between two word forms of a language, a semantic relation exists only if they are
in a paradigmatic or syntagmatic relation. The semantic relations of adjacent word forms
include in particular the category or functional specification, the unit or qualification,
modification and change. Logical relations are semantic relations, which support the logical
conclusions. These include in particular the preamble and narrower relationships,
synonyms, opposites, antonyms, and complementary concepts converse. The notion of
logical relations can be defined set-theoretically. A technical term is a word or a phrase,
which is characteristic for a given criterion for a field. To locate technical terms, a reference
corpus is required. The reference corpus is usually composed of a number of texts that
reflect the general linguistic usage. For general linguistic usage can also be some wellknown technical terms from the various technical languages belong. These technical terms,
however, occur in the reference corpus relatively rarely - compared with a corpus of
relevant technical language [2][3][4].
This connoted that normal Data Mining, which is concentrated in structured data’s like
gender, address, buying behavior and properties of a product, involved less than 20 % of the
available information’s. But the extracted information’s from Text Mining can be used by
Data Mining [1][4].
1. General process chain
In the next figure (fig. 1) the general process chain of Text Mining is shown.
Figure 1: General process chain of Text Mining [4]
The first step in the general process chain is the text acquisition. This initial step includes
all mechanism to collect and extract texts and get it in machine-readable form. For example
collect all threads in a forum or blog about a product from a company, extract the text from
this threads and save this text for the next step.
The second step is the key step in Text Mining, the Corpora Builder. He converts the
unstructured information’s of the text in structured Data. He builds cooccurrenzes and
extracts important information’s.
The last step is the used step. The extracted data from step two can now use in other
applications or they can be visualized. The other applications can be methods from Data
Mining or Data warehouse. For term relation visualization a sociogram (fig. 2), this is a
special network diagram with terms as nodes and the relation of two terms is represented by
the edge between the two nodes, is the best practice. For trend visualization a bar- or line
char with time as abscissa and the significance of a term for the time as ordinate is used.
Term7
Term2
Term6
Term1
Term5
Term4
Figure 2: Sociogram [4]
Term3
2. Special process chain with RSS
A big growing source for texts is Really Simple Syndication (RSS).
[5] This is the icon
for RSS and it is a XML- based data format for spread of text. This technique uses client
server connections. The prospect subscribes a RSS-Channel and the client search in
continues span for new RSS Feeds. This Feeds contained the headline, the language, the
date, a short text crack and the URL to get the full text version. [5]
Figure 3: Special process chain with RSS [4]
The followed description is for the figure 3.
The step 1.1: The Database covers round about 700 RSS sources, like Wall Street, BBC,
CNN, FAZ, Spiegel. The sources are classified in subject areas and they have a name, the
URL and next crawl rotation. The subject areas are used for generate subject area corpora’s
with step 1.3 and 2. These subject areas are for example politics, science, sport,
engineering, finance, society, business, technology and energy. The URL, where the new
Feeds can be downloaded, and the next crawl rotation, when a source has approximately
new Feeds, are important for the next step 1.2.
The step 1.2: An automated program starts every two hours a new crawl. To collect all
Feeds from the 700 RSS sources the program needs round 14 hours if there are all Feeds
new. But in real there are round 1.000 new Feeds every two hours. One RSS source
contains between 5 to 100 Feeds. If a “good” source contains for example 10 Feeds, but at
one day it publishes 25 new Feeds so a crawl only one per day miss some Feeds. Some
sources are “good” sources because they publish every two hours some Feeds, like BBC
and there are some “bad” sources too. Because they publish few Feeds per day or per week,
like “Swiss info economy in German”. If a crawl tests all 700 RSS sources for new Feeds it
takes more than two hours. Thus a continues learning process is needed to decide between
“good” and “bad” sources. This “good” and “bad” says nothing about the quality of the
text, but only on the publish ratio. The default value for next crawl rotation is 60 minutes. If
the RSS source is a “good” source and have new Feeds, than the next crawl rotation is
decreased by 2/3. At the other side, if the RSS source is a “bad” source and have in this
crawl zero new Feed, than the next crawl rotation is raised up by 3/2. The new Feeds are
saved in a Database, too. Every day round 12.000 new entries are collected at a weekday
and round 7.000 new entries are collected at a weekend day. These entries have important
properties, like language and date. With these specific things can be analyzed trends or
language specific properties. The collected URL is the interface to the complete message
and is used in the next step 1.3.
In the RSS Feed to text extractor (step 1.3) the desired Feeds can be selected by subject
area, language and/or date. They can be limited by count, too. After this configuration the
download process is started and the text content of the selected RSS Feeds would be
extracted. This text content is saved to his separate text file and they are grouped by
selection in directories.
The Corpora Builder (step 2) is the core of the process chain. He converts the extracted text
from the text Files to useful information’s. For this the Corpora- Builder comprise steps like
sentence segmentation, tokenization, optional lemmatization, stop word reduction, he
counts words and generates cooccurrenzes. The sentence segmentation scans the text for
sentence end chars, like ‘.’, ‘!’ and ‘?’. Here is the abbreviation dot a problem. For
example: “Google and Microsoft are U.S. companies.”. The first and second dots are not
sentence end dots, they are abbreviation dots and the third dot is a real sentence end dot.
The tokenization scans the separated sentence for morphemes, this are the smallest
meaningful units, the sentence blocks. For example from the previous example = {Google,
and, Microsoft, are, U, ., S, ., companies, .}. The lemmatization is an optional feature and it
is the reduction to the basic form of a term. For example “companies” can be reduced to
“company”. Some words like “the”, “a”, “I”, “and” and “or” are stop words. These words
are words that should be excluded from the analysis. These include, in general words or
word forms from the closed word classes, such as articles, conjunctions and prepositions. A
co-occurrence refers the statistically significant common appearance of two forms of words
in one sentence or text window (context).
With all this information’s a general reference Corpora can be build. Or more specific
reference Corpora’s for selection of language, subject area or date can build. With this,
analysis of pattern is possible. Technical terms and synonyms can be located easily. With
some selected date-Corpora’s, trends can be analyzed. The time difference-Corpora’s are
compared with each other and the differences and similarities are located. Now the new
information’s from Text Mining can be used by Data Mining methods, too (step 3.1)
[2][3][4][5].
Figure 4: Example with Prefuse [4]
Prefuse is a java framework for graph visualization (step 3.2). At the figure 4 an example
for the German company DHL is represented. In the middle the search term, DHL, is
located and the most significance terms are linked with it. The boarder width represents the
significance of two linked terms. And with a mouse over the absolute significance can be
shown in a little textbox. If there are synonyms for one term, they can visualize with a
mouse over in a textbox, too. Like FEDERAL EXPRESS: FDX, FEDEX and FEDEX
EXPRESS. If there are other interesting terms with a click on them can be navigated
between them [4][6].
3. Conclusions
Text Mining will have a great future and will achieve together with Data Mining very good
synergies. The path shown here with RSS enables a simple, straightforward and not
memory-intensive collection of texts. The time analysis is very easy, because the date of the
feeds is stored. For the future a better RSS source collector is needed. The used RSS
sources are collected manual. The future should collect automatically sources, rate at
publish rotation and deled orphaned sources. The freeware visualization toolkit can perform
more and should be optimized.
References
[1] SPSS Text Mining Tage in Hamburg (24.03.2010)
[2] Heyer, G. et al. (2006), Text Mining: Wissensrohstoff Text,
Bochum: W3L-Verlag
GmbH
[3] Ferber, R. (2003), Information Retrieval : Suchmodelle und Data-Mining-Verfahren
für Textsammlungen und das Web, Heidelberg: dpunkt-Verlag GmbH
[4] Internal paper of the TextTech Informationsmanagement und Texttechnologie
GmbH
[5] Really Simple Syndication and the RSS icon http://www.rss-nachrichten.de/
[6] Prefuse visualization toolkit http://www.prefuse.org/
Download