A Proximity Phrase Association Pattern Mining of unstructured text Talluri Nalini

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013
A Proximity Phrase Association Pattern Mining of
unstructured text
Talluri Nalini*, Satyanarayana Mummana#
Fianl M.Tech Student, Dept of Computer Science and Engineering, Avanthi Institute of Engineering &
Technology,Narsipatnam, Andhra Pradesh
Assistant Professor, Dept of Computer Science and Engineering, Avanthi Institute of Engineering &
Technology,Narsipatnam, Andhra Pradesh
Abstract:- Text and web mining are too complex task for
large unstructured text. So we introduced a simple
method to find patterns for unstructured text. It is a
framework for optimized pattern discovery. We
introduced a class of proximity phrase association
patterns for finding patterns that optimizes the given
statistical measure. By using this framework we can easily
organize the large set of documents and web pages.
I.INTRODUCTION
Text mining, sometimes alternately referred to as text data
mining, roughly equivalent to text analytics, refers to the
process of deriving high-quality information from text. Highquality information is typically derived through the devising
of patterns and trends through means such as statistical
pattern learning. Text mining usually involves the process of
structuring the input text (usually parsing, along with the
addition of some derived linguistic features and the removal
of others, and subsequent insertion into a database), deriving
patterns within the structured data, and finally evaluation and
interpretation of the output. 'High quality' in text mining
usually refers to some combination of relevance, novelty, and
interestingness. Typical text mining tasks include text
categorization, text clustering, concept/entity extraction,
production
of
granular
taxonomies, sentiment
analysis, document summarization, and entity relation
modeling (i.e., learning relations between named entities).
Text
analysis
involves information
retrieval, lexical
analysis to study word frequency distributions, pattern
recognition, tagging/annotation, information extraction, data
mining techniques
including
link
and
association
analysis, visualization,
and predictive
analytics.
The
overarching goal is, essentially, to turn text into data for
analysis, via application of natural language processing (NLP)
and analytical methods.
In 2001, Dow Chemical’s merged with Union Carbide
Corporation (UCC), requiring a massive integration of over
35,000 of UCC’s reports into Dow’s document management
system. Dow chose Clear Forest, a leading developer of textdriven business solutions, to help integrate the document
ISSN: 2231-5381
collection. Using technology they had developed, Clear Forest
indexed the documents and identified chemical substances,
products, companies, and people. This allowed Dow to add
more than 80 years’ worth of UCC’s research to their
information management system and approximately 100,000
new chemical substances to their registry. When the project
was complete, it was estimated
that Dow spent almost $3 million less than what they would
have if they had used their own existing methods for indexing
documents. Dow also reduced the time spent sorting
documents by 50% and reduced data errors by 10-15%.
Text mining is similar to data mining, except that
data mining tools are designed to handle structured data from
databases or XML files, but text mining can work with
unstructured or semi-structured data sets such as emails, fulltext documents, HTML files, etc. As a result, text mining is a
much better solution for companies, such as Dow, where large
volumes of diverse types of information must be merged and
managed. To date, however, most research and development
efforts have centered on data mining efforts using structured
data. The problem introduced by text mining is obvious:
natural language was developed for humans to communicate
with one another and to record information, and computers
are a long way from comprehending natural language.
Humans have the
ability to distinguish and apply linguistic patterns to text and
humans can easily overcome obstacles that computers cannot
easily handle such as slang, spelling variations and contextual
meaning. However, although our language capabilities allow
us to comprehend unstructured data, we lack the computer’s
ability to process text in large volumes or at high speeds.
Herein lays the key to text mining: creating technology that
combines a human’s linguistic capabilities with the speed and
accuracy of a computer.
Figure 1 depicts a generic process model for a text
mining application. Starting with a collection of documents, a
text mining tool would retrieve a particular document and
preprocess it by checking format and character sets. Then it
would go through a text analysis phase, sometimes repeating
techniques until information is extracted. Three text analysis
techniques are shown in the example, but many other
http://www.ijettjournal.org
Page 1792
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013
combinations of techniques could be used depending on the
goals of the organization. The resulting information can be
placed in a management information system, yielding an
abundant amount of knowledge for the user of that system.
Discrete
Collection
Retrieve and
process
Management
info system
Knowledge
Analysize Text
Information Extraction
Clustering
Seriazation
II.RELATED WORK
A.Fast and robust text mining algorithm for ordered patterns
If the maximum number of phrases in a pattern is
bounded by a constant d then the frequent pattern problem for
both unordered and ordered proximity phrase association
patterns is solvable by Enumerate-Scan algorithm [12], a
modification of a naive generate-and-test algorithm, in
O(nd+1) time and O(nd) scans although it is still too slow to
apply real world problems. Adopting the framework of
optimized pattern discovery, we have developed an efficient
algorithm, called Split-Merge, that finds all the optimal
patterns for the class of ordered k-proximity d-phrase
association patterns for various measures including the
classification error and information entropy [2, 3]. The
algorithm quickly searches the hypothesis space using
dynamic reconstruction of the content index, called a suffix
array with combining several techniques from computational
geometry and string algorithms.
We showed that the Split-Merge algorithm runs in almost
linear time in average, more precisely in O(kd+1N(logN)d+1)
time using O(kd+1N) space for nearly random texts of size N
[3]. We also show that the problem to find one of the best
phrase patterns with arbitrarily many strings is MAX SNPhard [3]. Thus, we see that there is no efficient approximation
algorithm with arbitrary small error for the problem when the
number d of phrases is unbounded. By experiments on an
English text collection of 7MB, the algorithm finds the best
600 patterns at the entropy measure in seconds for d = 2 and a
few minutes for d = 4 and with k = 2 words using a few
hundred mega-bytes of main memory on Sun Ultra 60 (Ultra
SPARC II 300MHz, g++ on Solaris 2.6) [6].
B.Developing a scan-based algorithm for unordered patterns
In Web mining, the unordered version of the phrase
patterns are more suitable than the ordered version. Besides
this, we also have to deal with huge text databases that cannot
_t into main memory. For the purpose, we developed another
pattern discovery algorithm, called Level wise Scan, for
mining unordered phrase patterns from large disk-resident
text data [4]. Based on the design principle of the Apriori
ISSN: 2231-5381
algorithm of Agrawal [1], the Level wise-Scan algorithm
quickly discovers most frequent unordered patterns with d
phrases and proximity k in time O(n2 +N(log n)d) and space
O(n log n + R) on nearly random texts using a random sample
of size n, where N is the total size of input text and R is the
output size [4]. To cope with the problem of the huge feature
space of phrase patterns, the algorithm combines the
techniques of random sampling, the generalized suffix tree,
and the pattern matching automaton. By computer
experiments on large text data, the Level wise-Scan algorithm
quickly finds patterns for various ranges of parameters and
scales up on a large disk-resident text database.
III.PROPOSED SYSTEM
We applied our text mining method to interactive document
browsing. Based on the Split-Merge algorithm [3], we
developed a prototype system on a unix workstation, and run
experiments on a medium sized text collection, called
Reuters-21578 data [8], which consists of news articles of
27MB on international affairs and trades from February to
August in 1987. The sample set is a collection of unstructured
texts of the total size 15.2MB obtained from Reuters-21578
by removing all but category tags. The target (positive) set
consists of 285 articles with category ship and the background
(negative) set consists of 18,758 articles with other categories
such as trade, grain, metal, and so on. The average length of
articles is 799 letters. Table 1: Phrase mining with entropy
optimization to capture ship category. To see the effectiveness
of entropy optimization, we mine only patterns with d = 1,
i.e., single phrases. Left: Short phrases of smallest rank, 1 _
10. Right: long phrases of middle rank, 261 _ 270. The data
set consists of 19,043 articles of 15.2MB from Reuters
Newswires in 1987.
1
2
3
<gulf>
<ships>
<shipping>
4
5
6
7
8
9
10
<Iranian>
<iran>
<port>
<the gulf>
<strike>
<vessels>
<attack>
261 <mahi>
262 <mclean>
263
<Lloyds
shopping
intelligence>
264 <Iranian oil platform>
265 <herald of free>
266 <began on>
267 <bagged>
268 <244->
269 <18->
270 <120 pct>
Table 2: Document browsing by optimized pattern discovery.
(a) First, a user tries to mine the original target set using
optimized pattern discovery over phrases using the
background set (d = 1; k = 0). The user selected a term union
of rank 50. (b) Next, the user mines a subset of articles
relating to union and mine this set again by optimized phrases
(d = 1; k = 0). He obtained topic terms on seamen. (c) Finally,
the user tries to discover optimized patterns starting with
http://www.ijettjournal.org
Page 1793
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013
seamen on the same target set (d = 4; k = 2 words). The
underlined pattern indicates a strike by a union of seamenn.
(a) stage 1
Bank
46
47
48
49
50
51
52
53
54
55
Pattern
(loading)
(us flag)
(platforms)
(s flag)
(the strike)
(union)
(Kuwait)
(waters)
(missiles)
(navy)
Bank
1
2
3
4
5
6
7
8
9
10
(b)stage 2
Pattern
(union)
(us)
(seamen)
(strike)
(pay)
(gulf)
(employers)
(labor)
(de)
(redundancies)
Bank
1
2
3
4
5
6
7
8
9
10
(b) stage 3
Pattern
(seamen) (and) (the)
(seamen) (a) (pay)
(seamen) (were) (strike)
(seamen)(were)(still
on
strike)
(seamen)(were)(on strike)
(seamen)(to)(after)
(seamen)(said)(were)
(seamen)(pct)(pay)
(seamen)(on)(strike)
(seamen)(laesders)(of)
which capture the category ship relative to other categories of
Reuters newswire. The patterns of smallest rank (1 _ 10)
contains the topic keywords in the major news stories for the
period in 1987 (Fig. 4:Left). such keywords are hard to find
by traditional frequent pattern discovery because of the
existence of the high frequency words such as \the" and \are."
The patterns of medium rank (261 _ 270) are long phrases,
such that \lloyds shipping intelligence" and \iranian oil
ISSN: 2231-5381
platform", as a summary (Table 4:Right), which cannot be
represented by any combination of non-contiguous keywords.
The second experiment. Table 2 shows an experiment on
interactive text browsing. First, we suppose that a user is
looking for articles related to the labor problem, but he does
not know any specific keywords enough to identify such
articles. (a) Starting with the original target and the
background sets related to ship category in the last section,
the user first finds topic keywords in the original target set
using optimized pattern mining with d = 1, i.e., phrase
mining. Let union be a keyword found. (b) Then, he builds a
new target set by drawing all articles with keyword union
from the original target set. The last target set is used as the
new background set. As a result, we obtained a list of topic
phrases concerned to union. (c) Using a term seamen found in
the last stage, we try to find long patterns consisting of four
phrases such that the first phrase is seamen using proximity k
= 2 words. In the table, a pattern (seamen)( were) ( still on
strike) found by the algorithm indicates that there is a strike
by a union of seamen.
C. Mining the Web
We also apply our prototype text mining system to
the Web mining. The problem is to discover a set of phrases
that characterize a community of web pages. In the
experiments, the mining system receives two keywords,
called the target and the background as inputs, and then
discovers the phrases that characterize the base set of the
target keyword relative to the base set of the background
keyword, where a base set of a keyword w is a community of
Web pages of around 5 MB that are found by the keyword
search with w and connected each other by Web links.
(a) Measure =Frequency ,d=1 POS=Honda
1
17893
<the>
2
12377
<and>
3
11904
<to>
4
11291
<a>
5
8728
<of>
6
8239
<for>
7
7278
<in>
8
5752
<is>
9
5626
<i>
10
5477
<Honda>
11
4838
<on>
12
4584
<s>
13
4571
<with>
14
4545
<you>
15
3880
<it>
16
3650
<or>
17
3447
<this>
18
3279
<that>
19
315
<are>
20
3085
<99>
(b) Measure=Entropy,d=1 POS=honda,NEG=Softbank
1
5477
4
<honda
http://www.ijettjournal.org
Page 1794
International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2125
5626
1863
1337
1472
862
744
3085
718
754
629
662
734
732
802
526
586
556
1307
0
1099
68
24
92
1
0
769
0
16
2
6
20
29
45
1
13
9
219
<prelude>
<i>
<car>
<parts>
<engine>
<rear>
<vtec>
<99>
<exhaust>
<miles>
<motorcycle>
<bike>
<racing>
<si>
<black>
<tires>
<fuel>
<civic>
<99>
(c) Measure=Entropy, d=1 POS=Honda,NEG=Toyota
1
5477
302
<honda
2
2125
9
<prelude>
3
5626
5
<vtec>
4
1863
16
<si>
5
1337
23
<bike>
6
1472
23
<motorcycle>
7
862
942
<99>
8
744
0
<prelude si>
9
3085
9
<the honda>
10
718
32
<civic>
11
754
5
<Honda prelude>
12
629
0
<valkyrie>
13
662
0
<99 time>
14
734
12
<Honda s>
15
732
266
<98>
16
802
0
<scooters>
17
526
17
<rims>
18
586
0
<date 10>
19
556
3
<92 96>
20
1307
29
<accord>
IV.CONCLUSION
In our proposed framework, we achieved optimized pattern
discovery and also we tested on web pages shown in the
experimental results. In web mining by finding pattern in
positive and negative documents we found best results than
traditional mining algorithms. In the existing pattern mining
algorithms there is no consideration of negative documents. In
text mining user searches the patterns in extracted optimized
terms.
REFERENCES
[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules,
In
Proc. VLDB'94, 487{499, 1994.
[2] Arimura, H., Wataki, A., Fujino, R., Arikawa, S., A fast algorithm for
discovering optimal string patterns in large text databases, In Proc. ALT'98,
LNAI 1501, 247{261, 1998.
[3] H. Arimura, S. Arikawa, S. Shimozono, Efficient discovery of optimal
word-association patterns in large text databases New Generation Computing,
18, 49{60, 2000.
[4] R. Fujino, H. Arimura, S. Arikawa, Discovering unordered and ordered
phrase association patterns for text mining. Proc. PAKDD2000, LNAI 1805,
281{293, 2000.
[5] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining
using two-dimensional optimized association rules, In Proc. SIGMOD'96,
13{23, 1996.
[6] T. Kasai, T. Itai, H. Arimura, S. Arikawa, Exploratory document
browsing using optimized text data mining, In Proc. Data Mining Workshop,
24{30, 1999 (In Japanese).
[7] M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic
learning. Machine Learning, 17(2{3), 115{141, 1994.
[8] D. Lewis, Reuters-21578 text categorization test collection, Distribution
1.0, AT&T Labs-Research, http://www.research/.att.com, 1997.
[9] S. Morishita, On classi_cation and regression, Proc. DS'98, LNAI 1532,
49{59, 1998.
[10] W. W. Cohen, Y. Singer, Context-sensitive learning methods for text
categorization, J. ACM, 17(2), 141{173, 1999.
[11] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K.
Nigam, and S. Slattery, Learning to construct knowledge bases from the
World Wide Web, Artificial Intelligence, 118, 69{114, 2000.
[12] J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K.
Zhang, Combinatorial pattern discovery for scientific data: Some preliminary
results, In Proc. SIGMOD'94, 115{125, 1994.
BIOGRAPHIES
In above Tables, we show the best 20 phrases found
in the target set honda varying the background set. In the table
(a), we show the phrases found by traditional frequent pattern
mining with the empty background [1]. In the next two tables,
we show the phrases found by mining based on entropy
minimization, where the target is honda and the background
varies. We see that (b) the mining system found more general
phrases, e.g. cars, parts, engine, in the target for the
background soft bank, while (c) it found more specific
phrases, e.g. prelude si,valkyrie, accord, for the background
keyword toyota which is close to the target.
ISSN: 2231-5381
http://www.ijettjournal.org
Talluri Nalini completed her MSC in PNC & KR
College, Narsaraopet
Guntur (Dist). Later she is
studying M.Tech(Software Engineering) in Avanthi
Institute of Engineering and technology. Her
interested areas are data mining.
Mr.Satyanarayana
Mummana is working as an Asst.
Professor in Avanthi Institute of Engineering &
Technology, Visakhapatnam, Andhra Pradesh. He has
received his Masters degree (MCA) from Gandhi
Institute of Technology and Management (GITAM),
Visakhapatnam and M.Tech (CSE) from Avanthi
Institute
of
Engineering
&
Technology,
Visakhapatnam. Andhra Pradesh. His research areas
include Image Processing, Computer Networks, Data
Mining, Distributed Systems, Cloud Computing.
Page 1795
Download