International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013 A Proximity Phrase Association Pattern Mining of unstructured text Talluri Nalini*, Satyanarayana Mummana# Fianl M.Tech Student, Dept of Computer Science and Engineering, Avanthi Institute of Engineering & Technology,Narsipatnam, Andhra Pradesh Assistant Professor, Dept of Computer Science and Engineering, Avanthi Institute of Engineering & Technology,Narsipatnam, Andhra Pradesh Abstract:- Text and web mining are too complex task for large unstructured text. So we introduced a simple method to find patterns for unstructured text. It is a framework for optimized pattern discovery. We introduced a class of proximity phrase association patterns for finding patterns that optimizes the given statistical measure. By using this framework we can easily organize the large set of documents and web pages. I.INTRODUCTION Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. Highquality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods. In 2001, Dow Chemical’s merged with Union Carbide Corporation (UCC), requiring a massive integration of over 35,000 of UCC’s reports into Dow’s document management system. Dow chose Clear Forest, a leading developer of textdriven business solutions, to help integrate the document ISSN: 2231-5381 collection. Using technology they had developed, Clear Forest indexed the documents and identified chemical substances, products, companies, and people. This allowed Dow to add more than 80 years’ worth of UCC’s research to their information management system and approximately 100,000 new chemical substances to their registry. When the project was complete, it was estimated that Dow spent almost $3 million less than what they would have if they had used their own existing methods for indexing documents. Dow also reduced the time spent sorting documents by 50% and reduced data errors by 10-15%. Text mining is similar to data mining, except that data mining tools are designed to handle structured data from databases or XML files, but text mining can work with unstructured or semi-structured data sets such as emails, fulltext documents, HTML files, etc. As a result, text mining is a much better solution for companies, such as Dow, where large volumes of diverse types of information must be merged and managed. To date, however, most research and development efforts have centered on data mining efforts using structured data. The problem introduced by text mining is obvious: natural language was developed for humans to communicate with one another and to record information, and computers are a long way from comprehending natural language. Humans have the ability to distinguish and apply linguistic patterns to text and humans can easily overcome obstacles that computers cannot easily handle such as slang, spelling variations and contextual meaning. However, although our language capabilities allow us to comprehend unstructured data, we lack the computer’s ability to process text in large volumes or at high speeds. Herein lays the key to text mining: creating technology that combines a human’s linguistic capabilities with the speed and accuracy of a computer. Figure 1 depicts a generic process model for a text mining application. Starting with a collection of documents, a text mining tool would retrieve a particular document and preprocess it by checking format and character sets. Then it would go through a text analysis phase, sometimes repeating techniques until information is extracted. Three text analysis techniques are shown in the example, but many other http://www.ijettjournal.org Page 1792 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013 combinations of techniques could be used depending on the goals of the organization. The resulting information can be placed in a management information system, yielding an abundant amount of knowledge for the user of that system. Discrete Collection Retrieve and process Management info system Knowledge Analysize Text Information Extraction Clustering Seriazation II.RELATED WORK A.Fast and robust text mining algorithm for ordered patterns If the maximum number of phrases in a pattern is bounded by a constant d then the frequent pattern problem for both unordered and ordered proximity phrase association patterns is solvable by Enumerate-Scan algorithm [12], a modification of a naive generate-and-test algorithm, in O(nd+1) time and O(nd) scans although it is still too slow to apply real world problems. Adopting the framework of optimized pattern discovery, we have developed an efficient algorithm, called Split-Merge, that finds all the optimal patterns for the class of ordered k-proximity d-phrase association patterns for various measures including the classification error and information entropy [2, 3]. The algorithm quickly searches the hypothesis space using dynamic reconstruction of the content index, called a suffix array with combining several techniques from computational geometry and string algorithms. We showed that the Split-Merge algorithm runs in almost linear time in average, more precisely in O(kd+1N(logN)d+1) time using O(kd+1N) space for nearly random texts of size N [3]. We also show that the problem to find one of the best phrase patterns with arbitrarily many strings is MAX SNPhard [3]. Thus, we see that there is no efficient approximation algorithm with arbitrary small error for the problem when the number d of phrases is unbounded. By experiments on an English text collection of 7MB, the algorithm finds the best 600 patterns at the entropy measure in seconds for d = 2 and a few minutes for d = 4 and with k = 2 words using a few hundred mega-bytes of main memory on Sun Ultra 60 (Ultra SPARC II 300MHz, g++ on Solaris 2.6) [6]. B.Developing a scan-based algorithm for unordered patterns In Web mining, the unordered version of the phrase patterns are more suitable than the ordered version. Besides this, we also have to deal with huge text databases that cannot _t into main memory. For the purpose, we developed another pattern discovery algorithm, called Level wise Scan, for mining unordered phrase patterns from large disk-resident text data [4]. Based on the design principle of the Apriori ISSN: 2231-5381 algorithm of Agrawal [1], the Level wise-Scan algorithm quickly discovers most frequent unordered patterns with d phrases and proximity k in time O(n2 +N(log n)d) and space O(n log n + R) on nearly random texts using a random sample of size n, where N is the total size of input text and R is the output size [4]. To cope with the problem of the huge feature space of phrase patterns, the algorithm combines the techniques of random sampling, the generalized suffix tree, and the pattern matching automaton. By computer experiments on large text data, the Level wise-Scan algorithm quickly finds patterns for various ranges of parameters and scales up on a large disk-resident text database. III.PROPOSED SYSTEM We applied our text mining method to interactive document browsing. Based on the Split-Merge algorithm [3], we developed a prototype system on a unix workstation, and run experiments on a medium sized text collection, called Reuters-21578 data [8], which consists of news articles of 27MB on international affairs and trades from February to August in 1987. The sample set is a collection of unstructured texts of the total size 15.2MB obtained from Reuters-21578 by removing all but category tags. The target (positive) set consists of 285 articles with category ship and the background (negative) set consists of 18,758 articles with other categories such as trade, grain, metal, and so on. The average length of articles is 799 letters. Table 1: Phrase mining with entropy optimization to capture ship category. To see the effectiveness of entropy optimization, we mine only patterns with d = 1, i.e., single phrases. Left: Short phrases of smallest rank, 1 _ 10. Right: long phrases of middle rank, 261 _ 270. The data set consists of 19,043 articles of 15.2MB from Reuters Newswires in 1987. 1 2 3 <gulf> <ships> <shipping> 4 5 6 7 8 9 10 <Iranian> <iran> <port> <the gulf> <strike> <vessels> <attack> 261 <mahi> 262 <mclean> 263 <Lloyds shopping intelligence> 264 <Iranian oil platform> 265 <herald of free> 266 <began on> 267 <bagged> 268 <244-> 269 <18-> 270 <120 pct> Table 2: Document browsing by optimized pattern discovery. (a) First, a user tries to mine the original target set using optimized pattern discovery over phrases using the background set (d = 1; k = 0). The user selected a term union of rank 50. (b) Next, the user mines a subset of articles relating to union and mine this set again by optimized phrases (d = 1; k = 0). He obtained topic terms on seamen. (c) Finally, the user tries to discover optimized patterns starting with http://www.ijettjournal.org Page 1793 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013 seamen on the same target set (d = 4; k = 2 words). The underlined pattern indicates a strike by a union of seamenn. (a) stage 1 Bank 46 47 48 49 50 51 52 53 54 55 Pattern (loading) (us flag) (platforms) (s flag) (the strike) (union) (Kuwait) (waters) (missiles) (navy) Bank 1 2 3 4 5 6 7 8 9 10 (b)stage 2 Pattern (union) (us) (seamen) (strike) (pay) (gulf) (employers) (labor) (de) (redundancies) Bank 1 2 3 4 5 6 7 8 9 10 (b) stage 3 Pattern (seamen) (and) (the) (seamen) (a) (pay) (seamen) (were) (strike) (seamen)(were)(still on strike) (seamen)(were)(on strike) (seamen)(to)(after) (seamen)(said)(were) (seamen)(pct)(pay) (seamen)(on)(strike) (seamen)(laesders)(of) which capture the category ship relative to other categories of Reuters newswire. The patterns of smallest rank (1 _ 10) contains the topic keywords in the major news stories for the period in 1987 (Fig. 4:Left). such keywords are hard to find by traditional frequent pattern discovery because of the existence of the high frequency words such as \the" and \are." The patterns of medium rank (261 _ 270) are long phrases, such that \lloyds shipping intelligence" and \iranian oil ISSN: 2231-5381 platform", as a summary (Table 4:Right), which cannot be represented by any combination of non-contiguous keywords. The second experiment. Table 2 shows an experiment on interactive text browsing. First, we suppose that a user is looking for articles related to the labor problem, but he does not know any specific keywords enough to identify such articles. (a) Starting with the original target and the background sets related to ship category in the last section, the user first finds topic keywords in the original target set using optimized pattern mining with d = 1, i.e., phrase mining. Let union be a keyword found. (b) Then, he builds a new target set by drawing all articles with keyword union from the original target set. The last target set is used as the new background set. As a result, we obtained a list of topic phrases concerned to union. (c) Using a term seamen found in the last stage, we try to find long patterns consisting of four phrases such that the first phrase is seamen using proximity k = 2 words. In the table, a pattern (seamen)( were) ( still on strike) found by the algorithm indicates that there is a strike by a union of seamen. C. Mining the Web We also apply our prototype text mining system to the Web mining. The problem is to discover a set of phrases that characterize a community of web pages. In the experiments, the mining system receives two keywords, called the target and the background as inputs, and then discovers the phrases that characterize the base set of the target keyword relative to the base set of the background keyword, where a base set of a keyword w is a community of Web pages of around 5 MB that are found by the keyword search with w and connected each other by Web links. (a) Measure =Frequency ,d=1 POS=Honda 1 17893 <the> 2 12377 <and> 3 11904 <to> 4 11291 <a> 5 8728 <of> 6 8239 <for> 7 7278 <in> 8 5752 <is> 9 5626 <i> 10 5477 <Honda> 11 4838 <on> 12 4584 <s> 13 4571 <with> 14 4545 <you> 15 3880 <it> 16 3650 <or> 17 3447 <this> 18 3279 <that> 19 315 <are> 20 3085 <99> (b) Measure=Entropy,d=1 POS=honda,NEG=Softbank 1 5477 4 <honda http://www.ijettjournal.org Page 1794 International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue5 - May 2013 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2125 5626 1863 1337 1472 862 744 3085 718 754 629 662 734 732 802 526 586 556 1307 0 1099 68 24 92 1 0 769 0 16 2 6 20 29 45 1 13 9 219 <prelude> <i> <car> <parts> <engine> <rear> <vtec> <99> <exhaust> <miles> <motorcycle> <bike> <racing> <si> <black> <tires> <fuel> <civic> <99> (c) Measure=Entropy, d=1 POS=Honda,NEG=Toyota 1 5477 302 <honda 2 2125 9 <prelude> 3 5626 5 <vtec> 4 1863 16 <si> 5 1337 23 <bike> 6 1472 23 <motorcycle> 7 862 942 <99> 8 744 0 <prelude si> 9 3085 9 <the honda> 10 718 32 <civic> 11 754 5 <Honda prelude> 12 629 0 <valkyrie> 13 662 0 <99 time> 14 734 12 <Honda s> 15 732 266 <98> 16 802 0 <scooters> 17 526 17 <rims> 18 586 0 <date 10> 19 556 3 <92 96> 20 1307 29 <accord> IV.CONCLUSION In our proposed framework, we achieved optimized pattern discovery and also we tested on web pages shown in the experimental results. In web mining by finding pattern in positive and negative documents we found best results than traditional mining algorithms. In the existing pattern mining algorithms there is no consideration of negative documents. In text mining user searches the patterns in extracted optimized terms. REFERENCES [1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. VLDB'94, 487{499, 1994. [2] Arimura, H., Wataki, A., Fujino, R., Arikawa, S., A fast algorithm for discovering optimal string patterns in large text databases, In Proc. ALT'98, LNAI 1501, 247{261, 1998. [3] H. Arimura, S. Arikawa, S. Shimozono, Efficient discovery of optimal word-association patterns in large text databases New Generation Computing, 18, 49{60, 2000. [4] R. Fujino, H. Arimura, S. Arikawa, Discovering unordered and ordered phrase association patterns for text mining. Proc. PAKDD2000, LNAI 1805, 281{293, 2000. [5] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, Data mining using two-dimensional optimized association rules, In Proc. SIGMOD'96, 13{23, 1996. [6] T. Kasai, T. Itai, H. Arimura, S. Arikawa, Exploratory document browsing using optimized text data mining, In Proc. Data Mining Workshop, 24{30, 1999 (In Japanese). [7] M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2{3), 115{141, 1994. [8] D. Lewis, Reuters-21578 text categorization test collection, Distribution 1.0, AT&T Labs-Research, http://www.research/.att.com, 1997. [9] S. Morishita, On classi_cation and regression, Proc. DS'98, LNAI 1532, 49{59, 1998. [10] W. W. Cohen, Y. Singer, Context-sensitive learning methods for text categorization, J. ACM, 17(2), 141{173, 1999. [11] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence, 118, 69{114, 2000. [12] J. T. L. Wang, G. W. Chirn, T. G. Marr, B. Shapiro, D. Shasha and K. Zhang, Combinatorial pattern discovery for scientific data: Some preliminary results, In Proc. SIGMOD'94, 115{125, 1994. BIOGRAPHIES In above Tables, we show the best 20 phrases found in the target set honda varying the background set. In the table (a), we show the phrases found by traditional frequent pattern mining with the empty background [1]. In the next two tables, we show the phrases found by mining based on entropy minimization, where the target is honda and the background varies. We see that (b) the mining system found more general phrases, e.g. cars, parts, engine, in the target for the background soft bank, while (c) it found more specific phrases, e.g. prelude si,valkyrie, accord, for the background keyword toyota which is close to the target. ISSN: 2231-5381 http://www.ijettjournal.org Talluri Nalini completed her MSC in PNC & KR College, Narsaraopet Guntur (Dist). Later she is studying M.Tech(Software Engineering) in Avanthi Institute of Engineering and technology. Her interested areas are data mining. Mr.Satyanarayana Mummana is working as an Asst. Professor in Avanthi Institute of Engineering & Technology, Visakhapatnam, Andhra Pradesh. He has received his Masters degree (MCA) from Gandhi Institute of Technology and Management (GITAM), Visakhapatnam and M.Tech (CSE) from Avanthi Institute of Engineering & Technology, Visakhapatnam. Andhra Pradesh. His research areas include Image Processing, Computer Networks, Data Mining, Distributed Systems, Cloud Computing. Page 1795