International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014 Discovering Patterns in Text Mining: A Survey Mr. A. P. Katade 1, Prof L.J Sankpal 2 1 Research Scholar, Dept. Computer Engineering, Sinhgad Academy Of Engineering Kondhwa (Bk), University of Pune , India 2 Associate Professor, Dept. Computer Engineering, Sinhgad Academy Of Engineering Kondhwa (Bk) University of Pune, India Abstract :- There are many approaches which search the text documents based on the term provided to them, Text mining is a branch of data mining that deals with searching of useful information from large amounts of text documents but these approaches suffer from polysemy and synonymy thus we use pattern based approach and results also have shown that pattern based approaches are better than term based approaches. There are different techniques proposed for discovering patterns in text. We get meaningless pattern when we searched them some of the unidentified pattern also get searched for pruning this pattern we used PTM i.e. Pattern taxonomy model that illustrates the relationship between patterns in documents, to improve the performance of discovered pattern and to get more semantic information .This paper present a techniques of pattern taxonomy for pruning meaning less patterns. Keywords: — Closed sequential pattern, Information filtering, Pattern mining, Pattern evolution, Text mining. 1. INTRODUCTION mining is an approach which helps end-users get T ext useful information from a huge amount data which is in the form of digital text documents available on the Web. It is difficult task to get the actual text information what the user wants by using different text mining models. In [7], web mining is one of the systems which includes two phases: filtering and sophisticated data processing. Filtering is to filter out those important data which are searched in systems, and its purpose is to speed up the process of extraction. Sophisticated data processing phase is to minimize the difficulties of inequality by adopting various mining techniques, these techniques are association rule mining frequent item set mining, sequential pattern mining, maximum pattern mining and closed pattern mining. IR has ability of accessing proper documents simultaneously from many possible relevant as well as irrelevant documents. Nowadays, with the help of different technologies, user pays attention to knowledge discovery or data mining due to its characteristics of discovering useful data from a large amount of data, which helps in many sectors like as market analysis, knowledge extraction, business management etc. It also gives unidentified information which is helpful to use, that we get due to the retrieval of information and extraction of data from datasets so, Data mining helps in discovering Knowledge from garbage data, many data mining techniques have already been used to mine data for end users. Most of the techniques use term based approach while others have a pattern based method, though the phrases which have less ISSN: 2231-5381 doubt in meanings than single words. The main reason for outperformance of the phrases are 1) phrases have lower numerical property for words 2) they also appear less frequently 3) there are a number of disused and noisy patterns [1]. In [7], explained that pattern based representations are not important as they found an insignificant performance development on eight various representations based on terms, patterns synonyms and hypermyms. In paper [7], the experimental result presented the method for phrase searching, i.e. pattern taxonomy model (PTM) is a possible way of applying data mining to the text mining to get effective patterns. As in pattern taxonomy model the term that get should be more semantic to get effectiveness, It is necessary to deploy the discovered pattern, hence in order to deploy it, we have a pattern deploying algorithm. This algorithm is strong enough and important to implement hence we propose patterns deploying algorithm [7] to effectively use extracted pattern.. In paper [9], for example, keyword LIB may have more term weight than JDK in a some of data collection; but we consider that keyword JDK is more specific than keyword LIB for relating Java Programming Language; and keyword LIB is more general than keyword JDK because keyword LIB is mostly used in C and C++. Therefore, it is inadequate for evaluating the weights of the keyword based on their distributions in the documents, even though this method of evaluating has been commonly used in developing IR models. For solving such inconsistency, we proposed a technique called an approach of discovering text patterns which first have a pattern taxonomy model, here we use the probability distribution function for distributing terms using Bayesian networks.In the pattern taxonomy model we also did apriori and postapriori component Which first calculates discovered specification of phrases; it then evaluates keywords weights (term support) according to the distribution of keywords in the discovered phrases rather than the distribution in data for solving the misreading problems. It as well considers the impact of phrases from the negative training examples to find ambivalent (noisy) phrases and try to minimize their effect of the low-frequency problem. The process of improving ambivalent phrases can be mentioned as phrase improvement. We have used information retrieval techniques which confirmed that keywords are important in text documents, though numerous terms with more weight, i.e. tf*idf (the term frequency and inverse document frequency) scheme. Thus the purpose of terms, since they are used normally in both positive and negative information http://www.ijettjournal.org Page 45 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014 In [11]The presented approach can improve the accurateness of evaluating term weights because discovered phrases are more exact than complete documents. This paper presents one module of the paper i.e. pattern taxonomy model The rest of this paper is structured as follows: Section-2 describes the related work, Section-3 provide terminologies such as Absolute support, closed sequential pattern, pattern taxonomy model, pattern deploying method, inner pattern evolution, Section-4 describes survey on the techniques: pattern taxonomy model, pattern deploying method and Inner pattern evolution, Section-5 demonstrates proposed system and Section-6 gives the conclusion. 2 .Related Work In text mining, pattern mining techniques can be used to find different text patterns, such as sequential patterns, frequent item sets, co-occurring terms, multiple grams and for building up a representation with new characteristics. The important issue is how to effectively use different discovered patterns [9]. In [7], a model called as a pattern taxonomy extraction for discovering repeated amount of sequential pattern. we use a pattern taxonomy model it is an IS a relationship where we have terms arranged. By using this method we form a tree and extract sequential pattern, however this together can be achieved with the help of filtering system. This system is related with web mining and it has two phases which are filtering and sophisticated data process. This process form a tree like structure, which shows the association between pattern and phrase extracted from the dataset. In the next phase, we remove meaningless patterns. In [5] proposed an approach which is based on the positive and negative document. Positive document is nothing but the document which has the pattern, which we want to extract. Negative document is nothing but a document which does not have any pattern or phrase. This technique uses an algorithm for calculation of minimum support. It also explains or proposes another technique called PDR. in pattern deploying method the term support of term is calculated. Here we get semantic patterns. In [8], Presented the problem of searching what items are bought together in a transaction in a complete data set. Thus the problem of finding sequential patterns which relate to finding inter transaction phrase. It is an active research of discovering phrases in sequence and the focus of this work is about searching the given sequence to predict a reasonable sequence continuation that is the rule that predict what terms will come in a given sequence. The term weighting scheme is nothing but the improvement in searching for effective performance. It depends upon the two factors 1) terms retrieved from the document may be relevant to the user 2) terms which are additional must be discarded this two ways are usually used to assess the related data and discard the unrelated data known as recall and precision. In [9] The work related to term frequency and inverse document frequency, i.e. (tf*idf) is presented in the text further to (tf* IDF) weighting scheme is proposed ISSN: 2231-5381 as said above [11] which improves the effectiveness. The problems are also there for searching essential feature among a dataset to improve performance and over fitting. To improve the performance we use term weight. Each term has global weight it indicates its performance and can be Paper Title Concept Advantages Limitati on An informatio n filtering model on the web and its applicatio n in job agent [13] Term Based Approac h Better Performanc e Polyse my Synony my Automatic pattern taxonomy for web mining [9] Pattern Based approach Resolves polysemy and synonymy Meaningles s Ambig uous Pattern with low capabili ty Deploying approache s for Pattern refinement in text mining [8] Pattern Deployi ng Effective patterns are discovered Noisy data Misinte rpretati on Low Freque ncy A novel approach in text mining for discoverin g useful pattern [5] Pattern based Misinterpret ation Low Frequency Noise data ------ applied to entire keywords. In [6] Term-based mining methods provided views for text representations. Pattern mining has been widely studied in data mining communities for many years [9]. Table No1 shows the comparison of different techniques. Table No: -1 Comparative study of Different techniques 3. TERMINOLOGIES 3.1) Absolute support - in a particular document d if the sequence or paragraph (q) occurs in d , then we can call p as a sequential pattern such that q ε d. and is denoted by suppa (q) http://www.ijettjournal.org Page 46 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014 P(t)= P(t)*1/summation of P(t) I.e. number of occurrences of q in d Relative support is the fraction of the number of occurrences of paragraph, p in document d and is denoted by suppr (q) = suppa (q) /d [7]. 3.2) Frequent sequential pattern - a pattern is said to be frequent sequential pattern if relative supports value suppr (P) is maximum or equal than that of minimum support value. 3. 3) Closed sequential pattern. - A pattern is said to closed sequential pattern if pattern P1 occurs frequently in a document or we can say P1 is closed sequential pattern of P2 if and only if P1 is a subset of P2 [10]. 3.4) Pattern Taxonomy model - We have a pattern-based model PTM [7] (Pattern Taxonomy Model) for presenting the text data. It is a structure like prune which has frequently occurring data, as a root node and others its nodes as a subset of it in text documents. The two Basics' approach regarding performance in phrases is its low frequency of occurrence and false terms. A phrase with maximum value of the support is a general term which occurs frequently and as the value of minimum support minimizes more the unidentified phrases are searched, to avoid this and to satisfy what the user actually want we have pattern taxonomy model Pattern taxonomy is a treelike structure that shows the relation between patterns discovered from a text data. It is ‘IS’ a relationship with most relevant patterns and subsequence. For example, pattern. < P; Q > is a sub-sequence of pattern < P; Q; R > and pattern <R> is a sub-sequence of pattern <P, R. > Thus the root of the tree at the bottom level is < P; Q; R >represents frequent patterns (i.e. More sequential patterns). Once the tree is structured, we are able to find links between different phrases. Thus, in this we shrubs the meaningless pattern, i.e. it is obvious that the pattern < P; Q >occurs in every paragraph of < P; Q; R > Hence it is considered as meaningless pattern. As in super sequence pattern more subsequence pattern can occur. The diagram below shows that the pattern occurs frequently, i.e. < P; Q >, < Q; R >, < P; R >, < P >, < R >, Q > in P, Q, R we can consider this pattern as less frequently or not useful patterns [7] The fig 1 shows the pattern taxonomy Model The sequential patterns that we get after pruning are composed using pattern deploying method. Fig. 1: Pattern Taxonomy Model 3.5) Pattern deploying method In [7] the method of using discovered pattern by term weighting scheme where we deploy the pattern in a sequential form and relation among these patterns is of is-a relation thus there are more overlaps among these pattern to represent it we deploy it from document d on T. Algorithm PDM (D, min sup) Input: a list of documents, D; minimum support, min sup. Output: a set of documents Method: 1: 2: for each document d in D do begin 3: SP = SPMining (PL, min sup) 4: 5: for each pattern P in SP do begin 6: 7: end for 8: normalize v 9: end for The Phrase enhancement method to deploy the searched phrase, which is used for showing the notion of documents. PDM (Pattern Deploying Method) accepts mining sequential phrase technique to deploy this pattern using a pattern deploying algorithm. 4. PROPOSED SYSTEM = Here P1 and P2 be the set of term numbers then P1 is called composition of P1 and P2 In order to effectively deploy patterns in different taxonomies from the different positive documents, dpatterns will be normalized using the following assignment sentence: ISSN: 2231-5381 A term with higher (tf*idf) value could not be useful in any deployed pattern and due to the problem of polysemy and synonymy in the inner pattern evolution system, we proposed a technique to remove these drawbacks. We apply an algorithm for removing stemming and stop words. We discuss the method of IPE where we calculate term support by rearranging support of the keywords, deployed from the document which is in normalized form, considering negative document in the dataset. This method is useful to minimize the side effect of misinterpretation in phrases. Here we change the phrase terms support within the same phrase. A http://www.ijettjournal.org Page 47 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014 threshold value is used to decide the significance of incoming documents [5]. Using the d-patterns, the value of the threshold can be defined naturally as follows: Threshold value δ of deployed pattern is calculated on the basis of minimum term support of term t.A threshold is usually used to classify documents into relevant or irrelevant categories (1) The document is said to be negative, which system identifies incorrectly as positive, thus weights of negative Fig.2 System Architecture Documents nd As the weight factor for the determination of the noise in the document is compared with a fixed threshold. This threshold will be made dynamic, so as to get the error to the margin and the noise be minimized. The parametric changes will be done in the experimental coefficient, separator threshold, variance component of the probability model too. We used probability distribution function and matrix for the calculation of support in the pattern deploying method. For reducing the noise of the terms, we keep track of which d-pattern provides such a noise; we say these patterns as miscreants of nd (negative document). The probabilistic model will be constructed using the Bayesian network which will be further pruned by using certain suitable techniques. The model constructed will be mainly based on the overall structure presented the dpattern mining. Also the model will be built upon the apriori structure which will be provided by the SP Mining algorithms. Finally, the noise induction model which is used in a linear form in the paper will be changed so as to make the system behave in a more realistic way. The induced noise in the system will be modelled using the Poisson's distribution which will provide a realistic model of the search space. The parametric changes will be done in the experimental coefficient, separator threshold. The changes will also be made in the variance component of the probability model too. The inner pattern evolution takes the concept of d-pattern in which the rearrangement of the support metric takes place in a certain pre-defined manner which is referred to as shuffling. The diagram below shows the system architecture of the proposed system. ISSN: 2231-5381 The system has pre-processes module which process the garbage data, it considers the impact of patterns of negative data to find an ambivalent pattern on which we apply Dpattern mining algorithm. Negative documents which are retrieved after d-mining as falsely positive, inner pattern evolution is applied to that deployed data to avoid noisy pattern to get the result. Out of this three methods we proposed an approach of pattern taxonomy model where data is arrange using “IS A“relationship Fig 3 Pattern Taxonomy Model Using the above pattern taxonomy we have calculated the support metrices the value of it as follows.the values that we have calculated is on abasis of text input term provided to it The positive document is used for getting PTM http://www.ijettjournal.org Page 48 International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014 Pattern Refine-ment in Text Mining”, Proc. IEEE Sixth Intl Conf. Data Mining (ICDM 06), pp. 1157-1161, 2006. SuppMet[0]: 0.0625 SuppMet[1]: 0.75 SuppMet[2]: 0.75 SuppMet[3]: 0.75 SuppMet[4]: 0.75 SuppMet[5]: 0.75 SuppMet[6]: 0.75 SuppMet[7]: 0.125 SuppMet[8]: 0.75 SuppMet[9]: 0.75 SuppMet[10]: 0.125 SuppMet[11]: 0.75 SuppMet[12]: 0.75 SuppMet[13]: 0.75 SuppMet[14]: 0.75 SuppMet[15]: 0.0625 [10] X. Li and B. Liu, “Learning to Classify Texts Using Positive and Unlabeled Data”, Proc. Intl Joint Conf. Artificial Intelligence (IJCAI03), pp. 587-594, 2003. [11] X. Yan, J. Han, and R. Afshar, ”Clospan: Mining Closed Sequential Patterns in Large Datasets”, , Proc. SIAM Intl Conf. Data Mining (SDM 03), pp. 166-177, 2003. [12] Y. Huang and S. Lin, “Mining Sequential Patterns Using GraphSearch Techniques”, Proc. 27th Ann. Intl Computer Software andApplications Conf., pp. 4-9, 2003. [13] Y. Li, C. Zhang, and J.R. Swan, ”An Information Filtering Model on the Web and Its Application in Jobagent,” Knowledge-Based Systems, vol. 13, no. 5, pp. 285-296, 2000 CONCLUSION Data mining support association rule mining, frequent. However, using these calculated disclosed patterns in text mining is challenging and ineffectual. The cause is that some long specific patterns have minimum support value (i.e. Low-frequency problem). As all short patterns are not relevant Hence, ineffectual performance in discovering pattern occurs. This paper gives an approach for using a discovered pattern to minimize low-frequency and misinterpretation problems. Here this paper shows the result in the form of term support metrics .the pattern taxonomy model is also shown.It also gives the comparative study of pattern taxonomy model. REFERENCES [1] F. Sebastiani., “ Machine learning in automated text categorization,” ACM Computing Surveys, ,34(1):147, 2002 [2] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation”, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 00), pp. 1-12, 2000. [3] J. Han and K.C.-C. Chang,”Data Mining for Web Intelligence,” vol. 35, no. 11, pp. 64-70, Nov. 2002. [4] K. Aas and L. Eikvil, ‘Text Categorisation: A Survey”, Technical Report Raport NR 941, Norwegian Computing Center, 1999.. [5] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu, “Effective Pattern Dis-covery for Text Mining”, vol.24,No.1,Jan.2012 [6] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases”, Proc. 20th Intl Conf. Very Large Data Bases (VLDB 94), pp. 478-499, 1994. [7] S. Scott and S. Matwin, “Feature engineering for text classification”, In ICML99, pages 379388, 1999. [8] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic pattern-taxonomy extraction for web mining. ”, In WI04,pages 242248, 2004. [9] S.-T. Wu, Y. Li, and Y. Xu, “Deploying Approaches for ISSN: 2231-5381 http://www.ijettjournal.org Page 49