Discovering Patterns in Text Mining: A Survey Mr. A. P. Katade

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014
Discovering Patterns in Text Mining: A Survey
Mr. A. P. Katade 1, Prof L.J Sankpal 2
1
Research Scholar, Dept. Computer Engineering, Sinhgad Academy Of Engineering Kondhwa (Bk), University of Pune , India
2
Associate Professor, Dept. Computer Engineering, Sinhgad Academy Of Engineering Kondhwa (Bk) University of
Pune, India
Abstract :- There are many approaches which search the text
documents based on the term provided to them, Text mining is
a branch of data mining that deals with searching of useful
information from large amounts of text documents but these
approaches suffer from polysemy and synonymy thus we use
pattern based approach and results also have shown that
pattern based approaches are better than term based
approaches. There are different techniques proposed for
discovering patterns in text. We get meaningless pattern when
we searched them some of the unidentified pattern also get
searched for pruning this pattern we used PTM i.e. Pattern
taxonomy model that illustrates the relationship between
patterns in documents, to improve the performance of
discovered pattern and to get more semantic information .This
paper present a techniques of pattern taxonomy for pruning
meaning less patterns.
Keywords: — Closed sequential pattern, Information filtering,
Pattern mining, Pattern evolution, Text mining.
1. INTRODUCTION
mining is an approach which helps end-users get
T ext
useful information from a huge amount data which is
in the form of digital text documents available on the Web.
It is difficult task to get the actual text information what the
user wants by using different text mining models. In [7],
web mining is one of the systems which includes two
phases: filtering and sophisticated data processing. Filtering
is to filter out those important data which are searched in
systems, and its purpose is to speed up the process of
extraction. Sophisticated data processing phase is to
minimize the difficulties of inequality by adopting various
mining techniques, these techniques are association rule
mining frequent item set mining, sequential pattern
mining, maximum pattern mining and closed pattern
mining. IR has ability of accessing proper documents
simultaneously from many possible relevant as well as
irrelevant documents. Nowadays, with the help of different
technologies, user pays attention to knowledge discovery or
data mining due to its characteristics of discovering useful
data from a large amount of data, which helps in many
sectors like as market analysis, knowledge extraction,
business management etc. It also gives unidentified
information which is helpful to use, that we get due to the
retrieval of information and extraction of data from datasets
so, Data mining helps in discovering Knowledge from
garbage data, many data mining techniques have already
been used to mine data for end users. Most of the
techniques use term based approach while others have a
pattern based method, though the phrases which have less
ISSN: 2231-5381
doubt in meanings than single words. The main reason for
outperformance of the phrases are 1) phrases have lower
numerical property for words 2) they also appear less
frequently 3) there are a number of disused and noisy
patterns [1]. In [7], explained that pattern based
representations are not important as they found an
insignificant performance development on eight various
representations based on terms, patterns synonyms and
hypermyms. In paper [7], the experimental result presented
the method for phrase searching, i.e. pattern taxonomy
model (PTM) is a possible way of applying data mining to
the text mining to get effective patterns. As in pattern
taxonomy model the term that get should be more semantic
to get effectiveness, It is necessary to deploy the discovered
pattern, hence in order to deploy it, we have a pattern
deploying algorithm. This algorithm is strong enough and
important to implement hence we propose patterns
deploying algorithm [7] to effectively use extracted pattern..
In paper [9], for example, keyword LIB may have more
term weight than JDK in a some of data collection; but we
consider that keyword JDK is more specific than keyword
LIB for relating Java Programming Language; and keyword
LIB is more general than keyword JDK because keyword
LIB is mostly used in C and C++. Therefore, it is
inadequate for evaluating the weights of the keyword based
on their distributions in the documents, even though this
method of evaluating has been commonly used in
developing IR models. For solving such inconsistency, we
proposed a technique called an approach of discovering
text patterns which first have a pattern taxonomy model,
here we use the probability distribution function for
distributing terms using Bayesian networks.In the pattern
taxonomy model we also did apriori and postapriori
component Which first calculates discovered specification
of phrases; it then evaluates keywords weights (term
support) according to the distribution of keywords in the
discovered phrases rather than the distribution in data for
solving the misreading problems. It as well considers the
impact of phrases from the negative training examples to
find ambivalent (noisy) phrases and try to minimize their
effect of the low-frequency problem. The process of
improving ambivalent phrases can be mentioned as phrase
improvement. We have used information retrieval
techniques which confirmed that keywords are important in
text documents, though numerous terms with more weight,
i.e. tf*idf (the term frequency and inverse document
frequency) scheme. Thus the purpose of terms, since they
are used normally in both positive and negative information
http://www.ijettjournal.org
Page 45
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014
In [11]The presented approach can improve the
accurateness of evaluating term weights because discovered
phrases are more exact than complete documents. This
paper presents one module of the paper i.e. pattern
taxonomy model The rest of this paper is structured as
follows: Section-2 describes the related work, Section-3
provide terminologies such as Absolute support, closed
sequential pattern, pattern taxonomy model, pattern
deploying method, inner pattern evolution, Section-4
describes survey on the techniques: pattern taxonomy
model, pattern deploying method and Inner pattern
evolution, Section-5 demonstrates proposed system and
Section-6 gives the conclusion.
2 .Related Work
In text mining, pattern mining techniques can be used to
find different text patterns, such as sequential patterns,
frequent item sets, co-occurring terms, multiple grams
and for building up a representation with new
characteristics. The important issue is how to effectively
use different discovered patterns [9]. In [7], a model
called as a pattern taxonomy extraction for discovering
repeated amount of sequential pattern. we use a pattern
taxonomy model it is an IS a relationship where we have
terms arranged. By using this method we form a tree and
extract sequential pattern, however this together can be
achieved with the help of filtering system. This system is
related with web mining and it has two phases which are
filtering and sophisticated data process. This process
form a tree like structure, which shows the association
between pattern and phrase extracted from the dataset. In
the next phase, we remove meaningless patterns. In [5]
proposed an approach which is based on the positive and
negative document. Positive document is nothing but the
document which has the pattern, which we want to
extract. Negative document is nothing but a document
which does not have any pattern or phrase. This
technique uses an algorithm for calculation of minimum
support. It also explains or proposes another technique
called PDR. in pattern deploying method the term
support of term is calculated. Here we get semantic
patterns. In [8], Presented the problem of searching what
items are bought together in a transaction in a complete
data set. Thus the problem of finding sequential patterns
which relate to finding inter transaction phrase. It is an
active research of discovering phrases in sequence and the
focus of this work is about searching the given sequence to
predict a reasonable sequence continuation that is the rule
that predict what terms will come in a given sequence. The
term weighting scheme is nothing but the improvement in
searching for effective performance. It depends upon the
two factors 1) terms retrieved from the document may be
relevant to the user 2) terms which are additional must be
discarded this two ways are usually used to assess the
related data and discard the unrelated data known as recall
and precision. In [9] The work related to term frequency
and inverse document frequency, i.e. (tf*idf) is presented in
the text further to (tf* IDF) weighting scheme is proposed
ISSN: 2231-5381
as said above [11] which improves the effectiveness. The
problems are also there for searching essential feature
among a dataset to improve performance and over fitting.
To improve the performance we use term weight. Each term
has global weight it indicates its performance and can be
Paper
Title
Concept
Advantages
Limitati
on
An
informatio
n filtering
model on
the
web
and
its
applicatio
n in job
agent [13]
Term
Based
Approac
h
Better
Performanc
e
Polyse
my
Synony
my
Automatic
pattern
taxonomy
for
web
mining [9]
Pattern
Based
approach
Resolves
polysemy
and
synonymy
Meaningles
s
Ambig
uous
Pattern
with
low
capabili
ty
Deploying
approache
s
for
Pattern
refinement
in
text
mining [8]
Pattern
Deployi
ng
Effective
patterns are
discovered
Noisy
data
Misinte
rpretati
on
Low
Freque
ncy
A novel
approach
in
text
mining for
discoverin
g useful
pattern [5]
Pattern
based
Misinterpret
ation
Low
Frequency
Noise data
------
applied to entire keywords. In [6] Term-based mining
methods provided views for text representations. Pattern
mining has been widely studied in data mining communities
for many years [9].
Table No1 shows the comparison of different techniques.
Table No: -1
Comparative study of Different techniques
3. TERMINOLOGIES
3.1) Absolute support - in a particular document d if the
sequence or paragraph (q) occurs in d , then we can call p as
a sequential pattern such that q ε d. and is denoted by suppa
(q)
http://www.ijettjournal.org
Page 46
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014
P(t)= P(t)*1/summation of P(t)
I.e. number of occurrences of q in d Relative support is the
fraction of the number of occurrences of paragraph, p in
document d and is denoted by suppr (q) = suppa (q) /d [7].
3.2) Frequent sequential pattern - a pattern is said to be
frequent sequential pattern if relative supports value suppr
(P) is maximum or equal than that of minimum support
value.
3. 3) Closed sequential pattern. - A pattern is said to
closed sequential pattern if pattern P1 occurs frequently in
a document or we can say P1 is closed sequential pattern
of P2 if and only if P1 is a subset of P2 [10].
3.4) Pattern Taxonomy model - We have a pattern-based
model PTM [7] (Pattern Taxonomy Model) for presenting
the text data. It is a structure like prune which has
frequently occurring data, as a root node and others its
nodes as a subset of it in text documents. The two Basics'
approach regarding performance in phrases is its low
frequency of occurrence and false terms. A phrase with
maximum value of the support is a general term which
occurs frequently and as the value of minimum support
minimizes more the unidentified phrases are searched, to
avoid this and to satisfy what the user actually want we
have pattern taxonomy model Pattern taxonomy is a
treelike structure that shows the relation between patterns
discovered from a text data. It is ‘IS’ a relationship with
most relevant patterns and subsequence. For example,
pattern. < P; Q > is a sub-sequence of pattern < P; Q; R
> and pattern <R> is a sub-sequence of pattern <P, R. >
Thus the root of the tree at the bottom level is < P; Q; R
>represents frequent patterns (i.e. More sequential
patterns). Once the tree is structured, we are able to find
links between different phrases. Thus, in this we shrubs
the meaningless pattern, i.e. it is obvious that the pattern <
P; Q >occurs in every paragraph of < P; Q; R > Hence it is
considered as meaningless pattern. As in super sequence
pattern more subsequence pattern can occur. The diagram
below shows that the pattern occurs frequently, i.e. < P; Q
>, < Q; R >, < P; R >, < P >, < R >, Q > in P, Q, R we
can consider this pattern as less frequently or not useful
patterns [7] The fig 1 shows the pattern taxonomy Model
The sequential patterns that we get after pruning are
composed using pattern deploying method.
Fig. 1: Pattern Taxonomy Model
3.5) Pattern deploying method
In [7] the method of using discovered pattern by term
weighting scheme where we deploy the
pattern in a
sequential form and relation among these patterns is of is-a
relation thus there are more overlaps among these pattern
to represent it we deploy it from document d on T.
Algorithm PDM (D, min sup)
Input: a list of documents, D; minimum support,
min sup.
Output: a set of documents Method:
1:
2: for each document d in D do begin
3: SP = SPMining (PL, min sup)
4:
5: for each pattern P in SP do begin
6:
7: end for
8: normalize v
9:
end for
The Phrase enhancement method to deploy the searched
phrase, which is used for showing the notion of documents.
PDM (Pattern Deploying Method) accepts mining sequential
phrase technique to deploy this pattern using a pattern
deploying algorithm.
4. PROPOSED SYSTEM
=
Here P1 and P2 be the set of term numbers then P1
is called composition of P1 and P2
In order to effectively deploy patterns in different
taxonomies from the different positive documents, dpatterns will be normalized using
the following
assignment sentence:
ISSN: 2231-5381
A term with higher (tf*idf) value could not be useful in any
deployed pattern and due to the problem of polysemy and
synonymy in the inner pattern evolution system, we
proposed a technique to remove these drawbacks. We apply
an algorithm for removing stemming and stop words. We
discuss the method of IPE where we calculate term support
by rearranging support of the keywords, deployed from the
document which is in normalized form, considering negative
document in the dataset. This method is useful to minimize
the side effect of misinterpretation in phrases. Here we
change the phrase terms support within the same phrase. A
http://www.ijettjournal.org
Page 47
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014
threshold value is used to decide the significance of
incoming documents [5].
Using the d-patterns, the value of the threshold can be
defined naturally as follows: Threshold value δ of deployed
pattern is calculated on the basis of minimum term support
of term t.A threshold is usually used to classify documents
into relevant or irrelevant categories
(1)
The document is said to be negative, which system
identifies incorrectly as positive, thus weights of negative
Fig.2 System Architecture
Documents nd As the weight factor for the determination of
the noise in the document is compared with a fixed
threshold. This threshold will be made dynamic, so as to get
the error to the margin and the noise be minimized. The
parametric changes will be done in the experimental
coefficient, separator threshold, variance component of the
probability model too. We used probability distribution
function and matrix for the calculation of support in the
pattern deploying method. For reducing the noise of the
terms, we keep track of which d-pattern provides such a
noise; we say these patterns as miscreants of nd (negative
document). The probabilistic model will be constructed
using the Bayesian network which will be further pruned by
using certain suitable techniques. The model constructed will
be mainly based on the overall structure presented the dpattern mining. Also the model will be built upon the apriori
structure which will be provided by the SP Mining
algorithms. Finally, the noise induction model which is used
in a linear form in the paper will be changed so as to make
the system behave in a more realistic way. The induced noise
in the system will be modelled using the Poisson's
distribution which will provide a realistic model of the
search space. The parametric changes will be done in the
experimental coefficient, separator threshold. The changes
will also be made in the variance component of the
probability model too. The inner pattern evolution takes the
concept of d-pattern in which the rearrangement of the
support metric takes place in a certain pre-defined manner
which is referred to as shuffling.
The diagram below shows the system architecture of the
proposed system.
ISSN: 2231-5381
The system has pre-processes module which process the
garbage data, it considers the impact of patterns of negative
data to find an ambivalent pattern on which we apply Dpattern mining algorithm. Negative documents which are
retrieved after d-mining as falsely positive, inner pattern
evolution is applied to that deployed data to avoid noisy
pattern to get the result.
Out of this three methods we proposed an approach of
pattern taxonomy model where data is arrange using “IS A“relationship
Fig 3 Pattern Taxonomy Model
Using the above pattern taxonomy we have calculated the
support metrices the value of it as follows.the values that we
have calculated is on abasis of text input term provided to it
The positive document is used for getting PTM
http://www.ijettjournal.org
Page 48
International Journal of Engineering Trends and Technology (IJETT) – Volume 13 Number 1 – Jul 2014
Pattern Refine-ment in Text Mining”, Proc. IEEE Sixth Intl
Conf. Data Mining (ICDM 06), pp. 1157-1161, 2006.
SuppMet[0]: 0.0625
SuppMet[1]: 0.75
SuppMet[2]: 0.75
SuppMet[3]: 0.75
SuppMet[4]: 0.75
SuppMet[5]: 0.75
SuppMet[6]: 0.75
SuppMet[7]: 0.125
SuppMet[8]: 0.75
SuppMet[9]: 0.75
SuppMet[10]: 0.125
SuppMet[11]: 0.75
SuppMet[12]: 0.75
SuppMet[13]: 0.75
SuppMet[14]: 0.75
SuppMet[15]: 0.0625
[10] X. Li and B. Liu, “Learning to Classify Texts Using Positive
and Unlabeled Data”, Proc. Intl Joint Conf. Artificial
Intelligence (IJCAI03), pp. 587-594, 2003.
[11] X. Yan, J. Han, and R. Afshar, ”Clospan: Mining Closed
Sequential Patterns in Large Datasets”, , Proc. SIAM Intl
Conf. Data Mining (SDM 03), pp. 166-177, 2003.
[12] Y. Huang and S. Lin, “Mining Sequential Patterns Using
GraphSearch Techniques”, Proc. 27th Ann. Intl Computer
Software andApplications Conf., pp. 4-9, 2003.
[13] Y. Li, C. Zhang, and J.R. Swan, ”An Information Filtering
Model on the Web and Its Application in Jobagent,”
Knowledge-Based Systems, vol. 13, no. 5, pp. 285-296,
2000
CONCLUSION
Data mining support association rule mining, frequent.
However, using these calculated disclosed patterns in text
mining is challenging and ineffectual. The cause is that
some long specific patterns have minimum support value
(i.e. Low-frequency problem). As all short patterns are not
relevant Hence, ineffectual performance in discovering
pattern occurs. This paper gives an approach for using a
discovered pattern to minimize low-frequency and
misinterpretation problems. Here this paper shows the
result in the form of term support metrics .the pattern
taxonomy model is also shown.It also gives the
comparative study of pattern taxonomy model.
REFERENCES
[1] F. Sebastiani., “ Machine learning in automated text
categorization,” ACM Computing Surveys, ,34(1):147, 2002
[2] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without
Candidate Generation”, Proc. ACM SIGMOD Intl Conf.
Management of Data (SIGMOD 00), pp. 1-12, 2000.
[3] J. Han and K.C.-C. Chang,”Data Mining for Web
Intelligence,” vol. 35, no. 11, pp. 64-70, Nov. 2002.
[4] K. Aas and L. Eikvil, ‘Text Categorisation: A Survey”,
Technical Report Raport NR 941, Norwegian Computing
Center, 1999..
[5] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu, “Effective
Pattern Dis-covery for Text Mining”, vol.24,No.1,Jan.2012
[6] R. Agrawal and R. Srikant, “Fast Algorithms for Mining
Association Rules in Large Databases”, Proc. 20th Intl Conf.
Very Large Data Bases (VLDB 94), pp. 478-499, 1994.
[7] S. Scott and S. Matwin, “Feature engineering for text
classification”, In ICML99, pages 379388, 1999.
[8] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic
pattern-taxonomy extraction for web mining. ”, In
WI04,pages 242248, 2004.
[9] S.-T. Wu, Y. Li, and Y. Xu, “Deploying Approaches for
ISSN: 2231-5381
http://www.ijettjournal.org
Page 49
Download