Balancing Exploration and Exploitation using Search Mining Techniques

advertisement
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
Balancing Exploration and Exploitation using Search Mining Techniques
Sonal Kapoor, Narendra Kumar and Alok Aggrawal
Singhania University, Jhunjhanu
ACEM, Agra
JIIT, Noida
Abstract: Search Mining is the process of
extracting patterns from data. Search Mining is
becoming an increasingly important tool to
transform this data into information. It is
commonly used in a wide range of profiling
practices, such as marketing, surveillance, fraud
detection and scientific discovery. In this paper
we are presenting an overview to balancing and
exploration. We are also going insight into the
search mining methods as an application part.
Introduction:
Search Mining techniques are becoming
indispensable parts of business intelligence
programs. Use these links to learn more about
these emerging fields and keep on top of this
trend. Search Mining is used to uncover patterns
in data but is often carried out only on samples
of data. The mining process will be ineffective
if the samples are not good representation of the
larger body of data. Search Mining cannot
discover patterns that may be present in the
larger body of data if those patterns are not
present in the sample being “mined”. Inability to
find patterns may become a cause for some
disputes between customers and service
providers. Therefore Search Mining is not
foolproof but may be useful in sufficiently
representative data samples are collected. The
discovery of particular patterns in a particular
set of data does not necessarily mean that a
pattern is found elsewhere in the larger data
from which that sample was drawn. An
important part of this process is the verification
and validation of patterns on other samples of
data. The related terms data dredging, data
ISSN: 2231-5381
fishing and data snooping refers to the use of
Search Mining Techniques to samples sizes that
are (or may be) too small for statistical
inferences to be made about the validity of any
patterns discovered. Data dredging may,
however, be used to develop new hypotheses,
which must then be validated with sufficiently
large sample sets.
Evolution:
In addition to industry driven demand for
standards and interoperability, professional and
academic activity have also made considerable
contribution to the evolution and rigour of the
methods and models; an article published in a
2008 issue of the International Journal of
Information Technology and Decision Making
summarizes the results of a literature survey
which traces and analyzes this evolution. The
premier professional body in the field is the
Association for Computing Machinery’s Special
Interest Group on Knowledge Discovery and
Search Mining(SIGKDD). Since 1989 they have
hosted an annual international conference and
published its proceedings, and since 1999 have
published a biannual academic journal titled
“SIGKDD Exploitations”.
Research Methodology:
• Clustering – is the task of discovering groups
and structures in the data that are in some way
or another “similar”, without using known
structure in the data.
• Classification – is the task of generalizing
known structure to apply to new data. For
example, an email programme might attempt to
http://www.internationaljournalssrg.org
Page 158
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
classify an email as legitimate or spam.
Common algorithms include decision tree
learning, nearest neighbor, naïve Bayesian
classification and neural networks.
• Regression – Attempt to find a function which
models the data with the least error.
• Association rule learning – Searches for
relationships between variables,. For example a
supermarket might gather data on customer
purchasing habits. Using association rule
learning, the supermarket can determine which
products are frequently bought together and use
this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Through the use of automated statistical analysis
(or “data mining”) techniques, businesses are
discovering new trends and patterns of behavior
that previously went unnoticed. Once they’ve
uncovered this vital intelligence, it can be used
in a predictive manner for a variety of
applications. Brain James, assistant coach of the
Toronto Raptors, uses Search Mining techniques
to rack and stack his team against the rest of the
NBA. The bank of Montreal’s business
intelligence and knowledge discovery program
is used to gain insight into customer behavior.
RESULTS VALIDATION
The first step towards building a productive
Search Mining program is, of course, to gather
data! Most business already perform these data
gathering tasks to some extent – the key here is
to locate the data critical to your business, refine
it and prepare it for the Search Mining process.
If you are currently tracking customer data in a
modern DBMS, chances are you’re almost done.
Take a look at the article Mining Customer Data
from DB2 Magazine for a great feature on
preparing your data for the mining process.
The final step of knowledge discovery from
data is to verify the patterns produced by the
mining algorithms occur in the wider data set.
Not all patterns produced by the Search Mining
algorithms are necessarily valid. It is common
for the Search Mining algorithms to find
patterns in the training set which are not present
in the general data set, this is called over fitting.
To overcome this, the evolution uses a test set of
data which the Search Mining algorithms was
not trained on. The learnt patterns are applied to
this test set and the resulting output is compared
to the desired output. For example, a Search
Mining algorithms trying to distinguish spam
from legitimate emails would be trained on a
training set of sample emails. Once trained, the
learnt patterns would be applied to the test set of
emails which it had not been trained on; the
accuracy of these patterns can then be measured
from how many emails they correctly classify.
A number of statistical methods may be used to
evaluate the algorithms such as ROC curves.
If the learnt patterns do not meet the desired
standards, then it is necessary to reevaluate and
change the preprocessing and data mining. If the
learnt patterns and turn them into knowledge.
SEARCH MININGIN BUSINESS
ISSN: 2231-5381
GATHERING DATA
SELECTING AN ALGORITHM
At this point, take a moment to pat yourself on
the back. You have a data warehouse! The next
step is to choose one or more Search Mining
algorithms to apply to your problem. If you’re
just starting out, its probably a good idea to
experiment with several techniques to give
yourself a feel for how they work. Your choice
of algorithm will depend upon the data you have
gathered, the problem you are trying to solve
and the computing tools you have available to
you.
REGRESSION
Regression is the oldest and most well-known
statistical technique that the Search Mining
community utilizes. Basically, regression takes
a numerical dataset and develops a
http://www.internationaljournalssrg.org
Page 159
International Journal of Engineering Trends and Technology- Volume3Issue2- 2012
mathematical formula that fits the data. When
you are ready to use the results to predict future
behavior, you simply take your new data, plug it
into the developed formula and you have got a
prediction! The major limitation of this
technique is that it only works well with
continuous quantitative data (like weight, speed
or age). If you are working with categorical data
where order is not significant (like color, name
or gender) you’re better off choosing another
technique.
CLASSIFICATION
Working with categorical data or a mixture of
continuous numeric and categorical data?
Classification analysis might suit your needs
well. This technique is capable of processing a
wider variety of data than regression and is
growing in popularity. You’ll also find output
that is much easier to interpret. Instead of the
complicated mathematical formula given by the
regression technique you’ll receive a decision
tree that requires a series of binary decisions.
SEARCH MININGPRODUCTS
Search Mining products are taking the industry
by storm. The major data base vendors have
already taken steps to ensure that their platforms
incorporate Search Mining techniques. Oracle’s
Search Mining Suite (Darwin) implements
classification and regression trees, neural
networks, k-nearest neighbors, regression
analysis and clustering algorithms. Microsoft’s
SQL Server also offers Search Mining
functionality through the use of classification
trees and clustering algorithms. If you’re
already working in a statistics environment,
you’re probably familiar with the Search Mining
algorithms implementations offered by the
advanced statistical packages SPSS, SAS, and
S-Plus.
Glossary of Terms;
Analytical Mode : A structure and process for
analyzing a dataset.
Anomalous Data: Data that result from errors.
Artificial
neural
networks:
Non-Linear
predictive models that learn through training
ISSN: 2231-5381
and resemble biological neural and networks
in structure.
CART: Classification and Regression trees
CHAID : Chi Square Automatic Interactions and
Detections Classification: The process of
dividing a dataset into mutually exclusive
group such that the members of each group are
as close as possible to one another.
Data Cleansing : The process of ensuring that
all values in a dataset are consistent and
correctly recorded.
References:
3.
4.
5.
6.
7.
8.
1.
Fayyad, Usama; Gregory Piatetsky-Shapiro, and
Padhraic Smyth (1996).. Retrieved 2008-12-17.
2.
Clifton, Christopher (2010).. Retrieved 2010-1209.
Ian H. Witten; Eibe Frank; Mark A. Hall (30
January 2011). Search Mining: Practical Machine Learning
Tools and Techniques (3 ed.). Elsevier..
R.R. Bouckaert; E. Frank; M.A. Hall; G. Holmes;
B. Pfahringer; P. Reutemann; I.H. Witten (2010), "WEKA
Experiences with a Java open-source project", Journal of
Machine Learning Research 11: 2533–2541, "the original
title, "Practical machine learning", was changed [...]^
Kantardzic, Mehmed (2003). Search Mining: Concepts,
Models, Methods, and Algorithms. John Wiley & Sons,
International Conferences on Knowledge Discovery and
Search Mining, ACM, New York. , ACM, New York.
Günnemann, S.; Kremer, H.; Seidl, T. (2011). "An
extension of the PMML standard to subspace clustering
models". Proceedings of the 2011 workshop on Predictive
markup language modeling - PMML '11. pp. 48
Ellen Monk, Bret Wagner (2006). Concepts in
Enterprise Resource Planning, Second Edition. Thomson
Course Technology, Boston, MA.
Roberto Battiti and Mauro Brunato, Reactive
Search Srl, Italy, February 2011.
Battiti, Roberto; Andrea Passerini (2010). "BrainComputer Evolutionary Multi-Objective Optimization (BCEMO): a genetic algorithm adapting to the decision maker.".
IEEE Transactions on Evolutionary Computation 14 (15):
671–687.
http://www.internationaljournalssrg.org
Page 160
Download