Data Mining

advertisement
Data Mining
The Mining Analogy
• Data mining gains its name to some degree its popularity, by playing
off a meaning that the data you have stored is much like a ‘mountain’
and that buried within the mountain (just as buried within your data)
are certain ‘gems’ of great value
• The problem is that there are also lots of non-valuable rocks and rubble
in the mountain that need to be mined through and discarded in order
to get that which is valuable.
• The trick is that both for mountains of rock and mountains of data you
need some power tools to unearth the value of the data
• For rock, this means earthmovers and dynamite; for data, this means
powerful computers and data mining software
What Data Mining Isn’t
• Statistic
– Statistical tools take longer to run
– Less robust on messy real world data
– Must often be wielded by a master craftsman
• OLAP (Online Analytical Processing)
– OLAP provides a tool for looking quickly anywhere
within the mountain
– What it doesn’t tell you is What is valuable and What
isn’t
• Data Warehousing
Data Mining has come of age
• What is data mining? And why are so many people talking about it in
both the computer industry and in direct marketing?
• The answer is simple: data mining helps end users extract useful
business information from large databases.
• What is so new about extracting information from data to make your
business run better?
• The allure of data mining is that it promise to fix the problem of
miscommunication between you and your data, and allow you to ask
complex questions of your data such as
– What has been going on?
– What is going to happen next and how can I profit?
Data Mining has come of age
(cont’d)
• What has been going on?
– To answer this question can be provided by the data warehouse and
multidimensional database technology that allow the user to easily
navigate and visualize the data
• What is going to happen next and how can I profit?
– The answer to this question can be provided by data mining tools
built on some of the latest computer algorithms: Decision Tree
(CART, CHAID, AID), Neural Networks, Nearest Neighbor, and
Rule Induction
Learning from Your Past
Mistakes
• Those who cannot remember the past are
condemned to repeat it [G.Santayana]
• How does data mining work?
• It works the same way as a human being does.
• It uses historical information (experience) to learn
from the past.
• The trick to building a successful predictive model
is to have some data in your database that
describes what has happened in the past. (see
details at page. 96)
Measuring Data Mining Effectiveness –
Accuracy, Speed, Cost
• To make the right choice of the data mining tool, they need
to evaluate it in comparison to existing statistical
techniques and also compare among the large number of
new data mining products that are currently on the market.
• Data mining technology is actually quite similar to
statistics in the way it builds a predictive model from data.
• Often, the accuracy of that prediction depends more on the
correct deployment of the technology and the quality of the
data than it does in the technology itself.
• The choice of data mining should be driven by the
advantages that it brings to the bottom line of the entire
business process – not just the statistical predictive
accuracy.
Measuring Data Mining Effectiveness –
Accuracy, Speed, Cost (cont’d)
• The other way that data mining techniques are often
measured is by speed.
• The reasoning is that the faster the tools runs, the larger is
the data set to which it can be applied. The larger the
database is, the better the accuracy of the predictive model
will be.
• To truly determine which technologies are best, it is
helpful to look at the big picture, which includes a much
larger business process than just data analysis. The full
process includes data collection, data analysis (data
mining), predictive model visualization, and the launching
of a marketing program against a customer set
Discovery versus Prediction
•
Discovery – finding something that you weren’t looking for
– One of the obvious things about real mining is that when you come
across a diamond or vein of gold, you know that you have found it
– You can recognize the important properties of diamonds or gold because they have
been discovered before and you know what they look like and feel like
– The main idea of how these system work is by making three measurements:
• How strong is the association?
• How unexpected is it?
• How ubiquitous is it?
– The first rule requires that the pattern in the data be a strong pattern (for example,
that it occurs 90% of the time)
– The second measurement ensures that the pattern is interesting to the user and not
obvious.
– The fulfillment of the third ensures that the pattern occurs often enough that it is
useful.
Discovery versus Prediction (cont’d)
• Prediction
– With prediction, you as the end user have a very specific event or attribute
that you want to find a pattern in association with.
– For instance, suppose you want to predict customer attrition.
– One of the most important parts of predicting customer attrition is having
historical information in your database about which customers have
attrited in the past.
– There may be many interesting patterns in your database – say, between
the age of your customers and their buying habits – that you might like to
discover, but in this case, you know very well that attrition is costing you
a lot of money.
State of the Industry
• The current offering in data mining software products
emphasize different important aspects of the algorithms
and their usage. The different emphases are driven because
of differences in the targeted user and the types of
problems being solved. There are four main categories of
products:
–
–
–
–
Targeted Solutions
Business Tools
Business Analyst Tools
Research Analyst Tools
Data Mining Methodology
• The first of these is the concept of finding a
pattern in the data
• The second of these is that of sampling or not
having to use all of the data in order to make
significant conclusions about what might be
happening with other parts of the data
• The third, validating the predictive models that
arise out of data mining algorithm
What is a pattern? What is a model?
• Although there are many ways to define patterns and
models, here is what they mean in the context of data
warehousing and data mining:
– Model. A description of the original historical database from which
it was built that can be successfully applied to new data in order to
make predictions about missing values or to make statements about
expected values.
– Pattern. An event or combination of events in a database that
occurs more often than expected. Typically, this means that its
actual occurrence is significantly different than what would be
expected by random chance
• What is the difference between a pattern and a model?
Visualizing a pattern
Figure 5-2 A Graphical Representation of Number Sequence
Visualizing a pattern
Figure 5-2 A Graphical Representation of Complex Number Sequence
appears much more understandable thanthe raw data
A note on terminology
•
•
•
•
•
•
Database
Record
Field
Predictor
Prediction
Value
A note on terminology (cont’d)
Download