Data Mining The Mining Analogy • Data mining gains its name to some degree its popularity, by playing off a meaning that the data you have stored is much like a ‘mountain’ and that buried within the mountain (just as buried within your data) are certain ‘gems’ of great value • The problem is that there are also lots of non-valuable rocks and rubble in the mountain that need to be mined through and discarded in order to get that which is valuable. • The trick is that both for mountains of rock and mountains of data you need some power tools to unearth the value of the data • For rock, this means earthmovers and dynamite; for data, this means powerful computers and data mining software What Data Mining Isn’t • Statistic – Statistical tools take longer to run – Less robust on messy real world data – Must often be wielded by a master craftsman • OLAP (Online Analytical Processing) – OLAP provides a tool for looking quickly anywhere within the mountain – What it doesn’t tell you is What is valuable and What isn’t • Data Warehousing Data Mining has come of age • What is data mining? And why are so many people talking about it in both the computer industry and in direct marketing? • The answer is simple: data mining helps end users extract useful business information from large databases. • What is so new about extracting information from data to make your business run better? • The allure of data mining is that it promise to fix the problem of miscommunication between you and your data, and allow you to ask complex questions of your data such as – What has been going on? – What is going to happen next and how can I profit? Data Mining has come of age (cont’d) • What has been going on? – To answer this question can be provided by the data warehouse and multidimensional database technology that allow the user to easily navigate and visualize the data • What is going to happen next and how can I profit? – The answer to this question can be provided by data mining tools built on some of the latest computer algorithms: Decision Tree (CART, CHAID, AID), Neural Networks, Nearest Neighbor, and Rule Induction Learning from Your Past Mistakes • Those who cannot remember the past are condemned to repeat it [G.Santayana] • How does data mining work? • It works the same way as a human being does. • It uses historical information (experience) to learn from the past. • The trick to building a successful predictive model is to have some data in your database that describes what has happened in the past. (see details at page. 96) Measuring Data Mining Effectiveness – Accuracy, Speed, Cost • To make the right choice of the data mining tool, they need to evaluate it in comparison to existing statistical techniques and also compare among the large number of new data mining products that are currently on the market. • Data mining technology is actually quite similar to statistics in the way it builds a predictive model from data. • Often, the accuracy of that prediction depends more on the correct deployment of the technology and the quality of the data than it does in the technology itself. • The choice of data mining should be driven by the advantages that it brings to the bottom line of the entire business process – not just the statistical predictive accuracy. Measuring Data Mining Effectiveness – Accuracy, Speed, Cost (cont’d) • The other way that data mining techniques are often measured is by speed. • The reasoning is that the faster the tools runs, the larger is the data set to which it can be applied. The larger the database is, the better the accuracy of the predictive model will be. • To truly determine which technologies are best, it is helpful to look at the big picture, which includes a much larger business process than just data analysis. The full process includes data collection, data analysis (data mining), predictive model visualization, and the launching of a marketing program against a customer set Discovery versus Prediction • Discovery – finding something that you weren’t looking for – One of the obvious things about real mining is that when you come across a diamond or vein of gold, you know that you have found it – You can recognize the important properties of diamonds or gold because they have been discovered before and you know what they look like and feel like – The main idea of how these system work is by making three measurements: • How strong is the association? • How unexpected is it? • How ubiquitous is it? – The first rule requires that the pattern in the data be a strong pattern (for example, that it occurs 90% of the time) – The second measurement ensures that the pattern is interesting to the user and not obvious. – The fulfillment of the third ensures that the pattern occurs often enough that it is useful. Discovery versus Prediction (cont’d) • Prediction – With prediction, you as the end user have a very specific event or attribute that you want to find a pattern in association with. – For instance, suppose you want to predict customer attrition. – One of the most important parts of predicting customer attrition is having historical information in your database about which customers have attrited in the past. – There may be many interesting patterns in your database – say, between the age of your customers and their buying habits – that you might like to discover, but in this case, you know very well that attrition is costing you a lot of money. State of the Industry • The current offering in data mining software products emphasize different important aspects of the algorithms and their usage. The different emphases are driven because of differences in the targeted user and the types of problems being solved. There are four main categories of products: – – – – Targeted Solutions Business Tools Business Analyst Tools Research Analyst Tools Data Mining Methodology • The first of these is the concept of finding a pattern in the data • The second of these is that of sampling or not having to use all of the data in order to make significant conclusions about what might be happening with other parts of the data • The third, validating the predictive models that arise out of data mining algorithm What is a pattern? What is a model? • Although there are many ways to define patterns and models, here is what they mean in the context of data warehousing and data mining: – Model. A description of the original historical database from which it was built that can be successfully applied to new data in order to make predictions about missing values or to make statements about expected values. – Pattern. An event or combination of events in a database that occurs more often than expected. Typically, this means that its actual occurrence is significantly different than what would be expected by random chance • What is the difference between a pattern and a model? Visualizing a pattern Figure 5-2 A Graphical Representation of Number Sequence Visualizing a pattern Figure 5-2 A Graphical Representation of Complex Number Sequence appears much more understandable thanthe raw data A note on terminology • • • • • • Database Record Field Predictor Prediction Value A note on terminology (cont’d)