DATA MINING CSCI-453 Dr. Khalil Research Paper Presented by Tarek El-Gaaly Ahmed Gamal el-Din 900-00-2342 900-00-0901 Outline 1. Introduction 2. Keywords 3. The Architecture of Data Mining 4. How does Data Mining Actually Work? 5. The Foundation of Data Mining 6. The Scope of Data Mining 7. Data Mining Techniques 8. Applications for Data Mining 9. Data Mining Case Studies 10.Conclusion of this Research Abstract: This paper is a research paper designed to outline and describe data mining as a concept and implementation. Real-life examples of implemented data mining projects will be given to illustrate deeper details into the nature and environment of this complicated and extensive field. This paper provides an introduction to the basic concepts and techniques of data mining. It illustrates examples of applications showing the relevance to today’s business and scientific environments as well as a basic description of how data mining architecture can evolve to deliver the value of data mining to end users and the techniques utilized in data mining. A large portion of this research document is dedicated to the techniques used in data mining. 1. Introduction: What is Data Mining? Data Mining is the process of extracting indirect knowledge hidden in large volumes of raw data. Data mining is an information extraction activity. The goal of this activity is to discover hidden information held within a database. This technique uses a combination of machine learning, statistical analysis, modeling techniques and database technology. Data mining finds patterns and relationships in data and concludes rules that allow the prediction of future results. So basically in lay-man terms it is a technique, which as its name indicates, involves the mining or analysis of data in databases and to extract and establish from this data, relationships and patterns. Data mining is done by database applications. These applications probe for hidden or undiscovered patterns in given collections of data. These applications use pattern recognition technologies as well as statistical and mathematical techniques and can have a unique impact on businesses and scientific research. Data mining is not simple, and most companies have not yet extensively mined their data, though many have plans to do so in the future. Data mining tools predict future trends and behaviors. This allows businesses to make educated and knowledge-based decisions. Data mining tools can answer business questions that traditionally were too time-consuming to resolve. The importance of collecting data that reflect business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases are in place today, however turning this data into an advantage in markets is the difficulty. Human analysts with no special tools can no longer make sense of enormous volumes of data that require processing in order to make educated business decisions. Data mining automates the process of finding relationships and patterns in raw data and delivers results that can be either utilized in an automated decision support system or assessed by a human analyst. 2. Keywords: The following keywords must be understood before proceeding into the detailed description of data mining: - - Data Warehousing: A collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single state in time. Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse database system that provides managers flexible access to the data. The term data warehousing generally refers to the distribution of many different databases across an entire enterprise. Predictors: A predictor is information that supports a probabilistic estimate of future events. 3. The Architecture of Data Mining: To best apply these advanced data mining techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas like promotional campaign management and new product rollout. The diagram below illustrates an architecture for advanced analysis in a large data warehouse. Integrated Data Mining Architecture The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. This data warehouse can be implemented using many relational database systems like: Sybase, Oracle and Redbrick and should be optimized for flexible and fast data access. An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused (ROI is return on investment which is a measure of a company’s profitability) business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the company can continually mine the best practices and apply them to future decisions. This design represents a fundamental shift from conventional decision support systems. Instead of simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a refined view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans. 4. How does Data Mining Actually Work? How exactly is data mining able to tell you important things that you didn't know or what is going to happen in future? The technique that is used to perform these feats in data mining is called modeling. Modeling is simply the act of building a model in one situation where you know the answer and then applying it to another situation that you don't know the answer of. This act of model building is thus something that people have been doing for a long time, certainly before the advent of computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with lots of data about a variety of situations where an answer is known and then the data mining software on the computer must run through that data and refine the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where you don't know the answer. 5. The Foundations of Data Mining Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond data access and navigation to practical information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: Massive data collection Powerful multiprocessor computers Data mining algorithms The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods. Today, the maturity of these techniques and with the help of high-performance relational database engines makes these technologies practical for current data warehouse environments 6. The Scope of Data Mining Given databases of sufficient size and quality, data mining technology can generate new commercial and scientific opportunities by providing these capabilities: Automated prediction of trends and behaviors: Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data.. Automated discovery of previously unknown patterns: Data mining tools sweep through databases and identify previously hidden patterns in one step. The history of the data stored on the database is analyzed to extract hidden patterns. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions because there is more previous data to supply the models and construct predicting models. 7. Data Mining Techniques: In this section the most common data mining algorithms in use today will be briefly outlines. The most commonly used techniques in data mining are: Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance. Clustering: As the word illustrates, this technique is used to cluster together similar data components together. These techniques have been sub-divided into two main categories: Classical Techniques: Statistics, Neighborhoods and Clustering Next Generation Techniques: Trees, Networks and Rules 7.1 The Classic techniques: The main techniques that we will discuss here are the ones that are used 99.9% of the time on existing business problems. There are certainly many other ones as well as but in general the industry is converging to these techniques that work consistently and are understandable and explainable. Statistics: By strict definition "statistics" or statistical techniques is not data mining. They were being used long before the term data mining. Despite this, statistical techniques are driven by the data and are used to discover patterns and build predictive models. From the users perspective you will be faced with a choice when solving a "data mining" problem as to whether you wish to attack it with statistical methods or other data mining techniques. For this reason it is important to have some idea of how statistical techniques work and how they can be applied because the data mining does utilize statistical techniques to a great extent. Data, counting and probability: One thing that is always true about statistics is that there is always data involved, and usually enough data so that the average person cannot keep track of all the data in their heads. Today people have to deal with up to terabytes of data and have to make sense of it and pick up the important patterns from it. Statistics can help greatly in this process by helping to answer several important questions about your data: What patterns are there in my database? What is the chance that an event will occur? Which patterns are significant? What is a high level summary of the data that gives me some idea of what is contained in my database? Certainly statistics can do more than answer these questions but for most people today these are the questions that statistics can help answer. Consider for example that a large part of statistics is concerned with summarizing data and this summarization has to do with counting. One of the great values of statistics is in presenting a high level abstract view of the database that provides some useful information without requiring every record to be understood in detail. Statistics at this level is used in the reporting of important information from which people may be able to make useful decisions. There are many different parts of statistics but the idea of collecting data and counting it is often at the base of even these more sophisticated techniques. The first step then in understanding statistics is to understand how the data is collected into a higher level form. One of the most famous ways of doing this is with the histogram. Linear regression Statistics prediction is usually identical to regression of some form. There are a variety of different types of regression in statistics but the basic idea is that a model is created that maps values from predictors in such a way that the lowest error occurs in making a prediction. The simplest form of regression is simple linear regression that just contains one predictor and a prediction. The relationship between the two can be mapped on a two dimensional space and the records plotted for the prediction values along the Y axis and the predictor values along the X axis. The simple linear regression model then could be viewed as the line that minimized the error rate between the actual prediction value and the point on the line (the prediction from the model). The simplest form of regression seeks to build a predictive model that is a line that maps between each predictor value to a prediction value. Of the many possible lines that could be drawn through the data the one that minimizes the distance between the line and the data points is the one that is chosen for the predictive model. Linear regression is similar to the task of finding the line that minimizes the total distance to a set of data. Nearest Neighbor and Clustering Clustering and the Nearest Neighbor prediction technique are among the oldest techniques used in data mining. Most people have an intuition that they understand what clustering is, namely that similar records are grouped or clustered together. Nearest neighbor is a prediction technique that is quite similar to clustering. Its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that it “nearest” to the unclassified record. Clustering Clustering is the method by which like records are grouped together. Usually this is done to give the end user a high level view of what is going on in the database. Clustering is sometimes used for mean segmentation, which most marketing people will say is useful for coming up with a birds-eye view of the business of a company. Clustering information is then used by the end user to tag the customers in their database. Once this is done the business users can get a quick high level view of what is happening within the cluster. Once the business user has worked with these codes for some time they also begin to build intuitions about how these different customers clusters will react to the marketing offers particular to their business. Sometimes clustering is performed not so much to keep records together, but rather to make it easier to see when one record sticks out from the rest. An example of elongated clusters created from data in a database: 7.2 Next Generation Techniques: Trees, Networks and Rules The data mining techniques in this section represent the most often used techniques. These techniques can be used for either discovering new information within large databases or for building predictive models. Though the older decision tree techniques such as CHAID are currently highly used the new techniques such as CART are gaining wider acceptance. Decision Trees A decision tree is a predictive model that, as its name implies, can be viewed as a tree. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the classified data in the database. For instance if we were going to classify customers who don’t renew their phone contracts in the cellular telephone industry, a decision tree might look like this: Interesting facts about the trees: It divides up the data on each branch point without losing any of the data. The number of renewers and non-renewers is conserved as you move up or down the tree. It is pretty easy to understand how the model is being built. It would also be easy to use this model if you actually had to target those customers that are likely to renew with a targeted marketing offer. Another way that the decision tree technology has been used is for preprocessing data for other prediction algorithms. Because the algorithm is fairly robust with respect to a variety of predictor types and because it can be run relatively quickly decision trees can be used on the first pass of a data mining run to create a subset of possibly useful predictors. CART CART stands for Classification and Regression Trees and is a data exploration and prediction algorithm. Researchers from Stanford University and the University of California at Berkeley showed how this new algorithm could be used on a variety of different problems. In building the CART tree each predictor is picked based on how well it divides apart the records with different predictions. For instance one measure that is used to determine whether a given split point for a given predictor is better than another is the entropy metric. CHAID Another equally popular decision tree technology to CART is CHAID or ChiSquare Automatic Interaction Detector. CHAID is similar to CART in that it builds a decision tree but it differs in the way that it chooses its splits. Instead of the entropy or “gini” metrics for choosing optimal splits the technique relies on the chi square test used in contingency tables to determine which categorical predictor is furthest from independence with the prediction values. Due to the fact that CHAID relies on the contingency tables to form its test of significance for each predictor all predictors must either be categorical or be forced into a categorical form. Neural Networks This paper will outline neural networks do have disadvantages that can be limiting in their ease of use and ease of deployment, but they do also have some significant advantages. Foremost among these advantages is their highly accurate predictive models that can be applied across a large number of different types of problems. To be more precise with the term “neural network” one might better speak of an “artificial neural network”. True neural networks are biological systems (brains) that detect patterns, make predictions and learn. The artificial ones are computer programs implementing sophisticated pattern detection and machine learning algorithms on a computer to build predictive models from large historical databases. Artificial neural networks derive their name from their historical development which started off with the premise that machines could be made to “think” if scientists found ways to mimic the structure and functioning of the human brain on the computer. To understand how neural networks can detect patterns in a database an analogy is often made that they “learn” to detect these patterns and make better predictions in a similar way to the way that human beings do. This view is encouraged by the way the historical training data is often supplied to the network, for example one record at a time. Neural networks do “learn” in a very real sense but under the top the algorithms and techniques that are being deployed are not truly different from the techniques found in statistics or other data mining algorithms. It is for instance, unfair to assume that neural networks could outperform other techniques because they “learn” and improve over time while the other techniques are static. The other techniques in fact “learn” from historical examples in exactly the same way but often times the examples (historical records) to learn from a processed all at once in a more efficient manner than neural networks which often modify their model one record at a time. Neural Networks for feature extraction One of the important problems in all of data mining is that of determining which predictors are the most relevant and the most important in building models that are most accurate at prediction. A simple example of a feature in problems that neural networks are working on is the feature of a vertical line in a computer image. The predictors or raw input data are just the colored pixels that make up the picture. Recognizing that the predictors (pixels) can be organized in such a way as to create lines, and then using the line as the input predictor can prove to dramatically improve the accuracy of the model and decrease the time to create it. Some features like lines in computer images are things that humans are already pretty good at detecting; in other problem domains it is more difficult to recognize the features. One fresh way that neural networks have been used to detect features is the idea that features are sort of a compression of the training database. Rule Induction Rule induction is one of the major forms of data mining and is perhaps the most common form of knowledge discovery in unsupervised learning systems. It is also perhaps the form of data mining that most closely resembles the process that most people think about when they think about data mining, namely “mining” for gold through a vast database. The gold in this case would be a rule that is interesting (that tells you something about your database that you didn’t already know and probably weren’t able to explicitly articulate). Rule induction on a database can be a massive undertaking where all possible patterns are systematically pulled out of the data and then an accuracy and significance are added to them that tell the user how strong the pattern is and how likely it is to occur again The annoyance of rule induction systems is also its strength because it retrieves all possible interesting patterns in the database. This is a strength in the sense that it leaves no stone unturned but it can also be viewed as a weakness because the user can easily become overwhelmed with such a large number of rules that it is difficult to look through all of them. This too many patterns can also be problematic for the simple task of prediction because all possible patterns are gathered from the database there may be conflicting predictions made by equally interesting rules. Automating the process of collecting the most interesting rules and of combing the recommendations of a variety of rules are well handled by many of the commercially available rule induction systems on the market today and is also an area of active research. Discovery The claim to fame of these ruled induction systems is much more so for knowledge discovers in unsupervised learning systems than it is for prediction. These systems provide both a very detailed view of the data where significant patterns that only occur a small portion of the time and only can be found when looking at the detail data as well as a broad overview of the data where some systems seek to deliver to the user an overall view of the patterns contained n the database. These systems thus display a nice combination of both micro and macro views. Prediction After the rules are created and their value is measured there is also a call for performing prediction with the rules. The rules must be used to predict or else they will prove almost useless. Each rule by itself can perform prediction. The resulting is the target and the accuracy of the rule is the accuracy of the prediction. But because rule induction systems produce many rules for a given ancestor or result there can be conflicting predictions with different accuracies. This is an opportunity for improving the overall performance of the systems by combining the rules. This can be done in a variety of ways by summing the accuracies as if they were weights or just by taking the prediction of the rule with the maximum accuracy. 8. Applications for Data Mining: A wide range of companies have deployed successful applications of data mining. Early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing; the technology is applicable to any company looking to influence a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, wellintegrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied. Some successful application areas include: A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations. A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product can be identified. Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches. A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. Using data mining to analyze its own customer experience, this company can build a unique segmentation identifying the attributes of high-value prospects. Applying this segmentation to a general business database such as those provided by Dun & Bradstreet can yield a prioritized list of prospects by region. A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments. Each of these examples has a clear common ground. They control the knowledge about customers understood from a data warehouse to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects, and design targeted marketing strategies to best reach them. 9. Data Mining Case Studies: The following case studies have been written to display and give the reader an indepth and below the surface description of data mining in the real world. These real life situations will allow the reader to relate to actual data mining in the world today, and the reader will understand more about the concepts of data mining. The following are three examples of data mining in the real world: 9.1 Energy Usage in a Power Plant: ICI Thornton Power station in the United Kingdom, produces steam for a range of processes on the site, and generates electricity in a mix of primary pass out and secondary condensing turbines. Total power output is approximately 50 MW. Fuel and water costs amount to about £5 million a year, depending on site steam demands. The objective of the data mining project was to identify opportunities to reduce power station operating costs. Costs include the cost of fuel (gas and oil to fire the boilers) and water (to make up for losses). Electricity and steam are sold and represent a revenue. Rule induction data mining was used for the project with the outcome of the analysis being the net cost of steam per unit of steam supplied to the site (the cost of the product). The attributes fall into two categories; disturbances such as ambient temperature and the site steam demand over which the operators have no control, and control settings such as pressure and steam rates. The benefits derived from the generated patterns include the identification of opportunities to improve process revenue considerably (by up to 5%). Mostly these involve altering control set points which can be implemented without any further work and for no extra cost. Implementing some of the opportunities identified needed additional controls and instrumentation. The pay back for the additional controls would be a few months. 9.2 Gas Processing Plant: This project was carried out for an oil company and was based in a remote US oil field location. The process investigated was a very large gas processing plant which produces two useful products from the gas from the wells, natural gas liquids. The aim of the study was to use data mining techniques to analyze historical process data to find opportunities to increase the production rates, and hence increase the revenue generated by the process. Approximately 2000 data measurements for the process were captured every minute. Rule induction data mining was used to discover patterns in the data. The business goal for data mining was the revenue from the Gas Process Plant, while the attributes of the analysis fell into two categories: Disturbances, such as wind speed and ambient temperature, which have an impact on the way the process is operated and performs, but which have to be accepted by the operators. Control set points, these can be altered by the process operators or automatically by the control systems, and include temperature and pressure set points, flow ratios, control valve positions, etc. This is the tree generated by rule induction. It reveals patterns relating the revenue from the process to the disturbances and control settings of the process. The benefits derived from the generated patterns include the identification of opportunities to improve process revenue considerably (by up to 4%). 9.2 Biological Applications of Multi-Relational Data Mining: Biological databases contain a wide variety of data types, often with rich relational structure. Consequently multi-relational data mining techniques frequently are applied to biological data. This section presents several applications of multi-relational data mining to biological data, taking care to cover a broad range of multi-relational data mining techniques. Consider storing in a database the operational details of a single cell organism. At minimum one would need to encode into the data warehouse or database the following. - Genome: DNA sequence and gene locations Proteome: the organism's full complement of proteins, not necessarily a direct mapping from its genes - Metabolic pathways: linked biochemical reactions involving multiple proteins. - Regulatory pathways: the mechanism by which the expression of some genes into proteins The database contains other information as well besides the items listed above, such as operons. The data warehouse being used with the data mining project stores data about genomes (DNA sequences and gene locations). From the data stored, patterns and relationships between the genes and genomes are fed into a neural network along with certain genetic laws and slowly the data mining tool learns the framework and network of genes and genomes. The data mining tools extracts hidden patterns and correlation hidden in the data stored and displays to the human analysts dependencies and relationships in the genetic material. This hidden data produced could never have been discovered by humans because of the huge information-intensive storage of data. This genetic biological application is a perfect application for data mining. It is an excellent example of a huge amount data that is exceedingly hard to analyze using other techniques besides data mining. Data mining has proved exceedingly useful in fields like this, where the amount of data is way beyond the scope of human comprehension. 10. Conclusion of this Research: The only basic conclusion that can be deducted from this intense research into the field of data mining with all its concepts, techniques and implementations is that it is a complex and growing technique. It is being more and more used in the real world to increase economy, knowledge and learning from data storages we already have, which sometimes sit idol. The details of some of the techniques of data mining have been omitted in this paper for the very fact that they dig down to the lowest level of abstraction which is un-necessary to describe the whole concept of data mining. References 1. “An Introduction to Data Mining” by Kurt Thearling, www.thearling.com 2. “An Overview of Data Mining Techniques”, by AlexBerson, Stephen Smith and Kurt Thearling. 3. “An Overview of Data Mining at Dun & Bradstreet”, Kurt Thearling 4. “CHAID”, G. J. Huba, http://www.themeasurementgroup.com/Definitions/chaid.htm 5. “Understanding Data Mining: It's All in the Interaction”, Kurt Thearling 6. “Data Mining Case Studies”, Dr Akeel Al-Attar, http://www.attar.com/tutor/mining.htm 7. “GDM – Genome Data Mining Tools”, http://gdm.fmrp.usp.br/projects.php