Course Name: Business Intelligence Year: 2009 Knowledge Discovery and Data Mining 19th Meeting Source of this Material (2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 14 Bina Nusantara University 3 The Business Case Data mining fills a niche in the BI arena where the data consumer is not necessarily sure what to be looking for. The kinds of knowledge that are discovered may be directed toward a specific goal or not, but the methods of data mining are driven by finding pattern in the data that reflect more meaningful bits of knowledge. It would be unwise to engage in building a BI program and ignore the promise of a data mining component. Bina Nusantara University 4 Data Mining and The Data Warehouse Knowledge discovery is a process that requires a lot of data, and that data needs to be in a reliable state before it can be subjected to the data mining process. The accumulation of enterprise data within a data warehouse that has been properly validated, cleaned, and integrated provides the best source of data that can be subjected to knowledge discovery. Not only is the warehouse likely to incorporate the breadth of data needed for this component of the BI process, it probably contains the historical data needed. Because a lot of data mining relies on using one set of data for training a process that can be tested on another set of data, having the historical information available for testing and evaluating hypothesis makes the warehouse even more valuable. Bina Nusantara University 5 The Virtuous Cycle As Berry and Linoff state in their book Data Mining Technique, the process of mining data can be describe as a virtuous cycle. The virtue is based on the continuous improvement of a business process that is driven by the discovery of actionable knowledge and taking the actions prescribed by these discoveries. • Identity The Business Problem One of the more difficult tasks is identifying the business problem that needs to be solved. Very often, other aspects of the BI program can feed into this process. Other kinds of business problems are actually part of the general business cycle. • Mine The Data For Actionable Information Depending on the problem, there are a number of different data mining techniques that can be used to look for actionable knowledge. But no matter what techniques are used, the process is to assemble the right set of information, prepare that information for mining, apply the algorithms, and analyze the results to find some knowledge that is actionable. Bina Nusantara University 6 The Virtuous Cycle (cont…) • Take The Action The next logical step is to take the actions suggested by the discoveries during the data mining process. Keep track of which actions were taken, because that leads into the next stage. • Measure Results The importance of measuring the results of the actions taken is that it refines the process of addressing the original business problem. The goal here is to look at what the expected response was to the specific actions and to determine the quality of each action. Bina Nusantara University 7 Directed Versus Undirected Knowledge Discovery There are two different approaches to knowledge discovery. The first is when we already have the problem we want to solve and are applying the data mining methods to discover the relationship between the variables under scrutiny in terms of the other available variables. This called directed knowledge discovery, as opposed to undirected knowledge discovery. Undirected knowledge discovery is the process of using data mining techniques to find interesting patterns within a data set as a way to highlight some potentially interesting issue. This approach is more likely to be used to recognize behavior or relationships, whereas directed knowledge discovery is used primarily to explain or describe those relationships once they have been found. Bina Nusantara University 8 Six Basic Tasks of Data Mining • Classification A frequent data mining task is classification, which involves examining the attributes of a particular object and assigning it to a defined class. Classification can be used to divide a customer base into best, mediocre, and low value customers. • Estimation Estimation is a process of assigning some continuously valued numeric value to an object. Estimation can be used as part of the classification process (such as using an estimation model to guess a person’s annual salary as part of a market segmentation process). A value of estimation is that because a value is being assigned to some continuous variable, the resulting assignments can be ranked by score. • Prediction The subtle difference between prediction and the previous two tasks is that prediction is the attempt to classify objects according to some expected future behavior. Classification and estimation can be used for the purposes of prediction by using historical data, where the classification is already known, to build a model (this is called training). That model can then be applied to new data to predict future behavior. Bina Nusantara University 9 Six Basic Tasks of Data Mining (cont…) • Affinity Grouping Affinity grouping is a process of evaluating relationships or associations between data elements that demonstrate some kind of affinity between objects. • Clustering Clustering is the task of taking a large collection of objects and dividing them into smaller groups of objects that exhibit some similarity. The difference between clustering and classification is that during the clustering task, the classes are not defined beforehand. Clustering can be used in concert with other data mining tasks as a way of identifying a business problem area to be further explored. • Description The last of the size task is description, which is the process of trying to characterize what has been discovered or trying to explain the results of the data mining process. Being able to describe a behavior or a business rule is another step toward an effective intelligence program that can indentify knowledge, articulate it, and then evaluate actions that can be taken. Bina Nusantara University 10 Data Mining Technique Although there are a number of technique used for data mining, this section enumerates some techniques that are frequently used as well as some examples of how each technique is used. • Market Basket Analysis Market basket analysis is the process of clustering objects to look for groups of objects that frequently appear together. Market basket analysis is a good way to look for items that appear together or a set of discrete events that take place in a particular sequence. • Memory-Based Reasoning Memory-based reasoning (MBR) is a process of using one data set to create a model from which prediction or assumptions can be made about newly introduced objects. There are two basic components to an MBR method. The first is the similarity (sometimes called distance) function, which measures how similar the members of any pair of object are to each other. The second is the combination function, which is used to combine the results from the set of neighbors to arrive at a decision. Bina Nusantara University 11 Data Mining Technique (cont…) • Cluster Detection There are two approaches to clustering. The first approach is to assume that a certain number of clusters are already embedded in the data; the goal is to break the data up into that number of clusters. In the other approach, called agglomerative clustering, instead of assuming the existence of any specific predetermined number of clusters, every item starts out in its own cluster, and an iterative process attempts to merge clusters, again though a process of computing similarity. • Link Analysis Link analysis is the process of looking for and establishing links between objects within a data set as well as characterizing the weight associated with any link between two objects. Link analysis is useful for analytical applications that rely on graph theory for drawing conclusions. Another analytical area for which link analysis is useful is process optimization. • Rule Induction Part of the knowledge discovery process is the identification of business rules that are embedded within data. The methods associated with rule induction are used for this discovery process. One approach to rule discovery is the use of decision tree. Bina Nusantara University 12 Data Mining Technique (cont..) Another approach to rule induction is the discovery of association rules. Association rules specify a relation between attributes that appears more frequently than expected if the attributes were independent. • Neural Networks A neural network is an attempt to represent the model of a human brain as a collection of individual neurons connected within a network. A neural network essentially captures a set of statistical operations embodied as the application of a weighted combination function applied to all inputs to a neuron to compute a single output value that is then propagated to other neuron within the networks. Bina Nusantara University 13 Management Issues Knowledge discovery and data mining are very valuable components of the BI program. To maintain the high value of the knowledge discovery process, keep the following management issues in mind. • Buy Versus Build To determine what kinds of data mining techniques are most appropriate for the business problems that arise within your organization and then to buy the tools that support those techniques and hire experienced engineers to work with those tools. • Data Preparation One issue that can destroy the effectiveness of any data mining activity is using data that has not been properly prepared for the task at hand. • Understanding The Results Some of the techniques described in this chapter are better suited for understanding results than others. Remember: To successfully draw conclusions from the results of data mining, you should have a good understanding of the data. Bina Nusantara University 14 Management Issues (cont…) • Managing Business Client Expectations Remember that data mining is an exploratory process and that sometimes what we discover during an exploration is that there is nothing to discover. Data mining can be a powerful value-adding technology, it does not always provide the expected magic bullet solution to all the problems. • Remember The Virtuous Cycle The data mining and knowledge discovery process is a virtuous cycle, and the process will not have as much value if your do not identify actions to take, actually take those actions, and the measure the results. Determining which techniques provide the best insight into a business problem and figuring out the best ways to exploit discovered knowledge are the critical components to data mining success. Bina Nusantara University 15 End of Slide Bina Nusantara University 16