ICT619 Intelligent Systems Topic 7: Data Mining Introduction Business Applications of data mining Data Mining Activities Data Mining Techniques How to Apply Data Mining Data Mining Development Methodology References Berry, M., & Linoff, G. Mastering Data Mining, Wiley Computer Publishing, New York 2000. Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A. Discovering Data Mining: From Concept to Implementation. Prentice Hall, Englewood Cliffs, NJ 1998. Dhar, V., & Stein, R.,”Deriving Rules from Data” in Seven Methods for Transforming Corporate Data into Business Intelligence., Prentice Hall 1997, pp. 167-189, 251-258. Ganti, V., Gehrke, J., & Ramakrishnan, R. Mining Very Large Databases, IEEE Computer, Vol.32 No.8, August 1999, pp.38-45. Hirji, K., Exploring Data Mining Implementation, Communications of the ACM, Vol.44, No.7, July 2001, pp. 87-93. Web site on Data Mining and Web Mining - http://www.kdnuggets.com/software/suites.html 533562489 1 ICT619 S2-05 Introduction Why data mining? The advent of information technology in the last four decades has resulted in an abundance of data generated or captured electronically. Some of the sources for this proliferation of data are: data generated by point-of-sale (POS) devices such as bar code scanners, customer call detail databases, web log files in e-commerce etc. While the generation, storage and transmission of data is becoming more and more efficient, organizations are ending up with huge amounts of mostly day-to-day transaction data, which are being stored away in database files for possible future use. A more organised and useful approach of storing data has given rise to organizational data warehouses, which are central repositories of cleaned and transformed data. As an example of the magnitude of business data, the US Library of Congress has 17 million books equivalent to about 17 terabytes of data, which is also reported to be the volume of data in the database of the UPS shipping company. Data is being collected mostly for improving efficiency of underlying operations, but not for analysis and prediction. It has become obvious that businesses can gain competitive advantage if it is possible to extract useful information on market and customer behaviour by “mine”-ing the data. Such information may indicate important underlying trends, associations or patterns in market behaviour, which can help obtain answers to questions like – “Based on past buying behaviour, which customers should be targeted for direct marketing?” According to (Hirji 2001), “A practical and applied definition of data mining is the analysis and non-trivial extraction of data from databases for the purpose of discovering new and valuable information, in the form of patterns and rules, from relationships between data elements.” The term data mining stands for the process of, rather than a product for, exploring and analysing data for discovering new and useful information. This concept is not new, and data analysis using statistical methods such as regression is already a well-established practice. But statistical methods suffer from scalability problem. They work well with relatively small data sets and manageable number of variables, but not so with millions of records and thousands of variables. The advent of intelligent techniques such as artificial neural networks and decision trees has made it possible to perform data mining involving large volumes of data more effectively and efficiently. Current interest in data mining, in particular for gaining competitive advantage in business, is growing. Its application in areas such direct target marketing campaigns, fraud detection, and development of models to aid in financial predictions is likely to intensify in the coming years. Data mining projects involve both the utilization of established algorithms from machine learning, statistics, database systems and information visualization, and the development of new methods and algorithms, targeted at large data mining problems. Nowadays data mining efforts have gone beyond crunching databases of credit card usage or stored transaction records. They have been focusing on data collected in the health care system, art, design, medicine and biology and other areas of human endeavour. Online Analytical Processing (OLAP), Data Mining and Knowledge Discovery Unlike data warehouses, which aim to present a single centralised view of the data in an organization, OLAP databases (known as data marts) offer improved speed and responsiveness by limiting themselves to a single view of the data specific to the department that owns the database. OLAP databases are organised using a multidimensional structure (called a cube) using dimensions of the business such as time, product type and geography. This enables the analyst to segregate and analyse data along the different aspects of business activity. Sometimes a data warehouse can consist of a collection of data marts. 533562489 2 ICT619 S2-05 Both data mining and OLAP are tools for decision support. While OLAP attempts to find what is happening and how it is happening, data mining aims to find out the underlying characteristics of data in order to be able to predict what is likely to happen. The difference in OLAP and data mining is best understood in terms of the questions they allow to be asked by the user - for example, “How has the sale of product X varied from quarter to quarter during the past year over all the geographic regions it is sold? (OLAP)”, “Which customers are more likely to buy product X?” (data mining ). Data mining belongs to the wider field of knowledge discovery, which concerns the automated extraction of useful information from a diverse range of sources including textual, multimedia and web data. Data mining is knowledge discovery applied predominantly in commercial databases. The main objectives of this topic are to: - Understand the role of data mining in business - Distinguish between different data mining techniques - Understand how to go about making use of data mining Business Applications of Data Mining The data mining segment is one of the fastest growing segments in the business intelligence market. Companies are increasingly investigating the potential of data mining technology to deliver competitive advantage. It is being regarded as an integral and necessary component of an organization’s portfolio of analytical techniques. The use of data mining techniques as an intelligent system tool in a number of aspects in business is outlined below. Data mining for marketing Many of the most successful applications of data mining are in marketing. In this area, data mining is used for both reducing cost and increasing revenue. Databases contain collections of data on prospective targets of a marketing campaign. These data relate to information on customer behaviour obtained from operational systems such as point-of-sale systems or from commercial databases available for a fee. Data mining can be used to reduce marketing costs by eliminating calls and letters to people who are unlikely to respond to an offer. Data mining for customer relationship management Good customer relationship management involves anticipating customers’ needs and responding to them proactively. This can be made possible in a large enterprise only through the application of data mining techniques to the records in its customer database. Data mining in R&D Data mining can lower costs during the research and development phase of the product life cycle. One example is the pharmaceutical industry, which is characterised by the generation of large amounts of test data. These data have to be analysed using sophisticated prediction techniques to determine which chemicals are likely to produce useful drugs. An entire discipline of bioinformatics has grown up to mine and interpret the data being generated by high throughput screening and other biological sources. Data Mining Activities The different objectives of performing data mining may be categorised into two broad groups – directed data mining and undirected data mining. In directed data mining, we know what we are looking for. We aim to find the value of a pre-identified target variable in terms of a collection of input variables, eg, classifying insurance claims. Undirected data mining takes a bottom-up approach. It finds patterns in data and leaves it to the user to find the significance of these patterns. Eg, an analysis of customer profiles may result in identifying groups of customers with similar buying patterns. 533562489 3 ICT619 S2-05 The different types of data mining tasks are listed below: Classification Estimation Prediction Finding affinity grouping or association rules Clustering Description and visualisation Classification, estimation and prediction are examples of directed data mining, while the remaining three tasks belong to the group undirected data mining group. Classification Classification assigns a given object to a predefined category (class) based on the object’s attributes (features). In the business data mining context, the objects to be classified are generally represented by database records. Examples of classification tasks include: o Assigning keywords to articles o Classifying credit applicants as low, medium and high risk o Assigning customers to predefined customer segments Estimation While classification produces discrete outcomes – yes or no, low, medium, high etc, estimation deals with continuously varying outcomes, such as income, probability of a customer leaving (known in data mining circles as sing) or average number of children in a family. Outcomes of such estimation can also be used for classification by ranking the values and applying a threshold to categorise. Prediction A prediction can be a classification or estimation task performed to predict some future behaviour. Examples include: o Predicting which customers will churn in the next six months o Predicting the size of a balance that will be transferred Finding affinity grouping or association rules The task of affinity grouping is to find out, which things go together. The prototypical example is – which things go together in a supermarket shopping trolley. Supermarkets make use of such information for arranging items in shelves or catalogues, and for identifying cross-selling opportunities (eg, people who buy product X also buy products Y and Z). Clustering Clustering segments a group of diverse records into subgroups or clusters containing similar records. There are no predefined classes in clustering; records are grouped based on similarities in their attributes – eg, people with similar buying habits. Different clusters of music CD purchases may indicate different age or cultural sub-groups. It is left to the data miner to interpret such clusters and decide what use can be made of them. Description and visualisation Data mining can be used simply to describe a database to help increase our understanding of the people, products or processes that produced the data in the first place. A good description can also provide an explanation of their behaviour or at least indicate where to look for an explanation. Data visualisation is a powerful form of descriptive data mining. Although visualisation may not be meaningful in all cases, it can be very effective in explaining things by exploiting our ability to make use of visual clues. 533562489 4 ICT619 S2-05 Data Mining Techniques The field of data mining spans a number of disciplines including statistics, databases and machine learning. In this topic, we’ll aim to find out when to apply data mining techniques, how to interpret their results, and how to evaluate their performance. To be able to do so, we need a basic understanding of the inner workings of these techniques. The three major approaches for data mining are: decision trees, automatic cluster detection and artificial neural networks (supervised and unsupervised). Decision Trees A decision tree may be regarded as the visual representation of a reasoning process. They are particularly suitable for solving classification problems. A decision tree consists of nodes, branches and leaves. In it, each internal node represents an attribute, called a splitting attribute, and each leaf node is labelled with a class label. The class label is decided by the class of the records that ended up in that leaf during training. A leaf node may also contain a value depending upon the average of the values of such records. Each edge originating from an internal node is labelled with a splitting predicate that involves only the node’s splitting attribute. The splitting predicate has the property that any record will take a unique path from the root to exactly one leaf node. Salary <= 50K > 50K Group D Age <= 40 > 40 Employment Academia, Industry Group B Group C Self Group A Figure 1 Sample decision tree for a catalogue mailing (Ganti et al. 1999) To help us understand how the decision tree works, we can view each record with N attributes as a point in an N-dimensional record space. Each branch in decision tree is a test on a single variable that splits the space into two or more regions. After the split in the root node of the tree, each of these regions will have a mix of records with different values for most, if not all, attributes except the one, which was tested in the root node. With each successive test and split, the resulting regions get more and more segregated with increasing homogeneity among the records in each region. Ultimately, the leaf nodes will contain the purest batch of records. For example, in the decision tree of Figure 1, any self-employed person aged less than 41 and earning a salary of more than $50,000 will be classified as belonging to group A. It is possible to build a decision tree that correctly classifies every single record, assuming no two records have the same set of attributes (input variables) but belong to two different classes (target variables). This is not desirable though, as it gives rise to the problem known as overfitting. Such a tree describes the training data very well but is unlikely to generalise to new data sets. As an example, a large decision tree may be built to identify every inhabitant of a small town – each leaf node will be labelled with a name (assuming no two persons with the same names live in this town). But this tree won’t be able to do anything useful like determining whether a given person belongs to the group of 533562489 5 ICT619 S2-05 overweight male teenage students (no representative leaf node). To prevent overfitting, test data set are used to prune decision trees once it has been built using the training data set. There are a number of different types of decision trees depending mainly upon the number of splits allowed at each level, how these splits are chosen when the tree is built and how the tree is pruned to prevent overfitting. More broadly, decision trees can be grouped as either classification trees (leafs represent classes) or regression trees (leafs represent a numeric value). There are various algorithms for building decision trees – the most notable among them are CHAID, C4.5/C5.0 and CART. Typical data mining software tools these days allow the user to choose among several splitting criteria and pruning strategies, and to control parameters such as maximum tree depth to allow approximation of any of these algorithms. How decision trees are built Decision trees are built through a process known as recursive partitioning. Recursive partitioning is an iterative process of splitting the data up into partitions (regions of record space). Initially all the records are in a training set – the preclassified records that are used to determine the structure of the decision tree. An algorithm splits up the data, using every possible binary split on every field of the records. The algorithm chooses the split that partitions the data into two parts that are purer than the original data (the training set). The splitting process is then applied to each of the new parts and so on until no more useful splits can be found. The most important task in building a decision tree is to decide which of the attributes (independent fields in a record) gives the best split. The best split is defined as one that creates partitions where a single class predominates. The measure used to evaluate a potential splitter is the reduction in diversity (or increase in purity). There are several methods of calculating the index of diversity for a set of records. One measure, called the Gini index in data mining circles, is given by the formula 2p1(1 – p2) where p1 is the probability of class one. Two other diversity indexes are: min (p1, p2) p1log p1 + p2log p2, known as Entropy. To choose the best splitter at a node, the decision tree algorithm considers each input field in turn. Then, every possible split is tried. The diversity measure is calculated for the two new partitions, and the best split is the one with the largest reduction in diversity. The field, which yields the best split, is chosen as the splitter for that node. If a field takes on only one value, it is eliminated from consideration since there is no way it can be used to create a split. When no split can be found that significantly decreases the diversity of a given node, then this node is a leaf node. Eventually only leaf nodes remain and the full decision tree has been grown. As mentioned earlier, the full decision tree needs to be pruned to improve its performance. Pruning is done by removing leaves and branches (edges leading to leaves) that fail to generalise. There are a number of pruning methods, one of which uses the actual performance of the tree on a separate set of preclassified data, called the test set. A tree is pruned back to the subtree that minimises error on the test set. Application of decision trees Decision trees are useful when the data mining task is classification of records or prediction of outcomes. They are used when the goal is to assign each record to one of a few broad categories. Decision tree methods are also chosen for their ability to generate understandable rules, which can be explained and translated into SQL or a natural language. For any classified record, the rule for its classification can be generated by simply tracing the path from the root to the leaf where record ended up. Most decision tree tools provide for this capability. Like artificial neural networks, decision trees also represent a class of machine learning algorithm since they are capable of generating rules from training data. One major difference between these two paradigms however is that unlike a neural net, whose rules (or input-output mappings) are implicit in its weights, rules in a decision tree are explicit. 533562489 6 ICT619 S2-05 Automatic Cluster Detection Cluster detection aims to discover structure in a complex data set as a whole in order to carve it up into simpler groups. Examples of clustering are - finding products that should be grouped together in a catalogue, or identifying groups of customers with similar tastes in music. There are many methods for finding clusters in data, a prominent one among which is described below. K-means clustering The K-means clustering algorithm is available in a wide variety of commercial data mining tools. It divides the data set into a predetermined number, k, of clusters. These clusters are centred at random points in the record space. Records are assigned to the clusters through an iterative process that moves the cluster means (also called cluster centroids) around until each one is actually at the centre of some cluster of records. Seed 3 Seed 2 Seed 1 Figure 2 Initial cluster seeds (from Berry & Linoff 2000). In the first step, k data points are selected to be the seeds more or less arbitrarily. Each of these seeds is an embryonic cluster with only one element. In the example shown in figure 1, k is 3. Seed 3 Seed 2 Seed 1 Figure 3 Initial clusters and intercluster boundaries (from Berry & Linoff 2000). In the second step, each record is assigned to the cluster whose centroid is nearest to that record. This forms the three clusters shown in figure 4 with the new intercluster boundaries. Note the boxed record which was assigned to cluster 2 (seed 2) initially now becomes part of cluster 1. 533562489 7 ICT619 S2-05 Seed 3 Seed 2 Seed 1 Figure 4 New clusters, their centroids marked by crosses and intercluster boundaries (from Berry & Linoff 2000). The centroid of a cluster of records is calculated by taking the average of each field for all the records in that cluster. For measuring distances between a record and a cluster’s centroid, the Euclidean distance1 is most commonly used by data mining software. In the k-means method, the original choice of the value of k determines the number of clusters that will be found. Unless advanced knowledge is available on the likely number of clusters, experimentation with different values of k is needed. Best results are obtained when k matches the underlying structure of the data. Interpreting clusters A strength of the automatic cluster detection is that it is an undirected data mining technique – we can look for something useful without knowing what we are looking for. But this also means, we may not recognize it when we find something useful! The most frequently used approaches to understanding clusters are Building a decision tree with the cluster labels as target variables, and using it derive rules explaining how to assign new records to the correct cluster. Using visualisation to see how the clusters are affected by changes in input variables. Examining the differences in the distributions of variables from cluster to cluster, one variable at a time. Application of clusters Cluster detection is used when it is suspected that there are natural groupings, which may represent groups of customers or products that have lot in common with each other. These may turn out to be commonly occurring customer segments for which customised marketing approaches are justified. Clustering is also useful when there are many competing patterns in the data making it hard to spot any single pattern. Creating clusters of similar data records reduces the complexity within clusters so that other data mining techniques are more likely to succeed. 1 The Euclidean distance between two points P(x1, x2, .. , xn) and Q(y1, y2, .. , yn) in n-dimensional space is ((x1-y1)2 + (x2-y2)2 + .. + (xn-yn)2). 533562489 8 ICT619 S2-05 Artificial Neural Networks Going back to the main data mining activities mentioned earlier, classification, estimation and prediction are the three types of tasks performed in directed data mining, where a (dependent) variable is described in terms of the values of a group of other (independent) variables. In practice, classification, where we look for a categorical (yes/no, low, medium, high etc) answer, can be an estimation problem, where some threshold is applied to estimated output values to come up with categories. Prediction, on the other hand, can be viewed as an estimation or classification task with the difference that the estimated value or class can only be verified at some future point. As we found in topic 3, the main generic application of artificial neural networks is pattern recognition or classification, which assigns an input object to a class based on the values of its attributes. The best artificial neural networks model for performing classification is the backpropagation network (or the multilayer perceptron), which learns to classify data through supervised training. The artificial neural network model particularly suited for the data mining task of clustering is the Kohonen net or the self-organising map (SOM). With SOMs, the learning algorithms are unsupervised. Groups of similar input pattern vectors (data records) that are near each other in the N-dimensional record space are mapped to neurons that are also close to one another in the output layer – thus forming representative clusters for similar data records. The SOM can thus serve as a clustering tool as well as visualisation tool for high-dimensional data. See topic 4 ANN for more details on both the backpropagation and SOM neural networks. SOMs have been claimed to be often more effective than classical clustering algorithms such as kmeans. While clustering techniques often have very restrictive assumptions or limited ability to find complex shapes, SOMs are much more able to isolate clusters in high dimensional space, particularly those clusters that are not tight little balls of data but rather twisting, curving regions in N-dimensional space. SOMs take data in N-space and produce graphs in 2-dimensional space that reveal key associations among the original input vectors in N-space. Artificial neural networks can produce very good results, but they require extensive data preparation involving normalisation and conversion of categorical values to numeric values. The main drawback of ANNs is that they are difficult to understand because they represent complex non-linear models that, unlike decision trees, do not produce rules readily. Application of neural nets Neural networks are a good choice for most classification and prediction tasks when the results are more important than understanding how the model works. Neural nets do not work well when there are many hundreds or thousands of input features. Large numbers of features can make it more difficult for the network to fined patterns and can result in long training phases. How to Apply Data Mining There are essentially four ways of utilising data mining expertise in business. 1. By purchasing readymade scores (such as on credit worthiness for a loan applicant) from outside vendors. 2. By purchasing software that embodies data mining expertise designed for a particular application such as credit approval, fraud detection or churn prevention. 3. By hiring outside consultants to perform data mining for special projects. 4. By developing own data mining skills within the business organisation. The advantage of purchasing scores is that it is quick and easy, but the intelligence being limited to single score values, its usefulness is also limited. 533562489 9 ICT619 S2-05 Purchasing Software Data mining expertise can be embodied in software in one of two ways. The software may be an actual model, in the form of a set of rules for decision support, or a fully-trained neural network applied to a particular domain. Alternatively, it may embody knowledge of the process of building models appropriate to a particular domain in the form of a model-creation wizard or template. Purchasing models developed somewhere else can work well but only to the extent that the products, customers, and market conditions match those that were used to develop the model. Applications developed to meet the needs of a particular industry are often called vertical applications because they attempt to cover every layer from data handling at the bottom to report generation at the top. A generalpurpose data mining tool on the other hand is horizontal since it provides broad applicability to many problems. One example of data mining tools which are vertical applications is the California based company HNC’s embedded neural network model Falcon for predicting fraud in credit cards. In 1998, Falcon monitored over 250 million card accounts worldwide. Model building software are tools for automating the process of creating candidate models and selecting the ones that perform best, once the proper input and output variables have been identified. Such software enable novice model builders to be taken through the process of creating models based on their own data. Although this gives the flexibility of reflecting local conditions, it leaves a number of significant tasks for the user (these are part of the overall data mining development methodology described below) – Choosing a suitable business problem to be addressed by data mining. Identifying and collecting data that is likely to contain the information needed to answer the business question. Preprocessing the data so that the data mining tool can make use of it. Transforming the database so that the input variables needed by the model are available. Designing a plan of action based on the model and implementing it in the marketplace. Measuring the results of the actions and feeding them back into the database where they will be available for future mining. As one example of model building software, Model 1 from Unica Technologies contains four modules, each of which addresses a particular direct marketing challenge. These are called Response Modeller, Cross Seller, Customer Valuator, and Customer Segmenter. Hiring Outside Experts This approach is recommended if an organisation is in the early stages of integrating data mining in its business, and especially if the data mining activity is to be an one-off process, eg, fixing a manufacturing problem. If instead, it is to be an ongoing process, eg, data mining for customer relationship management, it is more worthwhile to consider developing the necessary skill among an organisation’s own staff through in-house development. Outside expertise for data mining is likely to be available in three possible places: From a data mining software vendor If the data mining software has been already selected, the company providing the software is the first place to look for help. Data mining centres These are usually collaborations between universities and private companies. Consulting companies Ideally the consulting company chosen should have had experience specifically in the area of interest to the organisation seeking help. Developing In-house Expertise Any business serious about converting corporate data to business intelligence should consider making data mining one of its core competencies. This applies particularly to companies which have many products and customers. 533562489 10 ICT619 S2-05 Data Mining Development Methodology According to (Hirji 2001) no quantitative or qualitative study of how to actually perform data mining has been undertaken. Cabena et al.(Cabena 1998) have proposed a five-stage model of implementing and using data mining. The stages in the model are: 1. Business objective determination 2. Data preparation 3. Data mining 4. Results analysis 5. Knowledge assimilation Business objective determination is concerned with clearly identifying the business problem to be mined. Data preparation involves data selection, preprocessing and transformation. Data mining is concerned with algorithm selection and execution. Results analysis is concerned with the question of whether anything new or interesting has been found. Knowledge assimilation aims to formulate ways of exploiting the new information extracted. Case Study A case study involving a large fast food outlet and described in (Hirji 2001) brought out some deficiencies of the methodology described above. A new set of stages for data mining development and use has been proposed as given below: 1. Business objective determination 2. Data preparation 3. Data audit 4. Interactive data mining and results analysis 5. Back end data mining 6. Results synthesis and presentation The case study used IBM’s Intelligent Miner for Data on AIX as the data mining tool and took 20 actual days of effort across the 6 stages above. Back end data mining involves data enrichment and additional data mining algorithm execution by the data mining specialist. 45% of the total data mining project effort was taken up by stages 4, 5 and 6, and 30% was required by the data preparation stage. The 30% of total time required by the data preparation stage, as compared with the 70% predicted in the earlier model of Cabena et al. may be explained by the use of an existing data warehouse in the organization. For a data mining project without this advantage, more project resources would be required to perform tasks such as selecting, cleaning, transforming, coding, and loading the data. Important aspects of the Interactive data mining and results analysis stage were linking data mining results with business strategy and using application software such as spreadsheets to perform sensitivity analysis of results obtained. The objective would be to demonstrate how data mining results support business strategy. For example, patterns of fast food product combinations would be identified as a basis for developing strategies for recombining some existing product offerings. Further quantitative studies are required involving other industries to further validate the methodology outlined above. It is expected that a set of best practices for data mining implementation projects will emerge through further refinement of this methodology. 533562489 11 ICT619 S2-05