Copied from Adriaans, P, Zantinge, D. (1996) Data Mining, Addison-Wesley & Dunham, M.H. (2003) Data Mining: Introductory and Advanced Topics, Prentice Hall. Data mining (DM) deals with the discovery of hidden knowledge, unexpected patterns and new rules from large databases. Knowledge discovery in databases (KDD) is the non-trivial extraction of implicit, previously unknown and potentially useful knowledge in a database. DM and KDD are often used interchangeably. However, lately KDD has been used to refer to a process consisting of many steps, while DM is only one of these steps. In principle KDD consists of 6 stages: 1. Data selection 2. Cleaning 3. Enrichment 4. Coding 5. Data mining 6. Reporting As much as 80% of KDD is about preparing data, the remaining 20% is about mining. We will see these steps with an example below. SOME APPLICATION DOMAINS of DM Banking Sector (customer profiling, customer scoring, loan payment) Biomedical data analysis (detection of differentially expressed genes) Financial data analysis (Early warning systems, detection of money laundry and the financial crimes) Insurance services Security services (Crime and criminal detection) … Example: Suppose that we would like to work on a database of a magazine publisher. The publisher sells five types of magazine – on cars, houses, sports, music and comics. Say, we are interested in questions such as “What is the typical profile of a reader of a car magazine?” and “Is there any correlation between an interest in cars and in comics?”. Data selection: we start with the choice of a database. The records consist of: client number, name, address, date of subscription, and type of magazine. A part of the database is given in the following table: Client Number Name Address Data purchase Downing 04-15-94 23003 Johnson 23003 Johnson 23003 Johnson 1 Street 1 Downing 06-21-93 Street 1 Downing 05-30-92 Street 1 of Magazine purchased Car Music Comic 23009 23019 Clinton Jonson 2 Boulevard 01-01-10 1 Downing 03-30-95 Street Comic House Cleaning: there might be problems in the data, such as the duplication of records, out of range data value etc. At this stage, we simply clean the data. For example, in the example, there is both a Johnson and a Jonson in the database with the same address. These are probably the same people, and typo can be corrected. Enrichment: If you can “purchase” additional info on the customers, you can add these. For example, in this example, you can add info on date of birth, income, ownership of a car or house… Client name Johnson Clinton Date of birth 04-13-76 10-20-71 Income $18,500 $36,000 Car owner No Yes House owner No No Car owner House owner Region Car magazin e House m Sports m Music m Comic m 23003 32 23009 37 Income Age Client number Coding: this is similar to data reconstruction. We can add the enriched data as columns to original data and remove any unnecessary columns (if any). For example, name of clients might be unnecessary and can be removed. We can deal with missing cases. Some recoding or creating new variables might be necessary. For example, we need to convert “car owner” info from yes-no to 1-0. Similarly, “house owner” should be converted. We can divide income by 1000 to simplify. Date of birth and address info might be too detailed. One can convert them to simpler numbers that might give us some pattern. For example, birth year can be converted to age, and address can be converted to region by the use of postal code, if available. Instead of having one attribute “magazines” with five different possible values, we can create five binary attributes, one for every magazine. If the value of the variable is 1, then the reader is a subscriber, otherwise it is 0. This is called ‘flattening’. Note that this is different from creating dummy variables. Why is flattening preferred to dummy variables here? Think! The final version of the data looks like the one in the following table: 18.5 36 0 1 0 0 1 1 1 0 0 0 1 0 1 1 1 0 Data mining: Any technique that helps extract more out of your data is useful, so DM methods form quite a heterogeneous group, such as: Visualization Classification Linear, logistic regression, time series analysis Prediction Clustering Summarization Decision trees Association Rules 2 Neural Networks Genetic algorithms Before introducing these algorithms a little bit, we need to define supervised versus unsupervised algorithms. Supervised versus unsupervised learning / algorithms: Some algorithms need the control of a human operator during their execution; such algorithms are called supervised. Algorithms that can operate without human interaction are called unsupervised. • • • • • • Supervised k-nearest neighbor k-means clustering Regression models Decision trees Neural networks • • • Unsupervised Hierarchical clustering Self organized maps (SOM) Before we can apply more advanced pattern analysis algorithms, we need to know some basic aspects and structures of the data set. A good way to start is to extract some simple statistical information, for example summary statistics. Average 46.9 20.8 34.9 0.59 0.59 0.329 0.702 0.447 0.146 0.081 Age Income Credit Car owner House owner Car magazine House magazine Sports magazine Music magazine Comic magazine 1000 subjects in the data set. Magazine Car House Sports Music Comic Age 29.3 48.1 42.2 24.6 21.4 averages Credit 27.3 35.5 31.4 24.6 26.3 Income 17.1 21.1 24.3 12.8 25.5 Car 0.48 0.58 0.7 0.3 0.62 house 0.53 0.76 0.6 0.45 0.6 Graphs, such as histograms, scatter plots, interactive ones, would be very useful. 3 k-nearest neighbor: records of the same type will be close to each other in the data space; they will be living in each other’s neighborhood. If we want to predict the behavior of a certain individual, we start to look at the behavior of, for example, ten individuals that are close to him in the data space. We calculate an average of the behavior of these ten individuals, and this average will be the prediction for our individual. The letter k in k-nearest stands for the number of neighbors we investigate. Decision trees: our database consists of attributes such as age, income, and credit. If we want to predict a certain kind of customer, say who will buy a car magazine, what would help us more –the age or the income of a person? It could be that age is more important. If this is the case, next thing to do is split this attribute in two, that is, we must investigate whether there is a certain age threshold that separates car buyers from non-car buyers. In this way, we could start with the first attribute, find a certain threshold, go on to the next one, find a certain threshold, and repeat this process until we have made a correct classification for our customers. A simple decision tree for the car magazine: AGE > 44.5 99% AGE <= 44.5 38% Above the age of 44.5 only 1% of people subscribe to a car magazine, while below it 62% of people subscribe to such a magazine. A more detailed tree: AGE > 44.5 AGE <= 44.5 AGE > 48.5 AGE <= 48.5 100% 92% INCOME > 34.5 100% INCOME <= 34.5 AGE > 31.5 46% AGE<=31.5 0% People with income under 34.5 and an age under 31.5 are very likely to be interested in the car magazine. Classification: maps data into predefined groups or classes. Aim is to find a classification rule based on the given measurements on the training sample that can be used for future samples. It is often referred to as a supervised algorithm since classes are determined before examining the data. Examples: 1. classifying the applicants to a bank credit as risky or not. 2. An airport security screening station is used to determine if passengers are potential terrorists or criminals. To do this, the face of each passenger is scanned and its basic pattern (distance between eyes, size and shape of mouth, shape of head etc.) is identified. This pattern is compared to entries in a database to see if it matches any patterns that are associated with known offenders. Clustering: is similar to classification except that the groups are not predefined. Therefore, clustering is referred to as unsupervised learning. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters. 4 Example: A certain national department store chain creates special catalogs targeted to various demographic groups based on attributes such as income, location, and physical characteristics of potential customers (age, height, weight, etc.). The results of clustering are used by management to create new special catalogs and distribute them to the correct target population. Association Rules: Link analysis, alternatively referred to as affinity analysis or association, refers to the data mining task of uncovering relationships among data. The best example of this type of application is to determine association rules. An association rule is a model that identifies specific types of data associations. Example: A grocery store retailer is trying to decide whether to put bread on sale. To help determine the impact of this decision, the retailer generates association rules that show what other products are frequently sold with bread. He finds that 60% of the times that bread is sold so are pretzels and that 70% of the time jelly is also sold. Based on these facts, he decides to place some pretzels and jelly at the end of the aisle where the bread is placed. In addition, he decides not to place either of these items on sale at the same time. Be aware that association rules are not casual relationships. There probably is no relation between bread and pretzel to be purchased together. And there is no guarantee that this association will apply in the future. However, these rules can be used in helping the manager in effective advertisement, marketing, and inventory control. Neural networks: There are several different forms of neural network. Networks do not provide a rule to identify the association. They just show there is a connection. Also, although they do learn, they do not provide us with a theory about what they have learned. They are simply black boxes that give answers but provide no clear idea as to how they arrived at these answers. Genetic algorithms: A genetic algorithm is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Fuzzy logic: Fuzzy logic is a form of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate rather than precise. Just as in fuzzy set theory the set membership values can range (inclusively) between 0 and 1, in fuzzy logic the degree of truth of a statement can range between 0 and 1 and is not constrained to the two truth values {true, false} as in classic predicate logic. Additional references: www.kdnuggets.com Hastie, T., Tibshirani, R. and Friedman, J. 2001. The elements of statistical learning; data mining, inference and prediction. Springer Series in Statistics, 533, USA. 5