Chapter 4 Basic Data Mining Technique Content • • • • • • • What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning • Genetic algorithm • Rough set approach • Fuzzy set approaches Data Warehouse and Data Mining 2 Chapter 4 Data Mining Process Data Warehouse and Data Mining 3 Chapter 4 Data Mining Strategies Data Warehouse and Data Mining 4 Chapter 4 Classification vs. Prediction • Classification: – predicts categorical class labels – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and ....uses it in classifying new data • Prediction: – models continuous-valued functions, i.e., predicts unknown or missing values Data Warehouse and Data Mining 5 Chapter 4 Classification vs. Prediction • Typical Applications – credit approval – target marketing – medical diagnosis – treatment effectiveness analysis Data Warehouse and Data Mining 6 Chapter 4 Classification Process 1. Model construction: 2. Model usage: Data Warehouse and Data Mining 7 Chapter 4 Classification Process 1. Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction: training set • The model is represented as classification rules, decision trees, or mathematical formulae Data Warehouse and Data Mining 8 Chapter 4 1. Model Construction Classification Algorithms Training Data NAME RANK M ike M ary B ill Jim D ave Anne A ssistan t P ro f A ssistan t P ro f P ro fesso r A sso ciate P ro f A ssistan t P ro f A sso ciate P ro f Data Warehouse and Data Mining YEARS TENURED 3 7 2 7 6 3 no yes yes yes no no 9 Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Chapter 4 Classification Process 2. Model usage: for classifying future or unknown objects Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set Data Warehouse and Data Mining 10 Chapter 4 2. Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof Data Warehouse and Data Mining YEARS TENURED 2 7 5 7 no no yes yes 11 Tenured? Chapter 4 What Is Prediction? • Prediction is similar to classification – 1. Construct a model – 2. Use model to predict unknown value • Major method for prediction is regression – Linear and multiple regression – Non-linear regression • Prediction is different from classification – Classification refers to predict categorical class label – Prediction models continuous-valued functions Data Warehouse and Data Mining 12 Chapter 4 Issues regarding classification and prediction 1. Data Preparation 2. Evaluating Classification Methods Data Warehouse and Data Mining 13 Chapter 4 1. Data Preparation • Data cleaning – Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) – Remove the irrelevant or redundant attributes • Data transformation – Generalize and/or normalize data Data Warehouse and Data Mining 14 Chapter 4 2. Evaluating Classification Methods • Predictive accuracy • Speed and scalability – time to construct the model – time to use the model • Robustness – handling noise and missing values • Scalability – efficiency in disk-resident databases • Interpretability: – understanding and insight proved by the model • Goodness of rules – decision tree size – compactness of classification rules Data Warehouse and Data Mining 15 Chapter 4 Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Data Warehouse and Data Mining 16 Chapter 4 Supervised Learning Data Warehouse and Data Mining 17 Chapter 4 Unsupervised Learning Data Warehouse and Data Mining 18 Chapter 4 Classification by Decision Tree Induction • Decision tree – A flow-chart-like tree structure – Internal node denotes a test on an attribute – Branch represents an outcome of the test – Leaf nodes represent class labels or class distribution • Use of decision tree: Classifying an unknown sample – Test the attribute values of the sample against the decision tree Data Warehouse and Data Mining 19 Chapter 4 Classification by Decision Tree Induction • Decision tree generation consists of two phases 1. Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes 2. Tree pruning • Identify and remove branches that reflect noise or outliers Data Warehouse and Data Mining 20 Chapter 4 Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income high high high medium low low low medium low medium medium medium high medium Data Warehouse and Data Mining student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent 21 buys_computer no no yes yes yes no yes no yes yes yes yes yes no Chapter 4 Output: A Decision Tree for “buys_computer” age? <=30 student? 30..40 overcast >40 credit rating? yes no yes excellent fair no yes no yes Data Warehouse and Data Mining 22 Chapter 4 Decision Tree Data Warehouse and Data Mining 23 Chapter 4 What Is Association Mining? • Association rule mining: – • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: – Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Data Warehouse and Data Mining 24 Chapter 4 Presentation of Classification Results Data Warehouse and Data Mining 25 Chapter 4 Instance-Based Methods • Instance-based learning: – Store training examples and delay the processing (“lazy evaluation”) .....until a new instance must be classified • Typical approaches – k-nearest neighbor approach • Instances represented as points in a Euclidean space. – Case-based reasoning • Uses symbolic representations and knowledge-based inference Data Warehouse and Data Mining 26 Chapter 4 The k-Nearest Neighbor Algorithm • All instances correspond to points in the n-D space. • The nearest neighbor are defined in terms of Euclidean distance. • The target function could be discrete- or real- valued. • For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. • Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. . _ _ _ + _ . + xq Data Warehouse _ and Data Mining + _ . + 27 . . . Chapter 4 Case-Based Reasoning • Also uses: lazy evaluation + analyze similar instances • Difference: Instances.... are not “points in a Euclidean space” • Methodology – Instances represented by rich symbolic descriptions (e.g., function graphs) – Multiple retrieved cases may be combined Data Warehouse and Data Mining 28 Chapter 4 Genetic Algorithms • GA: based on an analogy to biological evolution • Each rule is represented by a string of bits • An initial population is created consisting of randomly generated rules – e.g., IF A1 and Not A2 then C2 can be encoded as 100 • Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings • The fitness of a rule is represented by its classification accuracy on a set of training examples • Offsprings are generated by crossover and mutation Data Warehouse and Data Mining 29 Chapter 4 Supervised genetic learning Data Warehouse and Data Mining 30 Chapter 4 Rough Set Approach • Rough sets are used to approximately or “roughly” define equivalent classes Data Warehouse and Data Mining 31 Chapter 4 Rough Set Approach • A rough set for a given class C is approximated by two sets: 1. a lower approximation (certain to be in C) and 2. an upper approximation (cannot be described as not belonging to C) • Finding the minimal subsets of attributes (for feature reduction) is NP-hard Data Warehouse and Data Mining 32 Chapter 4 Fuzzy Set Approaches • Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph) Fuzzy membeship Low Medium somewhat low High baseline high Income Data Warehouse and Data Mining 33 Chapter 4 Fuzzy Set Approaches • Attribute values are converted to fuzzy values – e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated • For a given new sample, more than one fuzzy value may apply • Each applicable rule contributes a vote for membership in the categories • Typically, the truth values for each predicted category are summed Data Warehouse and Data Mining 34 Chapter 4 Reference Data Mining: Concepts and Techniques (Chapter 7 for textbook), Jiawei Han and Micheline Kamber, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada Data Warehouse and Data Mining 35 Chapter 4