Data Mining Classification Algorithms Credits: Padhraic Smyth Outline • What is supervised classification • Classification algorithms – – – – – – Decision trees k-NN Linear discriminant analysis Naïve bayes Logistic regression SVM • Model evaluation and model selection – Performance measures (accuracy, ROC, precision, recall…) – Validation techniques (k-fold CV, LOO) • Ensemble learning – Random Forests Notation • Variables X, Y with values x, y (lower case) • Vectors indicated by X • Y is called the target variable • Components of X indicated by Xj with values xj • “Matrix” data set D with n rows and p columns – jth column contains values for variable Xj – ith row contains a vector of measurements on object i, indicated by x(i) – The jth measurement value for the ith object is xj(i) • Unknown parameter for a model = – Vector of parameters = q q Decision Region Terminlogy TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE 6 Decision Region 1 5 Decision Region 2 Feature 2 4 3 2 1 0 Decision Boundary -1 2 3 4 5 6 Feature 1 7 8 9 10 Classes of classifiers • Discriminative models, focus on locating optimal decision boundaries – – – – Decision trees: “swiss army knife”, often effective in high dimensionis Nearest neighbor: simple, can scale poorly in high dimensions Linear discriminants : simple, sometimes effective Support vector machines: generalization of linear discriminants, can be quite effective, computational complexity is an issue • Class-conditional/probabilistic, based on p( x | ck ), – Naïve Bayes (simple, but often effective in high dimensions) • Regression-based, p( ck | x ) directly – Logistic regression: simple, linear in “odds” space – Neural network: non-linear extension of logistic, can be difficult to work with Decision Tree Classifiers – Widely used in practice • Can handle both real-valued and nominal inputs (unusual) • Good with high-dimensional data – historically, developed both in statistics and computer science Decision tree learning • Aim: find a small tree consistent with the training examples • Idea: (recursively) choose "most significant" attribute as root of (sub)tree Learning decision trees Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) Attribute-based representations • Examples described by attribute values (Boolean, discrete, continuous) • E.g., situations where I will/won't wait for a table: • Classification of examples is positive (T) or negative (F) • The set of examples used for learning is called training set. Decision trees • One possible representation for hypotheses • E.g., here is the “true” tree for deciding whether to wait: Choosing an attribute • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" • Patrons? is a better choice Example contd. • Decision tree learned from the 12 examples: • Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of data Using information theory • To implement Choose-Attribute in the DTL algorithm • Information Content (Entropy): I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi) • For a training set containing p positive examples and n negative examples: I( p n p p n n , ) log 2 log 2 pn pn pn pn pn pn Information gain • A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values. v remainder ( A) i 1 p i ni pi ni I( , ) p n pi ni pi ni • Information Gain (IG) or reduction in entropy from the attribute test: p n IG ( A) I ( , ) remainder ( A) pn pn • Choose the attribute with the largest IG Information gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): 2 4 6 2 4 IG ( Patrons) 1 [ I (0,1) I (1,0) I ( , )] .541 bits 12 12 12 6 6 2 1 1 2 1 1 4 2 2 4 2 2 IG (Type) 1 [ I ( , ) I ( , ) I ( , ) I ( , )] 0 bits 12 2 2 12 2 2 12 4 4 12 4 4 Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root Given Patrons as root node, the next attribute chosen is Hungry?, with IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) = 0.252 Next step Given Patrons as root node, the next attribute chosen is Hungry?, with IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) = 0.252 Final decision tree induced by 12example training set Braodening the applicability of Decision Trees • Continuous or integer values attributes • Missing data • Continuous output attributes (regression!!!) Splitting on a nominal attribute • Nominal attribute with m values – e.g., the name of a state or a city in marketing data • 2m-1 possible subsets => exhaustive search is O(2m-1) – For small m, a simple approach is to branch on specific values – But for large m this may not work well • Neat trick for the 2-class problem: – – – – For each predictor value calculate the proportion of class 1’s Order the m values according to these proportions Now treat as an ordinal variable and select the best split (linear in m) This gives the optimal split for the Gini index, among all possible 2m-1 splits . Splitting on a integer or continuous attributes • A numeric attribute A is tested for two outcomes: A split_point and A > split_point • split_point is a value returned by choose-attribute function. Income? 42,000 > 42,000 Treating Missing Data in Trees • Missing values are common in practice • Approaches to handing missing values – During training • Ignore rows with missing values (inefficient) – During testing • Send the example being classified down both branches and average predictions – Replace missing values with an “imputed value” (can be suboptimal) • Other approaches – Treat “missing” as a unique value (useful if missing values are correlated with the class) How to Choose the Right-Sized Tree? Predictive Error Error on Test Data Error on Training Data Size of Decision Tree Ideal Range for Tree Size Choosing a Good Tree for Prediction • General idea – grow a large tree – prune it back to create a family of subtrees • “weakest link” pruning – score the subtrees and pick the best one • Massive data sizes (e.g., n ~ 100k data points) – use training data set to fit a set of trees – use a validation data set to score the subtrees • Smaller data sizes (e.g., n ~1k or less) – use cross-validation Example: Spam Email Classification • Data Set: (from the UCI Machine Learning Archive) – 4601 email messages from 1999 – Manually labeled as spam (60%), non-spam (40%) – 54 features: percentage of words matching a specific word/character • Business, address, internet, free, george, !, $, etc – Average/longest/sum lengths of uninterrupted sequences of CAPS • Error Rates – – – – Training: 3056 emails, Testing: 1536 emails Decision tree = 8.7% Logistic regression: error = 7.6% Naïve Bayes = 10% (typically) Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine Why Trees are widely used in Practice • Can handle high dimensional data – builds a model using 1 dimension at time • Can handle any type of input variables – categorical, real-valued, etc – most other methods require data of a single type (e.g., only real-valued) • Trees are (somewhat) interpretable – domain expert can “read off” the tree’s logic • Tree algorithms are relatively easy to code and test Limitations of Classification Trees • Representational Bias – piecewise linear boundaries, parallel to axes • High Variance – trees can be “unstable” as a function of the sample • e.g., small change in the data -> completely different tree – causes two problems • 1. High variance contributes to prediction error • 2. High variance reduces interpretability – Trees are good candidates for model combining • Often used with boosting and bagging • Trees do not scale well to massive data sets (e.g., N in millions) – repeated random access of subsets of the data Nearest Neighbor Classifiers • kNN: select the k nearest neighbors to x from the training data and select the majority class from these neighbors • Training method: – Save the training examples • At prediction time: – Use a distance measure (for example Euclidean) and calculate the distance of a test example x to all training examples – Find the k training examples (x1,y1),…(xk,yk) that are closest to the test example x – Predict the most frequent class among those yi’s. Nearest Neighbor Classifiers • k is a parameter: – Small k: “noisier” estimates, Large k: “smoother” estimates – Best value of k often chosen by cross-validation – Important case k = 1 (1NN) • Comments – Virtually assumption free – Interesting theoretical properties: • Disadvantages – Can scale poorly with dimensionality: sensitive to distance metric – Requires fast lookup at run-time to do classification with large n – Does not provide any interpretable “model” • Improvements: – Weighting examples from the neighborhood – Finding “close” examples in a large training set quickly – Assign a score for class prediction as the fraction of predicted class among k nearest examples (‘probabilistic version’) The effect of the value of K on class boundaries 1- 31 Linear Discriminant Classifiers • Linear Discriminant Analysis (LDA) – Earliest known classifier (1936, R.A. Fisher) – Find a projection onto a vector such that means for each class (2 classes) are separated as much as possible (with variances taken into account appropriately) – Reduces to a special case of parametric Gaussian classifier in certain situations – Many subsequent variations on this basic theme (e.g., regularized LDA) • Other linear discriminants – Decision boundary = (p-1) dimensional hyperplane in p dimensions – Perceptron learning algorithms (pre-dated neural networks) • Simple “error correction” based learning algorithms – SVMs: use a sophisticated “margin” idea for selecting the hyperplane Fisher Linear Discriminant • For each projection vector w compute for each example w·xi • Find w that minimizes within class variance and maximizes between class variance Fisher Linear Discriminant • Solution Where 1 and 2 are the means of class 1 and 2 vectors and 1 and 2 are the covariances of class 1 and 2 respectively. • In test time just use the simple decision rule if w·xi > threshold class 1 else class 2 threshold may be chosen such that the accuracy on the training set is maximal Support Vector Machines (will be discussed again later) • Support vector machines – Use a different loss function, the “margin” • Results in convex optimization problem, solvable by quadratic programming – Decision boundary represented by examples in training data – Linear version: clever placement of the hyperplane – Non-linear version: “kernel trick” for high-dimensional problems – Computational complexity can be O(N3) without speedups Naïve Bayes Classifiers • Bayes theorem P( A | B) P( B | A) P( A) P( B) • Theorem of total probability, if events Ai are mutually exclusive and probability sum to 1 n P( B) P( B | Ai ) P( Ai ) i 1 Naïve Bayes Classifiers • Probabilistic model based on Bayes rule P(C | X ) • P( X | C ) P(C ) P( X ) with conditional independence assumption on p( x | ck ), i.e. P( X | C ) P( x1 , x2 ,, xn | C ) P( x | C ) i i 1.. n • Decision rule cMAP argmax P(c j | x1 , x2 ,, xn ) c j C argmax c j C P( x1 , x2 ,, xn | c j ) P(c j ) P( x1 , x2 ,, xn ) argmax P( x1 , x2 ,, xn | c j ) P(c j ) c j C Parameters estimation • P(cj) – Can be estimated from the frequency of classes in the training examples. • P(x1,x2,…,xn|cj) – O(|X|n•|C|) parameters – Could only be estimated if a very, very large number of training examples was available. • Independence Assumption: attribute values are conditionally independent given the target value: naïve Bayes. P( x1 , x2 ,, xn | c j ) P( xi | c j ) i c NB arg max P(c j ) P( xi | c j ) c j C i Properties • Estimating P( xi | c j ) instead of P( x1 , x2 ,, xn | c j ) greatly reduces the number of parameters (and the data sparseness). • The learning step in Naïve Bayes consists of estimating P(c j ) and P( xi | c j ) based on the frequencies in the training data • An unseen instance is classified by computing the class that maximizes the posterior • When conditioned independence is satisfied, Naïve Bayes corresponds to MAP classification. Example. ‘Play Tennis’ data Day Outlook Temperature Humidity Wind Play Tennis Day1 Day2 Sunny Sunny Hot Hot High High Weak Strong No No Day3 Overcast Hot High Weak Yes Day4 Rain Mild High Weak Yes Day5 Rain Cool Normal Weak Yes Day6 Rain Cool Normal Strong No Day7 Overcast Cool Normal Strong Yes Day8 Sunny Mild High Weak No Day9 Sunny Cool Normal Weak Yes Day10 Rain Mild Normal Weak Yes Day11 Sunny Mild Normal Strong Yes Day12 Overcast Mild High Strong Yes Day13 Overcast Hot Normal Weak Yes Day14 Rain Mild High Strong No Question: For the day <sunny, cool, high, strong>, what’s the play prediction? Based on the examples in the table, classify the following case x: x=(Outlook=Sunny, Temp=Cool, Humidity=High, Wind=strong) • That means: c = Play tennis or not? P(a c NB arg max P(c) P(x | c) arg max P(c) c[ yes , no] c[ yes , no] t | c) t arg max P(c) P(Outlook sunny | c) P(Temp cool | c) P( Humidity high | c) P(Wind strong | c c[ yes , no] • Working: P( PlayTennis yes) 9 / 14 0.64 P( PlayTennis no) 5 / 14 0.36 P(Wind strong | PlayTennis yes) 3 / 9 0.33 P(Wind strong | PlayTennis no) 3 / 5 0.60 etc. P( yes) P( sunny | yes) P(cool | yes) P(high | yes) P( strong | yes) 0.0053 P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no) 0.0206 answer : PlayTennis( x) no Underflow Prevention • Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. • Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. • Class with highest final un-normalized log probability score is still the most probable. c NB argmax (log P(c j ) c jC log P( x | c )) i i positions j Properties of Naïve Bayes Classifier • Pros – Incrementality: with each training example, the prior and the likelihood can be updated dynamically – Robustness: flexible and robust to errors. – Probabilistic hypothesis: outputs not only a classification, but a probability distribution over all classes • Cons – Performace: Naïve bayes assumption is, in many cases, not realistic reducing the performance – Probability estimates: of classes are often un-realistic (due to the naive bayes assumption) Evaluating Classification Results (in general) • Summary statistics: – empirical estimate of score function on test data, eg., error rate • More detailed breakdown – E.g., “confusion matrices” – Can be quite useful in detecting systematic errors • Detection v. false-alarm plots (2 classes) – Binary classifier with real-valued output for each example, where higher means more likely to be class 1 – For each possible threshold, calculate • Detection rate = fraction of class 1 detected • False alarm rate = fraction of class 2 detected – Plot y (detection rate) versus x (false alarm rate) – Also known as ROC, precision-recall, specificity/sensitivity Bagging for Combining Classifiers • Training data sets of size N • Generate B “bootstrap” sampled data sets of size N – Bootstrap sample = sample with replacement – e.g. B = 100 • Build B models (e.g., trees), one for each bootstrap sample – Intuition is that the bootstrapping “perturbs” the data enough to make the models more resistant to true variability • For prediction, combine the predictions from the B models – E.g., for classification p(c | x) = fraction of B models that predict c – Plus: generally improves accuracy on models such as trees – Negative: lose interpretability green = majority vote purple = averaging the probabilities From Hastie, Tibshirani, And Friedman, 2001 Illustration of Boosting: Color of points = class label Diameter of points = weight at each iteration Dashed line: single stage classifier. Green line: combined, boosted classifier Dotted blue in last two: bagging (from G. Rätsch, Phd thesis, 2001) Summary on Classifiers • Simple models (but can be effective) – Logistic regression – Naïve Bayes – K nearest-neighbors • Decision trees – Good for high-dimensional problems with different data types • State of the art: – Support vector machines – Boosted trees • Many tradeoffs in interpretability, score functions, etc Decision Tree Classifiers Task Representation Classification Decision boundaries = hierarchy of axis-parallel Score Function Cross-validated error Search/Optimization Greedy search in tree space Data Management None specified Models, Parameters Tree Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine Naïve Bayes Classifier Task Representation Classification Conditional independence probability model Score Function Likelihood Search/Optimization Closed form probability estimates Data Management Models, Parameters Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine None specified Conditional probability tables Logistic Regression Task Representation Score Function Search/Optimization Classification Log-odds(C) = linear function of X’s Log-likelihood Iterative (Newton) method Data Management None specified Models, Parameters Logistic weights Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine Nearest Neighbor Classifier Task Representation Score Function Search/Optimization Classification Memory-based Cross-validated error (for selecting k) None Data Management None specified Models, Parameters None Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine Support Vector Machines Task Classification Representation Hyperplanes Score Function “Margin” Search/Optimization Convex optimization (quadratic programming) Data Management None specified Models, Parameters None Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine Software (same as for Regression) MATLAB • Many free “toolboxes” on the Web for regression and prediction – e.g., see http://lib.stat.cmu.edu/matlab/ – and in particular the CompStats toolbox R • General purpose statistical computing environment (successor to S) – Free (!) – Widely used by statisticians, has a huge library of functions and visualization tools – Commercial tools SAS, other statistical packages – Data mining packages – Often are not progammable: offer a fixed menu of items – Data Mining Lectures Lecture 7: Classification Padhraic Smyth, UC Irvine •