Supervised Learning Fall 2004 1 Introduction Key idea Known target concept (predict certain attribute) Find out how other attributes can be used Algorithms Rudimentary Rules (e.g., 1R) Statistical Modeling (e.g., Naïve Bayes) Divide and Conquer: Decision Trees Instance-Based Learning Neural Networks Support Vector Machines Fall 2004 2 1-Rule Generate a one-level decision tree One attribute Performs quite well! Basic idea: Rules testing a single attribute Classify according to frequency in training data Evaluate error rate for each attribute Choose the best attribute That’s all folks! Fall 2004 3 The Weather Data (again) Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Fall 2004 Temp. Humidity Windy Hot High FALSE Hot High TRUE Hot High FALSE Mild High FALSE Cool Normal FALSE Cool Normal TRUE Cool Normal TRUE Mild High FALSE Cool Normal FALSE Mild Normal FALSE Mild Normal TRUE Mild High TRUE Hot Normal FALSE Mild High TRUE Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 4 Apply 1R Attribute 1 outlook Rules sunnyno overcast yes rainy yes 2 temperaturehot no 2/4 mild yes cool no 3 humidity high no normal yes 4 windy false yes true no Fall 2004 Errors 2/5 0/4 2/5 5/14 2/6 3/7 3/7 2/8 2/8 3/6 Total 4/14 4/14 5/14 5 Other Features Numeric Values Discretization : Sort training data Split range into categories 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Y N Y Y Y N N Y Y Y N Y Y N Missing Values “Dummy” attribute Fall 2004 6 Naïve Bayes Classifier Allow all attributes to contribute equally Assumes All attributes equally important All attributes independent Realistic? Selection of attributes Fall 2004 7 Bayes Theorem Posterior Probability Hypothesis P[ E | H ] P[ H ] P[ H | E ] P[ E ] Prior Evidence Conditional probability of H given E Fall 2004 8 Maximum a Posteriori (MAP) H MAP arg max P[ H | E ] H P[ E | H ] P[ H ] arg max P[ E ] H arg max P[ E | H ] P[ H ] H H ML arg max P[ E | H ] H Maximum Likelihood (ML) Fall 2004 9 Classification Want to classify a new instance (a1, a2,…, an) into finite number of categories from the set V. Bayesian approach: Assign the most probable category vMAP given (a1, a2,…, an). vMAP arg max P[v | a1 , a2 ,..., an ] vV P[a1 , a2 ,..., an | v] P[v] arg max P[a1 , a2 ,..., an ] vV arg max P[a1 , a2 ,..., an | v] P[v] vV Can we estimate the probabilities from the training data? Fall 2004 10 Naïve Bayes Classifier Second probability easy to estimate How? The first probability difficult to estimate Why? Assume independence (this is the naïve bit): vMAP arg max P[v] P[ai | v] vV Fall 2004 i 11 The Weather Data (yet again) Outlook Temperature Humidity Windy Play Yes No Yes No Yes No Yes No Yes No Sunny 2 3 Hot 2 2 High 3 4 FALSE 6 2 9 5 Overcast 4 0 Mild 4 2 Normal 6 1 TRUE 3 3 Rainy 3 2 Cool 3 1 Pˆ [ Play yes ] 9 14 Pˆ [Outlook sunny | Play yes ] 2 9 Pˆ [Temperature cool | Play yes ] 3 Pˆ [ Humidity high | Play yes ] 3 Pˆ [Windy true | Play yes] 3 Fall 2004 9 9 9 12 Estimation Given a new instance with outlook=sunny, temperature=high, humidity=high, windy=true P[ Play yes] P[ai | play yes] i 9 2 3 3 3 0.0053 14 9 9 9 9 Fall 2004 13 Calculations continued … Similarly P[ Play no] P[ai | play no] i 5 3 1 4 3 0.0206 14 5 5 5 5 arg max P[v] P[ai | v] Thus vMAP v{Play yes , Play no} i {Play no} Fall 2004 14 Normalization Note that we can normalize to get the probabilities: P[a1 , a2 ,..., an | v] P[v] P[v | a1 , a2 ,..., an ] P[a1 , a2 ,..., an ] 0.0052 0.0053 0.0206 0.205 v {Play yes} 0.0206 0.795 v {Play no} 0.0053 0.0206 Fall 2004 15 Problems …. Suppose we had the following training data: Outlook Temperature Humidity Windy Play Yes No Yes No Yes No Yes No Yes No Sunny 0 5 Hot 2 2 High 3 4 FALSE 6 2 9 5 Overcast 4 0 Mild 4 2 Normal 6 1 TRUE 3 3 Rainy 3 2 Cool 3 1 Pˆ [ Play yes ] 9 14 Pˆ [Outlook sunny | Play yes ] 0 9 Pˆ [Temperature cool | Play yes ] 3 Pˆ [ Humidity high | Play yes ] 3 Pˆ [Windy true | Play yes ] 3 Fall 2004 9 9 Now what? 9 16 Laplace Estimator Replace estimates 2 ˆ P[Outlook sunny | play yes ] 9 with Fall 2004 4 ˆ P[Outlook overcast | play yes ] 9 3 ˆ P[Outlook rainy | play yes ] 9 2 p1 ˆ P[Outlook sunny | play yes ] 9 4 p 2 ˆ P[Outlook overcast | play yes ] 9 3 p3 ˆ P[Outlook rainy | play yes ] 9 17 Numeric Values Assume a probability distribution for the numeric attributes density f(x) 2 ( x ) normal 1 2 2 f ( x) e . 2 fit a distribution (better) Similarly as before vMAP arg max P[v] f (ai | v) vV Fall 2004 i 18 Discussion Simple methodology Powerful - good results in practice Missing values no problem Not so good if independence assumption is severely violated Extreme case: multiple attributes with same values Solutions: Fall 2004 Preselect which attributes to use Non-naïve Bayesian methods: networks 19 Decision Tree Learning Basic Algorithm: Select an attribute to be tested If classification achieved return classification Otherwise, branch by setting attribute to each of the possible values Repeat with branch as your new tree Main issue: how to select attributes Fall 2004 20 Deciding on Branching What do we want to accomplish? Make good predictions Obtain simple to interpret rules No diversity (impurity) is best all same class all classes equally likely Goal: select attributes to reduce impurity Fall 2004 21 Measuring Impurity/Diversity Lets say we only have two classes: Minimum min( p1 , p2 ) Gini index/Simpson diversity index 2 p1 p2 2 p1 (1 p1 ) Entropy p1 log 2 ( p1 ) p2 log 2 ( p2 ) Fall 2004 22 Impurity Functions 1.2 1 Entropy 0.8 0.6 Gini index 0.4 Minimum 0.2 0 Fall 2004 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 23 Number of classes Entropy c Entropy( S ) pi log 2 pi i 1 Training data (instances) Proportion of S classified as i Entropy is a measure of impurity in the training data S Measured in bits of information needed to encode a member of S Extreme cases Fall 2004 All member same classification (Note: 0·log 0 = 0) All classifications equally frequent 24 Expected Information Gain | Sv | Gain( S , a) Entropy( S ) Entropy( S v ) vValues( a ) | S | S v {s S : a( s ) v} All possible values for attribute a Gain(S,a) is the expected information provided about the classification from knowing the value of attribute a (Reduction in number of bits needed) Fall 2004 25 The Weather Data (yet again) Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Fall 2004 Temp. Humidity Windy Hot High FALSE Hot High TRUE Hot High FALSE Mild High FALSE Cool Normal FALSE Cool Normal TRUE Cool Normal TRUE Mild High FALSE Cool Normal FALSE Mild Normal FALSE Mild Normal TRUE Mild High TRUE Hot Normal FALSE Mild High TRUE Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 26 Decision Tree: Root Node Outlook Rainy Sunny Overcast Yes Yes No No No Fall 2004 Yes Yes Yes Yes Yes Yes Yes No No 27 Calculating the Entropy 2 Entropy( S ) pi log 2 pi i 1 9 9 5 5 log 2 log 2 0.92 14 14 14 14 2 2 3 3 Entropy( S sunny ) log 2 log 2 0.97 5 5 5 5 4 4 0 0 Entropy( S overcast ) log 2 log 2 0.00 4 4 4 4 3 3 2 2 Entropy( S rainy ) log 2 log 2 0.97 5 5 5 5 Fall 2004 28 Calculating the Gain | Sv | Gain( S , outlook ) Entropy( S ) Entropy( S v ) vValues( a ) | S | 5 4 5 0.92 0.97 0 0.97 14 14 14 0.92 0.69 0.23 Gain( S , temp) 0.03 Gain( S , humidity ) 0.15 Select! Gain( S , windy ) 0.05 Fall 2004 29 Next Level Outlook Rainy Sunny Overcast Temperature No No Fall 2004 Yes No Yes 30 Calculating the Entropy Entropy( S ) 0.97 0 0 2 2 Entropy( S hot ) log 2 log 2 0 2 2 2 2 1 1 1 1 Entropy( S mild ) log 2 log 2 1 2 2 2 2 1 1 0 0 Entropy( S cool ) log 2 log 2 0 1 1 1 1 Fall 2004 31 Calculating the Gain | Sv | Gain( S , temp) Entropy( S ) Entropy( S v ) vValues( a ) | S | 2 2 1 0.97 0 1 0 5 5 5 0.97 0.40 0.57 Gain( S , humidity ) 0.97 Gain( S , windy ) 0.02 Fall 2004 Select 32 Final Tree Outlook Rainy Sunny Overcast Humidity High No Fall 2004 Normal Yes Windy True False No Yes Yes 33 What’s in a Tree? Our final decision tree correctly classifies every instance Is this good? Two important concepts: Fall 2004 Overfitting Pruning 34 Overfitting Two sources of abnormalities Noise (randomness) Outliers (measurement errors) Chasing every abnormality causes overfitting Tree to large and complex Does not generalize to new data Solution: prune the tree Fall 2004 35 Pruning Prepruning Halt construction of decision tree early Use same measure as in determining attributes, e.g., halt if InfoGain < K Most frequent class becomes the leaf node Postpruning Fall 2004 Construct complete decision tree Prune it back Prune to minimize expected error rates Prune to minimize bits of encoding (Minimum Description Length principle) 36 Scalability Need to design for large amounts of data Two things to worry about Large number of attributes Leads to a large tree (prepruning?) Takes a long time Large amounts of data Fall 2004 Can the data be kept in memory? Some new algorithms do not require all the data to be memory resident 37 Discussion: Decision Trees The most popular methods Quite effective Relatively simple Have discussed in detail the ID3 algorithm: Information gain to select attributes No pruning Only handles nominal attributes Fall 2004 38 Selecting Split Attributes Other Univariate splits Gain Ratio: C4.5 Algorithm (J48 in Weka) CART (not in Weka) Multivariate splits Fall 2004 May be possible to obtain better splits by considering two or more attributes simultaneously 39 Instance-Based Learning Classification To not construct a explicit description of how to classify Store all training data (learning) New example: find most similar instance Fall 2004 computing done at time of classification k-nearest neighbor 40 K-Nearest Neighbor Each instance lives in n-dimensional space a1 ( x), a2 ( x),..., an ( x) Distance between instances d ( x1 , x2 ) Fall 2004 n (a ( x ), a ( x )) r 1 r 1 r 2 2 41 Example: nearest neighbor + - *+ + xq - + Fall 2004 1-Nearest neighbor? - 6-Nearest neighbor? + 42 Normalizing Some attributes may take large values and other small Normalize All attributes on equal footing v1 min vi ai max vi min vi Fall 2004 43 Other Methods for Supervised Learning Neural networks Support vector machines Optimization Rough set approach Fuzzy set approach Fall 2004 44 Evaluating the Learning Measure of performance Classification: error rate Resubstitution error Performance on training set Poor predictor of future performance Overfitting Useless for evaluation Fall 2004 45 Test Set Need a set of test instances Independent of training set instances Representative of underlying structure Sometimes: validation data Fine-tune parameters Independent of training and test data Plentiful data - no problem! Fall 2004 46 Holdout Procedures Common case: data set large but limited Usual procedure: Reserve some data for testing Use remaining data for training Problems: Want both sets as large as possible Want both sets to be representitive Fall 2004 47 "Smart" Holdout Simple check: Are the proportions of classes about the same in each data set? Stratified holdout Guarantee that classes are (approximately) proportionally represented Repeated holdout Fall 2004 Randomly select holdout set several times and average the error rate estimates 48 Holdout w/ Cross-Validation Cross-validation Fixed number of partitions of the data (folds) In turn: each partition used for testing and remaining instances for training May use stratification and randomization Standard practice: Stratified tenfold cross-validation Instances divided randomly into the ten partitions Fall 2004 49 Cross Validation Fold 1 Train on 90% of the data Model Test on 10% of the data Error rate e1 Fold 2 Train on 90% of the data Model Test on 10% of the data Fall 2004 Error rate e2 50 Cross-Validation Final estimate of error 1 k k k i 1 Quality of estimate t1 ,k 1s k 1 i s k (k 1) i 1 Fall 2004 2 51 Leave-One-Out Holdout n-Fold Cross-Validation (n instance set) Use all but one instance for training Maximum use of the data Deterministic High computational cost Non-stratified sample Fall 2004 52 Bootstrap Sample with replacement n times Use as training data Use instances not in training data for testing How many test instances are there? n 1 lim n n Fall 2004 53 0.632 Bootstrap On the average e-1 n = 0.369 n instances will be in the test set Thus, on average we have 63.2% of instance in training set Estimate error rate e = 0.632 etest + 0.368 etrain Fall 2004 54 Accuracy of our Estimate? Suppose we observe s successes in a testing set of ntest instances ... We then estimate the success rate Rsuccess=s/ ntest. Each instance is either a success or failure (Bernoulli trial w/success probability p) Fall 2004 Mean p Variance p(1-p) 55 Properties of Estimate We have E[Rsuccess]=p Var[Rsuccess]=p(1-p)/ntest If ntraining is large enough the Central Limit Theorem (CLT) states that, approximately, Rsuccess~Normal(p,p(1-p)/ntest) Fall 2004 56 Confidence Interval CI for normal P z Look up in table Rsuccess p z c p(1 p) / ntest Level CI for p 2 Rsuccess Rsuccess z2 z2 Rsuccess z 2 2ntest ntest ntest 4ntest p z2 1 ntest Fall 2004 57 Comparing Algorithms Know how to evaluate the results of our data mining algorithms (classification) How should we compare different algorithms? Evaluate each algorithm Rank Select best one Don't know if this ranking is reliable Fall 2004 58 Assessing Other Learning Developed procedures for classification Association rules Evaluated based on accuracy Same methods as for classification Numerical prediction Error rate no longer applies Same principles use independent test set and hold-out procedures cross-validation or bootstrap Fall 2004 59 Measures of Effectiveness Need to compare: Predicted values p1, p2,..., pn. Actual values a1, a2,..., an. Most common measure Mean-squared error 1 n 2 ( p a ) i i n i 1 Fall 2004 60 Other Measures 1 n Mean absolute error | pi ai | n i 1 n Relative squared error 2 ( p a ) i i i 1 n 2 ( a a ) i i 1 Relative absolute error | p a | n i 1 Correlation Fall 2004 i i n | a a | i 1 i 61 What to Do? “Large” amounts of data Hold-out 1/3 of data for testing Train a model on 2/3 of data Estimate error (or success) rate and calculate CI “Moderate” amounts of data Estimate error rate: Fall 2004 Use 10-fold cross-validation with stratification, or use bootstrap. Train model on the entire data set 62 Predicting Probabilities Classification into k classes Predict probabilities p1, p2,..., pnfor each class. Actual values a1, a2,..., an. No longer 0-1 error Correct class Quadratic loss function 1 k 2 2 2 ( p j a j ) (1 pi ) p j k j 1 j i 1 2 pi p 2j j Fall 2004 63 Information Loss Function Instead of quadratic function: log 2 p j where the j-th prediction is correct. Information required to communicate which class is correct Fall 2004 in bits with respect to the probability distribution 64 Occam's Razor Given a choice of theories that are equally good the simplest theory should be chosen Physical sciences: any theory should be consistant with all empirical observations Data mining: theory predictive model good theory good prediction What is good? Do we minimize the error rate? Fall 2004 65 Minimum Description Length MDL principle: Minimize size of theory + info needed to specify exceptions Suppose trainings set E is mined resulting in a theory T Want to minimize L[T ] L[ E | T ] Fall 2004 66 Most Likely Theory Suppose we want to maximize P[T|E] Bayes' rule P[ E | T ]P[T ] P[T | E ] P[ E ] Take logarithms log P[T | E ] log P[ E | T ] log P[T ] log P[ E ] Fall 2004 67 Information Function Maximizing P[T|E] equivilent to minimizing log P[T | E ] log P[ E | T ] log P[T ] log P[ E ] Number of bits it takes to submit the exceptions Number of bits it takes to submit the theory That is, the MDL principle! Fall 2004 68 Applications to Learning Classification, association, numeric prediciton Several predictive models with 'similar' error rate (usually as small as possible) Select between them using Occam's razor Simplicity subjective Use MDL principle Clustering Important learning that is difficult to evaluate Can use MDL principle Fall 2004 69 Comparing Mining Algorithms Know how to evaluate the results Suppose we have two algorithms Obtain two different models Estimate the error rates e(1) and e(2). Compare estimates Select the better e one ˆ (1) eˆ ( 2 ) ? Problem? Fall 2004 70 Weather Data Example Suppose we learn the rule If outlook=rainy then play=yes Otherwise play=no Test it on the following test set: Outlook Sunny Sunny Rainy Rainy Sunny Temp. Humidity Windy Hot High FALSE Hot High TRUE Mild High FALSE Cool Normal FALSE Mild High FALSE Play No No Yes Yes No Have zero error rate Fall 2004 71 Different Test Set 2 Again, suppose we learn the rule If outlook=rainy then play=yes Otherwise play=no Test it on a different test set: Outlook Overcast Rainy Overcast Sunny Rainy Temp. Humidity Windy Hot High FALSE Cool Normal TRUE Cool Normal TRUE Cool Normal FALSE Mild High TRUE Play Yes No Yes Yes No Have 100% error rate! Fall 2004 72 Comparing Random Estimates Estimated error rate is just an estimate (random) Need variance as well as point estimates Average of differences Construct a t-test dstatisticin error rates t 2 s /k Estimated standard deviation Fall 2004 H0: Difference = 0 73 Discussion Now know how to compare two learning algorithms and select the one with the better error rate We also know to select the simplest model that has 'comparable' error rate Is it really better? Minimising error rate can be misleading Fall 2004 74 Examples of 'Good Models' Application: loan approval Model: no applicants default on loans Evaluation: simple, low error rate Application: cancer diagnosis Model: all tumors are benign Evaluation: simple, low error rate Application: information assurance Fall 2004 Model: all visitors to network are well intentioned Evaluation: simple, low error rate 75 What's Going On? Many (most) data mining applications can be thought about as detecting exceptions Ignoring the exceptions does not significantly increase the error rate! Ignoring the exceptions often leads to a simple model! Thus, we can find a model that we evaluate as good but completely misses the point Need to account for the cost of error types Fall 2004 76 Accounting for Cost of Errors Explicit modeling of the cost of each error costs may not be known often not practical Look at trade-offs visual inspection semi-automated learning Cost-sensitive learning Fall 2004 assign costs to classes a priori 77 Explicit Modeling of Cost Confusion Matrix (Displayed in Weka) Predicted class Yes Yes Actual Class No Fall 2004 No True False positive negative False True positive negative 78 Cost Sensitive Learning Have used cost information to evaluate learning Better: use cost information to learn Simple idea: Fall 2004 Increase instances that demonstrate important behavior (e.g., classified as exceptions) Applies for any learning algorithm 79 Discussion Evaluate learning Estimate error rate Minimum length principle/Occam’s Razor Comparison of algorithm Based on evaluation Make sure difference is significant Cost of making errors may differ Use evaluation procedures with caution Incorporate into learning Fall 2004 80 Engineering the Output Prediction base on one model Model performs well on one training set, but poorly on others New data becomes available new model Combine models Bagging Boosting Stacking Fall 2004 Improve prediction but complicate structure 81 Bagging Bias: error despite all the data in the world! Variance: error due to limited data Intuitive idea of bagging: Assume we have several data sets Apply learning algorithm to each set Vote on the prediction (classification/numeric) What type of error does this reduce? When is this beneficial? Fall 2004 82 Bootstrap Aggregating In practice: only one training data set Create many sets from one Sample with replacement (remember the bootstrap) Does this work? Fall 2004 Often given improvements in predictive performance Never degeneration in performance 83 Boosting Assume a stable learning procedure Low variance Bagging does very little Combine structurally different models Intuitive motivation: Any given model may be good for a subset of the training data Encourage models to explain part of the data Fall 2004 84 AdaBoost.M1 Generate models: Assign equal weight to each training instance Iterate: Apply learning algorithm and store model e ¬ error If e = 0 or e > 0.5 terminate For every instance: If classified correctly multiply weight by e/(1-e) Normalize weight Until STOP Fall 2004 85 AdaBoost.M1 Classification: Assign zero weight to each class For every model: Fall 2004 1 e Add log e to class predicted by model Return class with highest weight 86 Performance Analysis Error of combined classifier converges to zero at an exponential rate (very fast) Fails on test data if Questionable value due to possible overfitting Must use independent test data Classifier more complex than training data justifies Training error become too large too quickly Must achieve balance between model complexity and the fit to the data Fall 2004 87 Fitting versus Overfitting Overfitting very difficult to assess here Assume we have reached zero error May be beneficial to continue boosting! Occam's razor? Build complex models from simple ones Boosting offers very significant improvement Can hope for more improvement than bagging Can degenerate performance Never happens with bagging Fall 2004 88 Stacking Models of different types Meta learner: Learn which learning algorithms are good Combine learning algorithms intelligently Level-0 Models Level-1 Model Decision Tree Naïve Bayes Meta Learner Instance-Based Fall 2004 89 Meta Learning Holdout part of the training set Use remaining data for training level-0 methods Use holdout data to train level-1 learning Retrain level-0 algorithms with all the data Comments: Level-1 learning: use very simple algorithm (e.g., linear model) Can use cross-validation to allow level-1 algorithms to train on all the data Fall 2004 90 Supervised Learning Two types of learning Classification Numerical prediction Classification learning algorithms Fall 2004 Decision trees Naïve Bayes Instance-based learning Many others are part of Weka, browse! 91 Other Issues in Supervised Learning Evaluation Accuracy: hold-out, bootstrap, crossvalidation Simplicity: MDL principle Usefulness: cost-sensitive learning Metalearning Fall 2004 Bagging, Boosting, Stacking 92