CS910: Foundations of Data Analytics Graham Cormode G.Cormode@warwick.ac.uk Classification Objectives 2 To understand the concept of classifiers and their construction See how decision tree classifiers are constructed Statistical classifiers: naive Bayes (again) Linear separation via SVM classifiers Construct and compare classifiers using Weka CS910 Foundations of Data Analytics Classification Regression concerns creating a model for numeric values Classification is about creating models for categoric values Many different approaches to classification – “The” core problem in machine learning (see CS342, CS909) – Many different uses for classification: Use the model to understand data better – Use classification to fill in missing values – Identify how one variable depends on others – Predict future values based on partial information – 3 CS910 Foundations of Data Analytics Applications of Classification Many applications of classification in practice: – – – – – – – – – 4 Determine whether a loan is likely to be repaid (credit scoring) Fraud detection: identify when a transaction is fraudulent Medical diagnosis: initial classification of test results Identify text from writing (optical character recognition) Determine whether an email is spam (spam filter) Predict whether someone will like a product (collaborative filtering) Decide which advert to place on a webpage Determine likelihood of suffering a disease based on history and many more… CS910 Foundations of Data Analytics The Model in Classification Many different choices in how to model data Central concept: data = model + error – – – – – – 5 Attempt to extract a model from data (error is unknown) Error comes from noise, measurement error, random variation We try to avoid modeling error: leads to overfitting Overfitting possible when model has too many parameters The model only describes the data and does not generalize Converse problem is underfitting: model cannot capture behaviour E.g. model curve with line CS910 Foundations of Data Analytics Classification Process Classification phase one: Training Obtain a data set with examples and a target attribute (class) – Use training data to build a model of the data – Classification phase two: Evaluation Apply the model to test data with hidden class value – Evaluate quality of model by comparing prediction to hidden value – Accept model, or reject if not sufficiently accurate – Classification phase three: Application Apply the model to new data with unknown class value – Monitor accuracy: look for concept drift Concept drift: properties of target change over time – 6 CS910 Foundations of Data Analytics Approaches to Evaluation Many approaches to evaluation of classifier Apply classifier to training data (test data = training data) Risk of overfitting here: classifier works well on training data but does not generalize – Some classifiers achieve zero error when applied to training data – Hold back a subset of the training data for testing Holdout method: typically use 2/3 for training, 1/3 for testing Balance: enough data to train, but enough to test also – Have seen this with the UCI adult data set: Adult.data: 32K training examples, adult.test: 16K test examples – To evaluate a new classifier, may repeat with different random sets – 7 CS910 Foundations of Data Analytics Approaches to Evaluation k-fold cross-validation: robust checking of model Divide data into k equal pieces (folds). Typical choice: k=10 – Use (k-1) folds to form the training set – Evaluate model accuracy on the 1 remaining fold – Repeat for each of the k choices of training set – Stratified cross-validation: Additionally, ensure that each fold has same class distribution – Likely to hold if folds chosen randomly – 8 Fold 1 Fold 2 Fold 3 Fold 4 CS910 Foundations of Data Analytics Measures of Accuracy Typically, a classifier outputs a single class label for an example May output a distribution of beliefs over the labels – To predict from a distribution, either sample or take majority vote – For each test example, the result is either right or wrong Typically, a “near miss” is not counted… – Accuracy counts the fraction of examples correctly classified – There are several other commonly used measures of quality – 9 True positive rate, false positive rate, Area Under Curve (AUC)… CS910 Foundations of Data Analytics Measures of Accuracy Prediction \ Truth True False Positive True positive (TP) False positive (FP) Negative False negative (FN) True negative (TN) False positive rate: FP/(FP + TN) – What fraction of False values are reported as positive True positive rate (aka sensitivity, recall): TP/(TP+FN) – What fraction of True values are reported as positive? Precision: TP/(TP+FP) – What fraction of those reported positive are correct? Accuracy: (TP+TN)/(TP+TN+FN+FP) F1 Measure: 2TP/(2TP + FP + FN) – 10 Harmonic mean of precision and recall CS910 Foundations of Data Analytics ROC and AUC Assume each prediction has a confidence p between 0 and 1 Can pick a threshold r, round all confidences < r to 0, > r to 1 – For each choice of r, we get a true positive and false positive rate – Different choices of r give a different tradeoff – Receiver Operating Characteristic (ROC) curve: Closer to top-left is better – Area Under Curve (AUC) measures overall quality – Compute AUC by stepping through sorted confidence values – Random guessing gives diagonal line AUC is 0.5 – 11 CS910 Foundations of Data Analytics Class-imbalance Accuracy is not always the best choice Class-imbalance problem: what if one outcome is rare – E.g. diagnosis is (rare disease) in 0.1% of individuals – If we predict “no disease”, correct with 99.9% accuracy… – True positive rate may be better here – Some classifiers do not handle class imbalance well Expect a roughly even mix of class values – Can “oversample” the positives, and “undersample” negatives – Look at the confusion matrix: how each class is labeled – 12 Matrix counts how many examples in class A are labeled B for all pairs of classes A and B CS910 Foundations of Data Analytics Kappa statistic The Kappa statistic compares accuracy to random guessing Measure po as the observed agreement between two labelers Here, between a classifier and the true (withheld) labels – pe is the agreement if both are labeling at random Based on the empirical probabilities of giving a label – κ = (po – pe)/(1 – pe) – Computing κ for a binary classifier: po = (TP + TN)/N [Fraction that are correct] – pe = (TP + FP)/N * (TP + FN)/N + (TN + FN)/N * (FP + TN)/N [Fraction labeled positive * fraction true + fraction negative * false] – Higher values of κ are better – no agreement on what is best – 13 Will see examples as we go on CS910 Foundations of Data Analytics Considerations for classifiers Which classification model should you use out of many choices? – The most accurate may not be the most desirable Speed is a consideration also Some classifiers take a prohibitively long time to train – Some take a very long time to apply to a new example – Scalability: some do not work well when data exceeds memory size – Robustness to noise / missing values Interpretability of results Can a human make sense of the model that is produced? – Does classifier help to identify which are most important attributes? – 14 CS910 Foundations of Data Analytics Majority class classifier A trivial baseline: always predict the majority class – More generally, predict the distribution of the class values Suppose the majority class occurs a p fraction of the time – Then should achieve (approximately) p accuracy Not a useful or meangingful classifier – But an important benchmark: any good model should clearly outperform the majority class classifier Implemented as “ZeroR” model in Weka: classifiers.rules.ZeroR – 15 Generalizes to numeric data: predicts mean CS910 Foundations of Data Analytics Weka Results for adult.data majority class Relation: adult Instances: 48842 Attributes: 15 Test mode:split 66.0% train, remainder test === Classifier model (full training set) === ZeroR predicts class value: <=50K Time taken to build model: 0.01 seconds === Evaluation on test split === Correctly Classified Instances 12728 76.647 % Incorrectly Classified Instances 3878 23.353 % Kappa statistic 0 Mean absolute error 0.3626 Root mean squared error 0.4232 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 16606 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area 0 0 0 0 0 0.5 1 1 0.766 1 0.868 0.5 Weighted Avg. 0.766 0.766 0.587 0.766 0.665 0.5 === Confusion Matrix === a b <-- classified as 0 3878 | a = >50K b = <=50K 16 0 12728 | CS910 Foundations of Data Analytics Class >50K <=50K Nearest neighbour classifier Assume that examples that are “close” should behave alike – To classify an example, find most similar training example k-nearest neighbour classifier: find the k closest examples And predict the majority vote / class distribution – N-nearest neighbour classifier: equivalent to Majority-class – How to determine which is the closest? Use an appropriate distance e.g. (L1 / L2 distance) See lectures on “data basics” for building a distance function – Finding the nearest neighbour can be slow: scan through all points – Some specialized data structures help, but still not fast “Curse of dimensionality”: no fast solution for NN above ~ d=5 – Overfitting: error on training set is zero, but higher on test set – 17 CS910 Foundations of Data Analytics Nearest neighbour classifier in Weka Listed as an “instance based classifier” (classifiers.lazy.IBk) Class of classifiers that use training data as model – IB1: very simple 1-nearest neighbour classifier – IBk: more flexible k-nearest neighbour classifier User provides k (default: 1) Or can use cross-validation to pick best value of k Pick search algorithm to use (default: linear scan) Pick distance to use (default: Euclidean; L1, L also available) – Fast to build (just look at data), but slow to evaluate/apply – 18 CS910 Foundations of Data Analytics Weka Results for adult.data kNN Relation: adult Instances: 48842 Attributes: 15 Test mode:split 66.0% train, remainder test === Classifier model (full training set) === IB1 instance-based classifier using 10 nearest neighbour(s) for classification Time taken to build model: 0.11 seconds === Evaluation on test split === Correctly Classified Instances 13851 83.4096 % Incorrectly Classified Instances 2755 16.5904 % Kappa statistic 0.5359 Mean absolute error 0.2122 Root mean squared error 0.3392 Relative absolute error 58.5032 % Root relative squared error 80.1583 % Total Number of Instances 16606 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.643 0.108 0.645 0.643 0.644 0.874 >50K 0.892 0.357 0.891 0.892 0.892 0.874 <=50K Weighted Avg. 0.834 0.299 0.834 0.834 0.834 0.874 === Confusion Matrix === a b <-- classified as 2492 1386 | a = >50K b = <=50K 191369 11359 | CS910 Foundations of Data Analytics From points to regions The principle behind the nearest neighbour classifier is sound – Examples with similar class are “close” in the data space The partition of space induced by nearest neighbours is complex Difficult to search/index – Very detailed description – Technically a “voronoi diagram” – Other classifiers aim for a simpler description of the space Divide space into “rectangles” – Divide hierarchically for easy indexing – 20 CS910 Foundations of Data Analytics 1 X < 4? Y < 2? A Y < 1? X < 2? A21 B B X < 3? A Class A Class B Class A Y < 1? X < 2? X < 1? 2 B A B 1 2 3 X < 5? B A Class B Class A Can be represented as a tree Class A Class A – Class B Variable Y Hierarchical divisions of space are easy to navigate (for classification) Class B Partition of space Class B 4 Variable X 5 A “decision tree” Convention: True to left, False to right Many ways to construct CS910 Foundations of Data Analytics X < 4? Building a decision tree Y < 2? X < 2? X < 1? Many choices in building a decision tree – – – – – A Y < 1? X < 2? Y < 1? B B A X < 3? Start with all examples at ‘root’ of tree A28 B A B Choose an attribute to partition on, and a split value Recursively apply the algorithm on both halves Continue splitting until some stopping condition is met: All examples in a partition belong to the same class All examples have the same attribute values There is only one example in the partition Label each ‘leaf’ of the tree with the majority class in the leaf Or with the class distribution Main difference between algorithms is how to choose splits 22 CS910 Foundations of Data Analytics X < 5? B A Entropy of a Distribution “Informativeness” is measured by entropy of the distribution For a discrete PDF X, the entropy is H(X) = i pi log(1/pi) Shorthand for binary case: Hp(p) = p log(1/p) + (1-p)log(1/(1-p)) – Measures the amount of “surprise” in seeing each draw Also a measure of the “purity” of the distribution – Low entropy (approaching zero): low information/surprise One item predominates – High entropy: more uncertainty what is next Many items are possible Maximum entropy is log n, if n possible choices – 23 CS910 Foundations of Data Analytics Information Gain General principle: pick the split that does most to separate classes – Information theoretic: pick split with highest information gain The information gain from a proposed split is determined by the weighted sum of the entropy of the partitions Split into v partitions with class distributions X1, X2, … Xv With Nj items in split j – Measure the weighted entropy Hsplit(X) = j=1v (Nj/N) * H(Xj) – Pick the split with the smallest value of Hsplit(X) The difference (H(X) – Hsplit(X)) is the information gained by splitting – This is a locally greedy choice of attribute to split on Does not guarantee the “best” tree, but works well in practice – 24 CS910 Foundations of Data Analytics Choice of splits Binary attributes (e.g. Male/Female): only one way to split! Categoric attributes, e.g. {Single, Married, Widowed, Divorced} Could pick one attribute value versus rest to split on n choices if attribute has n possibilities – Could split n ways, for each possibility – Could pick any subset of attribute values to split into two ~2n-1 choices if attribute has n possibilities e.g. {S}, {S, M}, {S, W}, {S, D}, {S, M, W}, {S, M, D}, {S, W, D} – Ordered attributes, e.g. Grade = (A, B, C, D, E) – 25 Pick some point to make a split into two n choices if attribute has n discrete values Up to N choices if attribute is real-valued, has N values in data May pick median, quantiles to limit options CS910 Foundations of Data Analytics Example of Attribute Selection Class Yes: buys_computer = “yes” Class No: buys_computer = “no” H(X) Hp (9/14) 9 9 5 5 log 2 ( ) log 2 ( ) 0.940 14 14 14 14 age Yes No Hp(Y/N) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 5 4 Hp (2/5) Hp (4/4) 14 14 5 Hp (3/5) 0.694 14 Hage (D) income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no (5/14) Hp(2/5) means “age <=30” has 5 out of 14 samples, with 2 yes and 3 no. Hence Gain(age) H(X) Hage (X) 0.246 Similarly, Gain(income) 0.029 Gain(student) 0.151 Gain(credi t_rating) 0.048 Slide adapted from Han, Kamber, Pei Decision tree algorithms and split rules Different decision tree algorithms use different split rules: ID3 uses information gain as described C4.5 observes information gain favours many-valued attributes Not informative if too many values: overfitting! – Scale by “split information”, the entropy of the split distribution – SplitInfo(split) = H(split) = j=1v (Nj/N) log(N/Nj) – Choose based on Gain Ratio = (H(X) – Hsplit(X))/H(split) – CART (classification and regression tree) uses Gini index Gini(X) = 1 – i=1C pi(X)2 pi(X) = proportion of examples of class i in X, from C classes – Ginisplit(X) = j=1v (Nj/N) Gini(Xj) Replaces H with Gini in the expression for information gain – 27 CS910 Foundations of Data Analytics Split rules The three measures, in general, return good results but – Information gain: biased towards multivalued attributes – Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others – Gini index: bias to multivalued attributes, struggles when C is large Many other split rules and other variations have been suggested 28 CS910 Foundations of Data Analytics X < 4? Pruning Trees X < 4? Y < 2? X < 2? X < 1? A Y < 1? X < 2? A B Y < 1? B A B Y < 2? X < 5? B A X < 1? A B Y < 1? X < 2? B B A X < 5? B A X < 3? A B Building a tree down to leaves that are pure is likely to overfit – Pure: only contain members of a single class Better results can be obtained by pruning the tree – Removing some branches from the tree to reduce size/complexity Prepruning: add additional stopping conditions when building E.g. limit the depth of the tree (length of path) – E.g. stop when the number of examples in a subtree is small – Postpruning: build full tree, then decide which branches to prune – 29 Easier to measure difference before/after pruning CS910 Foundations of Data Analytics Postpruning rules Prune based on tradeoff between complexity and error Complexity: measure of the complexity of the subtree – Error: measured on pruning set (separate from training and test set) – Reduced error pruning is simple and fast Start at leaves, replace node with majority class – Accept if this does not affect accuracy on pruning set – Cost complexity pruning (CART) uses a more complex function Given current tree T (with |T| leaves) and T’ with a node removed – Measure E = Error of T on pruning set, E’ = error of T’ on pruning set – Select T’ that minimizes (E’ – E)/(|T| - |T’|) – Can evaluate all generated trees on test set, pick the best – 30 CS910 Foundations of Data Analytics Classifying with a decision tree Classifying a new example with a decision tree is straightforward Start at the root – Apply the first test, and determine which branch to follow – Iterate until a leaf is reached – Apply the label at the leaf Class B X < 4? X < 2 ? X < 1? A Y < 1? X < 2 ? A 31 B Y < 1? B X < 3? A B A B X < 5? B 1 Class A Class B Class A A What is class predicted for ((3+4)/2, (1+2)/2) ? For (0, 0) ? CS910 Foundations of Data Analytics Class A Y < 2? Variable Y 2 Class B 1 2 3 Class A Class A Class B – Class B 4 Variable X 5 Decision trees in Weka trees.id3 implements ID3 (information gain) – not robust: does not allow missing attributes trees.J48 implements C4.5 (information gain scaled by split info) 32 CS910 Foundations of Data Analytics Weka results for adult.data via J48 J48 pruned tree -----------------age <= 27: <=50K (12012.0/369.0) age > 27 | education-num <= 12 | | marital-status = Married-civ-spouse | | | education-num <= 8: <=50K (2323.0/304.0) | | | education-num > 8 | | | | hours-per-week <= 35: <=50K (1410.0/315.0) | | | | hours-per-week > 35 | | | | | age <= 35: <=50K (2760.0/833.0) | | | | | age > 35 | | | | | | education-num <= 9 | | | | | | | occupation = Tech-support | | | | | | | | workclass = Private: >50K (55.88/23.32) | | | | | | | | workclass = Self-emp-not-inc: <=50K (1.03/0.01) | | | | | | | | workclass = Self-emp-inc: >50K (0.0) | | | | | | | | workclass = Federal-gov: >50K (10.35/3.24) | | | | | | | | workclass = Local-gov: >50K (1.03/0.02) | | | | | | | | workclass = State-gov: <=50K (5.17/1.05) | | | | | | | | workclass = Without-pay: >50K (0.0) | | | | | | | | workclass = Never-worked: >50K (0.0) 33 CS910 Foundations of Data Analytics Weka results for adult.data via J48 Attributes: age, workclass, education, education-num, marital-status, occupation, relationship, race, sex, hours-per-week, native-country, class Test mode:split 66.0% train, remainder test Number of Leaves : 549 Size of the tree : 682 Time taken to build model: 8.4 seconds === Evaluation on test split === Correctly Classified Instances 13938 83.9335 % Incorrectly Classified Instances 2668 16.0665 % Kappa statistic 0.5337 Mean absolute error 0.2263 Root mean squared error 0.3399 Relative absolute error 62.4059 % Root relative squared error 80.3307 % Total Number of Instances 16606 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.602 0.088 0.675 0.602 0.636 0.865 >50K 0.912 0.398 0.883 0.912 0.897 0.865 <=50K Weighted Avg. 0.839 0.326 0.834 0.839 0.836 0.865 === Confusion Matrix === a b <-- classified as 2335 1543 | a = >50K b = <=50K 341125 11603 | CS910 Foundations of Data Analytics Classifying based on attributes The majority class classifier only looks at global class distribution Could do better by looking at correlation with attributes Suppose argmaxC Pr[Class = C] depends on attribute values – E.g. One class is most likely for men, a different one for women – Use Pr[Class = C | X = x] to predict the class Conditional probability of Class given X Pr[A | B] = Pr[A B] / Pr[B] – The 1R Classifier (oneR in Weka) picks the best attribute for this – Discretizes numeric attributes to make them categoric How to incorporate information from multiple attributes? – 35 Naïve Bayes: use statistical reasoning & independence assumptions CS910 Foundations of Data Analytics Bayes’ theorem Pr[A|B] Pr[B] = Pr[A B] = Pr[ B A] = Pr[B|A] Pr[A] – Hence, Pr[B|A] = Pr [A|B] Pr[B] / Pr[A] Will use Bayes’ theorem to compute the probability of each class, given the evidence (features of the example) – Assume examples X have d attributes, X1 … Xd Want Pr[Class = C | (X1=x1 X2=x2 … Xd=xd)] (+) Want to pick the C that maximizes this probability – Use Bayes’ theorem: Pr[Class|X]=Pr[Class]Pr[X|Class]/Pr[X] (+) = Pr[Class=C] Pr[X1=x1 … Xd=xd | Class=C] / Pr[X1=x1 … Xd=xd] – Can drop the term Pr[X1=x1 … Xd=xd] as it is independent of C – Need to simplify Pr[X1=x1 … Xd = xd | Class=C] – 36 CS910 Foundations of Data Analytics Derivation of Naïve Bayes Classifier Use the chain rule of probability: Pr[ X Y | Z] = Pr[X | Z] Pr[Y | X Z] LHS: Pr[ X Y Z] / Pr[Z] – RHS: Pr[ X Z ] Pr[ Y X Z]/ (Pr[Z] Pr[X Z]) – Pr[X1 =x1 … Xd = xd | Class=C] = Pr[X1 = x1 |Class=C] Pr[ X2=x2 … Xd = xd |Class=C X1=x1] = Pr[X1 = x1|Class=C] Pr[X2 = x2 |Class=C X1=x1] * Pr[X3=x3…Xd=xd |Class=C X1=x1 X2=x2] = j=1d Pr[Xj = xj |Class = C i=1j-1 Xi = xi] Make independence assumption among variables: – 37 Pr[Xj = xj | Class=C X1=x1 X2=x2 … ] = Pr[Xj = xj|Class = C] CS910 Foundations of Data Analytics Naïve Bayes Classifier Naïve Bayes classifier makes the (possibly naïve) assumption that each attribute is conditionally independent of others – Gives Pr[X1 =x1 … Xd = xd | Class=C] = j=1d Pr[Xj = xj |Class = C ] Using (+), find the class C that maximizes Pr[Class=C] * j=1d Pr[Xj = xj |Class = C ] Can compute probabilities Pr[Xj = xj |Class = C] directly from data Categorical data: Count number of cases with Xj = xj and Class=C – Numeric data: fit a distribution (conditioned on class value) – Laplacian correction for unseen cases: add 1 to each count Avoids predicting probability zero for an unseen case – Does not significantly alter the other probabilities – 38 CS910 Foundations of Data Analytics Naïve Bayes Classifier: Training Dataset Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Example to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 39 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent class no no yes yes yes no yes no yes yes yes yes yes no Slide adapted from Han, Kamber, Pei Naïve Bayes Classifier: An Example To classify X = (age <= 30, income = medium, student = yes, credit_rating = fair) Compute global class distribution – – Pr[buys_computer = “yes”] = 9/14 = 0.643 Pr[buys_computer = “no”] = 5/14= 0.357 Compute Pr[X=x|Class=C] for each class C – – age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income studentcredit_rating buys_comp high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no Pr[age = “<=30” | buys_computer = “yes”] = 2/9 Pr[age = “<= 30” | buys_computer = “no”]= 3/5 Pr[income = “medium” | buys_computer = “yes”] = 4/9 Pr[income = “medium” | buys_computer = “no”] = 2/5 Pr[student = “yes” | buys_computer = “yes”] = 6/9 Pr[student = “yes” | buys_computer = “no”] = 1/5 Pr[credit_rating = “fair” | buys_computer = “yes”] = 6/9 Pr[credit_rating = “fair” | buys_computer = “no”] = 2/5 Score for “Yes” for X: (9/14) * (2/9) * (4/9) * (6/9) * (6/9) = 0.028 Score for “No” for X: (5/14) * (3/5) * (2/5) * 1/5) * (2/5) = 0.007 Naïve Bayes predicts Class “Yes” 40 Slide adapted from Han, Kamber, Pei Naïve Bayes in Weka Attribute >50K <=50K (0.24) (0.76) =============================================== age mean 44.2752 36.8722 std. dev. 10.5585 14.1039 weight sum 11687 37155 precision 1 1 workclass Private 7388.0 26520.0 Self-emp-not-inc 1078.0 2786.0 Self-emp-inc 939.0 758.0 Federal-gov 562.0 872.0 Local-gov 928.0 2210.0 State-gov 531.0 1452.0 Without-pay 3.0 20.0 Never-worked 1.0 11.0 [total] 11430.0 34629.0 … Time taken to build model: 0.51 seconds 41 Weka fits a Gaussian distribution for each class value on numeric data Weka applies the Laplacian correction to categorical attributes CS910 Foundations of Data Analytics Naïve Bayes in Weka Scheme:weka.classifiers.bayes.NaiveBayes Relation: adult-weka.filters.unsupervised.attribute.Remove-R3,11-12 Instances: 48842 Test mode:split 66.0% train, remainder test === Evaluation on test split === Correctly Classified Instances 13519 81.4103 % Incorrectly Classified Instances 3087 18.5897 % Kappa statistic 0.5295 Mean absolute error 0.2043 Root mean squared error 0.3681 Relative absolute error 56.3263 % Root relative squared error 86.9948 % Total Number of Instances 16606 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area 0.751 0.167 0.579 0.751 0.654 0.885 0.833 0.249 0.917 0.833 0.873 0.885 Weighted Avg. 0.814 0.23 0.838 0.814 0.822 0.885 === Confusion Matrix === a b <-- classified as 2913 965 | a = >50K 2122 10606 | b = <=50K 42 CS910 Foundations of Data Analytics Class >50K <=50K Linearly separable data Models seen so far have tried to partition up the data space – Decision trees: partition into (hyper)rectangles Sometimes the data is linearly separable Examples of one class fall one side of a line / hyperplane – Other class is on other side of the plane – Linear classifiers aim to find ‘best’ plane Allows a simple description of the model Defined by the separating hyperplane – Early examples: winnow algorithm, perceptron (used in neural networks) – 43 CS910 Foundations of Data Analytics x x x x x x o x x x o o x o ooo o o o o o o Support Vector Machines The model of “support vector machines” (SVM) seeks best plane When data is linearly separable, many hyperplanes can separate – SVM seeks the plane that maximizes the margin between the classes – Small Margin 44 Large Margin Support Vectors CS910 Foundations of Data Analytics SVM as an optimization problem In vector notation, a hyperplane is written as W X - b = 0 W: (d-dimensional) vector orthogonal to plane – X: point in (d-dimensional) space – b: scalar offset – Any X “above” hyperplane has (W X – b) > 0 – (W X – b) < 0: X is “below” the hyperplane Choose the scale of the vector W so that (W X – b) 1 for points in class “+1” – (W X – b) -1 for points in class “-1” – Then seek weights W so for all examples (y, x), y(W x - b) 1 45 CS910 Foundations of Data Analytics SVM as an optimization problem Want to maximize the margin Distance of a point x from the plane is (W x – b)/ǁWǁ2 – For the support vectors, (W x – b) = 1 – Hence, the margin is 2/ǁWǁ2 (Euclidean norm of W) – Becomes an optimization function: Maximize 2/ǁWǁ2 Equivalent to, Minimize ǁWǁ22 – Subject to: yi(W xi – b) 1 for all i – A constrained convex quadratic optimization problem – The solution of this optimization identifies the support vectors – 46 Those points that achieve equality, yi(W xi – b) 1 CS910 Foundations of Data Analytics Applying SVM Given W and b, SVM is very fast to apply to new examples – If (W X – b) > 0, output “+1”, else output “-1” However, can be very slow to train – Not fast to solve the convex optimization with many constraints Accuracy generally good, avoids overfitting Margin determined by only the few support vectors – Add/removing points away from support vectors doesn’t affect SVM – SVM has a few limitations Defined only for two classes: need e.g. one-vs-all for more classes – Assumes data is linearly separable – not true in general – 47 CS910 Foundations of Data Analytics Kernel Functions Try separating data by “lifting” to a higher dimensional space – Points not separated by a line may be separated by a curve We want to learn a separating curve in the new space – E.g. Add products of pairs of dimensions, (x12), (x1*x2), (x2*x3)… Many kernel functions have been proposed Degree k polynomial kernel: K(Xi, Xj) = (Xi Xj + 1)k – Gaussian radial basis kernel: K(Xi, Xj) = exp(-ǁXi – Xjǁ22/2s2) – Sigmoid kernel: K(Xi, Xj) = tanh(k Xi Xj – c) – Data remapped using radial basis kernel 48 More details on kernels in CS909 Data Mining CS910 Foundations of Data Analytics Soft Margin Classifiers Some data is not conveniently separated, even with kernels – So relax the requirement of a separation of data Allow some “slack” for each data point (as new variable zi) Constraints with slack are now yi(W xi – b) 1 - zi – Slack allows the point to fall the wrong side of the hyperplane – Then minimize the weighted combination of (ǁWǁ22 + C i zi) – Allows SVM to generalize to more cases – 49 Has been highly popular in practice: Kanellakis award for “one of the most frequently used algorithms in ML … used in medical diagnosis, weather forecasting, and intrusion detection” CS910 Foundations of Data Analytics SVM evaluated on adult.data in Weka Scheme:weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0" Relation: adult-weka.filters.unsupervised.attribute.Remove-R3,11-12 Instances: 48842 Attributes: 12 Test mode:split 66.0% train, remainder test === Classifier model (full training set) === Kernel used: Linear Kernel: K(x,y) = <x,y> Classifier for classes: >50K, <=50K Time taken to build model: 3472.87 seconds === Evaluation on test split === Correctly Classified Instances 13912 83.7769 % Incorrectly Classified Instances 2694 16.2231 % Total Number of Instances 16606 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.556 0.076 0.689 0.556 0.615 0.74 >50K 0.924 0.444 0.872 0.924 0.897 0.74 <=50K Weighted Avg. 0.838 0.358 0.829 0.838 0.831 0.74 === Confusion Matrix === a b <-- classified as 2155 1723 | a = >50K 971 11757 | b = <=50K 50 CS910 Foundations of Data Analytics Comparison of classifiers on adult.data Name Accuracy AUC FP Rate Time to build Kappa Majority class 0.766 0.5 0.766 0.1s 0 K Nearest Neighbour 0.834 0.874 0.299 0.1s 0.5359 Decision tree (J48) 0.839 0.865 0.326 8.4s 0.5337 Naïve Bayes 0.814 0.885 0.23 0.5s 0.5295 SVM 0.837 0.74 3472s 0.5141 Logistic Regression 0.837 0.888 0.35 31s 0.5175 0.358 Tradeoffs between accuracy, FP rate, AUC, time to build, time to apply, and interpretability of resulting classifier Different data sets (or different objectives) will show different behaviour 51 CS910 Foundations of Data Analytics Summary of Classifiers Many different ways to construct classifiers to predict values General framework: build on training data, evaluate on test data Decision tree classifiers constructed by recursively splitting data Naive Bayes makes independence assumption between variables SVM classifiers find linear separation – possibly via kernels Can construct and compare classifiers using Weka Background reading: Chapter 8 (classification, basic Concepts) and 9.3 (Support Vector Machines), 9.5 (Lazy learners) in "Data Mining: Concepts and Techniques" (Han, Kamber, Pei). 52 CS910 Foundations of Data Analytics