Classification CS910: Foundations of Data Analytics Graham Cormode

advertisement
CS910: Foundations
of Data Analytics
Graham Cormode
G.Cormode@warwick.ac.uk
Classification
Objectives





2
To understand the concept of classifiers and their construction
See how decision tree classifiers are constructed
Statistical classifiers: naive Bayes (again)
Linear separation via SVM classifiers
Construct and compare classifiers using Weka
CS910 Foundations of Data Analytics
Classification
 Regression concerns creating a model for numeric values
 Classification is about creating models for categoric values
Many different approaches to classification
– “The” core problem in machine learning (see CS342, CS909)
–
 Many different uses for classification:
Use the model to understand data better
– Use classification to fill in missing values
– Identify how one variable depends on others
– Predict future values based on partial information
–
3
CS910 Foundations of Data Analytics
Applications of Classification
 Many applications of classification in practice:
–
–
–
–
–
–
–
–
–
4
Determine whether a loan is likely to be repaid (credit scoring)
Fraud detection: identify when a transaction is fraudulent
Medical diagnosis: initial classification of test results
Identify text from writing (optical character recognition)
Determine whether an email is spam (spam filter)
Predict whether someone will like a product (collaborative filtering)
Decide which advert to place on a webpage
Determine likelihood of suffering a disease based on history
and many more…
CS910 Foundations of Data Analytics
The Model in Classification
 Many different choices in how to model data
 Central concept: data = model + error
–
–
–
–
–
–
5
Attempt to extract a model from data (error is unknown)
Error comes from noise, measurement error, random variation
We try to avoid modeling error: leads to overfitting
Overfitting possible when model has too many parameters
The model only describes the data and does not generalize
Converse problem is underfitting: model cannot capture behaviour
 E.g. model curve with line
CS910 Foundations of Data Analytics
Classification Process
 Classification phase one: Training
Obtain a data set with examples and a target attribute (class)
– Use training data to build a model of the data
–
 Classification phase two: Evaluation
Apply the model to test data with hidden class value
– Evaluate quality of model by comparing prediction to hidden value
– Accept model, or reject if not sufficiently accurate
–
 Classification phase three: Application
Apply the model to new data with unknown class value
– Monitor accuracy: look for concept drift
 Concept drift: properties of target change over time
–
6
CS910 Foundations of Data Analytics
Approaches to Evaluation
Many approaches to evaluation of classifier
 Apply classifier to training data (test data = training data)
Risk of overfitting here:
classifier works well on training data but does not generalize
– Some classifiers achieve zero error when applied to training data
–
 Hold back a subset of the training data for testing
Holdout method: typically use 2/3 for training, 1/3 for testing
 Balance: enough data to train, but enough to test also
– Have seen this with the UCI adult data set:
 Adult.data: 32K training examples, adult.test: 16K test examples
– To evaluate a new classifier, may repeat with different random sets
–
7
CS910 Foundations of Data Analytics
Approaches to Evaluation
 k-fold cross-validation: robust checking of model
Divide data into k equal pieces (folds). Typical choice: k=10
– Use (k-1) folds to form the training set
– Evaluate model accuracy on the 1 remaining fold
– Repeat for each of the k choices of training set
–
 Stratified cross-validation:
Additionally, ensure that each fold has same class distribution
– Likely to hold if folds chosen randomly
–
8
Fold 1
Fold 2
Fold 3
Fold 4
CS910 Foundations of Data Analytics
Measures of Accuracy
 Typically, a classifier outputs a single class label for an example
May output a distribution of beliefs over the labels
– To predict from a distribution, either sample or take majority vote
–
 For each test example, the result is either right or wrong
Typically, a “near miss” is not counted…
– Accuracy counts the fraction of examples correctly classified
–
 There are several other commonly used measures of quality
–
9
True positive rate, false positive rate, Area Under Curve (AUC)…
CS910 Foundations of Data Analytics
Measures of Accuracy
Prediction \ Truth
True
False
Positive
True positive (TP)
False positive (FP)
Negative
False negative (FN)
True negative (TN)
 False positive rate: FP/(FP + TN)
–
What fraction of False values are reported as positive
 True positive rate (aka sensitivity, recall): TP/(TP+FN)
–
What fraction of True values are reported as positive?
 Precision: TP/(TP+FP)
–
What fraction of those reported positive are correct?
 Accuracy: (TP+TN)/(TP+TN+FN+FP)
 F1 Measure: 2TP/(2TP + FP + FN)
–
10
Harmonic mean of precision and recall
CS910 Foundations of Data Analytics
ROC and AUC
 Assume each prediction has a confidence p between 0 and 1
Can pick a threshold r, round all confidences < r to 0, > r to 1
– For each choice of r, we get a true positive and false positive rate
– Different choices of r give a different tradeoff
–
 Receiver Operating Characteristic (ROC) curve:
Closer to top-left is better
– Area Under Curve (AUC)
measures overall quality
– Compute AUC by stepping
through sorted confidence values
– Random guessing gives diagonal line
 AUC is 0.5
–
11
CS910 Foundations of Data Analytics
Class-imbalance
 Accuracy is not always the best choice
Class-imbalance problem: what if one outcome is rare
– E.g. diagnosis is (rare disease) in 0.1% of individuals
– If we predict “no disease”, correct with 99.9% accuracy…
– True positive rate may be better here
–
 Some classifiers do not handle class imbalance well
Expect a roughly even mix of class values
– Can “oversample” the positives, and “undersample” negatives
–
 Look at the confusion matrix: how each class is labeled
–
12
Matrix counts how many examples in class A are labeled B
for all pairs of classes A and B
CS910 Foundations of Data Analytics
Kappa statistic
 The Kappa statistic compares accuracy to random guessing
Measure po as the observed agreement between two labelers
 Here, between a classifier and the true (withheld) labels
– pe is the agreement if both are labeling at random
 Based on the empirical probabilities of giving a label
– κ = (po – pe)/(1 – pe)
–
 Computing κ for a binary classifier:
po = (TP + TN)/N [Fraction that are correct]
– pe = (TP + FP)/N * (TP + FN)/N + (TN + FN)/N * (FP + TN)/N
[Fraction labeled positive * fraction true + fraction negative * false]
–
 Higher values of κ are better – no agreement on what is best
–
13
Will see examples as we go on
CS910 Foundations of Data Analytics
Considerations for classifiers
 Which classification model should you use out of many choices?
–
The most accurate may not be the most desirable
 Speed is a consideration also
Some classifiers take a prohibitively long time to train
– Some take a very long time to apply to a new example
– Scalability: some do not work well when data exceeds memory size
–
 Robustness to noise / missing values
 Interpretability of results
Can a human make sense of the model that is produced?
– Does classifier help to identify which are most important attributes?
–
14
CS910 Foundations of Data Analytics
Majority class classifier
 A trivial baseline: always predict the majority class
–
More generally, predict the distribution of the class values
 Suppose the majority class occurs a p fraction of the time
–
Then should achieve (approximately) p accuracy
 Not a useful or meangingful classifier
–
But an important benchmark: any good model should clearly
outperform the majority class classifier
 Implemented as “ZeroR” model in Weka: classifiers.rules.ZeroR
–
15
Generalizes to numeric data: predicts mean
CS910 Foundations of Data Analytics
Weka Results for adult.data majority class
Relation:
adult
Instances:
48842
Attributes:
15
Test mode:split 66.0% train, remainder test
=== Classifier model (full training set) ===
ZeroR predicts class value: <=50K
Time taken to build model: 0.01 seconds
=== Evaluation on test split ===
Correctly Classified Instances
12728
76.647 %
Incorrectly Classified Instances
3878
23.353 %
Kappa statistic
0
Mean absolute error
0.3626
Root mean squared error
0.4232
Relative absolute error
100
%
Root relative squared error
100
%
Total Number of Instances
16606
=== Detailed Accuracy By Class ===
TP Rate
FP Rate
Precision
Recall F-Measure
ROC Area
0
0
0
0
0
0.5
1
1
0.766
1
0.868
0.5
Weighted Avg.
0.766
0.766
0.587
0.766
0.665
0.5
=== Confusion Matrix ===
a
b
<-- classified as
0 3878 |
a = >50K
b = <=50K
16 0 12728 |
CS910 Foundations of Data Analytics
Class
>50K
<=50K
Nearest neighbour classifier
 Assume that examples that are “close” should behave alike
–
To classify an example, find most similar training example
 k-nearest neighbour classifier: find the k closest examples
And predict the majority vote / class distribution
– N-nearest neighbour classifier: equivalent to Majority-class
–
 How to determine which is the closest?
Use an appropriate distance e.g. (L1 / L2 distance)
 See lectures on “data basics” for building a distance function
– Finding the nearest neighbour can be slow: scan through all points
– Some specialized data structures help, but still not fast
 “Curse of dimensionality”: no fast solution for NN above ~ d=5
– Overfitting: error on training set is zero, but higher on test set
–
17
CS910 Foundations of Data Analytics
Nearest neighbour classifier in Weka
 Listed as an “instance based classifier” (classifiers.lazy.IBk)
Class of classifiers that use training data as model
– IB1: very simple 1-nearest neighbour classifier
– IBk: more flexible k-nearest neighbour classifier
 User provides k (default: 1)
 Or can use cross-validation to pick best value of k
 Pick search algorithm to use (default: linear scan)
 Pick distance to use (default: Euclidean; L1, L also available)
– Fast to build (just look at data), but slow to evaluate/apply
–
18
CS910 Foundations of Data Analytics
Weka Results for adult.data kNN
Relation:
adult
Instances:
48842
Attributes:
15
Test mode:split 66.0% train, remainder test
=== Classifier model (full training set) ===
IB1 instance-based classifier using 10 nearest neighbour(s) for classification
Time taken to build model: 0.11 seconds
=== Evaluation on test split ===
Correctly Classified Instances
13851
83.4096 %
Incorrectly Classified Instances
2755
16.5904 %
Kappa statistic
0.5359
Mean absolute error
0.2122
Root mean squared error
0.3392
Relative absolute error
58.5032 %
Root relative squared error
80.1583 %
Total Number of Instances
16606
=== Detailed Accuracy By Class ===
TP Rate
FP Rate
Precision
Recall F-Measure
ROC Area Class
0.643
0.108
0.645
0.643
0.644
0.874
>50K
0.892
0.357
0.891
0.892
0.892
0.874
<=50K
Weighted Avg.
0.834
0.299
0.834
0.834
0.834
0.874
=== Confusion Matrix ===
a
b
<-- classified as
2492 1386 |
a = >50K
b = <=50K
191369 11359 |
CS910 Foundations of Data Analytics
From points to regions
 The principle behind the nearest neighbour classifier is sound
–
Examples with similar class are “close” in the data space
 The partition of space induced by nearest neighbours is complex
Difficult to search/index
– Very detailed description
– Technically a “voronoi diagram”
–
 Other classifiers aim for a simpler
description of the space
Divide space into “rectangles”
– Divide hierarchically for easy indexing
–
20
CS910 Foundations of Data Analytics
1
X <  4?
Y < 2?
A
Y < 1?
X <  2?
A21
B
B
X <  3?
A
Class A
Class B
Class A
Y < 1?
X <  2?
X <  1?
2
B
A
B
1 2 3
X <  5?
B
A
Class B
Class A
Can be represented as a tree
Class A
Class A
–
Class B
Variable Y
 Hierarchical divisions of space are
easy to navigate (for classification)
Class B
Partition of space
Class B
4
Variable X
5
 A “decision tree”
 Convention:
True to left, False to right
 Many ways to construct
CS910 Foundations of Data Analytics
X <  4?
Building a decision tree
Y < 2?
X <  2?
X <  1?
 Many choices in building a decision tree
–
–
–
–
–
A
Y < 1?
X <  2?
Y < 1?
B
B
A
X <  3?
Start with all examples at ‘root’ of tree
A28 B
A
B
Choose an attribute to partition on, and a split value
Recursively apply the algorithm on both halves
Continue splitting until some stopping condition is met:
 All examples in a partition belong to the same class
 All examples have the same attribute values
 There is only one example in the partition
Label each ‘leaf’ of the tree with the majority class in the leaf
 Or with the class distribution
 Main difference between algorithms is how to choose splits
22
CS910 Foundations of Data Analytics
X <  5?
B
A
Entropy of a Distribution
 “Informativeness” is measured by entropy of the distribution
For a discrete PDF X, the entropy is H(X) = i pi log(1/pi)
 Shorthand for binary case: Hp(p) = p log(1/p) + (1-p)log(1/(1-p))
– Measures the amount of “surprise” in seeing each draw
 Also a measure of the “purity” of the distribution
– Low entropy (approaching zero): low information/surprise
 One item predominates
– High entropy: more uncertainty what is next
 Many items are possible
 Maximum entropy is log n,
if n possible choices
–
23
CS910 Foundations of Data Analytics
Information Gain
 General principle: pick the split that does most to separate classes
–
Information theoretic: pick split with highest information gain
 The information gain from a proposed split is determined by the
weighted sum of the entropy of the partitions
Split into v partitions with class distributions X1, X2, … Xv
 With Nj items in split j
– Measure the weighted entropy Hsplit(X) = j=1v (Nj/N) * H(Xj)
–
 Pick the split with the smallest value of Hsplit(X)
The difference (H(X) – Hsplit(X)) is the information gained by splitting
– This is a locally greedy choice of attribute to split on
 Does not guarantee the “best” tree, but works well in practice
–
24
CS910 Foundations of Data Analytics
Choice of splits
 Binary attributes (e.g. Male/Female): only one way to split!
 Categoric attributes, e.g. {Single, Married, Widowed, Divorced}
Could pick one attribute value versus rest to split on
 n choices if attribute has n possibilities
– Could split n ways, for each possibility
– Could pick any subset of attribute values to split into two
 ~2n-1 choices if attribute has n possibilities
e.g. {S}, {S, M}, {S, W}, {S, D}, {S, M, W}, {S, M, D}, {S, W, D}
–
 Ordered attributes, e.g. Grade = (A, B, C, D, E)
–
25
Pick some point to make a split into two
 n choices if attribute has n discrete values
 Up to N choices if attribute is real-valued, has N values in data
 May pick median, quantiles to limit options
CS910 Foundations of Data Analytics
Example of Attribute Selection


Class Yes: buys_computer = “yes”
Class No: buys_computer = “no”
H(X)  Hp (9/14)  
9
9
5
5
log 2 ( )  log 2 ( ) 0.940
14
14 14
14
age Yes No Hp(Y/N)
<=30
2
3 0.971
31…40 4
0 0
>40
3
2 0.971
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
5
4
Hp (2/5) 
Hp (4/4)
14
14
5

Hp (3/5)  0.694
14
Hage (D) 
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
(5/14) Hp(2/5) means “age <=30” has
5 out of 14 samples, with 2 yes
and 3 no. Hence
Gain(age)  H(X)  Hage (X)  0.246
Similarly,
Gain(income)  0.029
Gain(student)  0.151
Gain(credi t_rating)  0.048
Slide adapted from Han, Kamber, Pei
Decision tree algorithms and split rules
Different decision tree algorithms use different split rules:
 ID3 uses information gain as described
 C4.5 observes information gain favours many-valued attributes
Not informative if too many values: overfitting!
– Scale by “split information”, the entropy of the split distribution
– SplitInfo(split) = H(split) = j=1v (Nj/N) log(N/Nj)
– Choose based on Gain Ratio = (H(X) – Hsplit(X))/H(split)
–
 CART (classification and regression tree) uses Gini index
Gini(X) = 1 – i=1C pi(X)2
 pi(X) = proportion of examples of class i in X, from C classes
– Ginisplit(X) = j=1v (Nj/N) Gini(Xj)
 Replaces H with Gini in the expression for information gain
–
27
CS910 Foundations of Data Analytics
Split rules
 The three measures, in general, return good results but
–
Information gain: biased towards multivalued attributes
–
Gain ratio: tends to prefer unbalanced splits in which one partition is
much smaller than the others
–
Gini index: bias to multivalued attributes, struggles when C is large
 Many other split rules and other variations have been suggested
28
CS910 Foundations of Data Analytics
X < 4?
Pruning Trees
X <  4?
Y < 2?
X < 2?
X < 1?
A
Y < 1?
X < 2?
A
B
Y < 1?
B
A
B
Y < 2?
X < 5?
B
A
X <  1?
A
B
Y < 1?
X <  2?
B
B
A
X <  5?
B
A
X < 3?
A
B
 Building a tree down to leaves that are pure is likely to overfit
–
Pure: only contain members of a single class
 Better results can be obtained by pruning the tree
–
Removing some branches from the tree to reduce size/complexity
 Prepruning: add additional stopping conditions when building
E.g. limit the depth of the tree (length of path)
– E.g. stop when the number of examples in a subtree is small
–
 Postpruning: build full tree, then decide which branches to prune
–
29
Easier to measure difference before/after pruning
CS910 Foundations of Data Analytics
Postpruning rules
 Prune based on tradeoff between complexity and error
Complexity: measure of the complexity of the subtree
– Error: measured on pruning set (separate from training and test set)
–
 Reduced error pruning is simple and fast
Start at leaves, replace node with majority class
– Accept if this does not affect accuracy on pruning set
–
 Cost complexity pruning (CART) uses a more complex function
Given current tree T (with |T| leaves) and T’ with a node removed
– Measure E = Error of T on pruning set, E’ = error of T’ on pruning set
– Select T’ that minimizes (E’ – E)/(|T| - |T’|)
– Can evaluate all generated trees on test set, pick the best
–
30
CS910 Foundations of Data Analytics
Classifying with a decision tree
 Classifying a new example with a decision tree is straightforward
Start at the root
– Apply the first test, and determine which branch to follow
– Iterate until a leaf is reached
– Apply the label at the leaf
Class B
X < 4?
X < 2 ?
X < 1?
A
Y < 1?
X < 2 ?
A
31
B
Y < 1?
B
X < 3?
A
B
A
B
X < 5?
B
1
Class A
Class B
Class A
A
What is class predicted for
((3+4)/2, (1+2)/2) ? For (0, 0) ?
CS910 Foundations of Data Analytics
Class A
Y < 2?
Variable Y
2
Class B
1 2 3
Class A
Class A
Class B
–
Class B
4
Variable X
5
Decision trees in Weka
 trees.id3 implements ID3 (information gain)
–
not robust: does not allow missing attributes
 trees.J48 implements C4.5 (information gain scaled by split info)
32
CS910 Foundations of Data Analytics
Weka results for adult.data via J48
J48 pruned tree
-----------------age <= 27: <=50K (12012.0/369.0)
age > 27
|
education-num <= 12
|
|
marital-status = Married-civ-spouse
|
|
|
education-num <= 8: <=50K (2323.0/304.0)
|
|
|
education-num > 8
|
|
|
|
hours-per-week <= 35: <=50K (1410.0/315.0)
|
|
|
|
hours-per-week > 35
|
|
|
|
|
age <= 35: <=50K (2760.0/833.0)
|
|
|
|
|
age > 35
|
|
|
|
|
|
education-num <= 9
|
|
|
|
|
|
|
occupation = Tech-support
|
|
|
|
|
|
|
|
workclass = Private: >50K (55.88/23.32)
|
|
|
|
|
|
|
|
workclass = Self-emp-not-inc: <=50K (1.03/0.01)
|
|
|
|
|
|
|
|
workclass = Self-emp-inc: >50K (0.0)
|
|
|
|
|
|
|
|
workclass = Federal-gov: >50K (10.35/3.24)
|
|
|
|
|
|
|
|
workclass = Local-gov: >50K (1.03/0.02)
|
|
|
|
|
|
|
|
workclass = State-gov: <=50K (5.17/1.05)
|
|
|
|
|
|
|
|
workclass = Without-pay: >50K (0.0)
|
|
|
|
|
|
|
|
workclass = Never-worked: >50K (0.0)
33
CS910 Foundations of Data Analytics
Weka results for adult.data via J48
Attributes:
age, workclass, education, education-num, marital-status, occupation,
relationship, race, sex, hours-per-week, native-country, class
Test mode:split 66.0% train, remainder test
Number of Leaves : 549
Size of the tree : 682
Time taken to build model: 8.4 seconds
=== Evaluation on test split ===
Correctly Classified Instances
13938
83.9335 %
Incorrectly Classified Instances
2668
16.0665 %
Kappa statistic
0.5337
Mean absolute error
0.2263
Root mean squared error
0.3399
Relative absolute error
62.4059 %
Root relative squared error
80.3307 %
Total Number of Instances
16606
=== Detailed Accuracy By Class ===
TP Rate
FP Rate
Precision
Recall F-Measure
ROC Area Class
0.602
0.088
0.675
0.602
0.636
0.865
>50K
0.912
0.398
0.883
0.912
0.897
0.865
<=50K
Weighted Avg.
0.839
0.326
0.834
0.839
0.836
0.865
=== Confusion Matrix ===
a
b
<-- classified as
2335 1543 |
a = >50K
b = <=50K
341125 11603 |
CS910 Foundations of Data Analytics
Classifying based on attributes
 The majority class classifier only looks at global class distribution
 Could do better by looking at correlation with attributes
Suppose argmaxC Pr[Class = C] depends on attribute values
– E.g. One class is most likely for men, a different one for women
– Use Pr[Class = C | X = x] to predict the class
 Conditional probability of Class given X
 Pr[A | B] = Pr[A  B] / Pr[B]
–
 The 1R Classifier (oneR in Weka) picks the best attribute for this
–
Discretizes numeric attributes to make them categoric
 How to incorporate information from multiple attributes?
–
35
Naïve Bayes: use statistical reasoning & independence assumptions
CS910 Foundations of Data Analytics
Bayes’ theorem
 Pr[A|B] Pr[B] = Pr[A  B] = Pr[ B  A] = Pr[B|A] Pr[A]
–
Hence, Pr[B|A] = Pr [A|B] Pr[B] / Pr[A]
 Will use Bayes’ theorem to compute the probability of each class,
given the evidence (features of the example)
–
Assume examples X have d attributes, X1 … Xd
 Want Pr[Class = C | (X1=x1  X2=x2  … Xd=xd)] (+)
Want to pick the C that maximizes this probability
– Use Bayes’ theorem: Pr[Class|X]=Pr[Class]Pr[X|Class]/Pr[X]
(+) = Pr[Class=C] Pr[X1=x1 … Xd=xd | Class=C] / Pr[X1=x1 … Xd=xd]
– Can drop the term Pr[X1=x1 … Xd=xd] as it is independent of C
– Need to simplify Pr[X1=x1 … Xd = xd | Class=C]
–
36
CS910 Foundations of Data Analytics
Derivation of Naïve Bayes Classifier
 Use the chain rule of probability:
Pr[ X  Y | Z] = Pr[X | Z] Pr[Y | X  Z]
LHS: Pr[ X  Y  Z] / Pr[Z]
– RHS: Pr[ X  Z ] Pr[ Y  X  Z]/ (Pr[Z] Pr[X  Z])
–
 Pr[X1 =x1 … Xd = xd | Class=C]
= Pr[X1 = x1 |Class=C] Pr[ X2=x2 … Xd = xd |Class=C  X1=x1]
= Pr[X1 = x1|Class=C] Pr[X2 = x2 |Class=C  X1=x1] *
Pr[X3=x3…Xd=xd |Class=C  X1=x1  X2=x2]
= j=1d Pr[Xj = xj |Class = C i=1j-1 Xi = xi]
 Make independence assumption among variables:
–
37
Pr[Xj = xj | Class=C  X1=x1  X2=x2 … ] = Pr[Xj = xj|Class = C]
CS910 Foundations of Data Analytics
Naïve Bayes Classifier
 Naïve Bayes classifier makes the (possibly naïve) assumption that
each attribute is conditionally independent of others
–
Gives Pr[X1 =x1 … Xd = xd | Class=C] =  j=1d Pr[Xj = xj |Class = C ]
 Using (+), find the class C that maximizes
Pr[Class=C] *  j=1d Pr[Xj = xj |Class = C ]
 Can compute probabilities Pr[Xj = xj |Class = C] directly from data
Categorical data: Count number of cases with Xj = xj and Class=C
– Numeric data: fit a distribution (conditioned on class value)
–
 Laplacian correction for unseen cases: add 1 to each count
Avoids predicting probability zero for an unseen case
– Does not significantly alter the other probabilities
–
38
CS910 Foundations of Data Analytics
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Example to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
39
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium yes fair
medium yes excellent
medium
no excellent
high
yes fair
medium
no excellent
class
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Slide adapted from Han, Kamber, Pei
Naïve Bayes Classifier: An Example
To classify X = (age <= 30, income = medium,
student = yes, credit_rating = fair)
 Compute global class distribution
–
–
Pr[buys_computer = “yes”] = 9/14 = 0.643
Pr[buys_computer = “no”] = 5/14= 0.357
 Compute Pr[X=x|Class=C] for each class C
–
–
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income studentcredit_rating
buys_comp
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
Pr[age = “<=30” | buys_computer = “yes”] = 2/9
Pr[age = “<= 30” | buys_computer = “no”]= 3/5
Pr[income = “medium” | buys_computer = “yes”] = 4/9
Pr[income = “medium” | buys_computer = “no”] = 2/5
Pr[student = “yes” | buys_computer = “yes”] = 6/9
Pr[student = “yes” | buys_computer = “no”] = 1/5
Pr[credit_rating = “fair” | buys_computer = “yes”] = 6/9
Pr[credit_rating = “fair” | buys_computer = “no”] = 2/5
Score for “Yes” for X: (9/14) * (2/9) * (4/9) * (6/9) * (6/9) = 0.028
Score for “No” for X: (5/14) * (3/5) * (2/5) * 1/5) * (2/5) = 0.007
 Naïve Bayes predicts Class “Yes”
40
Slide adapted from Han, Kamber, Pei
Naïve Bayes in Weka
Attribute
>50K
<=50K
(0.24) (0.76)
===============================================
age
mean
44.2752 36.8722
std. dev.
10.5585 14.1039
weight sum
11687
37155
precision
1
1
workclass
Private
7388.0 26520.0
Self-emp-not-inc
1078.0 2786.0
Self-emp-inc
939.0
758.0
Federal-gov
562.0
872.0
Local-gov
928.0 2210.0
State-gov
531.0 1452.0
Without-pay
3.0
20.0
Never-worked
1.0
11.0
[total]
11430.0 34629.0
…
Time taken to build model: 0.51 seconds
41
Weka fits a Gaussian
distribution for each class
value on numeric data
Weka applies the
Laplacian correction to
categorical attributes
CS910 Foundations of Data Analytics
Naïve Bayes in Weka
Scheme:weka.classifiers.bayes.NaiveBayes
Relation:
adult-weka.filters.unsupervised.attribute.Remove-R3,11-12
Instances:
48842
Test mode:split 66.0% train, remainder test
=== Evaluation on test split ===
Correctly Classified Instances
13519
81.4103 %
Incorrectly Classified Instances
3087
18.5897 %
Kappa statistic
0.5295
Mean absolute error
0.2043
Root mean squared error
0.3681
Relative absolute error
56.3263 %
Root relative squared error
86.9948 %
Total Number of Instances
16606
=== Detailed Accuracy By Class ===
TP Rate
FP Rate
Precision
Recall F-Measure
ROC Area
0.751
0.167
0.579
0.751
0.654
0.885
0.833
0.249
0.917
0.833
0.873
0.885
Weighted Avg.
0.814
0.23
0.838
0.814
0.822
0.885
=== Confusion Matrix ===
a
b
<-- classified as
2913
965 |
a = >50K
2122 10606 |
b = <=50K
42
CS910 Foundations of Data Analytics
Class
>50K
<=50K
Linearly separable data
 Models seen so far have tried to partition up the data space
–
Decision trees: partition into (hyper)rectangles
 Sometimes the data is linearly separable
Examples of one class fall one side of a line / hyperplane
– Other class is on other side of the plane
–
 Linear classifiers aim to find ‘best’ plane
Allows a simple description of the model
 Defined by the separating hyperplane
– Early examples: winnow algorithm,
perceptron (used in neural networks)
–
43
CS910 Foundations of Data Analytics
x
x
x
x
x
x
o
x x x
o
o
x
o
ooo
o
o
o o o
o
Support Vector Machines
 The model of “support vector machines” (SVM) seeks best plane
When data is linearly separable, many hyperplanes can separate
– SVM seeks the plane that maximizes the margin between the classes
–
Small Margin
44
Large Margin
Support Vectors
CS910 Foundations of Data Analytics
SVM as an optimization problem
 In vector notation, a hyperplane is written as W  X - b = 0
W: (d-dimensional) vector orthogonal to plane
– X: point in (d-dimensional) space
– b: scalar offset
–
 Any X “above” hyperplane has (W  X – b) > 0
–
(W  X – b) < 0: X is “below” the hyperplane
 Choose the scale of the vector W so that
(W  X – b)  1 for points in class “+1”
– (W  X – b)  -1 for points in class “-1”
–
 Then seek weights W so for all examples (y, x), y(W x - b)  1
45
CS910 Foundations of Data Analytics
SVM as an optimization problem
 Want to maximize the margin
Distance of a point x from the plane is (W  x – b)/ǁWǁ2
– For the support vectors, (W  x – b) =  1
– Hence, the margin is 2/ǁWǁ2 (Euclidean norm of W)
–
 Becomes an optimization function:
Maximize 2/ǁWǁ2
 Equivalent to, Minimize ǁWǁ22
– Subject to: yi(W  xi – b)  1 for all i
– A constrained convex quadratic optimization problem
–
 The solution of this optimization identifies the support vectors
–
46
Those points that achieve equality, yi(W  xi – b)  1
CS910 Foundations of Data Analytics
Applying SVM
 Given W and b, SVM is very fast to apply to new examples
–
If (W X – b) > 0, output “+1”, else output “-1”
 However, can be very slow to train
–
Not fast to solve the convex optimization with many constraints
 Accuracy generally good, avoids overfitting
Margin determined by only the few support vectors
– Add/removing points away from support vectors doesn’t affect SVM
–
 SVM has a few limitations
Defined only for two classes: need e.g. one-vs-all for more classes
– Assumes data is linearly separable – not true in general
–
47
CS910 Foundations of Data Analytics
Kernel Functions
 Try separating data by “lifting” to a higher dimensional space
–
Points not separated by a line may be separated by a curve
 We want to learn a separating curve in the new space
–
E.g. Add products of pairs of dimensions, (x12), (x1*x2), (x2*x3)…
 Many kernel functions have been proposed
Degree k polynomial kernel: K(Xi, Xj) = (Xi  Xj + 1)k
– Gaussian radial basis kernel: K(Xi, Xj) = exp(-ǁXi – Xjǁ22/2s2)
– Sigmoid kernel: K(Xi, Xj) = tanh(k Xi  Xj – c)
–
Data remapped using
radial basis kernel
48
More details on kernels
in CS909 Data Mining
CS910 Foundations of Data Analytics
Soft Margin Classifiers
 Some data is not conveniently separated, even with kernels
–
So relax the requirement of a separation of data
 Allow some “slack” for each data point (as new variable zi)
Constraints with slack are now yi(W  xi – b)  1 - zi
– Slack allows the point to fall the wrong side of the hyperplane
– Then minimize the weighted combination of (ǁWǁ22 + C i zi)
–
 Allows SVM to generalize to more cases
–
49
Has been highly popular in practice:
Kanellakis award for “one of the most
frequently used algorithms in ML … used
in medical diagnosis, weather forecasting,
and intrusion detection”
CS910 Foundations of Data Analytics
SVM evaluated on adult.data in Weka
Scheme:weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K
"weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
Relation:
adult-weka.filters.unsupervised.attribute.Remove-R3,11-12
Instances:
48842
Attributes:
12
Test mode:split 66.0% train, remainder test
=== Classifier model (full training set) ===
Kernel used: Linear Kernel: K(x,y) = <x,y>
Classifier for classes: >50K, <=50K
Time taken to build model: 3472.87 seconds
=== Evaluation on test split ===
Correctly Classified Instances
13912
83.7769 %
Incorrectly Classified Instances
2694
16.2231 %
Total Number of Instances
16606
=== Detailed Accuracy By Class ===
TP Rate
FP Rate
Precision
Recall F-Measure
ROC Area Class
0.556
0.076
0.689
0.556
0.615
0.74
>50K
0.924
0.444
0.872
0.924
0.897
0.74
<=50K
Weighted Avg.
0.838
0.358
0.829
0.838
0.831
0.74
=== Confusion Matrix ===
a
b
<-- classified as
2155 1723 |
a = >50K
971 11757 |
b = <=50K
50
CS910 Foundations of Data Analytics
Comparison of classifiers on adult.data
Name
Accuracy
AUC
FP Rate Time to build
Kappa
Majority class
0.766
0.5
0.766
0.1s
0
K Nearest Neighbour
0.834
0.874 0.299
0.1s
0.5359
Decision tree (J48)
0.839
0.865 0.326
8.4s
0.5337
Naïve Bayes
0.814
0.885 0.23
0.5s
0.5295
SVM
0.837
0.74
3472s
0.5141
Logistic Regression
0.837
0.888 0.35
31s
0.5175
0.358
 Tradeoffs between accuracy, FP rate, AUC, time to build, time to
apply, and interpretability of resulting classifier
 Different data sets (or different objectives) will show different
behaviour
51
CS910 Foundations of Data Analytics
Summary of Classifiers






Many different ways to construct classifiers to predict values
General framework: build on training data, evaluate on test data
Decision tree classifiers constructed by recursively splitting data
Naive Bayes makes independence assumption between variables
SVM classifiers find linear separation – possibly via kernels
Can construct and compare classifiers using Weka
 Background reading:
Chapter 8 (classification, basic Concepts) and 9.3 (Support Vector
Machines), 9.5 (Lazy learners) in "Data Mining: Concepts and
Techniques" (Han, Kamber, Pei).
52
CS910 Foundations of Data Analytics
Download