Data Mining Classification Algorithms

advertisement
Data Mining
Classification Algorithms
Credits: Padhraic Smyth
Outline
• What is supervised classification
• Classification algorithms
–
–
–
–
–
–
Decision trees
k-NN
Linear discriminant analysis
Naïve bayes
Logistic regression
SVM
• Model evaluation and model selection
– Performance measures (accuracy, ROC, precision, recall…)
– Validation techniques (k-fold CV, LOO)
• Ensemble learning
– Random Forests
Notation
• Variables X, Y with values x, y (lower case)
• Vectors indicated by X
• Y is called the target variable
• Components of X indicated by Xj with values xj
• “Matrix” data set D with n rows and p columns
– jth column contains values for variable Xj
– ith row contains a vector of measurements on object i, indicated by
x(i)
– The jth measurement value for the ith object is xj(i)
• Unknown parameter for a model =
– Vector of parameters = q
q
Decision Region Terminlogy
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
6
Decision
Region 1
5
Decision
Region 2
Feature 2
4
3
2
1
0
Decision
Boundary
-1
2
3
4
5
6
Feature 1
7
8
9
10
Classes of classifiers
• Discriminative models, focus on locating optimal decision
boundaries
–
–
–
–
Decision trees: “swiss army knife”, often effective in high dimensionis
Nearest neighbor: simple, can scale poorly in high dimensions
Linear discriminants : simple, sometimes effective
Support vector machines: generalization of linear discriminants, can
be quite effective, computational complexity is an issue
• Class-conditional/probabilistic, based on p( x | ck ),
– Naïve Bayes (simple, but often effective in high dimensions)
• Regression-based, p( ck | x )
directly
– Logistic regression: simple, linear in “odds” space
– Neural network: non-linear extension of logistic, can be difficult to
work with
Decision Tree Classifiers
– Widely used in practice
• Can handle both real-valued and nominal inputs
(unusual)
• Good with high-dimensional data
– historically, developed both in statistics and
computer science
Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of (sub)tree
Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:
• Classification of examples is positive (T) or negative (F)
• The set of examples used for learning is called training set.
Decision trees
• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to
wait:
Choosing an attribute
• Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"
• Patrons? is a better choice
Example contd.
• Decision tree learned from the 12 examples:
• Substantially simpler than “true” tree---a more complex
hypothesis isn’t justified by small amount of data
Using information theory
• To implement Choose-Attribute in the DTL
algorithm
• Information Content (Entropy):
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
• For a training set containing p positive examples and
n negative examples:
I(
p
n
p
p
n
n
,
)
log 2

log 2
pn pn
pn
pn pn
pn
Information gain
• A chosen attribute A divides the training set E into subsets E1,
… , Ev according to their values for A, where A has v distinct
values.
v
remainder ( A)  
i 1
p i ni
pi
ni
I(
,
)
p  n pi  ni pi  ni
• Information Gain (IG) or reduction in entropy from the
attribute test:
p
n
IG ( A)  I (
,
)  remainder ( A)
pn pn
• Choose the attribute with the largest IG
Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Consider the attributes Patrons and Type (and others too):
2
4
6 2 4
IG ( Patrons)  1  [ I (0,1)  I (1,0)  I ( , )]  .541 bits
12
12
12 6 6
2 1 1
2 1 1
4 2 2
4 2 2
IG (Type)  1  [ I ( , )  I ( , )  I ( , )  I ( , )]  0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the DTL
algorithm as the root
Given Patrons as root node, the next attribute chosen is Hungry?, with
IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) = 0.252
Next step
Given Patrons as root node, the next attribute chosen is
Hungry?, with IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) =
0.252
Final decision tree induced by 12example training set
Braodening the applicability of
Decision Trees
• Continuous or integer values attributes
• Missing data
• Continuous output attributes (regression!!!)
Splitting on a nominal attribute
• Nominal attribute with m values
– e.g., the name of a state or a city in marketing data
• 2m-1 possible subsets => exhaustive search is O(2m-1)
– For small m, a simple approach is to branch on specific values
– But for large m this may not work well
• Neat trick for the 2-class problem:
–
–
–
–
For each predictor value calculate the proportion of class 1’s
Order the m values according to these proportions
Now treat as an ordinal variable and select the best split (linear in m)
This gives the optimal split for the Gini index, among all possible 2m-1
splits .
Splitting on a integer or continuous
attributes
• A numeric attribute A is tested for two outcomes:
A  split_point and A > split_point
• split_point is a value returned by choose-attribute
function.
Income?
 42,000
> 42,000
Treating Missing Data in Trees
• Missing values are common in practice
• Approaches to handing missing values
– During training
• Ignore rows with missing values (inefficient)
– During testing
• Send the example being classified down both branches and
average predictions
– Replace missing values with an “imputed value” (can be
suboptimal)
• Other approaches
– Treat “missing” as a unique value (useful if missing values
are correlated with the class)
How to Choose the Right-Sized Tree?
Predictive
Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range
for Tree Size
Choosing a Good Tree for Prediction
• General idea
– grow a large tree
– prune it back to create a family of subtrees
• “weakest link” pruning
– score the subtrees and pick the best one
• Massive data sizes (e.g., n ~ 100k data points)
– use training data set to fit a set of trees
– use a validation data set to score the subtrees
• Smaller data sizes (e.g., n ~1k or less)
– use cross-validation
Example: Spam Email Classification
• Data Set: (from the UCI Machine Learning Archive)
– 4601 email messages from 1999
– Manually labeled as spam (60%), non-spam (40%)
– 54 features: percentage of words matching a specific word/character
• Business, address, internet, free, george, !, $, etc
– Average/longest/sum lengths of uninterrupted sequences of CAPS
• Error Rates
–
–
–
–
Training: 3056 emails, Testing: 1536 emails
Decision tree = 8.7%
Logistic regression: error = 7.6%
Naïve Bayes = 10% (typically)
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Why Trees are widely used in Practice
• Can handle high dimensional data
– builds a model using 1 dimension at time
• Can handle any type of input variables
– categorical, real-valued, etc
– most other methods require data of a single type (e.g.,
only real-valued)
• Trees are (somewhat) interpretable
– domain expert can “read off” the tree’s logic
• Tree algorithms are relatively easy to code and test
Limitations of Classification Trees
• Representational Bias
– piecewise linear boundaries, parallel to axes
• High Variance
– trees can be “unstable” as a function of the sample
• e.g., small change in the data -> completely different tree
– causes two problems
• 1. High variance contributes to prediction error
• 2. High variance reduces interpretability
– Trees are good candidates for model combining
• Often used with boosting and bagging
• Trees do not scale well to massive data sets (e.g., N in
millions)
– repeated random access of subsets of the data
Nearest Neighbor Classifiers
• kNN: select the k nearest neighbors to x from the training data
and select the majority class from these neighbors
• Training method:
– Save the training examples
• At prediction time:
– Use a distance measure (for example Euclidean) and calculate the
distance of a test example x to all training examples
– Find the k training examples (x1,y1),…(xk,yk) that are closest to the test
example x
– Predict the most frequent class among those yi’s.
Nearest Neighbor Classifiers
•
k is a parameter:
– Small k: “noisier” estimates, Large k: “smoother” estimates
– Best value of k often chosen by cross-validation
– Important case k = 1 (1NN)
•
Comments
– Virtually assumption free
– Interesting theoretical properties:
•
Disadvantages
– Can scale poorly with dimensionality: sensitive to distance metric
– Requires fast lookup at run-time to do classification with large n
– Does not provide any interpretable “model”
•
Improvements:
– Weighting examples from the neighborhood
– Finding “close” examples in a large training set quickly
– Assign a score for class prediction as the fraction of predicted class among k nearest
examples (‘probabilistic version’)
The effect of the value of K on class boundaries
1-
31
Linear Discriminant Classifiers
•
Linear Discriminant Analysis (LDA)
– Earliest known classifier (1936, R.A. Fisher)
– Find a projection onto a vector such that means for each class (2 classes) are separated
as much as possible (with variances taken into account appropriately)
– Reduces to a special case of parametric Gaussian classifier in certain situations
– Many subsequent variations on this basic theme (e.g., regularized LDA)
•
Other linear discriminants
– Decision boundary = (p-1) dimensional hyperplane in p dimensions
– Perceptron learning algorithms (pre-dated neural networks)
• Simple “error correction” based learning algorithms
– SVMs: use a sophisticated “margin” idea for selecting the hyperplane
Fisher Linear Discriminant
• For each projection vector w compute for
each example w·xi
• Find w that minimizes within class variance
and maximizes between class variance
Fisher Linear Discriminant
• Solution
Where 1 and 2 are the means of class 1 and 2 vectors and 1
and 2 are the covariances of class 1 and 2 respectively.
• In test time just use the simple decision rule
if
w·xi > threshold class 1
else
class 2
threshold may be chosen such that the accuracy on the training
set is maximal
Support Vector Machines
(will be discussed again later)
• Support vector machines
– Use a different loss function, the “margin”
• Results in convex optimization problem, solvable by
quadratic programming
– Decision boundary represented by examples in
training data
– Linear version: clever placement of the hyperplane
– Non-linear version: “kernel trick” for high-dimensional
problems
– Computational complexity can be O(N3) without
speedups
Naïve Bayes Classifiers
• Bayes theorem
P( A | B) 
P( B | A) P( A)
P( B)
• Theorem of total probability, if events Ai are
mutually exclusive and probability sum to 1
n
P( B)   P( B | Ai ) P( Ai )
i 1
Naïve Bayes Classifiers
•
Probabilistic model based on Bayes rule
P(C | X ) 
•
P( X | C ) P(C )
P( X )
with conditional independence assumption
on p( x | ck ), i.e.
P( X | C )  P( x1 , x2 ,, xn | C ) 
 P( x | C )
i
i 1.. n
•
Decision rule
cMAP  argmax P(c j | x1 , x2 ,, xn )
c j C
 argmax
c j C
P( x1 , x2 ,, xn | c j ) P(c j )
P( x1 , x2 ,, xn )
 argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C
Parameters estimation
• P(cj)
– Can be estimated from the frequency of classes in the training
examples.
• P(x1,x2,…,xn|cj)
– O(|X|n•|C|) parameters
– Could only be estimated if a very, very large number of training
examples was available.
• Independence Assumption: attribute values are conditionally
independent given the target value: naïve Bayes.
P( x1 , x2 ,, xn | c j )   P( xi | c j )
i
c NB  arg max P(c j ) P( xi | c j )
c j C
i
Properties
• Estimating P( xi | c j ) instead of P( x1 , x2 ,, xn | c j ) greatly
reduces the number of parameters (and the data sparseness).
• The learning step in Naïve Bayes consists of estimating P(c j )
and P( xi | c j ) based on the frequencies in the training data
• An unseen instance is classified by computing the class that
maximizes the posterior
• When conditioned independence is satisfied, Naïve Bayes
corresponds to MAP classification.
Example. ‘Play Tennis’ data
Day
Outlook
Temperature
Humidity
Wind
Play
Tennis
Day1
Day2
Sunny
Sunny
Hot
Hot
High
High
Weak
Strong
No
No
Day3
Overcast
Hot
High
Weak
Yes
Day4
Rain
Mild
High
Weak
Yes
Day5
Rain
Cool
Normal
Weak
Yes
Day6
Rain
Cool
Normal
Strong
No
Day7
Overcast
Cool
Normal
Strong
Yes
Day8
Sunny
Mild
High
Weak
No
Day9
Sunny
Cool
Normal
Weak
Yes
Day10
Rain
Mild
Normal
Weak
Yes
Day11
Sunny
Mild
Normal
Strong
Yes
Day12
Overcast
Mild
High
Strong
Yes
Day13
Overcast
Hot
Normal
Weak
Yes
Day14
Rain
Mild
High
Strong
No
Question: For the day <sunny, cool, high, strong>, what’s the
play prediction?
Based on the examples in the table, classify the following case x:
x=(Outlook=Sunny, Temp=Cool, Humidity=High, Wind=strong)
• That means: c = Play tennis or not?
 P(a
c NB  arg max P(c) P(x | c)  arg max P(c)
c[ yes , no]
c[ yes , no]
t
| c)
t
 arg max P(c) P(Outlook  sunny | c) P(Temp  cool | c) P( Humidity  high | c) P(Wind  strong | c
c[ yes , no]
• Working:
P( PlayTennis  yes)  9 / 14  0.64
P( PlayTennis  no)  5 / 14  0.36
P(Wind  strong | PlayTennis  yes)  3 / 9  0.33
P(Wind  strong | PlayTennis  no)  3 / 5  0.60
etc.
P( yes) P( sunny | yes) P(cool | yes) P(high | yes) P( strong | yes)  0.0053
P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no)  0.0206
 answer : PlayTennis( x)  no
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
c NB  argmax (log P(c j ) 
c jC
 log P( x | c ))
i
i positions
j
Properties of Naïve Bayes Classifier
• Pros
– Incrementality: with each training example, the prior
and the likelihood can be updated dynamically
– Robustness: flexible and robust to errors.
– Probabilistic hypothesis: outputs not only a
classification, but a probability distribution over all
classes
• Cons
– Performace: Naïve bayes assumption is, in many cases,
not realistic reducing the performance
– Probability estimates: of classes are often un-realistic
(due to the naive bayes assumption)
Evaluating Classification Results (in
general)
•
Summary statistics:
– empirical estimate of score function on test data, eg., error rate
•
More detailed breakdown
– E.g., “confusion matrices”
– Can be quite useful in detecting systematic errors
•
Detection v. false-alarm plots (2 classes)
– Binary classifier with real-valued output for each example,
where higher means more likely to be class 1
– For each possible threshold, calculate
• Detection rate = fraction of class 1 detected
• False alarm rate = fraction of class 2 detected
– Plot y (detection rate) versus x (false alarm rate)
– Also known as ROC, precision-recall, specificity/sensitivity
Bagging for Combining Classifiers
•
Training data sets of size N
•
Generate B “bootstrap” sampled data sets of size N
– Bootstrap sample = sample with replacement
– e.g. B = 100
•
Build B models (e.g., trees), one for each bootstrap sample
– Intuition is that the bootstrapping “perturbs” the data enough to make the models more
resistant to true variability
•
For prediction, combine the predictions from the B models
– E.g., for classification p(c | x) = fraction of B models that predict c
– Plus: generally improves accuracy on models such as trees
– Negative: lose interpretability
green = majority vote
purple = averaging
the probabilities
From Hastie, Tibshirani,
And Friedman, 2001
Illustration of Boosting:
Color of points = class label
Diameter of points = weight at each iteration
Dashed line: single stage classifier. Green line: combined, boosted classifier
Dotted blue in last two: bagging
(from G. Rätsch, Phd thesis, 2001)
Summary on Classifiers
• Simple models (but can be effective)
– Logistic regression
– Naïve Bayes
– K nearest-neighbors
• Decision trees
– Good for high-dimensional problems with different data types
• State of the art:
– Support vector machines
– Boosted trees
• Many tradeoffs in interpretability, score functions, etc
Decision Tree Classifiers
Task
Representation
Classification
Decision boundaries =
hierarchy of axis-parallel
Score Function
Cross-validated
error
Search/Optimization
Greedy search in tree
space
Data
Management
None specified
Models,
Parameters
Tree
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Naïve Bayes Classifier
Task
Representation
Classification
Conditional independence
probability model
Score Function
Likelihood
Search/Optimization
Closed form probability
estimates
Data
Management
Models,
Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
None specified
Conditional
probability tables
Logistic Regression
Task
Representation
Score Function
Search/Optimization
Classification
Log-odds(C) = linear function
of X’s
Log-likelihood
Iterative (Newton) method
Data
Management
None specified
Models,
Parameters
Logistic weights
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Nearest Neighbor Classifier
Task
Representation
Score Function
Search/Optimization
Classification
Memory-based
Cross-validated error (for
selecting k)
None
Data
Management
None specified
Models,
Parameters
None
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Support Vector Machines
Task
Classification
Representation
Hyperplanes
Score Function
“Margin”
Search/Optimization
Convex optimization
(quadratic programming)
Data
Management
None specified
Models,
Parameters
None
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Software (same as for Regression)
MATLAB
•
Many free “toolboxes” on the Web for regression and prediction –
e.g., see http://lib.stat.cmu.edu/matlab/ –
and in particular the CompStats toolbox
R
•
General purpose statistical computing environment (successor to S) –
Free (!) –
Widely used by statisticians, has a huge library of functions and visualization tools –
Commercial tools
SAS, other statistical packages –
Data mining packages –
Often are not progammable: offer a fixed menu of items –
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
•
Download