SVM and Decision Tree Le Song Machine Learning I

advertisement
SVM and Decision Tree
Le Song
Machine Learning I
CSE 6740, Fall 2013
Which decision boundary is better?
Suppose the training samples
are linearly separable
We can find a decision boundary
which gives zero training error
Class 2
But there are many such
decision boundaries
Which one is better?
Class 1
2
Compare two decision boundaries
Suppose we perturb the data, which boundary is more
susceptible to error?
3
Constraints on data points
Constraints on data points
For all π‘₯ in class 2, 𝑦 = 1 and 𝑀 ⊀ π‘₯ + 𝑏 ≥ 𝑐
For all π‘₯ in class 1, 𝑦 = −1 and 𝑀 ⊀ π‘₯ + 𝑏 ≤ −𝑐
Or more compactly, (𝑀 ⊀ π‘₯ + 𝑏)𝑦 ≥ 𝑐
w xb ο€½ 0
T
𝑀
Class 2
Class 1
c
c
4
Classifier margin
Pick two data points π‘₯ 1 and π‘₯ 2 which are on each dashed line
respectively
1
𝑀
The margin is 𝛾 =
𝑀
⊀
1
π‘₯ −π‘₯
2
w xb ο€½ 0
=
2𝑐
|𝑀|
T
𝑀
π‘₯1
Class 2
π‘₯2
Class 1
c
c
5
Maximum margin classifier
Find decision boundary 𝑀 as far from data point as possible
2𝑐
max
𝑀,𝑏 | 𝑀 |
𝑠. 𝑑. 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 ≥ 𝑐, ∀𝑖
w xb ο€½ 0
T
𝑀
π‘₯1
Class 2
π‘₯2
Class 1
c
c
6
Support vector machines with hard margin
min
𝑠. 𝑑.
𝑀,𝑏
𝑦𝑖 𝑀 ⊀ π‘₯𝑖
𝑀
2
+ 𝑏 ≥ 1, ∀𝑖
Convert to standard form
1 ⊀
min 𝑀 𝑀
𝑀,𝑏 2
𝑠. 𝑑. 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 ≤ 0, ∀𝑖
The Lagrangian function
1 ⊀
𝐿 𝑀, 𝛼, 𝛽 = 𝑀 𝑀 +
2
π‘š
𝛼𝑖 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏
𝑖
7
Deriving the dual problem
π‘š
1 ⊀
𝐿 𝑀, 𝛼, 𝛽 = 𝑀 𝑀 +
2
𝛼𝑖 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏
𝑖
Taking derivative and set to zero
π‘š
πœ•πΏ
=𝑀−
πœ•π‘€
πœ•πΏ
=
πœ•π‘
𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 = 0
π‘š
𝑖
𝛼𝑖 𝑦 𝑖 = 0
𝑖
8
Plug back relation of w and b
𝐿 𝑀, 𝛼, 𝛽 =
1
π‘š
𝑖π‘₯𝑖
𝛼
𝑦
𝑖
𝑖
2
π‘š
𝑖 𝛼𝑖
1
− 𝑦𝑖
⊀
π‘š
𝑗π‘₯𝑗
𝛼
𝑦
𝑗 𝑗
+
⊀
π‘š
𝑗
𝑗
𝑗 𝛼𝑗 𝑦 π‘₯
After simplification
π‘š
𝐿 𝑀, 𝛼, 𝛽 =
𝑖
1
𝛼𝑖 −
2
π‘₯𝑖 + 𝑏
π‘š
𝛼𝑖 𝛼𝑗
𝑦𝑖 𝑦 𝑗
⊀ 𝑗
𝑖
π‘₯ π‘₯
𝑖,𝑗
9
The dual problem
π‘š
max
𝛼
𝑖
1
𝛼𝑖 −
2
π‘š
𝛼𝑖 𝛼𝑗
𝑦𝑖 𝑦 𝑗
⊀ 𝑗
𝑖
π‘₯ π‘₯
𝑖,𝑗
𝑠. 𝑑. 𝛼𝑖 ≥ 0, 𝑖 = 1, … , π‘š
π‘š
𝛼𝑖 𝑦 𝑖 = 0
𝑖
This is a constrained quadratic programming
Nice and convex, and global maximum can be found
𝑀 can be found as 𝑀 =
How about 𝑏?
π‘š
𝑖 𝑖
𝑖 𝛼𝑖 𝑦 π‘₯
10
Support vectors
Note that the KKT condition
𝛼𝑖 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏
=0
For data points with 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏
< 0 , 𝛼𝑖 = 0
For data points with 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏
= 0 , 𝛼𝑖 > 0
Class 2
a8=0.6
a10=0
a7=0
a5=0
a4=0
a9=0
Class 1
a2=0
Call the training data points
whose ai's are nonzero the
support vectors (SV)
a1=0.8
a6=1.4
a3=0
11
Computing b and obtain the classifer
Pick any data point with 𝛼𝑖 > 0, solve for 𝑏 with
1 − 𝑦𝑖 𝑀 ⊀ π‘₯𝑖 + 𝑏 = 0
For a new test point z
Compute
π‘€βŠ€π‘§ + 𝑏 =
𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 𝑧 + 𝑏
𝑖∈π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘ π‘£π‘’π‘π‘‘π‘œπ‘Ÿπ‘ 
Classify 𝑧 as class 1 if the result is positive, and class 2 otherwise
12
Interpretation of support vector machines
The optimal w is a linear combination of a small number of
data points. This “sparse” representation can be viewed as
data compression
To compute the weights 𝛼𝑖 , and to use support vector
machines we need to specify only the inner products (or
⊀ 𝑗
𝑖
kernel) between the examples π‘₯ π‘₯
We make decisions by comparing each new example z with
only the support vectors:
𝑦 ∗ = 𝑠𝑖𝑔𝑛
𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 𝑧 + 𝑏
𝑖∈π‘ π‘’π‘π‘π‘œπ‘Ÿπ‘‘ π‘£π‘’π‘π‘‘π‘œπ‘Ÿπ‘ 
13
Soft margin constraints
What if the data is not linearly separable?
We will allow points to violate the hard margin constraint
(𝑀 ⊀ π‘₯ + 𝑏)𝑦 ≥ 1 − πœ‰
w xb ο€½ 0
T
𝑀
πœ‰1
πœ‰2
Class 1
Class 2
πœ‰3
1
1
14
Soft margin SVM
min
𝑀
𝑀,𝑏,πœ‰
2
π‘š
πœ‰π‘–
+𝐢
𝑖=1
𝑖
𝑠. 𝑑. 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 ≥ 1 − πœ‰ , πœ‰ 𝑖 ≥ 0, ∀𝑖
Convert to standard form
1 ⊀
min 𝑀 𝑀
𝑀,𝑏 2
𝑠. 𝑑. 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 − πœ‰ 𝑖 ≤ 0, πœ‰ 𝑖 ≥ 0, ∀𝑖
The Lagrangian function
1 ⊀
= 𝑀 𝑀+
2
π‘š
𝐿 𝑀, 𝛼, 𝛽
πΆπœ‰ 𝑖 + 𝛼𝑖 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 − πœ‰ 𝑖 − 𝛽𝑖 πœ‰ 𝑖
𝑖
15
Deriving the dual problem
1 ⊀
= 𝑀 𝑀+
2
π‘š
𝐿 𝑀, 𝛼, 𝛽
πΆπœ‰ 𝑖 + 𝛼𝑖 1 − 𝑦 𝑖 𝑀 ⊀ π‘₯ 𝑖 + 𝑏 − πœ‰ 𝑖 − 𝛽𝑖 πœ‰ 𝑖
𝑖
Taking derivative and set to zero
πœ•πΏ
=𝑀−
πœ•π‘€
π‘š
π‘š
𝛼𝑖 𝑦 𝑖 π‘₯ 𝑖 = 0
𝑖
πœ•πΏ
=
𝛼𝑖 𝑦 𝑖 = 0
πœ•π‘
𝑖
πœ•πΏ
= 𝐢 − 𝛼𝑖 − 𝛽𝑖 = 0
𝑖
πœ•πœ‰
16
Plug back relation of 𝑀, 𝑏 and πœ‰
𝐿 𝑀, 𝛼, 𝛽 =
1
π‘š
𝑖π‘₯𝑖
𝛼
𝑦
𝑖
𝑖
2
π‘š
𝑖 𝛼𝑖
1
− 𝑦𝑖
⊀
π‘š
𝑗π‘₯𝑗
𝛼
𝑦
𝑗 𝑗
+
⊀
π‘š
𝑗
𝑗
𝑗 𝛼𝑗 𝑦 π‘₯
After simplification
π‘š
𝐿 𝑀, 𝛼, 𝛽 =
𝑖
1
𝛼𝑖 −
2
π‘₯𝑖 + 𝑏
π‘š
𝛼𝑖 𝛼𝑗
𝑦𝑖 𝑦 𝑗
⊀ 𝑗
𝑖
π‘₯ π‘₯
𝑖,𝑗
17
The dual problem
π‘š
max
𝛼
𝑖
1
𝛼𝑖 −
2
π‘š
𝛼𝑖 𝛼𝑗
𝑦𝑖 𝑦 𝑗
⊀ 𝑗
𝑖
π‘₯ π‘₯
𝑖,𝑗
𝑠. 𝑑. 𝐢 − 𝛼𝑖 − 𝛽𝑖 = 0, 𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0, 𝑖 = 1, … , π‘š
π‘š
𝛼𝑖 𝑦 𝑖 = 0
𝑖
The constraint 𝐢 − 𝛼𝑖 − 𝛽𝑖 = 0, 𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0 can be
simplified to 𝐢 ≥ 𝛼𝑖 ≥ 0
This is a constrained quadratic programming
Nice and convex, and global maximum can be found
18
Learning nonlinear decision boundary
Linearly separable
Nonlinearly separable
The XOR gate
Speech recognition
19
A decision tree for Tax Fraud
Input: a vector of attributes
𝑋 = [Refund,MarSt,TaxInc]
Output:
π‘Œ= Cheating or Not
H as a procedure:
Each internal node: test one
attribute 𝑋𝑖
Each branch from a node: selects
one value for 𝑋𝑖
Each leaf node: predict π‘Œ
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
20
Apply model to test data I
Query Data
Start from the root of tree.
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
21
Apply model to test data II
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
22
Apply model to test data III
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
23
Apply model to test data IV
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
24
Apply model to test data V
Query Data
R e fu n d
No
Refund
Yes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
M a r r ie d
80K
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
25
Expressiveness of decision tree
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:
Trivially, there is a consistent decision tree for any training set
with one path to leaf for each example.
Prefer to find more compact decision trees
26
Hypothesis spaces (model space
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
n
n
2
= number of distinct truth tables with 2 rows = 2
E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees
How many purely conjunctive hypotheses (e.g., Hungry  οƒ˜Rain)?
Each attribute can be in (positive), in (negative), or out
οƒž 3n distinct conjunctive hypotheses
More expressive hypothesis space
increases chance that target function can be expressed
increases number of hypotheses consistent with training set
οƒž may get worse predictions
27
Decision tree learning
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
28
Example of a decision tree
T id
R e fu n d
Splitting Attributes
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
1
Yes
S in g le
125K
No
2
No
M a r r ie d
100K
No
3
No
S in g le
70K
No
4
Yes
M a r r ie d
120K
No
5
No
D iv o r c e d
95K
Yes
6
No
M a r r ie d
60K
No
7
Yes
D iv o r c e d
220K
No
8
No
S in g le
85K
Yes
9
No
M a r r ie d
75K
No
10
No
S in g le
90K
Yes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
29
Another example of a decision tree
MarSt
T id
R e fu n d
Married
M a r it a l
T a x a b le
S ta tu s
In c o m e
C heat
NO
1
Yes
S in g le
125K
No
2
No
M a r r ie d
100K
No
3
No
S in g le
70K
No
4
Yes
M a r r ie d
120K
No
5
No
D iv o r c e d
95K
Yes
6
No
M a r r ie d
60K
No
7
Yes
D iv o r c e d
220K
No
8
No
S in g le
85K
Yes
9
No
M a r r ie d
75K
No
10
No
S in g le
90K
Yes
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
10
Training Data
30
Top-Down Induction of Decision tree
Main loop:
𝐴 ← the “best” decision attribute for next node
Assign A as the decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classified, then STOP;
ELSE iterate over new leaf nodes
31
Tree Induction
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
32
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values.
CarType
Family
Luxury
Sports
Binary split: Divides values into two subsets. Need to find
optimal partitioning.
{Sports,
Luxury}
CarType
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values.
Size
Small
Large
Medium
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Small,
Medium}
Size
{Large}
OR
{Medium,
Large}
Size
{Small}
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
Binary Decision: (𝐴 < 𝑑) or (𝐴 ο‚³ 𝑑)
consider all possible splits and finds the best cut
can be more compute intensive
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
(i) Binary split
[25K,50K)
[50K,80K)
(ii) Multi-way split
How to determine the Best Split
Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"
Homogeneous,
Non-homogeneous,
Low degree of impurity
High degree of impurity
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity
How to compare attribute?
Entropy
Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to encode a randomly drawn value
of X (under most efficient code)
Information theory:
Most efficient code assigns -log2P(X=i) bits to encode the message X=I, So,
expected number of bits to code one random X is:
Sample Entropy
S is a sample of training examples
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S
Entropy measure the impurity of S
Examples for computing Entropy
C1
0
P(C1) = 0/6 = 0
C2
6
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1
1
P(C1) = 1/6
C2
5
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1
2
P(C1) = 2/6
C2
4
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
P(C2) = 6/6 = 1
P(C2) = 5/6
P(C2) = 4/6
How to compare attribute?
Conditional Entropy of variable 𝑋 given variable π‘Œ
Given specific Y=v entropy H(X|Y=v) of X:
Conditional entropy H(X|Y) of X: average of H(X|Y=v)
Mutual information (aka information gain) of X given Y :
Information Gain
Information gain (after split a node):
ni

( p) ο€­  οƒ₯
Entropy
 i ο€½1 n
k
GAIN
split
ο€½ Entropy
οƒΆ
(i ) οƒ·
οƒΈ
𝑛 samples in parent node 𝑝 is split into π‘˜ partitions; 𝑛𝑖 is number of
records in partition 𝑖
Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes
GAIN)
Problem of splitting using information gain
Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
Gain Ratio:
GainRATIO
split
ο€½
GAIN
k
Split
SplitINFO
SplitINFO
ο€½ ο€­οƒ₯
i ο€½1
ni
n
log
ni
n
Adjusts Information Gain by the entropy of the partitioning
(SplitINFO). Higher entropy partitioning (large number of small
partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of Information Gain
42
Stopping Criteria for Tree Induction
Stop expanding a node when all the records belong to the
same class
Stop expanding a node when all the records have similar
attribute values
Early termination (to be discussed later)
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for
many simple data sets
Example: C4.5
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets. Needs out-of-core sorting.
You can download the software from:
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Download