Uploaded by 민경하

4. Decision Tree Learning

advertisement
Decision Tree Learning
Introduction
●
Decision tree learning is one of the most widely used and practical methods for
inductive classification task.
●
Trained model (i.e. approximate target function) is a decision tree.
“Play tennis” example
- Each internal node tests an attribute
- Each branch corresponds to attribute value
- Each leaf node assigns a class
●
Decision tree takes <attribute values> as inputs and outputs predicted class.
- Predict the class (Yes or No) of a new example <Outlook=Sunny, Temperature=Hot,
Humidity=High, Wind=Strong>
●
A training example consists of <attribute values, class>.
●
Decision tree can also be represented as a set of “if-then rules” to improve human
readability. For the “play tennis” example, create if-then rules.
●
Create your own decision tree for the following training example:
- A training example is composed of <x, y, c> where x, y: attributes and c: class
- Training set = { <x1, y1, c1>, <x2, y1,c1>, <x3, y1, c2>, <x1, y2, c2>, <x2, y2, c2>, <x3,
y2, c2> }
- From the example, we know that decision tree learning splits attribute space vertically
and horizontally in a recursive manner such that each segmented area contains classes
of the same label as many as possible.
●
Alternative description of decision tree
- For predicting C-section risk
When to Consider Decision Trees
●
Training examples (also called instances) are represented by attribute-class pairs.
- Attribute values can be category values, discrete values, or continuous value
(discretization is required)
- Class should be category values.
●
Training data may contain errors – decision tree learning is robust to errors, both in
class values and in attribute values.
●
Training data may contain missing attribute values.
●
Application examples
- Equipment or medical diagnosis
- Credit risk analysis
How Do Machines Learn Decision Trees
●
Decision tree learning is an optimization problem – find the best tree in the set of all
possible trees (hypothesis space) that maximizes the classification accuracy of training
examples.
●
Widely used decision tree learning algorithms are ID3 and its revisions (C4.5 and C5.0).
We will review ID3 in this chapter.
●
The ID3 algorithm employs a top-down, greedy search through the hypothesis space.
Main loop:
1. A ← the “best” decision attribute for next node
2. Assign A as decision attribute for the node
3. For each value of A, create new descendant of the node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, then stop. Otherwise, iterate over new
leaf nodes
End loop
●
The central choice in the ID3 algorithm is selecting which attribute to test at each
node in the tree.
- which attribute is best?
Entropy & Information Gain
●
We would like to select the attribute that is most useful for classifying training
examples. For the purpose we define a statistical property, referred to as information
gain, that measures how well a given attribute separates the training examples
according to their target classification. Information gain is based on entropy.
●
Entropy measures the impurity (uncertainty) of an arbitrary collection of training
examples. Given a collection 𝑆, containing positive and negative examples, the entropy
of 𝑆 relative to this Boolean classification is
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −𝑝+ 𝑙𝑜𝑔2 𝑝+ − 𝑝− 𝑙𝑜𝑔2 𝑝−
where 𝑝+ is the proportion of positive examples in 𝑆, 𝑝− is the proportion of negative
examples in 𝑆.
- The entropy is 0 when the collection 𝑆 contains the same class (all positive or all
negative). Note that we define 0log0 to be 0.
- The entropy is 1 when the collection 𝑆 contains an equal number of positive and
negative examples
- The entropy is between 0 and 1 when the collection 𝑆 contains unequal numbers of
positive and negative examples
- Examples: calculate the entropy of the collection 𝑆 that contains 9 positive and 5
negative examples ([9+, 5-]).
●
(Generalization) If class can take 𝑐 different class values, then the entropy of 𝑆 is
defined as
𝑐
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = ∑ −𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖
𝑖=1
●
Given an attribute, information gain is the expected reduction in entropy caused by
partitioning the training examples according to this attribute. More precisely, the
information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴) of an attribute 𝐴, relative to a collection of examples 𝑆, is
defined as
𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) −
∑
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
|𝑆𝑣 |
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
where 𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) is the set of all possible values for attribute 𝐴 and 𝑆𝑣 is the subset
of 𝑆 for which attribute 𝐴 has value 𝑣.
- Calculate information gain for attributes A1 and A2. Which one is better classifier?
●
Example
ID3 Algorithm
●
Complete ID3 algorithm (recursive form)
ID3 (Examples, Attribute Set)
1. Create a Root node for the tree.
2. If all Examples have a single class value, return the single-node tree Root, with
label=the class value.
3. If Attribute Set is empty, return the single-node tree Root, with label=most common
class value in Examples.
4. Otherwise begin
4.1 A ← the attribute from Attribute Set that best classifies the examples.
4.2 Decision attribute for Root ← A.
4.3 For each value vi of A,
4.3.1 Add a new tree branch below Root with A=vi.
4.3.2 If Examples(vi) is empty,
Then
below this new branch add a leaf node with label=most common class
value in Examples.
Else
below this new branch add the subtreeID3(Examples(vi), Attribute Set –
{A})
End begin
Hypothesis Space Search by ID3
●
The hypothesis space searched by ID3 is the set of all possible decision trees
●
ID3 performs top-down, greedy search (also called hill climbing search) guided by
information gain.
- Only one decision tree is generated.
- No backtracking is allowed (local minima).
●
ID3 searches the best decision tree in the hypothesis space that classifies training
examples as correctly as possible.
Property of ID3
●
Given a collection of training examples, there are typically many decision trees
consistent with these examples. Which of these decision trees does ID3 prefer? Shorter
trees are preferred over large trees.
●
Why prefer shorter decision tree?
- Scientists sometimes appear to follow this inductive bias. For example, physicists prefer
simple explanations for the motions of planets over more complex explanations.
- Why? One argument is that because there are fewer short hypotheses than longer ones,
it is less likely that one will find a short hypothesis that coincidentally fits the training
data. In contrast, there are often many very complex hypotheses that fit the current
training data but fail to generalize correctly to subsequent data.
- Occam’s razor: Prefer the simplest hypothesis that fits the data.
Overfitting in Decision Tress
●
Consider the error of hypothesis h over
- training data: errortrain(h)
- entire distribution D of data: errorD(h)
Hypothesis h∈H overfits training data if there is an alternative hypothesis h’ ∈H such that
errortrain(h) < errortrain(h’)
and
errorD(h) > errorD(h’)
●
Overfitting can lead to difficulties
- When there is noise in the data
Consider adding noisy training example #15
<Outlook=Sunny,
Temperature=Hot,
PlayTennis=No>
What effect on earlier tree?
Humidity=Normal,
Wind=Strong,
- When the number of training examples is too small to produce representative samples
●
How can we avoid overfitting?
- First approach: Stop growing the tree earlier before it reaches the point where it
perfectly classifies the training data.
- Second approach: Grow full-size tree, then post-prune.
-The second approach seems more practical one because the first approach has difficulty
in finding when to stop growing the tree.
●
Reduced error pruning
- Split data into training and validation set.
- Do until further pruning is harmful:
1. Starting from bottom decision nodes, evaluate impact on validation set of
pruning each possible node (bottom-up pruning).
2. Remove the decision node with its subtree that most improves validation set
accuracy. Replace the subtree with a leaf node.
- Suppose that the decision node named A has a set S of examples (see the following
figure), and its two branches (v1 and v2) have Sv1 and Sv2 example sets. Then,
if Error_rate (S,A) < Error_rate (Sv1, v1) + Error_rate (Sv1, v2) then replace the
subtree with a leaf node with label of most common class value in the example
set S of the decision node A.
Bottom-up pruning
Leaf
node
Leaf
node
A
v1
Leaf
node
Sv1
S
Leaf
node
Leaf
node
Leaf
node
Leaf
node
Leaf
node
Leaf
node
Leaf
node
Leaf
node
Leaf
node
v2
Leaf
node
Sv2
Pruning
Most common class value in
the example set S
Subtree
- The following figure shows the pruning effect on the accuracy. As the learning
continues, tree’s accuracy over the training set would improve, but its accuracy against
test data set decreases (overfitting effect). However, if we employ the reduced error
pruning, we can detect the size of the tree that guarantees high accuracy over the test
data set.
-The major drawback is that when data is limited, withholding part of it for the validation
set reduces even further the number of examples available for training.
●
How do we measure performance (error rate or accuracy) of a decision tree algorithm?
Moreover, suppose that you developed a new classifier. How do you test the
performance of the classifier with a small training set?
(1) k-fold cross validation (Leave-one-out in extreme cases)
(2) Bootstrapping
Alternative Measures for Selecting Attributes
●
Gain ratio
- Gain ratio is used by the C4.5 algorithm, an updated version of ID3 (Latest version is
C5).
- It is mainly developed to cope with test attributes that have many attribute values.
- For example, suppose that the temperature (test attributes) in the play tennis game has
discrete (or continuous) values (e.g. 24 ℃, 25 ℃, …). Then splitting with the temperature
results in many lower nodes, one for each temperature value with small entropy value.
This result is not what we want (we need rather simple trees: occam’s razor!).
- The gain ratio is defined as
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 (𝑆, 𝐴) =
𝐺𝑎𝑖𝑛 (𝑆, 𝐴)
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴)
where the 𝐺𝑎𝑖𝑛 (𝑆, 𝐴) is the information gain and the S𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴) is given as
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴) = −
∑
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
|𝑆𝑣 |
|𝑆𝑣 |
𝑙𝑜𝑔2
|𝑆 |
|𝑆 |
- The split information is high when attribute value size, |𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)|, is big and lower
nodes have similar numbers of examples.
- Therefore, the gain ratio prefers test attributes that have high information gain,
𝐺𝑎𝑖𝑛 (𝑆, 𝐴), and low split information, S𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴).
●
Gini index
- The gini index is used by CART (classification and regression tree), the first decision tree
model developed by Breiman et al. (1984).
- CART can handle category type class values and numeric type class values.
- When class value is a category type, the gini index is given
𝑔(𝑡) = ∑ 𝑓(𝑛|𝑡) 𝑓(𝑙 |𝑡) = ∑ 𝑓 (𝑛|𝑡) {1 − 𝑓(𝑛 |𝑡)} = 1 − ∑ 𝑓 (𝑛|𝑡 )2
𝑛≠𝑙
𝑛
𝑛
where 𝑡 is a node and 𝑓 (𝑛|𝑡) is the fraction of examples labeled with category value 𝑛
at the node 𝑡.
- The gini index measures the impurity of the node 𝑡. It has a low value when ∑𝑛 𝑓 (𝑛|𝑡)2
is high, for which most examples in the node fall into a single category value.
- Given a parent node 𝑡𝑝 , change of impurity by splitting with attribute 𝐴, ∆𝑔(𝑡𝑝 , 𝐴), is
∆𝑔(𝑡𝑝 , 𝐴) = 𝑔(𝑡𝑝 ) − 𝐸𝐴 [𝑔(𝑡𝑐 )]
∆𝑔(𝑡𝑝 , 𝐴) = 𝑔(𝑡𝑝 ) −
∑
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
|𝑆𝑣 |
𝑔(𝑡𝑣 )
|𝑆|
where 𝑆 is the set of examples in the node 𝑡𝑝 and 𝑆𝑣 is the set of examples in a child
node 𝑡𝑣 where all examples have attribute 𝐴 value of 𝑣.
- CART selects the best attribute 𝐴∗ with the maximum ∆𝑔 and split the tree at the
parent node using the attribute 𝐴∗ .
- When class value is numeric type, variance (or error) concept is used for splitting.
- At a node 𝑡, variance (sum of squared errors) is defined as
𝑛
∑(𝑦𝑖 − 𝑦̅)2
𝑖=1
where 𝑛 is the size of examples in the node 𝑡; 𝑦𝑖 is the class value of the i-th example;
𝑦̅ is the average class value of the examples in the node 𝑡.
- At the children nodes of the node 𝑡 split by attribute 𝐴, the variance (average sum of
squared errors) is defined as
∑
𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)
|𝑆𝑣 |
2
∑ (𝑦𝑖 − 𝑦̅𝑡𝑣 )
|𝑆|
𝑦𝑖 ∈𝑡𝑣
where 𝑦̅𝑡𝑣 is the average class value of the examples in a child node 𝑡𝑣 .
- CART selects the best attribute 𝐴∗ with the maximum of variance difference defined as
|𝑆 |
Variance difference = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 − ∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) |𝑆|𝑣 ∑𝑦𝑖 ∈𝑡𝑣(𝑦𝑖 − 𝑦̅𝑡𝑣 )
2
and split the tree at the parent node using the attribute 𝐴∗ .
- When class value is numeric type, the decision tree is called regression tree.
- In a regression tree, the representative class value of a leaf node corresponds to the
average class value of the node. Therefore, unlike to popular statistical regression models
such as linear regression and neural networks, regression tree predicts discontinuous
class values.
- In fact, the variance is identical to the gini index.
Miscellaneous
●
Incorporating continuous-valued attributes
- For a continuous-valued attribute, define a new discrete-valued attribute that partitions
the continuous-valued attribute into a discrete set of intervals.
- Thresholds {𝑐} should be determined for the discretization. It is usual to choose the
midpoint between two consecutive values (i.e. 𝑐 = (𝑣𝑖 + 𝑣𝑖+1 )/2) as a threshold.
Single threshold case
T
T  54
2 examples
Two-threshold case
T
6 examples
T > 54=(48+60)/2
4 examples
T  54
2 examples
6 examples
54 < T 85
T > 85=(80+90)/2
3 examples
1 examples
- For a single threshold case, pick a threshold c that produces the greatest information
gain.
-Multiple threshold case is rather complex.
●
Handling training examples with unknown attribute values
- Consider a situation in which 𝐺𝑎𝑖𝑛 (𝑆, 𝐴𝑖 ) is to be calculated at node 𝑛 to evaluate
whether the attribute 𝐴𝑖 is the best attribute to test at this decision node. Suppose that
< 𝑎1 , … , 𝑎𝑛 , 𝑐 > is a training example and that the value 𝑎𝑖 is unknown.
Node n
Ai
Example set S
example 1 <a1,…,ai=T, ….an, c2>
example 2 <a1,…,ai ?, ….an, c1>
example 3 <a1,…,ai=T, ….an, c2>
example 4 <a1,…,ai=F, ….an, c1>
- Three strategies for this case
1. Assign most common value of 𝐴𝑖
among other examples sorted to node 𝑛.
2. Assign most common value of 𝐴𝑖
among other examples with the same class
value 𝑐 sorted to node 𝑛.
3. Assign probability 𝑝𝑖
to each possible value 𝑣𝑖
a probabilistic choice.
- Classify new examples in the same fashion.
of 𝐴𝑖
and estimate 𝑎𝑖 with
Download