Advanced Analy+cal Theory and
Methods: Classifica+on
IT 532: Advanced Topics in Data Mining
Isra Al-Turaiki, PhD
What is Classifica+on?
• A fundamental learning method that appears in applicaAons related to data mining.
• In classifica'on learning :
• a classifier is presented with a set of examples ( already classified )
• The classifier learns to assign labels to unseen examples .
• Basically, learns how the aKributes of these observaAons contribute to the classificaAon.
Supervised vs. Unsupervised Learning
Classifica+on vs. Numeric Predic+on
Classifica+on-A Two-Step Process
Classifica+on- Learning step
Classifica+on- Classifica+on step
Example Step (1): Learning
Example Step (2): Classifica+on
This Lecture
• This lecture we focus on two fundamental classificaAon methods:
• Decision trees
• Naïve Bayes
Decision Tree Induc+on
Decision Tree: Example
Decision Tree: Example
Root
Branch
Internal node
Leaf node
Decision Tree
• depth of a node is the minimum number of steps required to reach the node from the root.
• Leaf nodes are at the end of the last branches on the tree. represent class labels-the outcome of all the prior decisions.
The General Algorithm
A greedy algorithm: a tree T is constructed from a training sets S .
• First, all the training records S are at the root
• Split S recursively based on a se lected a/ribute
• IF all the records in S belong to some class C or if S is sufficiently pure, then that node is considered a leaf node and assigned the label
C .
• ELSE conAnue spliRng.
When to Stop?
• CondiAons for stopping spliRng:
• All the records in parAAon D belong to the same class.
• All the leaf nodes in T saAsfy a minimum purity threshold.
• There are no remaining aKributes for further parAAoning.
• Any other stopping criterion is saAsfied(such as the maximum depth of the tree).
Purity
• Purity of a node is defined as its probability of the corresponding class
AQribute Selec+on
AQribute Selec+on
• The first step in construcAng a decision tree is to choose the most informaAve aKribute
• Entropy measures the impurity of an aKribute (or disorder in a dataset)
• Informa3on gain measures purity of an aKribute
Entropy
• Given a class X and its label ,let the probability of x. x ∈ X
P(x) be
• H
X
,the entropy of X, is defined as
H
X
=
x ∈ X
P ( x ) log
2
P ( x )
Entropy
• H
X becomes 0 when all P(x ) is 0 or 1
• Maximum entropy when all the class labels are equally probable.
Entropy Example
• The data set has 50% posiAve records and
50% negaAve records
H
X
=
0.5log
2
(0.5)
0.5log
2
(0.5) = 1
Entropy Example
• The data set has 20% posiAve records and
80% negaAve records
H
X
=
0.2log
2
(0.2)
0.8log
2
(0.8) = 0.722
Entropy Example
• The data set has 100% posiAve records and
0% negaAve records
H
X
=
1og
2
(1)
0log
2
(0) = 0
Entropy
• As the data become purer and purer, the entropy value becomes smaller and smaller
• base entropy is defined as entropy of the output variable
Condi+onal Entropy
• CondiAonal entropy for each aKribute.
• Given an aKribute X and its value x, its outcome Y, its value y,
• CondiAonal entropy H entropy of Y given X
Y|X is the remaining
H
Y | X
=
x ∈ X
P ( x )
P ( y | x ) log
2
P ( y | x ) y ∈ Y
Informa+on Gain
Informa+on Gain
InformaAon gain is defined as the difference between the base entropy and the condiAonal entropy of the aKribute
InfoGain = H
X
− H
Y | X
Informa+on Gain: Example
Informa+on Gain: Example
Decision Tree Algorithms
Informa+on Gain: Example
ID3 Algorithm
T =training set, P =output variable, A =aKribute
C4.5
• Introduces many improvements over ID3
• can handle missing data
• by considering only the records where the aKribute is defined.
• Both categorical and conAnuous aKributes are supported by C 4.5.
CART
• C lassificaAon A nd R egression T rees
• can handle conAnuous aKributes.
• uses the Gini diversity index defined
Evalua+ng a Decision Tree
• Decision trees use greedy algorithms,
• AKribute selecAon may not be the best overall, but it is guaranteed to be the best at that step.
• bad split is propagated through the rest of the tree.
OverfiWng
• In overfiRng, the model fits the training set well, but it performs poorly on the new samples in the tesAng set.
• Too many branches, some may reflect anomalies due to noise or outliers
Poor accuracy for unseen samples
• overfiRng can be caused by either the lack of training data or the biased data in the training set.
OverfiWng
• Decision trees are greedy algorithms
– Best opAon at each step, maybe not best overall
– Addressed by ensemble methods: random forest
• Model might overfit the data
Blue = training set
Red = test set
Overcome overfiRng:
Stop growing tree early
Grow full tree, then prune
Tree Pruning
Decision Trees
• Advantages of decision trees
– ComputaAonally inexpensive
– Outputs are easy to interpret – sequence of tests
– Show importance of each input variable
– Decision trees handle
• Both numerical and categorical aKributes
• Categorical aKributes with many disAnct values
• Variables with nonlinear effect on outcome
• Variable interacAons
Decision Trees
• Disadvantages of decision trees
– SensiAve to small variaAons in the training data
– OverfiRng can occur because each split reduces training data for subsequent splits
– Poor if dataset contains many irrelevant variables
References
•
Chapter 7: Data Science and Big Data
Analytics: Discovering, Analyzing,
Visualizing and Presenting Data,
EMC Education Services (Editor) ISBN:
978-1-118-87613-8
• Charles Tappert slides