Uploaded by halmazrou

4 Classification- Part 1

advertisement

Advanced Analy+cal Theory and

Methods: Classifica+on

IT 532: Advanced Topics in Data Mining

Isra Al-Turaiki, PhD

What is Classifica+on?

•   A fundamental learning method that appears in applicaAons related to data mining.

•   In classifica'on learning :

•   a classifier is presented with a set of examples ( already classified )

•   The classifier learns to assign labels to unseen examples .

•   Basically, learns how the aKributes of these observaAons contribute to the classificaAon.

Supervised vs. Unsupervised Learning

Classifica+on vs. Numeric Predic+on

Classifica+on-A Two-Step Process

Classifica+on- Learning step

Classifica+on- Classifica+on step

Example Step (1): Learning

Example Step (2): Classifica+on

This Lecture

•   This lecture we focus on two fundamental classificaAon methods:

•   Decision trees

•   Naïve Bayes

Decision Tree Induc+on

Decision Tree: Example

Decision Tree: Example

Root

Branch

Internal node

Leaf node

Decision Tree

•   depth of a node is the minimum number of steps required to reach the node from the root.

•   Leaf nodes are at the end of the last branches on the tree. represent class labels-the outcome of all the prior decisions.

The General Algorithm

A greedy algorithm: a tree T is constructed from a training sets S .

•   First, all the training records S are at the root

•   Split S recursively based on a se lected a/ribute

•   IF all the records in S belong to some class C or if S is sufficiently pure, then that node is considered a leaf node and assigned the label

C .

•   ELSE conAnue spliRng.

When to Stop?

•   CondiAons for stopping spliRng:

•   All the records in parAAon D belong to the same class.

•   All the leaf nodes in T saAsfy a minimum purity threshold.

•   There are no remaining aKributes for further parAAoning.

•   Any other stopping criterion is saAsfied(such as the maximum depth of the tree).

Purity

•   Purity of a node is defined as its probability of the corresponding class

AQribute Selec+on

AQribute Selec+on

•   The first step in construcAng a decision tree is to choose the most informaAve aKribute

•   Entropy measures the impurity of an aKribute (or disorder in a dataset)

•   Informa3on gain measures purity of an aKribute

Entropy

•   Given a class X and its label ,let the probability of x. x ∈ X

P(x) be

•   H

X

,the entropy of X, is defined as

H

X

=

x ∈ X

P ( x ) log

2

P ( x )

Entropy

•   H

X becomes 0 when all P(x ) is 0 or 1

•   Maximum entropy when all the class labels are equally probable.

Entropy Example

•   The data set has 50% posiAve records and

50% negaAve records

H

X

=

0.5log

2

(0.5)

0.5log

2

(0.5) = 1

Entropy Example

•   The data set has 20% posiAve records and

80% negaAve records

H

X

=

0.2log

2

(0.2)

0.8log

2

(0.8) = 0.722

Entropy Example

•   The data set has 100% posiAve records and

0% negaAve records

H

X

=

1og

2

(1)

0log

2

(0) = 0

Entropy

•   As the data become purer and purer, the entropy value becomes smaller and smaller

•   base entropy is defined as entropy of the output variable

Condi+onal Entropy

•   CondiAonal entropy for each aKribute.

•   Given an aKribute X and its value x, its outcome Y, its value y,

•   CondiAonal entropy H entropy of Y given X

Y|X is the remaining

H

Y | X

=

x ∈ X

P ( x )

P ( y | x ) log

2

P ( y | x ) y ∈ Y

Informa+on Gain

Informa+on Gain

InformaAon gain is defined as the difference between the base entropy and the condiAonal entropy of the aKribute

InfoGain = H

X

− H

Y | X

Informa+on Gain: Example

Informa+on Gain: Example

Decision Tree Algorithms

Informa+on Gain: Example

ID3 Algorithm

T =training set, P =output variable, A =aKribute

C4.5

•   Introduces many improvements over ID3

•   can handle missing data

•   by considering only the records where the aKribute is defined.

•   Both categorical and conAnuous aKributes are supported by C 4.5.

CART

•   C lassificaAon A nd R egression T rees

•   can handle conAnuous aKributes.

•   uses the Gini diversity index defined

Evalua+ng a Decision Tree

•   Decision trees use greedy algorithms,

•   AKribute selecAon may not be the best overall, but it is guaranteed to be the best at that step.

•   bad split is propagated through the rest of the tree.

OverfiWng

•   In overfiRng, the model fits the training set well, but it performs poorly on the new samples in the tesAng set.

•   Too many branches, some may reflect anomalies due to noise or outliers

Poor accuracy for unseen samples

•   overfiRng can be caused by either the lack of training data or the biased data in the training set.

OverfiWng

•   Decision trees are greedy algorithms

–   Best opAon at each step, maybe not best overall

–   Addressed by ensemble methods: random forest

•   Model might overfit the data

Blue = training set

Red = test set

Overcome overfiRng:

Stop growing tree early

Grow full tree, then prune

Tree Pruning

Decision Trees

•   Advantages of decision trees

–   ComputaAonally inexpensive

–   Outputs are easy to interpret – sequence of tests

–   Show importance of each input variable

–   Decision trees handle

•   Both numerical and categorical aKributes

•   Categorical aKributes with many disAnct values

•   Variables with nonlinear effect on outcome

•   Variable interacAons

Decision Trees

•   Disadvantages of decision trees

–   SensiAve to small variaAons in the training data

–   OverfiRng can occur because each split reduces training data for subsequent splits

–   Poor if dataset contains many irrelevant variables

References

•  

Chapter 7: Data Science and Big Data

Analytics: Discovering, Analyzing,

Visualizing and Presenting Data,

EMC Education Services (Editor) ISBN:

978-1-118-87613-8

•   Charles Tappert slides

Download