slides - School of Computer Science

G54DMT – Data Mining Techniques and

Applications http://www.cs.nott.ac.uk/~jqb/G54DMT

Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Topic 2: Data Preprocessing

Lecture 5: Discretisation methods

Outline of the lecture

• Definition and taxonomy

• Static discretisation techniques

• The ADI representation: a dynamic, local, discretisation method

• Resources

Definition

• Discretisation:

– Process of converting a continuous variable into a discrete one. That is, partitioning it into a finite

(discrete) set of elements (intervals)

I1 I2 I3

– Every data point inside one interval will be treated equally

Definition

• How?

– A discretisation algorithm proposes a series of cut-

points. These, plus the domains bounds define the intervals

• Why?

– Many methods in mathematics and computer science cannot deal with continuous variables

– A discretisation process is required to be able to use them, despite the loss of information

Taxonomy

• (Liu et al, 2002)

• Supervised vs Non supervised

– Supervised methods use the class label of each instances to decide

• Dynamic vs Static

– Dynamic discretisation is performed at the same time as the learning process. Static discretisation is performed before learning

– Rule-based systems that generate arbitrary intervals can be considered a case of dynamic discretisation

Taxonomy

• Global vs local

– Global methods apply the same discretisation criteria (cut-points) to all instance

– Local methods use different cut points for different groups of instances

– Again, rule-learning methods that generate arbitrary intervals can be considered a case of local discretisation

Taxonomy

• Splitting vs Merging methods

– Splitting methods: Start with a single interval (no cut-point) and divide it

– Merging methods: Start with every possible interval (n-1 cut points) and merge some of them

Splitting

Merging

Equal-length discretisation

• Unsupervised classification

• Given a domain with d l number of bins b and d u bounds and a

• It will generate b intervals of equal size (d u

-d l

)/b

Equal-frequency discretisation

• Unsupervised classification

• Given a domain with d l number of bins b and d u bounds and a

• It will generate b intervals, each of them containing the same number of values

ID3 discretisation (Quinlan, 86)

• Supervised, splitting method

• Inspired in the ID3 decision trees induction algorithm

• It chooses the cut-points that creates intervals with minimal Entropy , that is, maximising the

Information Gain

ID3 splitting procedure

1. Start with a single interval

2. Identify the cut-point that creates two intervals with minimal entropy

S=original interval, S1,S2 candidate splits

1. Split the interval using the best cut-point

2. Recursively apply the method to S1 and S2

3. Stop criterion: All instances in an interval have the same class

(Fayyad & Irani, 93) discretisation

• Refinement of ID3 to make it more conservative

– ID3 generates lots of intervals because the stop criterion is very loose

• In this method, in order to split an interval, the difference between Entropy(S) and

EntropyPartition(S,S

1

,S

2

) needs to be large enough

• Stop criteria based on the Minimum Description

Length (MDL) principle (Rissanen, 78)

• MDL is a modern reformulation of the classic

Occam ’ s Razor principle: “ If you have two equally good explanations, always choose the simplest one ”

Minimum Description Length

• MDL also comes from the information theory field, dealing with information transmission

Sender Receiver

Instances + class Instances

How do we send the class of each instance?

1) Sending the classes

2) Generating a theory and sending it plus its exceptions

Theory description

+

• The theory that minimizes its sizes and the size of the exceptions will be the best

MDL for discretisation

• Stop partitioning if this inequality is true

Gain of discretising Cost of discretizing

N = number of instances in the original interval, k = number of classes k

1

,k

2

= number of classes represented in the subpartitions 1 and 2

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

• Supervised merging algorithm

• It defines the quality of an interval by a measure called goodness

Goodness ( I )

= max C ( I )

1

+ errors ( I )

– maxC(I) = number of examples belonging to the majority class in I

– Errors(I) = number of examples not belonging to the majority class in I

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

• discretisation process

1. Starts with every possible cut-point in the domain

2. Identifies candidate pairs of intervals to join (I i

,I i+1

) if both conditions are true

1.

I i and I i+1 have the same majority class or there is a tie in I i oir I i+1

2.

goodness(I i

+I i+1

)> [goodness(I i

)+goodness(I i+1

)]/2

3. Merges the candidate pair with highest goodness

4. Repeats steps 2-4 until no more intervals are merged

ChiMerge (Kerber, 92)

• Merging supervised discretisation

• Uses the c 2 statistical test to decide whether to merge intervals or not

• This test checks whether a discrete random variable presents an certain distribution

– E.g. testing whether a dice is fair (all 6 outcomes are equally likely to occur)

• Two intervals are merged if the null hypothesis (same distribution of classes) is true

ChiMerge

• c 2 formula

– A p-value can be computed from c 2 and N. A predefined confidence level is used to reject the test

– A ij

= examples in interval I from class j

– p = number of classes

– Ri = examples in interval I

– Cj = examples from class j

– N = total number of examples

• It iteratively merges intervals until the statistical test fails for every possible pairs of consecutive intervals

Differences between discretisers

(Bacardit, 2004)

Resources

• Good survey on discretisation methods with empirical validation using C4.5

• Implementation of the methods described in this lecture is available in the KEEL package

– List of the 27 discretisation algorithms (with references) in KEEL

Questions?

slides - School of Computer Science

G54DMT – Data Mining Techniques and

Applications http://www.cs.nott.ac.uk/~jqb/G54DMT

Outline of the lecture

Definition

Definition

Taxonomy

Taxonomy

Taxonomy

Equal-length discretisation

Equal-frequency discretisation

ID3 discretisation (Quinlan, 86)

ID3 splitting procedure

(Fayyad & Irani, 93) discretisation

Minimum Description Length

MDL for discretisation

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

ChiMerge (Kerber, 92)

ChiMerge

Differences between discretisers

Resources

Questions?

Related documents

Products

Support

slides - School of Computer Science

G54DMT – Data Mining Techniques and

Applications http://www.cs.nott.ac.uk/~jqb/G54DMT

Outline of the lecture

Definition

Definition

Taxonomy

Taxonomy

Taxonomy

Equal-length discretisation

Equal-frequency discretisation

ID3 discretisation (Quinlan, 86)

ID3 splitting procedure

(Fayyad & Irani, 93) discretisation

Minimum Description Length

MDL for discretisation

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

ChiMerge (Kerber, 92)

ChiMerge

Differences between discretisers

Resources

Questions?

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib