slides - School of Computer Science

advertisement

G54DMT – Data Mining Techniques and

Applications http://www.cs.nott.ac.uk/~jqb/G54DMT

Dr. Jaume Bacardit jqb@cs.nott.ac.uk

Topic 2: Data Preprocessing

Lecture 5: Discretisation methods

Outline of the lecture

• Definition and taxonomy

• Static discretisation techniques

• The ADI representation: a dynamic, local, discretisation method

• Resources

Definition

• Discretisation:

– Process of converting a continuous variable into a discrete one. That is, partitioning it into a finite

(discrete) set of elements (intervals)

I1 I2 I3

– Every data point inside one interval will be treated equally

Definition

• How?

– A discretisation algorithm proposes a series of cut-

points. These, plus the domains bounds define the intervals

• Why?

– Many methods in mathematics and computer science cannot deal with continuous variables

– A discretisation process is required to be able to use them, despite the loss of information

Taxonomy

• (Liu et al, 2002)

• Supervised vs Non supervised

– Supervised methods use the class label of each instances to decide

• Dynamic vs Static

– Dynamic discretisation is performed at the same time as the learning process. Static discretisation is performed before learning

– Rule-based systems that generate arbitrary intervals can be considered a case of dynamic discretisation

Taxonomy

• Global vs local

– Global methods apply the same discretisation criteria (cut-points) to all instance

– Local methods use different cut points for different groups of instances

– Again, rule-learning methods that generate arbitrary intervals can be considered a case of local discretisation

Taxonomy

• Splitting vs Merging methods

– Splitting methods: Start with a single interval (no cut-point) and divide it

– Merging methods: Start with every possible interval (n-1 cut points) and merge some of them

Splitting

Merging

Equal-length discretisation

• Unsupervised classification

• Given a domain with d l number of bins b and d u bounds and a

• It will generate b intervals of equal size (d u

-d l

)/b

Equal-frequency discretisation

• Unsupervised classification

• Given a domain with d l number of bins b and d u bounds and a

• It will generate b intervals, each of them containing the same number of values

ID3 discretisation (Quinlan, 86)

• Supervised, splitting method

• Inspired in the ID3 decision trees induction algorithm

• It chooses the cut-points that creates intervals with minimal Entropy , that is, maximising the

Information Gain

ID3 splitting procedure

1. Start with a single interval

2. Identify the cut-point that creates two intervals with minimal entropy

S=original interval, S1,S2 candidate splits

1. Split the interval using the best cut-point

2. Recursively apply the method to S1 and S2

3. Stop criterion: All instances in an interval have the same class

(Fayyad & Irani, 93) discretisation

• Refinement of ID3 to make it more conservative

– ID3 generates lots of intervals because the stop criterion is very loose

• In this method, in order to split an interval, the difference between Entropy(S) and

EntropyPartition(S,S

1

,S

2

) needs to be large enough

• Stop criteria based on the Minimum Description

Length (MDL) principle (Rissanen, 78)

• MDL is a modern reformulation of the classic

Occam ’ s Razor principle: “ If you have two equally good explanations, always choose the simplest one ”

Minimum Description Length

• MDL also comes from the information theory field, dealing with information transmission

Sender Receiver

Instances + class Instances

How do we send the class of each instance?

1) Sending the classes

2) Generating a theory and sending it plus its exceptions

Theory description

+

• The theory that minimizes its sizes and the size of the exceptions will be the best

MDL for discretisation

• Stop partitioning if this inequality is true

Gain of discretising Cost of discretizing

N = number of instances in the original interval, k = number of classes k

1

,k

2

= number of classes represented in the subpartitions 1 and 2

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

• Supervised merging algorithm

• It defines the quality of an interval by a measure called goodness

Goodness ( I )

= max C ( I )

1

+ errors ( I )

– maxC(I) = number of examples belonging to the majority class in I

– Errors(I) = number of examples not belonging to the majority class in I

Unparametrized Supervised

Discretizer (Giraldez et al., 02)

• discretisation process

1. Starts with every possible cut-point in the domain

2. Identifies candidate pairs of intervals to join (I i

,I i+1

) if both conditions are true

1.

I i and I i+1 have the same majority class or there is a tie in I i oir I i+1

2.

goodness(I i

+I i+1

)> [goodness(I i

)+goodness(I i+1

)]/2

3. Merges the candidate pair with highest goodness

4. Repeats steps 2-4 until no more intervals are merged

ChiMerge (Kerber, 92)

• Merging supervised discretisation

• Uses the c 2 statistical test to decide whether to merge intervals or not

• This test checks whether a discrete random variable presents an certain distribution

– E.g. testing whether a dice is fair (all 6 outcomes are equally likely to occur)

• Two intervals are merged if the null hypothesis (same distribution of classes) is true

ChiMerge

• c 2 formula

– A p-value can be computed from c 2 and N. A predefined confidence level is used to reject the test

– A ij

= examples in interval I from class j

– p = number of classes

– Ri = examples in interval I

– Cj = examples from class j

– N = total number of examples

• It iteratively merges intervals until the statistical test fails for every possible pairs of consecutive intervals

Differences between discretisers

(Bacardit, 2004)

Resources

• Good survey on discretisation methods with empirical validation using C4.5

• Implementation of the methods described in this lecture is available in the KEEL package

– List of the 27 discretisation algorithms (with references) in KEEL

Questions?

Download