Dr. Jaume Bacardit jqb@cs.nott.ac.uk
Topic 2: Data Preprocessing
Lecture 5: Discretisation methods
• Definition and taxonomy
• Static discretisation techniques
• The ADI representation: a dynamic, local, discretisation method
• Resources
• Discretisation:
– Process of converting a continuous variable into a discrete one. That is, partitioning it into a finite
(discrete) set of elements (intervals)
I1 I2 I3
– Every data point inside one interval will be treated equally
• How?
– A discretisation algorithm proposes a series of cut-
points. These, plus the domains bounds define the intervals
• Why?
– Many methods in mathematics and computer science cannot deal with continuous variables
– A discretisation process is required to be able to use them, despite the loss of information
• (Liu et al, 2002)
• Supervised vs Non supervised
– Supervised methods use the class label of each instances to decide
• Dynamic vs Static
– Dynamic discretisation is performed at the same time as the learning process. Static discretisation is performed before learning
– Rule-based systems that generate arbitrary intervals can be considered a case of dynamic discretisation
• Global vs local
– Global methods apply the same discretisation criteria (cut-points) to all instance
– Local methods use different cut points for different groups of instances
– Again, rule-learning methods that generate arbitrary intervals can be considered a case of local discretisation
• Splitting vs Merging methods
– Splitting methods: Start with a single interval (no cut-point) and divide it
– Merging methods: Start with every possible interval (n-1 cut points) and merge some of them
Splitting
Merging
• Unsupervised classification
• Given a domain with d l number of bins b and d u bounds and a
• It will generate b intervals of equal size (d u
-d l
)/b
• Unsupervised classification
• Given a domain with d l number of bins b and d u bounds and a
• It will generate b intervals, each of them containing the same number of values
• Supervised, splitting method
• Inspired in the ID3 decision trees induction algorithm
• It chooses the cut-points that creates intervals with minimal Entropy , that is, maximising the
Information Gain
1. Start with a single interval
2. Identify the cut-point that creates two intervals with minimal entropy
S=original interval, S1,S2 candidate splits
1. Split the interval using the best cut-point
2. Recursively apply the method to S1 and S2
3. Stop criterion: All instances in an interval have the same class
• Refinement of ID3 to make it more conservative
– ID3 generates lots of intervals because the stop criterion is very loose
• In this method, in order to split an interval, the difference between Entropy(S) and
EntropyPartition(S,S
1
,S
2
) needs to be large enough
• Stop criteria based on the Minimum Description
Length (MDL) principle (Rissanen, 78)
• MDL is a modern reformulation of the classic
Occam ’ s Razor principle: “ If you have two equally good explanations, always choose the simplest one ”
• MDL also comes from the information theory field, dealing with information transmission
Sender Receiver
Instances + class Instances
How do we send the class of each instance?
1) Sending the classes
2) Generating a theory and sending it plus its exceptions
Theory description
+
• The theory that minimizes its sizes and the size of the exceptions will be the best
• Stop partitioning if this inequality is true
Gain of discretising Cost of discretizing
N = number of instances in the original interval, k = number of classes k
1
,k
2
= number of classes represented in the subpartitions 1 and 2
• Supervised merging algorithm
• It defines the quality of an interval by a measure called goodness
Goodness ( I )
= max C ( I )
1
+ errors ( I )
– maxC(I) = number of examples belonging to the majority class in I
– Errors(I) = number of examples not belonging to the majority class in I
• discretisation process
1. Starts with every possible cut-point in the domain
2. Identifies candidate pairs of intervals to join (I i
,I i+1
) if both conditions are true
1.
I i and I i+1 have the same majority class or there is a tie in I i oir I i+1
2.
goodness(I i
+I i+1
)> [goodness(I i
)+goodness(I i+1
)]/2
3. Merges the candidate pair with highest goodness
4. Repeats steps 2-4 until no more intervals are merged
• Merging supervised discretisation
• Uses the c 2 statistical test to decide whether to merge intervals or not
• This test checks whether a discrete random variable presents an certain distribution
– E.g. testing whether a dice is fair (all 6 outcomes are equally likely to occur)
• Two intervals are merged if the null hypothesis (same distribution of classes) is true
• c 2 formula
– A p-value can be computed from c 2 and N. A predefined confidence level is used to reject the test
– A ij
= examples in interval I from class j
– p = number of classes
– Ri = examples in interval I
– Cj = examples from class j
– N = total number of examples
• It iteratively merges intervals until the statistical test fails for every possible pairs of consecutive intervals
(Bacardit, 2004)
• Good survey on discretisation methods with empirical validation using C4.5
• Implementation of the methods described in this lecture is available in the KEEL package
– List of the 27 discretisation algorithms (with references) in KEEL