Uploaded by francisco.oliveira.ext

ismail

advertisement
AN EMPIRICAL INVESTIGATION OF THE IMPACT OF
DISCRETIZATION ON COMMON DATA DISTRIBUTIONS
Michael Kumaran Ismail
Master of Technology
(Information Technology)
2003
RMIT University
An Empirical Investigation of the Impact of Discretization on Common
Data Distributions
by
Michael K. Ismail
A dissertation submitted in partial fulfillment of the requirements for the
degree of Master of Technology (Information Technology)
Department of Computer Science
RMIT University
Melbourne, VIC 3001, AUSTRALIA
Abstract
This study attempts to identify the merits of six of the most popular discretization
methods when confronted with a randomly generated dataset consisting of attributes
that conform to one of eight common statistical distributions. It is hoped that the
analysis will enlighten as to a heuristic which identifies the most appropriate
discretization method to be applied, given some preliminary analysis or visualization to
determine the type of statistical distribution of the attribute to be discretized. Further,
the comparative effectiveness of discretization given each data distribution is a primary
focus. Analysis of the data was accomplished by inducing a decision tree classifier
(C4.5) on the discretized data and an error measure was used to determine the relative
value of discretization. The experiments showed that the method of discretization and
the level of inherent error placed in the class attribute have a major impact on
classification errors generated post-discretization. More importantly, the general
effectiveness of discretization varies significantly depending on the shape of data
distribution considered. Distributions that are highly skewed or have high peaks tend to
result in higher classification errors, and the relative superiority of supervised
discretization over unsupervised discretization is diminished significantly when applied
to these data distributions.
2
Declaration
I certify that all work on this dissertation was carried out between March 2003 and June 2003 and it was not
submitted for any academic award at any other college, institute or university. The work presented was carried
out under the supervision of Dr Vic Ciesielski who proposed the idea of creating random data and applying an
inherent error in the construction of class labels. All work in the dissertation is my own except where
acknowledged in the text.
Signed,
Michael K. Ismail,
25th of June, 2003
3
TABLE OF CONTENTS
1.
Introduction ....................................................................................................................................................6
2.
Data Generation Methodology .......................................................................................................................6
3.
Data Distributions ..........................................................................................................................................8
3.1 Normal Distribution ....................................................................................................................................8
3.2 Uniform Distribution...................................................................................................................................8
3.3 Leptokurtic Distribution..............................................................................................................................9
3.4
Platykurtic Distribution............................................................................................................................10
3.5 Bimodal Distribution.................................................................................................................................10
3.6 Skewed Distribution..................................................................................................................................11
3.7 Exponential Distribution ...........................................................................................................................12
3.8 Zipf Distribution .......................................................................................................................................12
4.
Discretization Techniques ............................................................................................................................13
4.1 Unsupervised Discretization .....................................................................................................................13
4.1.1 Equal Interval Width/Uniform Binning.............................................................................................13
4.1.2 Equal Frequency/Histogram Binning ................................................................................................13
4.2 Supervised Discretization..........................................................................................................................14
4.2.1 Holte’s 1R Discretizer .......................................................................................................................14
4.2.2 C4.5 Discretizer.................................................................................................................................15
4.2.3 Fayyad and Irani’s Entropy Based MDL Method .............................................................................16
4.2.4 Kononenko’s Entropy Based MDL Method......................................................................................17
5.
Running Algorithms in WEKA....................................................................................................................17
6.
Analysis of Errors.........................................................................................................................................18
6.1 Normal Distribution Errors .......................................................................................................................19
6.2 Uniform Distribution Errors......................................................................................................................20
6.3 Leptrokurtic Distribution Errors................................................................................................................21
6.4 Platykurtic Distribution Errors..................................................................................................................22
6.5 Bimodal Distribution Errors......................................................................................................................23
6.6 Skewed Distribution Errors.......................................................................................................................24
6.7 Exponential Distribution Errors ................................................................................................................25
6.8 Zipf Distribution Errors.............................................................................................................................26
7.
Results ..........................................................................................................................................................27
8.
Conclusion....................................................................................................................................................30
9.
References ....................................................................................................................................................31
10.
Appendix .................................................................................................................................................32
4
TABLE OF FIGURES
Figure 1. 0% error overlap dataset with 10 instances. .............................................................................................6
Figure 2. 20% error overlap dataset with 20 instances. ..........................................................................................6
Figure 3. 5% class error overlap (class 1 class 2) with 10,000 instances.............................................................7
Figure 4. Built in classification error rates...............................................................................................................7
Figure 5 : Histogram of normally distributed attribute generated (normal).............................................................8
Figure 6. Histogram of uniformly distributed attribute generated (uniform). ..........................................................8
Figure 7. Histogram of a leptokurtically distributed attribute generated. ................................................................9
Figure 8. Histogram of a platykurtically distributed attribute generated ...............................................................10
Figure 9. Histogram of a bimodally distributed attribute generated with moving average trendline.....................10
Figure 10. Histogram of a positively skewed attribute generated with moving average trendline. .......................11
Figure 11. Histogram of an exponential distribution generated with moving average trendline. ..........................12
Figure 12. Histogram of an approximated Zipf distribution. .................................................................................12
Figure 13. Equal width binning with number of bins set to 4................................................................................13
Figure 14. Equal frequency binning with number of bins set to 4. .......................................................................13
Figure 15. Holte’s 1R bin partitioning with minimum bucket size of 6. ...............................................................14
Figure 16. C4.5 Decision tree growth of continuous variable – attribute bimodal. Class labels and number of
errors contained in leaves and thresholds listed beside child pointers..........................................................15
Figure 17. Entropy based multiway splitting of intervals for attribute bimodal. Class labels and number of errors
contained in leaves and thresholds listed beside child pointers....................................................................16
Figure 18. Layout of dataset generated containing 8 attributes and 15 classes......................................................17
Figure 19. Fayyad & Irani MDL method with C4.5 classifier run information for uniform distribution 3 class
problem (small, medium, large) and 10% error overlap...............................................................................18
Figure 20. Classification error rate for 2-class label experiment on normally distributed data. ............................19
Figure 21. 3 class label errors (normal).
Figure 22. 5 class label errors (normal)............................................19
Figure 23. Classification error rate for 2-class label experiment on uniformly distributed data............................20
Figure 24. 3 class label errors (uniform).
Figure 25. 5 class label errors (uniform).........................................20
Figure 26. Classification error rate for 2-class label experiment on leptokurtically distributed data.....................21
Figure 27. 3 class label errors (leptokurtic). Figure 28. 5 class label errors (leptokurtic)..................................21
Figure 29. Classification error rate for 2-class label experiment on platykurtically distributed data.....................22
Figure 30. 3 class label errors (platykurtic). Figure 31. 5 class label errors (platykurtic)..................................22
Figure 32. Classification error rate for a 2-class label experiment on bimodally distributed data.........................23
Figure 33. 3 class label errors (bimodal).
Figure 34. 5 class label errors (bimodal). .......................................23
Figure 35. Classification error rate for 2-class label experiment on positively skewed data. ................................24
Figure 36. 3 class label errors (skewed).
Figure 37. 5 class label errors (skewed). .........................................24
Figure 38. Classification error rate for 2-class label experiment on exponentially distributed data. .....................25
Figure 39. 3 class label errors (exponential). Figure 40. 5 class label errors (exponential). ..............................25
Figure 41. Classification error rate for 2-class label experiment on approximated Zipf distribution. ...................26
Figure 42. 3 class label errors (Zipf).
Figure 43. 5 class label errors (Zipf). ....................................................26
Figure 44. Unsupervised(Equal-width) vs. Supervised(Fayyad&Irani) average errors (Y-axis) for Exponential
and Uniform distributions plotted against number of class labels (X-axis) and error overlap (Z-axis) .......27
Figure 45 . Average root relative squared error for all distributions and discretization methods. .........................28
Figure 46. Error rate for number of bins manually chosen for equal-width binning strategy for normal
distribution of 2-class problem with 35% error overlap...............................................................................29
Figure 47. Kurtosis vs. average error rate : r2 = 0.45
Figure 48. Skewness vs. average error rate : r2 = 0.22 ..30
5
1. Introduction
Discretization of continuous attributes not only broadens the scope of the number of data mining algorithms able
to analyze data in discrete form, but also dramatically increase the speed at which these tasks can be performed
[1]. There have been many studies that have evaluated the relative effectiveness of the many discretization
techniques along with suggested optimizations [4,6,7]. However, these experiments have usually been carried
out with application to real-world datasets where no attempt has been made to identify the type of probability
density function of the continuous attribute.
The effectiveness of various discretization methods as applied to a variety of data will be evaluated. All data has
been artificially constructed and conform to common statistical distributions. Discretization and data mining
operations will be performed using the WEKA data mining software [11]. The comparative responsiveness of
each distribution to discretization is the primary focus of the experiments and will be achieved by the analysis of
classification errors resulting from discretization. Thus, it is hoped that by data visualization alone we can make
inference as to the effectiveness of any subsequent discretization apriori and if applicable, isolate the most
supportive discretization method to adopt. There are eight statistical distributions chosen that will be analyzed,
along with six discretization methods.
2. Data Generation Methodology
Data was generated using the random number generation facility in Microsoft Excel with the data analysis tool.
Each attribute was formed with 10,000 instances as this provided accurate measures in terms of achieving the
desired symmetry in distributions. Each class attribute was formed with an inbuilt error or class overlap (1%,
5%, 10%, 20%, 35%).
For instance, if we have a numeric attribute (personal assets) that contains 10,000 instances ranging from
$50,000,000 to $50, for a 2-class label problem (sad, happy) we find the median value 5000th e.g. ($50,000). If
we say that everyone below the median value (poor) is sad and everyone above the median value (rich) is happy,
then anyone who is poor but happy or rich but sad contravenes our hypothesis. To illustrate this point further,
consider the dataset of 10 instances in Figure 1.
Value
Class
2
small
4
small
6
small
8
small
10
small
12
large
14
large
16
large
18
large
20
large
Figure 1. 0% error overlap dataset with 10 instances.
Intuitively, we can see that any sensible discretization method should recognize that a partition should be placed
between values 10 and 12 as the class label changes from small to large. If all datasets were this simple our work
would be done. However if we swap the two middle class labels (Figure 2) there is some confusion as to where
we should place partitions.
Value
Class
2
small
4
small
6
small
8
small
10
large
12
small
14
large
16
large
18
large
20
large
Figure 2. 20% error overlap dataset with 20 instances.
Given that the middle 2 out of 10 values in the dataset are in dispute, we have used the phrase ‘error overlap’ to
represent the level of uncertainty deliberately placed into the class labels when forming each class. Hence, the
error overlap for Figure 2 is 20% or 2/10. Figure 3 demonstrates as personal assets increase, at some point there
is a change of mood from sad to happy, with the error overlap represented by the intersection of class 1 and class
2.
6
Class 1 - sad
Class 2 - happy
personal assets
Figure 3. 5% class error overlap (class 1
class 2) with 10,000 instances.
Figure 4 shows the number of errors built into the data (shaded rows) as the error overlap increases for a dataset
of 10,000 instances.
2 class problem
Poor & sad
Poor & happy
Rich & sad
Rich & happy
1%
4950
50
50
4950
5%
4750
250
250
4750
10%
4500
500
500
4500
20%
4000
1000
1000
4000
35%
3250
1750
1750
3250
Figure 4. Built in classification error rates.
where
Poor
personal assets < $50,000
Rich
personal assets > $50,000
For 3 and 5 class label attributes a similar methodology was followed to build an error rate into the data. For
instance, 3 classes in the above case would result in class labels of: sad, content, happy and for a 5 class
problem: very sad, sad, content, happy, very happy.
7
3. Data Distributions
3.1 Normal Distribution
May datasets conform to a normal distribution as depicted in Figure 5. They are characterized by a symmetrical
bell shaped curve with a single peak at the median value which is also approximately equal the mean value. 68%
of values are contained within one standard deviation of the mean, 95% of values within 2 standard deviations,
and 99% of values within 3 standard deviations. The normal probability distribution sets a benchmark for center,
spread, skewness, and kurtosis that other shapes/distributions are measured. Data was generated with a mean of
176cm and a standard deviation of 8cm as an attempt to represent the height distribution of adult males. The
probability density function for a normal distribution with mean and variance 2 is given by:
where
>0
normal distribution
1200
Frequency
1000
800
600
400
200
0
146
156
166
176
186
Height(cms)
196
206
Figure 5 : Histogram of normally distributed attribute generated (normal).
3.2 Uniform Distribution
Uniform distributions are characterized by a roughly equal frequency count across the data range. Such a
distribution is expected to occur in the frequency of numbers spun on a roulette wheel where every value has the
same probability of occurrence. Figure 6 shows a randomly generated uniform distribution with the greater
variability in the data being reflected in a higher standard deviation.
Frequency
uniform distribution
400
350
300
250
200
150
100
50
0
Figure 6. Histogram of uniformly distributed attribute generated (uniform).
8
3.3 Leptokurtic Distribution
A leptokurtic distribution is similar to a normal distribution however, the central area of the curve is more
pronounced with a higher peak. Thus, more data values are concentrated close to the mean and median value. A
co-efficient of kurtosis of magnitude greater than three formally recognizes a data’s distribution as being
leptokurtic.
The co-efficient of kurtosis, g4, is given by
where m4 refers to how much data is massed at the center
and m2 refers to the spread about the center.
The rth sample moment about the sample mean for a dataset a1,a2,a3,..,an is given by
where i = 1, ..,n.
leptokurtic distribution
2500
Frequency
2000
1500
1000
500
0
Figure 7. Histogram of a leptokurtically distributed attribute generated.
9
3.4
Platykurtic Distribution
The platykurtic distribution is the opposite of a leptokurtic distribution. This distribution is characterized by a
flatter than normal central peak and as such a more evenly symmetrical distribution of values across the data
range. Formally, if the coefficient of kurtosis is significantly less than three, it is regarded as platykurtic. Note, a
uniform distribution is a special case of a platykurtic distribution.
Frequency
platykurtic distribution
500
450
400
350
300
250
200
150
100
50
0
Figure 8. Histogram of a platykurtically distributed attribute generated.
3.5 Bimodal Distribution
A Bimodal distribution contains two distinct peaks or modes and may be asymmetric. For this example, we have
combined 2 subsets of normal distribution to form a bimodal attribute. 5000 instances have been modeled around
an estimate of the bodyweight distribution of the female population and 5000 of the male population. We have
used a mean of 65 kilograms for females and 90 kilograms for males. Figure 7 shows the effect of combining the
sub data into a single attribute. Two distinct peaks are prevalent in Figure 9; either can be regarded as the mode.
bimodal distribution
Frequency
2000
1500
1000
500
0
25
40
55
70
85
Weight (kgs)
100
115
130
Figure 9. Histogram of a bimodally distributed attribute generated with moving average trendline.
10
3.6 Skewed Distribution
In the real world, many datasets will not exhibit the symmetry of a normal distribution. Very often there will be a
greater proportion of values either side of the mean or the distance between the mean and the extreme values
either side of the mean differ. When this occurs, we are said to have skewed data. To mimic this distribution, let
us once again look at the bodyweight distribution of male adults. We will assume that the average weight of an
adult male is approximately 80 kilograms. Modeling this data as a normal distribution would result in problems.
We have all seen adult men greater than 120 kilograms but we do not see adult men less than 40 kilograms as
this weight is impossible to achieve for a person not suffering a growth defect. With this in mind, we have to
shift a proportion of our population to represent weights greater than 120 kilograms but hold the probability of
males weighing less than 40 kilograms at zero. Such a presumption leads to a skewed distribution as in Figure
10. Formally, if the coefficient of skewness is significantly greater/less than zero, we have a skewed distribution.
The sample co-efficient of skewness, g3, is given by
where m3 refers to the skewness about the center
and m2 refers to the spread about the center.
The rth sample moment about the sample mean for a dataset a1,a2,a3,..,an is given by
where i = 1, ..,n.
Figure 10 depicts a positively skewed distribution characterized by a long right tail. In general, m3>0 if large
values are present far above the sample mean , then long tails will develop to the right producing a rightskewed distribution as below.
Frequency
skewed distribution
1800
1600
1400
1200
1000
800
600
400
200
0
50
65
80
95
110
weight(kgs)
125
140
155
Figure 10. Histogram of a positively skewed attribute generated with moving average trendline.
11
3.7 Exponential Distribution
Exponential distributions are used often to model expectancy and lifetimes such as the life expectancy of
humans, car engines, or light bulbs. As x increases, y decreases at a slowing rate known as the failure rate. Thus,
as we get older our survival rate declines.
exponential distribution
1800
1600
Frequency
1400
1200
1000
800
600
400
200
0
Figure 11. Histogram of an exponential distribution generated with moving average trendline.
3.8 Zipf Distribution
A Zipf Distribution follows a straight line when frequency and rank are plotted on a double-logarithmic scale.
This inverse correlation between ordered frequencies in a dataset exist frequently in real life situations. An
example of data that exhibits this pattern would be the frequency of words in a text document. Some words such
as ‘the’, ’and’, ’to’, and ‘a’ have very high frequency counts in most long text files, whilst other words such as
‘alleviate’ occur rarely. As a Zipf distribution, all the words would be sorted into frequency order where each
spike in the histogram represents a word. Figure 12 below depicts an approximated Zipf distribution. If we apply
our scenario, we can see that there are many words with few occurrences and few words with many occurrences.
zipf distribution
3000
Frequency
2500
2000
1500
1000
500
0
Figure 12. Histogram of an approximated Zipf distribution.
12
4. Discretization Techniques
Many classifiers used in data mining compel the data to be in non-continuous feature form. As such, suitable
techniques are required to allot numerical data values into discrete and meaningful bins or partitions whilst
avoiding dilution of potential knowledge within this data due to classification information loss. Furthermore, the
categorization of numerical features allows faster computational methods to be used as opposed to algorithms
running on continuous features such as neural networks and X-means clustering. The selection of discretization
technique as we will see has a great impact on classification accuracy.
4.1 Unsupervised Discretization
The simplest means of discretizing continuous features involves a class blind approach where only knowledge of
the feature to be discretized is required. Here the user specifies the number of intervals or bins.
4.1.1 Equal Interval Width/Uniform Binning
This method relies on sorting the data and dividing the data values into equally spaced bin ranges. A seed k
supplied by the user determines how many bins are required. With this seed k, it is just a matter of finding the
maximum and minimum values to derive the range and then partition the data into k bins. The bin width is
computed by:
and bin thresholds are constructed at xmin + i where i = 1,…k-1.
Figure 13 below shows an example of this binning strategy with number of bins set to 4 bins. According to the
above formula, breakpoints occur at 71.75, 103.5, and 135.25.
bin
Bin1
instance
value
Bin2
Bin3
Bin4
1
2
3
4
5
6
7
8
9
10
11
12
40
45
55
60
64
67
78
89
110
140
154
167
Figure 13. Equal width binning with number of bins set to 4.
4.1.2 Equal Frequency/Histogram Binning
Partitioning of data is based on allocating the same number of instances to each bin. Dividing the total number of
instances n by k the number of bins supplied by the user achieves this. Thus, if we had 1000 instances and
specified 10 bins, after sorting the data into numerical order we would allocate the first 100 instances into the
first bin, and continue this process until 10 bins of frequency 100 are created. Figure 14 below shows the
contrasting breakpoints of these unsupervised methods. Equal frequency binning leads to each bucket being the
same size whereas equal width binning can lead to different sized buckets as depicted in Figures 13 and 14.
Bin
Instance
Value
Bin2
Bin1
Bin3
Bin4
1
2
3
4
5
6
7
8
9
10
11
12
40
45
55
60
64
67
78
89
110
140
154
167
Figure 14. Equal frequency binning with number of bins set to 4.
13
4.2 Supervised Discretization
Supervised discretization on the other hand makes use of the instance/class labels during the discretization
process to aid in the partitioning process. Prior knowledge of the class label of each instance is incorporated at
each iteration to refine partition breakpoint estimation of each bin.
4.2.1 Holte’s 1R Discretizer
This method uses an error-based approach with error counts determining when to split intervals [4]. The attribute
is sorted into ascending order and a greedy algorithm that divides the feature into bins where each contains only
one instance of a particular class label is used. The danger inherent in such a technique is that each instance may
end up belonging to a separate bin. To combat this problem, a minimum number of instances of a particular class
for each bin as specified by the user (except for the upper most bin) was implemented. Hence any given bin can
now contain a mixture of class labels and boundaries will not be continually divided and thus lead to overfitting.
Each bucket grows i.e. the partition shifts to the right until it has at least 6 instances of a class label, and
continues until the instance to be considered is not part of the majority class label. Empirical analysis [4]
suggests a minimum bin size of 6 performs the best.
Bin
Bin1
Bin2
Instance
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Value
Class
64
small
65
large
68
small
69
small
70
small
71
large
72
large
72
small
75
small
75
small
80
large
81
small
83
small
85
large
Figure 15. Holte’s 1R bin partitioning with minimum bucket size of 6.
Figure 15 shows an example of how the 1R method of discretization is evaluated. Data is arranged in ascending
order and then a bin is formed by considering each class label instance from the start. Once we reach instance 9
we have a possible partition as there are at least 6 instances of one class label i.e. 6 small instances. We then
consider the next instance 10; as this is also a small instance we add it to bin1. This procedure iterates until the
class label changes i.e. Instance 11 is a large instance. Thus, our partition is formed and excludes instance 11.
The midpoint between the instance values 10 and 11. i.e. 75 and 80 is the breakpoint between these two bins.
The rule set produced for this dataset is:
Attribute:
77.5
> 77.5
14
small (bin 1)
large
(bin 2)
4.2.2 C4.5 Discretizer
The C4.5 algorithm [8] when run on continuous variables applies thresholds in order to discretize the variable to
form a decision tree via binary splits at each node. The instances are sorted on the attribute being discretized and
the number of discrete values are calculated. Thus, if we have 10,000 instances and 250 discrete values, the
number of possible threshold cutoff points that can be considered is 249. Information gain theory is then used to
determine at which threshold value our gain ratio is greatest in order to partition the data. A divide and conquer
algorithm is then successively applied to determine whether to split each partition into smaller subsets at each
iteration. The gain ratio is defined as:
!"
#$
!"
let
!%
!%
where info(T) measures the average amount of information needed to identify the class of a case in dataset T
and infox(T) measures the information gained by partitioning T in accordance with test X.
let
#$
!
&%& '$
! &%&
&%&
&%&
where split info(x) represents the potential information generated by dividing T into n subsets.
C4.5 uses a bottom up approach that builds a complete tree and then prunes the number of intervals. Each nonleaf sub tree is examined from the bottom and if the predicted error is lower when a leaf replaces the sub tree,
then a leaf will replace the sub tree. This may be a computationally expensive procedure if the data is not in
sorted order, however, the data generated was sorted before discretization was performed. The gain ratio
penalizes larger numbers of splits to select the actual number of bins. Figure 16 depicts how C4.5 recursively
partitions attribute values.
bimodal
<=87
>87
bimodal
<=74
large (2529)
>74
bimodal
<=65
small (3556/431)
bimodal
>65
large (1333/25)
<=84
small (1716/216)
>84
large (866/350)
Figure 16. C4.5 Decision tree growth of continuous variable – attribute bimodal. Class labels along with the
number of correct classifications and errors are contained in leaves and thresholds listed beside child pointers.
15
4.2.3 Fayyad and Irani’s Entropy Based MDL Method
This method uses a top down approach whereby multiple ranges rather than binary ranges are created to form a
tree via multi-way splits of the numeric attribute at the same node to produce discrete bins [3]. Determining the
exact cut off points for intervals is based on the Minimum Description Length Principle (MDLP) that biases a
simpler theory that can explain the same body of data and favours a hypothesis which minimizes the probability
of making a wrong decision assuming a uniform error cost. No pruning is applied to the grown tree.
An information entropy/uncertainty minimization heuristic is used to select threshold boundaries by finding a
single threshold that minimizes the entropy function over all possible thresholds. This entropy function is then
recursively applied to both of the partitions induced. Thresholds are placed half way between the two delimiting
instances. The entropy function is given by:
( )
$
!
let
C1..Ck = class labels; k = number of classes; P(CiS) = proportion of instances in S that have class Ci
At this point the MDL stopping criterion is applied to determine when to stop subdividing discrete intervals. If
the following condition holds then a partition is induced:
+$
!
*$
!
( )
( )
( ) ,
where
( ) &)&( )
-
&) &( )
-
let
S = set; A = attribute; S1,S2 = subsets of S; T = threshold value; N = instances.
Ent = entropy of instances in each subinterval
Figure 17 shows a typical discretization of a continuous variable named bimodal using the Fayyad & Irani
method.
87-inf
bimodal
-inf-60
small (1800)
84-87
60-62
large (2529)
81-84
77-81
small (866/16)
large (866/350)
small (647/105)
62-65
65-68
small (890/415)
74-77
small (784)
68-74
small (285/111)
large (638/23)
large (695)
Figure 17. Entropy based multiway splitting of intervals for attribute bimodal. . Class labels along with the
number of correct classifications and errors are contained in leaves and thresholds listed beside child pointers.
16
4.2.4 Kononenko’s Entropy Based MDL Method
This method is virtually identical to that of Fayyad and Irani except that it includes an adjustment for when
multiple attributes are to be discretized. This algorithm provides a correction for the bias the entropy measure
has towards an attribute with many values.
5. Running Algorithms in WEKA
The dataset generated contained 8 attributes representing each data distribution. There are 15 class columns: 5
classes for the simplest 2 class problem with either a 1%, 5%, 10%, 20% or 35% error rate built into the class
labels; and similarly 5 classes for a 3 class label and 5 class label problem. Each discretization method was tested
with 1 attribute and 1 class column run together via the C4.5 classifier using 10-fold cross-validation. For
instance, to analyze the normal distribution, column 1 and column 9 would firstly be run using each of the 6
discretization methods. Then, the effect of changing the error rate and number of class labels would be tested by
using each of the class columns from 10-24 against column 1. Each distribution was then analyzed by filtering
out the appropriate columns with all 720 combinations run, (i.e. 6 discretization methods * 8 attributes * 15
classes).
Attribute
8
Class1
Class 2
Class 3
Class
4
Class
n
normal
uniform
lepto
platy
bimod
skew
expo
Zipf
120
121
1% 3 class
Small
1% 5 class
v.small
5% 2class
Small
…..
50
60
1% 2class
Small
Small
Small
Small
Large
…
50
60
…
50
60
…
51
56
…
52
65
…
122
123
123
60
70
…
120
140
150
40
50
125
145
160
130
135
140
134
143
155
175
190
210
220
255
300
450
500
560
Large
Average
Average
Small
Large
Large
Large
Large
Large
large
v.large
large
Figure 18. Layout of dataset generated containing 8 attributes and 15 classes.
Equal width binning was run using weka.filters.DiscretizeFilter in weka 3-2-3 with the number of bins used set
at the default of 10 bins. No attempt was made to optimize this value although a discussion for the reasons
behind this are given in the results section. The C4.5 algorithm was then applied to the discretized data to
produce a decision tree. This classifier was chosen over a Bayesian classifier as there are no normality
constraints on the data required, especially relevant given the focus of the experiments.
Equal frequency binning was run using the FilteredClassifier in weka 3-3-4 with the number of bins used set at
the default of 10 bins. The C4.5 algorithm was then applied to the discretized data to produce a decision tree.
Holte’s 1R Discretizer was run using weka.classifiers.OneR classifier in weka 3-2-3 with default minimum
bucket size run at the default setting of 6.
C4.5 Discretizer was run using weka.classifiers.j48.J48 in weka 3-2-3 with confidence factor set to 0.25 as per
the default value.
Fayyad and Irani’s Entropy Based MDL Method was achieved using weka.classifiers.FilteredClassifier in weka
3-2-3 with the use MDL parameter set to true. The C4.5 algorithm was used as the wrapper function.
Kononenko’s Entropy Based MDL Method was achieved using weka.classifiers.FilteredClassifier in weka 3-2-3
with the use MDL parameter set to false. The C4.5 algorithm was used as the wrapper function.
17
…
Attribute
7
…
Attribute
6
…
Attribute
5
…
Attribute
4
…
Attribute
3
…
Attribute
2
…
Attribute
1
6. Analysis of Errors
The method of error estimation used can influence findings in many instances, however, in the experiments
undertaken, the error estimates were of an overwhelming consensus as to which method was the most accurate.
In most practical situations, the best numerical prediction method is still the best no matter which error measure
is used [11]. We chose to compare the root relative squared error for each method. This measure gives extra
emphasis to outliers and larger discrepancies are weighed more heavily than smaller ones. Further, it was able to
detect subtle differences in the various error rates of each method, especially when the total error count or
confusion matrix was the same. This measure is listed below:
+,
,+
+,
,+
where
p = predicted value ; a = actual value;
iai
= 1/n
average value of ai .
Figure 19 shows sample WEKA run information including decision tree construction and error output.
J48 pruned tree
-----------------uniform <= 162.93
|
uniform <= 158.86: small (2150.0)
|
uniform > 158.86
|
|
uniform <= 160.23: medium (250.0)
|
|
uniform > 160.23
|
|
|
uniform <= 161.4
|
|
|
|
uniform <= 160.81: small (100.0)
|
|
|
|
uniform > 160.81: medium (101.0/1.0)
|
|
|
uniform > 161.4: small (249.0)
uniform > 162.93
|
uniform <= 189.16: medium (4300.0)
|
uniform > 189.16
|
|
uniform <= 192.96
|
|
|
uniform <= 190.58: large (248.0)
|
|
|
uniform > 190.58
|
|
|
|
uniform <= 191.65
|
|
|
|
|
uniform <= 191.14: medium (103.0/3.0)
|
|
|
|
|
uniform > 191.14: large (100.0/1.0)
|
|
|
|
uniform > 191.65: medium (251.0/2.0)
|
|
uniform > 192.96: large (2148.0)
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
9992
8
0.9987
0.0009
0.0231
0.2226 %
5.0702 %
10000
99.92
0.08
%
%
=== Detailed Accuracy By Class ===
TP Rate
0.998
1
0.999
FP Rate
0
0.001
0
Precision
0.999
0.999
1
Recall
0.998
1
0.999
F-Measure
0.999
0.999
1
Class
large
medium
small
=== Confusion Matrix ===
a
b
c
<-- classified as
2496
4
0 |
a = large
2 4998
0 |
b = medium
0
2 2498 |
c = small
Figure 19. Fayyad & Irani MDL method with C4.5 classifier run information for uniform distribution 3 class
problem (small, medium, large) and 10% error overlap.
18
6.1 Normal Distribution Errors
NORMAL DISTRIBUTION - 2 class
root relative squared error
100
90
80
70
1%
5%
10%
20%
35%
60
50
40
30
20
10
0
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 20. Classification error rate for 2-class label experiment on normally distributed data.
As can be seen from Figure 20, there is a great amount of variability in the performance of the various
discretization techniques. The first two unsupervised methods progressively deteriorate in predictive
performance as the inherent error overlap is increased. On the other hand, the other methods have a stable
performance as the error overlap is increased. Surprisingly, the MDL methods perform worse with an error
overlap of 1% compared to when the error overlap is 35%.
NORMAL DISTRIBUTION - 5 class
100
90
90
root relative squared error
root relative squared error
NORMAL DISTRIBUTION - 3 class
100
80
70
60
1%
50
5%
10%
40
20%
30
35%
20
10
80
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 21. 3 class label errors (normal).
1R
C4.5
FI
KON
Figure 22. 5 class label errors (normal).
When the number of class labels is increased from 2 to 3 and 5, the performance difference between methods
becomes more pronounced. Once again, the unsupervised methods progressively deteriorate in performance
especially equal-width binning while the supervised methods remain stable.
19
6.2 Uniform Distribution Errors
UNIFORM DISTRIBUTION - 2 class
root relative squared error
100
90
80
70
1%
5%
10%
20%
35%
60
50
40
30
20
10
0
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 23. Classification error rate for 2-class label experiment on uniformly distributed data.
Once again the supervised methods are clearly superior to the unsupervised methods, with the unsupervised
methods progressively deteriorating in performance as the error overlap increases.
UNIFORM DISTRIBUTION - 5 class
100
90
90
80
80
root relative squared error
root relative squared error
UNIFORM DISTRIBUTION - 3 class
100
70
60
1%
5%
50
10%
20%
40
35%
30
20
70
60
1%
5%
50
10%
20%
40
35%
30
20
10
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 24. 3 class label errors (uniform).
1R
C4.5
FI
KON
Figure 25. 5 class label errors (uniform).
Figure 24 and 25 reinforce the findings of the 2-class experiment with no clear winner.
20
6.3 Leptrokurtic Distribution Errors
root relative squared error
LEPTOKURTIC DISTRIBUTION - 2 class
100
90
80
70
60
50
40
30
20
10
0
1%
5%
10%
20%
35%
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 26. Classification error rate for 2-class label experiment on leptokurtically distributed data.
Figure 26 shows some surprising results with all methods having almost identical error rates.
LEPTOKURTIC - 5 class
100
90
90
80
80
root relative squared error
root relative squared error
LEPTOKURTIC - 3 class
100
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 27. 3 class label errors (leptokurtic).
1R
C4.5
FI
KON
Figure 28. 5 class label errors (leptokurtic).
Figure 27 and 28 shows that there is a decline in relative performance for the equal-width binning method when
additional class labels are added.
21
6.4 Platykurtic Distribution Errors
PLATYKURTIC DISTRIBUTION - 2 class
root relative squared error
100
90
80
70
1%
5%
10%
20%
35%
60
50
40
30
20
10
0
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 29. Classification error rate for 2-class label experiment on platykurtically distributed data.
Figure 29 shows that the entropy based methods outperform the unsupervised and error based method, especially
when the error overlap is increased.
PLATYKURTIC - 5 class
100
90
90
80
80
root relative squared error
root relative squared error
PLATYKURTIC - 3 class
100
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 30. 3 class label errors (platykurtic).
1R
C4.5
FI
KON
Figure 31. 5 class label errors (platykurtic).
Figure 30 and 31 shows that as the number of class labels grows, so too does the disparity between the entropy
based methods and the 2 unsupervised and error based method. Interestingly, the entropy based methods errors
actually level out even as the number of class labels increases. The sensitivity of the error rate to the number of
bins selected is particularly apparent in the equal-frequency method.
22
6.5 Bimodal Distribution Errors
BIMODAL DISTRIBUTION - 2 class
root relative squared error
100
90
80
70
1%
5%
10%
20%
35%
60
50
40
30
20
10
0
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 32. Classification error rate for a 2-class label experiment on bimodally distributed data.
The superiority of the supervised discretization methods across all error overlaps is marginal when data is
bimodal and there are 2 class labels.
BIMODAL DISTRIBUTION - 5 class
100
90
90
80
80
root relative squared error
root relative squared error
BIMODAL DISTRIBUTION - 3 class
100
70
60
1%
5%
50
10%
20%
40
35%
30
20
10
70
60
1%
5%
50
10%
20%
40
35%
30
20
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 33. 3 class label errors (bimodal).
1R
C4.5
FI
KON
Figure 34. 5 class label errors (bimodal).
Interestingly, error rates across the board jump significantly when there are 5 class labels in the dataset,
especially at lower error overlap levels as seen in Figure 34.
23
6.6 Skewed Distribution Errors
POSITIVELY SKEWED DISTRIBUTION - 2 class
root relative squared error
100
90
80
70
1%
5%
10%
20%
35%
60
50
40
30
20
10
0
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 35. Classification error rate for 2-class label experiment on positively skewed data.
Apart from the OneR discretizer, all methods seem to provide similar error rates. Interesting, the equal-frequency
discretizer has virtually identical error rates as the supervised methods.
POSITIVELY SKEWED - 5 class
100
90
90
80
80
root relative squared error
root relative squared error
POSITIVELY SKEWED - 3 class
100
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 36. 3 class label errors (skewed).
1R
C4.5
FI
KON
Figure 37. 5 class label errors (skewed).
Once again with 3 class labels, the equal-frequency method exhibits identical error rates to the supervised
methods.
24
6.7 Exponential Distribution Errors
EXPONENTIAL DISTRIBUTION - 2 class
root relative squared error
100
90
80
70
60
50
40
30
1%
5%
10%
20%
35%
20
10
0
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 38. Classification error rate for 2-class label experiment on exponentially distributed data.
Figure 38 shows that at lower error overlap levels the error rate seems to be almost identical for all methods bar
the equal-interval binning method for the exponential distribution. At the high end of the error overlap, the
equal-frequency binning and the supervised methods exhibit a slightly smaller error rate.
EXPONENTIAL - 5 class
100
90
90
80
80
root relative squared error
root relative squared error
EXPONENTIAL - 3 class
100
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
70
60
1%
5%
50
10%
40
20%
35%
30
20
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 39. 3 class label errors (exponential).
1R
C4.5
FI
KON
Figure 40. 5 class label errors (exponential).
Figure 39 and 40 shows that as the number of class labels introduced increased, the superiority of the entropybased methods became more prevalent.
25
6.8 Zipf Distribution Errors
ZIPF DISTRIBUTION - 2 class
root relative squared error
100
90
80
70
1%
5%
10%
20%
35%
60
50
40
30
20
10
0
EQ-INT
EQ-FR
1R
C4.5
FI
KON
Figure 41. Classification error rate for 2-class label experiment on approximated Zipf distribution.
At lower error overlaps, all methods exhibit similar error rates. However, when the error overlap increases, the
relative increase in the error rates for the 2 MDL supervised methods are slightly less.
ZIPF DISTRIBUTION - 5 class
100
90
90
80
80
root relative squared error
root relative squared error
ZIPF DISTRIBUTION - 3 class
100
70
60
1%
5%
50
10%
20%
40
35%
30
20
10
70
60
1%
5%
50
10%
20%
40
35%
30
20
10
0
0
EQ-INT EQ-FR
1R
C4.5
FI
KON
EQ-INT EQ-FR
Figure 42. 3 class label errors (Zipf).
1R
C4.5
FI
KON
Figure 43. 5 class label errors (Zipf).
Interestingly, only the error rates at the lower overlap levels increase across the board as the number of class
labels increase. Once again the 2 MDL methods exhibit lower error rates than the other methods.
26
7. Results
7.1 Varying the Error Overlap
From the section 6.1 and 6.2 errors we saw that the first two unsupervised methods progressively deteriorate in
predictive performance as the inherent error overlap is increased. On the other hand, the other methods have a
stable performance as the error overlap is increased. Surprisingly, the MDL methods perform worse with an
error overlap of 1% compared to when the error overlap is 35%. In contrast, the section 6.3 to 6.8 errors showed
a degenerative performance for the skewed distribution when the error overlap was increased, regardless of the
discretization method used.
7.2 Varying the Number of Class Labels
80
1
60
5
40
10
20
35 20
35
10
1
0
5
3
2
5
exponential (EQ-INT)
3
2
exponential (FI)
5
3
2
uniform (EQ-INT)
5
3
2
uniform (FI)
Figure 44. Unsupervised (Equal-interval) vs. Supervised (Fayyad&Irani) average errors (Y-axis) for Exponential
and Uniform distributions plotted against number of class labels (X-axis) and error overlap (Z-axis)
The effect of increasing the number of class labels can be seen in Figure 44 where an example of the exponential
and uniform distribution errors are highlighted when run with both an unsupervised (equal-interval) and
supervised method (Fayyad&Irani). When the number of class labels is increased from 2 to 3 and 5, the
degenerative performance of the unsupervised methods becomes more pronounced. From Figure 44, the error
rate rises sharply for both the exponential and uniform distributions. This pattern was replicated across the board
for all distributions when unsupervised discretization was applied. The supervised methods provided mixed
outcomes when the number of class labels increased. In Section 6.2 we saw that the uniform distribution errors
remained static for supervised discretization using the Fayyad&Irani method even when the number of class
labels increased. Only, the normal distribution followed this pattern (Section 6.1). All other distributions
followed a growth in the error rate similar to that exhibited by the supervised exponential errors in Figure 44.
27
7.3 Comparison of Overall Error Rates
AVERAGE ERRORS - ALL DISTRIBUTONS
100.0
90.0
80.0
70.0
60.0
50.0
40.0
30.0
20.0
10.0
0.0
NORM
UNI
LEP
EQ-INT
PLAT
EQ-FR
BI
1R
C4.5
SKEW
FI
EXP
ZIPF
KON
Figure 45 . Average root relative squared error for all distributions and discretization methods.
Figure 45 shows a comparison of the performance of each discretization method as applied to each data
distribution. Each bar represents the average root relative squared error by summing across all overlaps and
number of class labels and dividing by 15. The equal-interval binning technique appears to be more effective
when the data distribution is flatter, as depicted by a uniform or platykurtic distribution. This concurs with the
findings in [2] that this method of discretization is vulnerable to skewed data. Interestingly, the error rate of
equal-interval binning converges to that of equal-frequency binning with flatter distributions as depicted in
Figures 23 to 25. Equal interval binning often distributes instances unevenly across bins with some bins
containing many instances and others possibly containing none. Figure 45 demonstrates the degenerative
performance of equal-interval binning when the distribution being discretized is less even. It is thus no surprise
that the leptokurtic and the Zipf distribution are the worse performers applying this method, and the best
performers being those distributions with a more even distribution, i.e. uniform and platykurtic distributions.
Similarly, equal frequency binning can suffer from serious problems. Consider the event that we have 1000
instances and 10 intervals specified by the user. If there are 105 instances with a value between 10 and 20
corresponding to bin 2, then the remaining 5 instances are grouped into the next bin. This sort of occurrence
undermines knowledge discovery in the sense that partitions then lose all semantic value. On the whole, the
equal-frequency binning method shows relative indifference to the distribution of the data.
The OneR discretizer works well with both a normal and uniform distribution. The OneR error rate is roughly
20% higher for all other distributions relative to the MDL methods. The problem with an error based method like
OneR is that it may result in splitting intervals unnecessarily to minimize the error count and consequently blurs
the understanding of classification results. Furthermore, multiple attributes that require discretization cannot be
used with this algorithm.
28
The C4.5 discretizer was the equal best performer for the normal and uniform distribution. This concurs with [5]
who showed that discretization causes a small increase in error rate when the continuous feature is normally
distributed. However, it was nearly 10% worse than the MDL entropy methods for the other six distributions.
One factor that impacts on the overall tree generated is the confidence level. The default confidence parameter of
0.25 was used. Varying this value changes the level of pruning applied to the tree. The smaller this parameter the
greater the level of pruning that occurs. It is a balancing act to prevent overfitting whilst trying to optimize this
parameter in the context of knowledge discovery [5].
The entropy based MDL methods combine both local and global information during learning, i.e. they use a
wrapper function i.e. C4.5 in this case with the MDL criterion (global). The MDL pre-selects cut-points and the
classifier decides which cut-point is the most appropriate in the current context. Using a global discretization as
opposed to a local method (C4.5) makes the MDL methods less sensitive to variation from small fragmented
data [2]. These methods were consistently high performers across all data distributions as depicted in Figure 13.
7.4 Selection of the Number of Bins
In the experiments k bins set to the default value of ten was used. The choice of k has a major effect on error
rates [6]. Identifying the best value of k to use is a trial and error process and cannot be applied universally to
each attribute. In some cases, error rates may even compare favourably to supervised discretization, however,
finding the optimum cut off thresholds by supervised discretization may involve 2an calculations where a is the
number of attributes and n is the number of instances [7]. The unpredictability of unsupervised methods as
shown in the overall summary error rates for all distributions seems to make them a poor choice for
discretization.
root relative squared error
Bin Count Vs. Error rate
100
80
60
default
40
20
0
0
5
10
15
20
number of bins chosen
Figure 46. Error rate for number of bins manually chosen for equal-width binning strategy for normal
distribution of 2-class problem with 35% error overlap.
Figure 46 depicts the effect of the error rate on classification when the seed for an unsupervised method is
varied. Here the seed represents a manually chosen number by the user that is the number of bins requested from
the data. Figure 46 shows that the default value of 10 bins (used in all runs) is not the best selection of the
number of bins in order to produce the smallest error rate in this instance but may be in others.
29
7.5 Identifying a Heuristic
Initial thoughts were that error rates were correlated positively with the level of kurtosis or skewness. The
correlation coefficient for kurtosis against the root relative squared error was 0.45, indicating that there
appears to be a significant relationship as shown in Figure 47. The relationship between skewness and the
error rate as depicted in Figure 48 was less pronounced with the correlation coefficient at 0.22. Prudence is
taken with influence regarding the size of the coefficients given the error measure used and the assumption
of a linear relationship via a linear regression model. In general terms, we can deduce a positive relationship
between the elevation (kurtosis) and symmetry (skewness) of the data and classification errors resulting
from discretization. However, a word of caution is that discretization methods make a poor job of separating
the two underlying populations in a variable that has a bimodal distribution [9].
coefficient of skewness
coefficient of kurtosis
7
lep
6
skew
exp
5
4
norm
zipf
3
bimod
2
1
plat
uni
0
0
20
40
expo
skew
zipf
norm
0
60
uni
plat
30
lepto
bimod
60
root relative squared error
root relative squared error
Figure 47. Kurtosis vs. average error rate: r2 = 0.45.
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Figure 48. Skewness vs. average error rate: r2 = 0.22.
Via closer inspection of the shape of each distribution, it seems that distributions that have either:
•
•
a long thin tail (skewed, Zipf and exponential distributions) or,
a high peak/concentration of data (leptokurtic, bimodal distribution)
benefit the least from discretization. Similarly, these distributions gain the least from using a supervised method
compared to an unsupervised method of discretization. For instance, the root relative squared error rate can be
improved by 86% for a normal or uniform distribution, compared to 19% for a skewed distribution and 25% for
a bimodal distribution (Fig.45).
8. Conclusion
From analysis of the 720 runs it was found the distributions that benefit the most from discretization in ranked
order are: (1) uniform, (2) normal, (3) platykurtic, (4) bimodal, (5) exponential, (6) Zipf, (7) leptokurtic, (8)
skewed.
The higher ranked distributions, namely the uniform and normal distributions exhibit escalating error rates for
the unsupervised discretization methods when the error overlap and/or the number of classes labels increases.
These unsupervised methods were shown to be highly susceptible to an arbitrarily selected number of bins. In
contrast, the error rates for the supervised discretization methods remained constant as both the error overlap
and/or the number of class labels introduced increased.
At the lower end of the ranked distributions, the classification error rates grew as the error overlap grew as
expected, regardless of which discretization method was adopted. Unexpectedly, as more class labels were
introduced, error rates increased at lower error overlap levels whilst remaining relatively static at the higher end
of the error overlap level.
30
In summary, we have identified a heuristic that is able to determine the relative effectiveness of discretization
apriori by data visualization, and by establishing that a positive correlation between the level of kurtosis and
skewness and the error rate from classification in the data distributions exists, with a caveat for bimodal
distributions.
In this work we examined classification errors based on a single discretized attribute. In future work we will look
at combinations of discretized attributes.
9. References
[1] Catlett, J. (1991b) “On Changing Continuous Attributes into Ordered Discrete Attributes”, In
Y.Kodrtoff, ed., EWSL-91. Lecture Notes in Artificial Intelligence 482, pp.164-178. Springer-Verlag,
Berlin, Germany.
[2] Dougherty, J., Kohavi, R., Sahami, M. (1995) “Supervised and Unsupervised Discretization of
Continuous Features” Machine Learning: Proceedings of the 12th International Conference, Morgan
Kaufmann, Los Altos, CA.
[3] Fayyad, U.M., Irani, K.B. (1993) “Multi-interval discretization of continuous valued attributes for
classification learning” In Proceedings of the 13th International Joint Conference on Artificial
Intelligence, pp.1022-1027, Los Altos, CA: Morgan Kaufmann.
[4] Holte, R.C. (1993) “Very simple classification rules perform well on most commonly used datasets”
Machine Learning 11, pp.63-90.
[5] Kohavi, R., Sahami, M. (1996) “Error-based and Entropy-based Discretization of Continuous Features”
In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp.114119. Menlo Park:AAAI Press.
[6] Kononenko, I., Sikonja, M.R, (1995) “Discretization of continuous attributes using ReliefF”
Proceedings of ERK'95 , Portoroz, Slovenia, 1995.
[7] Pazzani, M. (1995) “An iterative improvement approach for the discretization of numeric attributes in
Bayesian classifiers” KDD-95 pp.228-233
[8] Quinlan, J.R. (1993) “C4.5 : Programs for Machine Learning”, Morgan Kaufmann, Los Altos, CA
[9] Scott, P.D., Williams, R.J., Ho, K.M. (1997) “Forming Categories in Exploratory Data Analysis and
Data Mining”, IDA 1997, pp.235-246.
[10] Smith, P.J. (1997) “Into Statistics” Springer-verlag, pp.205-206, 304-306, 323-329.
[11] Witten, I., Frank, E. (2000) “Data Mining: Practical Machine Learning Tools and Techniques” Morgan
Kaufmann, pp.80-82, 147-150, 238-246.
31
10. Appendix
statistics of random generated normal distribution data with mean=176, standard deviation = 8.
normal distribution(176,8)
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
176.135
0.079
176.059
180.732
7.865
61.851
-0.001
0.055
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
Confidence Level(95.0%)
59.279
146.654
205.933
1761350.518
10000.000
205.933
146.654
0.154
statistics for uniform distribution generated.
uniform (146-206)
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
176.143
0.174
176.405
194.984
17.352
301.097
-1.209
-0.017
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
Confidence Level(95.0%)
59.995
146.000
205.995
1761431.341
10000.000
205.995
146.000
0.340
statistics for leptokurtic distribution generated.
leptokurtic
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
88.215
0.099
87.000
87.000
9.881
97.629
3.321
Skewness
-0.167
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
Confidence Level(95.0%)
85.000
45.000
130.000
882145.000
10000.000
130.000
45.000
0.194
statistics for platykurtic distribution generated.
platykurtic
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
88.698
0.269
90.000
90.000
26.893
723.210
-0.676
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
-0.012 Confidence Level(95.0%)
32
117.000
30.000
147.000
886977.000
10000.000
147.000
30.000
0.527
statistics for bimodal distribution generated.
bimodal (male-90,female-65)
Mean
Standard Error
Median
75.610
0.158
75.000
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
65.000
Sum
15.840
Count
250.894
Largest(1)
-0.428
Smallest(1)
-0.004 Confidence Level(95.0%)
105.000
25.000
130.000
756104.000
10000.000
130.000
25.000
0.310
statistics for skewed distribution generated.
skewed distribution - 1.396
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
78.063
0.179
75.000
75.000
17.913
320.871
2.633
1.396
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
105.000
50.000
155.000
780625.000
10000.000
155.000
50.000
Confidence Level(95.0%)
0.351
statistics for exponential distribution generated.
Exponential
distribution
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
15.668
0.135
12.500
2.500
13.526
182.948
2.165
1.492
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
Confidence Level(95.0%)
72.5
2.5
75
156680
10000
75
2.5
0.265
statistics for approximated Zipf distribution generated.
Zipf distribution
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
16.312
0.170
8
2
17.077
291.645
0.522
1.261
Range
Minimum
Maximum
Sum
Count
Largest(1)
Smallest(1)
Confidence Level(95.0%)
33
66
0
66
163124
10000
66
0
0.334
Download