AN EMPIRICAL INVESTIGATION OF THE IMPACT OF DISCRETIZATION ON COMMON DATA DISTRIBUTIONS Michael Kumaran Ismail Master of Technology (Information Technology) 2003 RMIT University An Empirical Investigation of the Impact of Discretization on Common Data Distributions by Michael K. Ismail A dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology (Information Technology) Department of Computer Science RMIT University Melbourne, VIC 3001, AUSTRALIA Abstract This study attempts to identify the merits of six of the most popular discretization methods when confronted with a randomly generated dataset consisting of attributes that conform to one of eight common statistical distributions. It is hoped that the analysis will enlighten as to a heuristic which identifies the most appropriate discretization method to be applied, given some preliminary analysis or visualization to determine the type of statistical distribution of the attribute to be discretized. Further, the comparative effectiveness of discretization given each data distribution is a primary focus. Analysis of the data was accomplished by inducing a decision tree classifier (C4.5) on the discretized data and an error measure was used to determine the relative value of discretization. The experiments showed that the method of discretization and the level of inherent error placed in the class attribute have a major impact on classification errors generated post-discretization. More importantly, the general effectiveness of discretization varies significantly depending on the shape of data distribution considered. Distributions that are highly skewed or have high peaks tend to result in higher classification errors, and the relative superiority of supervised discretization over unsupervised discretization is diminished significantly when applied to these data distributions. 2 Declaration I certify that all work on this dissertation was carried out between March 2003 and June 2003 and it was not submitted for any academic award at any other college, institute or university. The work presented was carried out under the supervision of Dr Vic Ciesielski who proposed the idea of creating random data and applying an inherent error in the construction of class labels. All work in the dissertation is my own except where acknowledged in the text. Signed, Michael K. Ismail, 25th of June, 2003 3 TABLE OF CONTENTS 1. Introduction ....................................................................................................................................................6 2. Data Generation Methodology .......................................................................................................................6 3. Data Distributions ..........................................................................................................................................8 3.1 Normal Distribution ....................................................................................................................................8 3.2 Uniform Distribution...................................................................................................................................8 3.3 Leptokurtic Distribution..............................................................................................................................9 3.4 Platykurtic Distribution............................................................................................................................10 3.5 Bimodal Distribution.................................................................................................................................10 3.6 Skewed Distribution..................................................................................................................................11 3.7 Exponential Distribution ...........................................................................................................................12 3.8 Zipf Distribution .......................................................................................................................................12 4. Discretization Techniques ............................................................................................................................13 4.1 Unsupervised Discretization .....................................................................................................................13 4.1.1 Equal Interval Width/Uniform Binning.............................................................................................13 4.1.2 Equal Frequency/Histogram Binning ................................................................................................13 4.2 Supervised Discretization..........................................................................................................................14 4.2.1 Holte’s 1R Discretizer .......................................................................................................................14 4.2.2 C4.5 Discretizer.................................................................................................................................15 4.2.3 Fayyad and Irani’s Entropy Based MDL Method .............................................................................16 4.2.4 Kononenko’s Entropy Based MDL Method......................................................................................17 5. Running Algorithms in WEKA....................................................................................................................17 6. Analysis of Errors.........................................................................................................................................18 6.1 Normal Distribution Errors .......................................................................................................................19 6.2 Uniform Distribution Errors......................................................................................................................20 6.3 Leptrokurtic Distribution Errors................................................................................................................21 6.4 Platykurtic Distribution Errors..................................................................................................................22 6.5 Bimodal Distribution Errors......................................................................................................................23 6.6 Skewed Distribution Errors.......................................................................................................................24 6.7 Exponential Distribution Errors ................................................................................................................25 6.8 Zipf Distribution Errors.............................................................................................................................26 7. Results ..........................................................................................................................................................27 8. Conclusion....................................................................................................................................................30 9. References ....................................................................................................................................................31 10. Appendix .................................................................................................................................................32 4 TABLE OF FIGURES Figure 1. 0% error overlap dataset with 10 instances. .............................................................................................6 Figure 2. 20% error overlap dataset with 20 instances. ..........................................................................................6 Figure 3. 5% class error overlap (class 1 class 2) with 10,000 instances.............................................................7 Figure 4. Built in classification error rates...............................................................................................................7 Figure 5 : Histogram of normally distributed attribute generated (normal).............................................................8 Figure 6. Histogram of uniformly distributed attribute generated (uniform). ..........................................................8 Figure 7. Histogram of a leptokurtically distributed attribute generated. ................................................................9 Figure 8. Histogram of a platykurtically distributed attribute generated ...............................................................10 Figure 9. Histogram of a bimodally distributed attribute generated with moving average trendline.....................10 Figure 10. Histogram of a positively skewed attribute generated with moving average trendline. .......................11 Figure 11. Histogram of an exponential distribution generated with moving average trendline. ..........................12 Figure 12. Histogram of an approximated Zipf distribution. .................................................................................12 Figure 13. Equal width binning with number of bins set to 4................................................................................13 Figure 14. Equal frequency binning with number of bins set to 4. .......................................................................13 Figure 15. Holte’s 1R bin partitioning with minimum bucket size of 6. ...............................................................14 Figure 16. C4.5 Decision tree growth of continuous variable – attribute bimodal. Class labels and number of errors contained in leaves and thresholds listed beside child pointers..........................................................15 Figure 17. Entropy based multiway splitting of intervals for attribute bimodal. Class labels and number of errors contained in leaves and thresholds listed beside child pointers....................................................................16 Figure 18. Layout of dataset generated containing 8 attributes and 15 classes......................................................17 Figure 19. Fayyad & Irani MDL method with C4.5 classifier run information for uniform distribution 3 class problem (small, medium, large) and 10% error overlap...............................................................................18 Figure 20. Classification error rate for 2-class label experiment on normally distributed data. ............................19 Figure 21. 3 class label errors (normal). Figure 22. 5 class label errors (normal)............................................19 Figure 23. Classification error rate for 2-class label experiment on uniformly distributed data............................20 Figure 24. 3 class label errors (uniform). Figure 25. 5 class label errors (uniform).........................................20 Figure 26. Classification error rate for 2-class label experiment on leptokurtically distributed data.....................21 Figure 27. 3 class label errors (leptokurtic). Figure 28. 5 class label errors (leptokurtic)..................................21 Figure 29. Classification error rate for 2-class label experiment on platykurtically distributed data.....................22 Figure 30. 3 class label errors (platykurtic). Figure 31. 5 class label errors (platykurtic)..................................22 Figure 32. Classification error rate for a 2-class label experiment on bimodally distributed data.........................23 Figure 33. 3 class label errors (bimodal). Figure 34. 5 class label errors (bimodal). .......................................23 Figure 35. Classification error rate for 2-class label experiment on positively skewed data. ................................24 Figure 36. 3 class label errors (skewed). Figure 37. 5 class label errors (skewed). .........................................24 Figure 38. Classification error rate for 2-class label experiment on exponentially distributed data. .....................25 Figure 39. 3 class label errors (exponential). Figure 40. 5 class label errors (exponential). ..............................25 Figure 41. Classification error rate for 2-class label experiment on approximated Zipf distribution. ...................26 Figure 42. 3 class label errors (Zipf). Figure 43. 5 class label errors (Zipf). ....................................................26 Figure 44. Unsupervised(Equal-width) vs. Supervised(Fayyad&Irani) average errors (Y-axis) for Exponential and Uniform distributions plotted against number of class labels (X-axis) and error overlap (Z-axis) .......27 Figure 45 . Average root relative squared error for all distributions and discretization methods. .........................28 Figure 46. Error rate for number of bins manually chosen for equal-width binning strategy for normal distribution of 2-class problem with 35% error overlap...............................................................................29 Figure 47. Kurtosis vs. average error rate : r2 = 0.45 Figure 48. Skewness vs. average error rate : r2 = 0.22 ..30 5 1. Introduction Discretization of continuous attributes not only broadens the scope of the number of data mining algorithms able to analyze data in discrete form, but also dramatically increase the speed at which these tasks can be performed [1]. There have been many studies that have evaluated the relative effectiveness of the many discretization techniques along with suggested optimizations [4,6,7]. However, these experiments have usually been carried out with application to real-world datasets where no attempt has been made to identify the type of probability density function of the continuous attribute. The effectiveness of various discretization methods as applied to a variety of data will be evaluated. All data has been artificially constructed and conform to common statistical distributions. Discretization and data mining operations will be performed using the WEKA data mining software [11]. The comparative responsiveness of each distribution to discretization is the primary focus of the experiments and will be achieved by the analysis of classification errors resulting from discretization. Thus, it is hoped that by data visualization alone we can make inference as to the effectiveness of any subsequent discretization apriori and if applicable, isolate the most supportive discretization method to adopt. There are eight statistical distributions chosen that will be analyzed, along with six discretization methods. 2. Data Generation Methodology Data was generated using the random number generation facility in Microsoft Excel with the data analysis tool. Each attribute was formed with 10,000 instances as this provided accurate measures in terms of achieving the desired symmetry in distributions. Each class attribute was formed with an inbuilt error or class overlap (1%, 5%, 10%, 20%, 35%). For instance, if we have a numeric attribute (personal assets) that contains 10,000 instances ranging from $50,000,000 to $50, for a 2-class label problem (sad, happy) we find the median value 5000th e.g. ($50,000). If we say that everyone below the median value (poor) is sad and everyone above the median value (rich) is happy, then anyone who is poor but happy or rich but sad contravenes our hypothesis. To illustrate this point further, consider the dataset of 10 instances in Figure 1. Value Class 2 small 4 small 6 small 8 small 10 small 12 large 14 large 16 large 18 large 20 large Figure 1. 0% error overlap dataset with 10 instances. Intuitively, we can see that any sensible discretization method should recognize that a partition should be placed between values 10 and 12 as the class label changes from small to large. If all datasets were this simple our work would be done. However if we swap the two middle class labels (Figure 2) there is some confusion as to where we should place partitions. Value Class 2 small 4 small 6 small 8 small 10 large 12 small 14 large 16 large 18 large 20 large Figure 2. 20% error overlap dataset with 20 instances. Given that the middle 2 out of 10 values in the dataset are in dispute, we have used the phrase ‘error overlap’ to represent the level of uncertainty deliberately placed into the class labels when forming each class. Hence, the error overlap for Figure 2 is 20% or 2/10. Figure 3 demonstrates as personal assets increase, at some point there is a change of mood from sad to happy, with the error overlap represented by the intersection of class 1 and class 2. 6 Class 1 - sad Class 2 - happy personal assets Figure 3. 5% class error overlap (class 1 class 2) with 10,000 instances. Figure 4 shows the number of errors built into the data (shaded rows) as the error overlap increases for a dataset of 10,000 instances. 2 class problem Poor & sad Poor & happy Rich & sad Rich & happy 1% 4950 50 50 4950 5% 4750 250 250 4750 10% 4500 500 500 4500 20% 4000 1000 1000 4000 35% 3250 1750 1750 3250 Figure 4. Built in classification error rates. where Poor personal assets < $50,000 Rich personal assets > $50,000 For 3 and 5 class label attributes a similar methodology was followed to build an error rate into the data. For instance, 3 classes in the above case would result in class labels of: sad, content, happy and for a 5 class problem: very sad, sad, content, happy, very happy. 7 3. Data Distributions 3.1 Normal Distribution May datasets conform to a normal distribution as depicted in Figure 5. They are characterized by a symmetrical bell shaped curve with a single peak at the median value which is also approximately equal the mean value. 68% of values are contained within one standard deviation of the mean, 95% of values within 2 standard deviations, and 99% of values within 3 standard deviations. The normal probability distribution sets a benchmark for center, spread, skewness, and kurtosis that other shapes/distributions are measured. Data was generated with a mean of 176cm and a standard deviation of 8cm as an attempt to represent the height distribution of adult males. The probability density function for a normal distribution with mean and variance 2 is given by: where >0 normal distribution 1200 Frequency 1000 800 600 400 200 0 146 156 166 176 186 Height(cms) 196 206 Figure 5 : Histogram of normally distributed attribute generated (normal). 3.2 Uniform Distribution Uniform distributions are characterized by a roughly equal frequency count across the data range. Such a distribution is expected to occur in the frequency of numbers spun on a roulette wheel where every value has the same probability of occurrence. Figure 6 shows a randomly generated uniform distribution with the greater variability in the data being reflected in a higher standard deviation. Frequency uniform distribution 400 350 300 250 200 150 100 50 0 Figure 6. Histogram of uniformly distributed attribute generated (uniform). 8 3.3 Leptokurtic Distribution A leptokurtic distribution is similar to a normal distribution however, the central area of the curve is more pronounced with a higher peak. Thus, more data values are concentrated close to the mean and median value. A co-efficient of kurtosis of magnitude greater than three formally recognizes a data’s distribution as being leptokurtic. The co-efficient of kurtosis, g4, is given by where m4 refers to how much data is massed at the center and m2 refers to the spread about the center. The rth sample moment about the sample mean for a dataset a1,a2,a3,..,an is given by where i = 1, ..,n. leptokurtic distribution 2500 Frequency 2000 1500 1000 500 0 Figure 7. Histogram of a leptokurtically distributed attribute generated. 9 3.4 Platykurtic Distribution The platykurtic distribution is the opposite of a leptokurtic distribution. This distribution is characterized by a flatter than normal central peak and as such a more evenly symmetrical distribution of values across the data range. Formally, if the coefficient of kurtosis is significantly less than three, it is regarded as platykurtic. Note, a uniform distribution is a special case of a platykurtic distribution. Frequency platykurtic distribution 500 450 400 350 300 250 200 150 100 50 0 Figure 8. Histogram of a platykurtically distributed attribute generated. 3.5 Bimodal Distribution A Bimodal distribution contains two distinct peaks or modes and may be asymmetric. For this example, we have combined 2 subsets of normal distribution to form a bimodal attribute. 5000 instances have been modeled around an estimate of the bodyweight distribution of the female population and 5000 of the male population. We have used a mean of 65 kilograms for females and 90 kilograms for males. Figure 7 shows the effect of combining the sub data into a single attribute. Two distinct peaks are prevalent in Figure 9; either can be regarded as the mode. bimodal distribution Frequency 2000 1500 1000 500 0 25 40 55 70 85 Weight (kgs) 100 115 130 Figure 9. Histogram of a bimodally distributed attribute generated with moving average trendline. 10 3.6 Skewed Distribution In the real world, many datasets will not exhibit the symmetry of a normal distribution. Very often there will be a greater proportion of values either side of the mean or the distance between the mean and the extreme values either side of the mean differ. When this occurs, we are said to have skewed data. To mimic this distribution, let us once again look at the bodyweight distribution of male adults. We will assume that the average weight of an adult male is approximately 80 kilograms. Modeling this data as a normal distribution would result in problems. We have all seen adult men greater than 120 kilograms but we do not see adult men less than 40 kilograms as this weight is impossible to achieve for a person not suffering a growth defect. With this in mind, we have to shift a proportion of our population to represent weights greater than 120 kilograms but hold the probability of males weighing less than 40 kilograms at zero. Such a presumption leads to a skewed distribution as in Figure 10. Formally, if the coefficient of skewness is significantly greater/less than zero, we have a skewed distribution. The sample co-efficient of skewness, g3, is given by where m3 refers to the skewness about the center and m2 refers to the spread about the center. The rth sample moment about the sample mean for a dataset a1,a2,a3,..,an is given by where i = 1, ..,n. Figure 10 depicts a positively skewed distribution characterized by a long right tail. In general, m3>0 if large values are present far above the sample mean , then long tails will develop to the right producing a rightskewed distribution as below. Frequency skewed distribution 1800 1600 1400 1200 1000 800 600 400 200 0 50 65 80 95 110 weight(kgs) 125 140 155 Figure 10. Histogram of a positively skewed attribute generated with moving average trendline. 11 3.7 Exponential Distribution Exponential distributions are used often to model expectancy and lifetimes such as the life expectancy of humans, car engines, or light bulbs. As x increases, y decreases at a slowing rate known as the failure rate. Thus, as we get older our survival rate declines. exponential distribution 1800 1600 Frequency 1400 1200 1000 800 600 400 200 0 Figure 11. Histogram of an exponential distribution generated with moving average trendline. 3.8 Zipf Distribution A Zipf Distribution follows a straight line when frequency and rank are plotted on a double-logarithmic scale. This inverse correlation between ordered frequencies in a dataset exist frequently in real life situations. An example of data that exhibits this pattern would be the frequency of words in a text document. Some words such as ‘the’, ’and’, ’to’, and ‘a’ have very high frequency counts in most long text files, whilst other words such as ‘alleviate’ occur rarely. As a Zipf distribution, all the words would be sorted into frequency order where each spike in the histogram represents a word. Figure 12 below depicts an approximated Zipf distribution. If we apply our scenario, we can see that there are many words with few occurrences and few words with many occurrences. zipf distribution 3000 Frequency 2500 2000 1500 1000 500 0 Figure 12. Histogram of an approximated Zipf distribution. 12 4. Discretization Techniques Many classifiers used in data mining compel the data to be in non-continuous feature form. As such, suitable techniques are required to allot numerical data values into discrete and meaningful bins or partitions whilst avoiding dilution of potential knowledge within this data due to classification information loss. Furthermore, the categorization of numerical features allows faster computational methods to be used as opposed to algorithms running on continuous features such as neural networks and X-means clustering. The selection of discretization technique as we will see has a great impact on classification accuracy. 4.1 Unsupervised Discretization The simplest means of discretizing continuous features involves a class blind approach where only knowledge of the feature to be discretized is required. Here the user specifies the number of intervals or bins. 4.1.1 Equal Interval Width/Uniform Binning This method relies on sorting the data and dividing the data values into equally spaced bin ranges. A seed k supplied by the user determines how many bins are required. With this seed k, it is just a matter of finding the maximum and minimum values to derive the range and then partition the data into k bins. The bin width is computed by: and bin thresholds are constructed at xmin + i where i = 1,…k-1. Figure 13 below shows an example of this binning strategy with number of bins set to 4 bins. According to the above formula, breakpoints occur at 71.75, 103.5, and 135.25. bin Bin1 instance value Bin2 Bin3 Bin4 1 2 3 4 5 6 7 8 9 10 11 12 40 45 55 60 64 67 78 89 110 140 154 167 Figure 13. Equal width binning with number of bins set to 4. 4.1.2 Equal Frequency/Histogram Binning Partitioning of data is based on allocating the same number of instances to each bin. Dividing the total number of instances n by k the number of bins supplied by the user achieves this. Thus, if we had 1000 instances and specified 10 bins, after sorting the data into numerical order we would allocate the first 100 instances into the first bin, and continue this process until 10 bins of frequency 100 are created. Figure 14 below shows the contrasting breakpoints of these unsupervised methods. Equal frequency binning leads to each bucket being the same size whereas equal width binning can lead to different sized buckets as depicted in Figures 13 and 14. Bin Instance Value Bin2 Bin1 Bin3 Bin4 1 2 3 4 5 6 7 8 9 10 11 12 40 45 55 60 64 67 78 89 110 140 154 167 Figure 14. Equal frequency binning with number of bins set to 4. 13 4.2 Supervised Discretization Supervised discretization on the other hand makes use of the instance/class labels during the discretization process to aid in the partitioning process. Prior knowledge of the class label of each instance is incorporated at each iteration to refine partition breakpoint estimation of each bin. 4.2.1 Holte’s 1R Discretizer This method uses an error-based approach with error counts determining when to split intervals [4]. The attribute is sorted into ascending order and a greedy algorithm that divides the feature into bins where each contains only one instance of a particular class label is used. The danger inherent in such a technique is that each instance may end up belonging to a separate bin. To combat this problem, a minimum number of instances of a particular class for each bin as specified by the user (except for the upper most bin) was implemented. Hence any given bin can now contain a mixture of class labels and boundaries will not be continually divided and thus lead to overfitting. Each bucket grows i.e. the partition shifts to the right until it has at least 6 instances of a class label, and continues until the instance to be considered is not part of the majority class label. Empirical analysis [4] suggests a minimum bin size of 6 performs the best. Bin Bin1 Bin2 Instance 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Value Class 64 small 65 large 68 small 69 small 70 small 71 large 72 large 72 small 75 small 75 small 80 large 81 small 83 small 85 large Figure 15. Holte’s 1R bin partitioning with minimum bucket size of 6. Figure 15 shows an example of how the 1R method of discretization is evaluated. Data is arranged in ascending order and then a bin is formed by considering each class label instance from the start. Once we reach instance 9 we have a possible partition as there are at least 6 instances of one class label i.e. 6 small instances. We then consider the next instance 10; as this is also a small instance we add it to bin1. This procedure iterates until the class label changes i.e. Instance 11 is a large instance. Thus, our partition is formed and excludes instance 11. The midpoint between the instance values 10 and 11. i.e. 75 and 80 is the breakpoint between these two bins. The rule set produced for this dataset is: Attribute: 77.5 > 77.5 14 small (bin 1) large (bin 2) 4.2.2 C4.5 Discretizer The C4.5 algorithm [8] when run on continuous variables applies thresholds in order to discretize the variable to form a decision tree via binary splits at each node. The instances are sorted on the attribute being discretized and the number of discrete values are calculated. Thus, if we have 10,000 instances and 250 discrete values, the number of possible threshold cutoff points that can be considered is 249. Information gain theory is then used to determine at which threshold value our gain ratio is greatest in order to partition the data. A divide and conquer algorithm is then successively applied to determine whether to split each partition into smaller subsets at each iteration. The gain ratio is defined as: !" #$ !" let !% !% where info(T) measures the average amount of information needed to identify the class of a case in dataset T and infox(T) measures the information gained by partitioning T in accordance with test X. let #$ ! &%& '$ ! &%& &%& &%& where split info(x) represents the potential information generated by dividing T into n subsets. C4.5 uses a bottom up approach that builds a complete tree and then prunes the number of intervals. Each nonleaf sub tree is examined from the bottom and if the predicted error is lower when a leaf replaces the sub tree, then a leaf will replace the sub tree. This may be a computationally expensive procedure if the data is not in sorted order, however, the data generated was sorted before discretization was performed. The gain ratio penalizes larger numbers of splits to select the actual number of bins. Figure 16 depicts how C4.5 recursively partitions attribute values. bimodal <=87 >87 bimodal <=74 large (2529) >74 bimodal <=65 small (3556/431) bimodal >65 large (1333/25) <=84 small (1716/216) >84 large (866/350) Figure 16. C4.5 Decision tree growth of continuous variable – attribute bimodal. Class labels along with the number of correct classifications and errors are contained in leaves and thresholds listed beside child pointers. 15 4.2.3 Fayyad and Irani’s Entropy Based MDL Method This method uses a top down approach whereby multiple ranges rather than binary ranges are created to form a tree via multi-way splits of the numeric attribute at the same node to produce discrete bins [3]. Determining the exact cut off points for intervals is based on the Minimum Description Length Principle (MDLP) that biases a simpler theory that can explain the same body of data and favours a hypothesis which minimizes the probability of making a wrong decision assuming a uniform error cost. No pruning is applied to the grown tree. An information entropy/uncertainty minimization heuristic is used to select threshold boundaries by finding a single threshold that minimizes the entropy function over all possible thresholds. This entropy function is then recursively applied to both of the partitions induced. Thresholds are placed half way between the two delimiting instances. The entropy function is given by: ( ) $ ! let C1..Ck = class labels; k = number of classes; P(CiS) = proportion of instances in S that have class Ci At this point the MDL stopping criterion is applied to determine when to stop subdividing discrete intervals. If the following condition holds then a partition is induced: +$ ! *$ ! ( ) ( ) ( ) , where ( ) &)&( ) - &) &( ) - let S = set; A = attribute; S1,S2 = subsets of S; T = threshold value; N = instances. Ent = entropy of instances in each subinterval Figure 17 shows a typical discretization of a continuous variable named bimodal using the Fayyad & Irani method. 87-inf bimodal -inf-60 small (1800) 84-87 60-62 large (2529) 81-84 77-81 small (866/16) large (866/350) small (647/105) 62-65 65-68 small (890/415) 74-77 small (784) 68-74 small (285/111) large (638/23) large (695) Figure 17. Entropy based multiway splitting of intervals for attribute bimodal. . Class labels along with the number of correct classifications and errors are contained in leaves and thresholds listed beside child pointers. 16 4.2.4 Kononenko’s Entropy Based MDL Method This method is virtually identical to that of Fayyad and Irani except that it includes an adjustment for when multiple attributes are to be discretized. This algorithm provides a correction for the bias the entropy measure has towards an attribute with many values. 5. Running Algorithms in WEKA The dataset generated contained 8 attributes representing each data distribution. There are 15 class columns: 5 classes for the simplest 2 class problem with either a 1%, 5%, 10%, 20% or 35% error rate built into the class labels; and similarly 5 classes for a 3 class label and 5 class label problem. Each discretization method was tested with 1 attribute and 1 class column run together via the C4.5 classifier using 10-fold cross-validation. For instance, to analyze the normal distribution, column 1 and column 9 would firstly be run using each of the 6 discretization methods. Then, the effect of changing the error rate and number of class labels would be tested by using each of the class columns from 10-24 against column 1. Each distribution was then analyzed by filtering out the appropriate columns with all 720 combinations run, (i.e. 6 discretization methods * 8 attributes * 15 classes). Attribute 8 Class1 Class 2 Class 3 Class 4 Class n normal uniform lepto platy bimod skew expo Zipf 120 121 1% 3 class Small 1% 5 class v.small 5% 2class Small ….. 50 60 1% 2class Small Small Small Small Large … 50 60 … 50 60 … 51 56 … 52 65 … 122 123 123 60 70 … 120 140 150 40 50 125 145 160 130 135 140 134 143 155 175 190 210 220 255 300 450 500 560 Large Average Average Small Large Large Large Large Large large v.large large Figure 18. Layout of dataset generated containing 8 attributes and 15 classes. Equal width binning was run using weka.filters.DiscretizeFilter in weka 3-2-3 with the number of bins used set at the default of 10 bins. No attempt was made to optimize this value although a discussion for the reasons behind this are given in the results section. The C4.5 algorithm was then applied to the discretized data to produce a decision tree. This classifier was chosen over a Bayesian classifier as there are no normality constraints on the data required, especially relevant given the focus of the experiments. Equal frequency binning was run using the FilteredClassifier in weka 3-3-4 with the number of bins used set at the default of 10 bins. The C4.5 algorithm was then applied to the discretized data to produce a decision tree. Holte’s 1R Discretizer was run using weka.classifiers.OneR classifier in weka 3-2-3 with default minimum bucket size run at the default setting of 6. C4.5 Discretizer was run using weka.classifiers.j48.J48 in weka 3-2-3 with confidence factor set to 0.25 as per the default value. Fayyad and Irani’s Entropy Based MDL Method was achieved using weka.classifiers.FilteredClassifier in weka 3-2-3 with the use MDL parameter set to true. The C4.5 algorithm was used as the wrapper function. Kononenko’s Entropy Based MDL Method was achieved using weka.classifiers.FilteredClassifier in weka 3-2-3 with the use MDL parameter set to false. The C4.5 algorithm was used as the wrapper function. 17 … Attribute 7 … Attribute 6 … Attribute 5 … Attribute 4 … Attribute 3 … Attribute 2 … Attribute 1 6. Analysis of Errors The method of error estimation used can influence findings in many instances, however, in the experiments undertaken, the error estimates were of an overwhelming consensus as to which method was the most accurate. In most practical situations, the best numerical prediction method is still the best no matter which error measure is used [11]. We chose to compare the root relative squared error for each method. This measure gives extra emphasis to outliers and larger discrepancies are weighed more heavily than smaller ones. Further, it was able to detect subtle differences in the various error rates of each method, especially when the total error count or confusion matrix was the same. This measure is listed below: +, ,+ +, ,+ where p = predicted value ; a = actual value; iai = 1/n average value of ai . Figure 19 shows sample WEKA run information including decision tree construction and error output. J48 pruned tree -----------------uniform <= 162.93 | uniform <= 158.86: small (2150.0) | uniform > 158.86 | | uniform <= 160.23: medium (250.0) | | uniform > 160.23 | | | uniform <= 161.4 | | | | uniform <= 160.81: small (100.0) | | | | uniform > 160.81: medium (101.0/1.0) | | | uniform > 161.4: small (249.0) uniform > 162.93 | uniform <= 189.16: medium (4300.0) | uniform > 189.16 | | uniform <= 192.96 | | | uniform <= 190.58: large (248.0) | | | uniform > 190.58 | | | | uniform <= 191.65 | | | | | uniform <= 191.14: medium (103.0/3.0) | | | | | uniform > 191.14: large (100.0/1.0) | | | | uniform > 191.65: medium (251.0/2.0) | | uniform > 192.96: large (2148.0) === Stratified cross-validation === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 9992 8 0.9987 0.0009 0.0231 0.2226 % 5.0702 % 10000 99.92 0.08 % % === Detailed Accuracy By Class === TP Rate 0.998 1 0.999 FP Rate 0 0.001 0 Precision 0.999 0.999 1 Recall 0.998 1 0.999 F-Measure 0.999 0.999 1 Class large medium small === Confusion Matrix === a b c <-- classified as 2496 4 0 | a = large 2 4998 0 | b = medium 0 2 2498 | c = small Figure 19. Fayyad & Irani MDL method with C4.5 classifier run information for uniform distribution 3 class problem (small, medium, large) and 10% error overlap. 18 6.1 Normal Distribution Errors NORMAL DISTRIBUTION - 2 class root relative squared error 100 90 80 70 1% 5% 10% 20% 35% 60 50 40 30 20 10 0 EQ-INT EQ-FR 1R C4.5 FI KON Figure 20. Classification error rate for 2-class label experiment on normally distributed data. As can be seen from Figure 20, there is a great amount of variability in the performance of the various discretization techniques. The first two unsupervised methods progressively deteriorate in predictive performance as the inherent error overlap is increased. On the other hand, the other methods have a stable performance as the error overlap is increased. Surprisingly, the MDL methods perform worse with an error overlap of 1% compared to when the error overlap is 35%. NORMAL DISTRIBUTION - 5 class 100 90 90 root relative squared error root relative squared error NORMAL DISTRIBUTION - 3 class 100 80 70 60 1% 50 5% 10% 40 20% 30 35% 20 10 80 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 21. 3 class label errors (normal). 1R C4.5 FI KON Figure 22. 5 class label errors (normal). When the number of class labels is increased from 2 to 3 and 5, the performance difference between methods becomes more pronounced. Once again, the unsupervised methods progressively deteriorate in performance especially equal-width binning while the supervised methods remain stable. 19 6.2 Uniform Distribution Errors UNIFORM DISTRIBUTION - 2 class root relative squared error 100 90 80 70 1% 5% 10% 20% 35% 60 50 40 30 20 10 0 EQ-INT EQ-FR 1R C4.5 FI KON Figure 23. Classification error rate for 2-class label experiment on uniformly distributed data. Once again the supervised methods are clearly superior to the unsupervised methods, with the unsupervised methods progressively deteriorating in performance as the error overlap increases. UNIFORM DISTRIBUTION - 5 class 100 90 90 80 80 root relative squared error root relative squared error UNIFORM DISTRIBUTION - 3 class 100 70 60 1% 5% 50 10% 20% 40 35% 30 20 70 60 1% 5% 50 10% 20% 40 35% 30 20 10 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 24. 3 class label errors (uniform). 1R C4.5 FI KON Figure 25. 5 class label errors (uniform). Figure 24 and 25 reinforce the findings of the 2-class experiment with no clear winner. 20 6.3 Leptrokurtic Distribution Errors root relative squared error LEPTOKURTIC DISTRIBUTION - 2 class 100 90 80 70 60 50 40 30 20 10 0 1% 5% 10% 20% 35% EQ-INT EQ-FR 1R C4.5 FI KON Figure 26. Classification error rate for 2-class label experiment on leptokurtically distributed data. Figure 26 shows some surprising results with all methods having almost identical error rates. LEPTOKURTIC - 5 class 100 90 90 80 80 root relative squared error root relative squared error LEPTOKURTIC - 3 class 100 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 27. 3 class label errors (leptokurtic). 1R C4.5 FI KON Figure 28. 5 class label errors (leptokurtic). Figure 27 and 28 shows that there is a decline in relative performance for the equal-width binning method when additional class labels are added. 21 6.4 Platykurtic Distribution Errors PLATYKURTIC DISTRIBUTION - 2 class root relative squared error 100 90 80 70 1% 5% 10% 20% 35% 60 50 40 30 20 10 0 EQ-INT EQ-FR 1R C4.5 FI KON Figure 29. Classification error rate for 2-class label experiment on platykurtically distributed data. Figure 29 shows that the entropy based methods outperform the unsupervised and error based method, especially when the error overlap is increased. PLATYKURTIC - 5 class 100 90 90 80 80 root relative squared error root relative squared error PLATYKURTIC - 3 class 100 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 30. 3 class label errors (platykurtic). 1R C4.5 FI KON Figure 31. 5 class label errors (platykurtic). Figure 30 and 31 shows that as the number of class labels grows, so too does the disparity between the entropy based methods and the 2 unsupervised and error based method. Interestingly, the entropy based methods errors actually level out even as the number of class labels increases. The sensitivity of the error rate to the number of bins selected is particularly apparent in the equal-frequency method. 22 6.5 Bimodal Distribution Errors BIMODAL DISTRIBUTION - 2 class root relative squared error 100 90 80 70 1% 5% 10% 20% 35% 60 50 40 30 20 10 0 EQ-INT EQ-FR 1R C4.5 FI KON Figure 32. Classification error rate for a 2-class label experiment on bimodally distributed data. The superiority of the supervised discretization methods across all error overlaps is marginal when data is bimodal and there are 2 class labels. BIMODAL DISTRIBUTION - 5 class 100 90 90 80 80 root relative squared error root relative squared error BIMODAL DISTRIBUTION - 3 class 100 70 60 1% 5% 50 10% 20% 40 35% 30 20 10 70 60 1% 5% 50 10% 20% 40 35% 30 20 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 33. 3 class label errors (bimodal). 1R C4.5 FI KON Figure 34. 5 class label errors (bimodal). Interestingly, error rates across the board jump significantly when there are 5 class labels in the dataset, especially at lower error overlap levels as seen in Figure 34. 23 6.6 Skewed Distribution Errors POSITIVELY SKEWED DISTRIBUTION - 2 class root relative squared error 100 90 80 70 1% 5% 10% 20% 35% 60 50 40 30 20 10 0 EQ-INT EQ-FR 1R C4.5 FI KON Figure 35. Classification error rate for 2-class label experiment on positively skewed data. Apart from the OneR discretizer, all methods seem to provide similar error rates. Interesting, the equal-frequency discretizer has virtually identical error rates as the supervised methods. POSITIVELY SKEWED - 5 class 100 90 90 80 80 root relative squared error root relative squared error POSITIVELY SKEWED - 3 class 100 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 36. 3 class label errors (skewed). 1R C4.5 FI KON Figure 37. 5 class label errors (skewed). Once again with 3 class labels, the equal-frequency method exhibits identical error rates to the supervised methods. 24 6.7 Exponential Distribution Errors EXPONENTIAL DISTRIBUTION - 2 class root relative squared error 100 90 80 70 60 50 40 30 1% 5% 10% 20% 35% 20 10 0 EQ-INT EQ-FR 1R C4.5 FI KON Figure 38. Classification error rate for 2-class label experiment on exponentially distributed data. Figure 38 shows that at lower error overlap levels the error rate seems to be almost identical for all methods bar the equal-interval binning method for the exponential distribution. At the high end of the error overlap, the equal-frequency binning and the supervised methods exhibit a slightly smaller error rate. EXPONENTIAL - 5 class 100 90 90 80 80 root relative squared error root relative squared error EXPONENTIAL - 3 class 100 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 70 60 1% 5% 50 10% 40 20% 35% 30 20 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 39. 3 class label errors (exponential). 1R C4.5 FI KON Figure 40. 5 class label errors (exponential). Figure 39 and 40 shows that as the number of class labels introduced increased, the superiority of the entropybased methods became more prevalent. 25 6.8 Zipf Distribution Errors ZIPF DISTRIBUTION - 2 class root relative squared error 100 90 80 70 1% 5% 10% 20% 35% 60 50 40 30 20 10 0 EQ-INT EQ-FR 1R C4.5 FI KON Figure 41. Classification error rate for 2-class label experiment on approximated Zipf distribution. At lower error overlaps, all methods exhibit similar error rates. However, when the error overlap increases, the relative increase in the error rates for the 2 MDL supervised methods are slightly less. ZIPF DISTRIBUTION - 5 class 100 90 90 80 80 root relative squared error root relative squared error ZIPF DISTRIBUTION - 3 class 100 70 60 1% 5% 50 10% 20% 40 35% 30 20 10 70 60 1% 5% 50 10% 20% 40 35% 30 20 10 0 0 EQ-INT EQ-FR 1R C4.5 FI KON EQ-INT EQ-FR Figure 42. 3 class label errors (Zipf). 1R C4.5 FI KON Figure 43. 5 class label errors (Zipf). Interestingly, only the error rates at the lower overlap levels increase across the board as the number of class labels increase. Once again the 2 MDL methods exhibit lower error rates than the other methods. 26 7. Results 7.1 Varying the Error Overlap From the section 6.1 and 6.2 errors we saw that the first two unsupervised methods progressively deteriorate in predictive performance as the inherent error overlap is increased. On the other hand, the other methods have a stable performance as the error overlap is increased. Surprisingly, the MDL methods perform worse with an error overlap of 1% compared to when the error overlap is 35%. In contrast, the section 6.3 to 6.8 errors showed a degenerative performance for the skewed distribution when the error overlap was increased, regardless of the discretization method used. 7.2 Varying the Number of Class Labels 80 1 60 5 40 10 20 35 20 35 10 1 0 5 3 2 5 exponential (EQ-INT) 3 2 exponential (FI) 5 3 2 uniform (EQ-INT) 5 3 2 uniform (FI) Figure 44. Unsupervised (Equal-interval) vs. Supervised (Fayyad&Irani) average errors (Y-axis) for Exponential and Uniform distributions plotted against number of class labels (X-axis) and error overlap (Z-axis) The effect of increasing the number of class labels can be seen in Figure 44 where an example of the exponential and uniform distribution errors are highlighted when run with both an unsupervised (equal-interval) and supervised method (Fayyad&Irani). When the number of class labels is increased from 2 to 3 and 5, the degenerative performance of the unsupervised methods becomes more pronounced. From Figure 44, the error rate rises sharply for both the exponential and uniform distributions. This pattern was replicated across the board for all distributions when unsupervised discretization was applied. The supervised methods provided mixed outcomes when the number of class labels increased. In Section 6.2 we saw that the uniform distribution errors remained static for supervised discretization using the Fayyad&Irani method even when the number of class labels increased. Only, the normal distribution followed this pattern (Section 6.1). All other distributions followed a growth in the error rate similar to that exhibited by the supervised exponential errors in Figure 44. 27 7.3 Comparison of Overall Error Rates AVERAGE ERRORS - ALL DISTRIBUTONS 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 NORM UNI LEP EQ-INT PLAT EQ-FR BI 1R C4.5 SKEW FI EXP ZIPF KON Figure 45 . Average root relative squared error for all distributions and discretization methods. Figure 45 shows a comparison of the performance of each discretization method as applied to each data distribution. Each bar represents the average root relative squared error by summing across all overlaps and number of class labels and dividing by 15. The equal-interval binning technique appears to be more effective when the data distribution is flatter, as depicted by a uniform or platykurtic distribution. This concurs with the findings in [2] that this method of discretization is vulnerable to skewed data. Interestingly, the error rate of equal-interval binning converges to that of equal-frequency binning with flatter distributions as depicted in Figures 23 to 25. Equal interval binning often distributes instances unevenly across bins with some bins containing many instances and others possibly containing none. Figure 45 demonstrates the degenerative performance of equal-interval binning when the distribution being discretized is less even. It is thus no surprise that the leptokurtic and the Zipf distribution are the worse performers applying this method, and the best performers being those distributions with a more even distribution, i.e. uniform and platykurtic distributions. Similarly, equal frequency binning can suffer from serious problems. Consider the event that we have 1000 instances and 10 intervals specified by the user. If there are 105 instances with a value between 10 and 20 corresponding to bin 2, then the remaining 5 instances are grouped into the next bin. This sort of occurrence undermines knowledge discovery in the sense that partitions then lose all semantic value. On the whole, the equal-frequency binning method shows relative indifference to the distribution of the data. The OneR discretizer works well with both a normal and uniform distribution. The OneR error rate is roughly 20% higher for all other distributions relative to the MDL methods. The problem with an error based method like OneR is that it may result in splitting intervals unnecessarily to minimize the error count and consequently blurs the understanding of classification results. Furthermore, multiple attributes that require discretization cannot be used with this algorithm. 28 The C4.5 discretizer was the equal best performer for the normal and uniform distribution. This concurs with [5] who showed that discretization causes a small increase in error rate when the continuous feature is normally distributed. However, it was nearly 10% worse than the MDL entropy methods for the other six distributions. One factor that impacts on the overall tree generated is the confidence level. The default confidence parameter of 0.25 was used. Varying this value changes the level of pruning applied to the tree. The smaller this parameter the greater the level of pruning that occurs. It is a balancing act to prevent overfitting whilst trying to optimize this parameter in the context of knowledge discovery [5]. The entropy based MDL methods combine both local and global information during learning, i.e. they use a wrapper function i.e. C4.5 in this case with the MDL criterion (global). The MDL pre-selects cut-points and the classifier decides which cut-point is the most appropriate in the current context. Using a global discretization as opposed to a local method (C4.5) makes the MDL methods less sensitive to variation from small fragmented data [2]. These methods were consistently high performers across all data distributions as depicted in Figure 13. 7.4 Selection of the Number of Bins In the experiments k bins set to the default value of ten was used. The choice of k has a major effect on error rates [6]. Identifying the best value of k to use is a trial and error process and cannot be applied universally to each attribute. In some cases, error rates may even compare favourably to supervised discretization, however, finding the optimum cut off thresholds by supervised discretization may involve 2an calculations where a is the number of attributes and n is the number of instances [7]. The unpredictability of unsupervised methods as shown in the overall summary error rates for all distributions seems to make them a poor choice for discretization. root relative squared error Bin Count Vs. Error rate 100 80 60 default 40 20 0 0 5 10 15 20 number of bins chosen Figure 46. Error rate for number of bins manually chosen for equal-width binning strategy for normal distribution of 2-class problem with 35% error overlap. Figure 46 depicts the effect of the error rate on classification when the seed for an unsupervised method is varied. Here the seed represents a manually chosen number by the user that is the number of bins requested from the data. Figure 46 shows that the default value of 10 bins (used in all runs) is not the best selection of the number of bins in order to produce the smallest error rate in this instance but may be in others. 29 7.5 Identifying a Heuristic Initial thoughts were that error rates were correlated positively with the level of kurtosis or skewness. The correlation coefficient for kurtosis against the root relative squared error was 0.45, indicating that there appears to be a significant relationship as shown in Figure 47. The relationship between skewness and the error rate as depicted in Figure 48 was less pronounced with the correlation coefficient at 0.22. Prudence is taken with influence regarding the size of the coefficients given the error measure used and the assumption of a linear relationship via a linear regression model. In general terms, we can deduce a positive relationship between the elevation (kurtosis) and symmetry (skewness) of the data and classification errors resulting from discretization. However, a word of caution is that discretization methods make a poor job of separating the two underlying populations in a variable that has a bimodal distribution [9]. coefficient of skewness coefficient of kurtosis 7 lep 6 skew exp 5 4 norm zipf 3 bimod 2 1 plat uni 0 0 20 40 expo skew zipf norm 0 60 uni plat 30 lepto bimod 60 root relative squared error root relative squared error Figure 47. Kurtosis vs. average error rate: r2 = 0.45. 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Figure 48. Skewness vs. average error rate: r2 = 0.22. Via closer inspection of the shape of each distribution, it seems that distributions that have either: • • a long thin tail (skewed, Zipf and exponential distributions) or, a high peak/concentration of data (leptokurtic, bimodal distribution) benefit the least from discretization. Similarly, these distributions gain the least from using a supervised method compared to an unsupervised method of discretization. For instance, the root relative squared error rate can be improved by 86% for a normal or uniform distribution, compared to 19% for a skewed distribution and 25% for a bimodal distribution (Fig.45). 8. Conclusion From analysis of the 720 runs it was found the distributions that benefit the most from discretization in ranked order are: (1) uniform, (2) normal, (3) platykurtic, (4) bimodal, (5) exponential, (6) Zipf, (7) leptokurtic, (8) skewed. The higher ranked distributions, namely the uniform and normal distributions exhibit escalating error rates for the unsupervised discretization methods when the error overlap and/or the number of classes labels increases. These unsupervised methods were shown to be highly susceptible to an arbitrarily selected number of bins. In contrast, the error rates for the supervised discretization methods remained constant as both the error overlap and/or the number of class labels introduced increased. At the lower end of the ranked distributions, the classification error rates grew as the error overlap grew as expected, regardless of which discretization method was adopted. Unexpectedly, as more class labels were introduced, error rates increased at lower error overlap levels whilst remaining relatively static at the higher end of the error overlap level. 30 In summary, we have identified a heuristic that is able to determine the relative effectiveness of discretization apriori by data visualization, and by establishing that a positive correlation between the level of kurtosis and skewness and the error rate from classification in the data distributions exists, with a caveat for bimodal distributions. In this work we examined classification errors based on a single discretized attribute. In future work we will look at combinations of discretized attributes. 9. References [1] Catlett, J. (1991b) “On Changing Continuous Attributes into Ordered Discrete Attributes”, In Y.Kodrtoff, ed., EWSL-91. Lecture Notes in Artificial Intelligence 482, pp.164-178. Springer-Verlag, Berlin, Germany. [2] Dougherty, J., Kohavi, R., Sahami, M. (1995) “Supervised and Unsupervised Discretization of Continuous Features” Machine Learning: Proceedings of the 12th International Conference, Morgan Kaufmann, Los Altos, CA. [3] Fayyad, U.M., Irani, K.B. (1993) “Multi-interval discretization of continuous valued attributes for classification learning” In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp.1022-1027, Los Altos, CA: Morgan Kaufmann. [4] Holte, R.C. (1993) “Very simple classification rules perform well on most commonly used datasets” Machine Learning 11, pp.63-90. [5] Kohavi, R., Sahami, M. (1996) “Error-based and Entropy-based Discretization of Continuous Features” In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp.114119. Menlo Park:AAAI Press. [6] Kononenko, I., Sikonja, M.R, (1995) “Discretization of continuous attributes using ReliefF” Proceedings of ERK'95 , Portoroz, Slovenia, 1995. [7] Pazzani, M. (1995) “An iterative improvement approach for the discretization of numeric attributes in Bayesian classifiers” KDD-95 pp.228-233 [8] Quinlan, J.R. (1993) “C4.5 : Programs for Machine Learning”, Morgan Kaufmann, Los Altos, CA [9] Scott, P.D., Williams, R.J., Ho, K.M. (1997) “Forming Categories in Exploratory Data Analysis and Data Mining”, IDA 1997, pp.235-246. [10] Smith, P.J. (1997) “Into Statistics” Springer-verlag, pp.205-206, 304-306, 323-329. [11] Witten, I., Frank, E. (2000) “Data Mining: Practical Machine Learning Tools and Techniques” Morgan Kaufmann, pp.80-82, 147-150, 238-246. 31 10. Appendix statistics of random generated normal distribution data with mean=176, standard deviation = 8. normal distribution(176,8) Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness 176.135 0.079 176.059 180.732 7.865 61.851 -0.001 0.055 Range Minimum Maximum Sum Count Largest(1) Smallest(1) Confidence Level(95.0%) 59.279 146.654 205.933 1761350.518 10000.000 205.933 146.654 0.154 statistics for uniform distribution generated. uniform (146-206) Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness 176.143 0.174 176.405 194.984 17.352 301.097 -1.209 -0.017 Range Minimum Maximum Sum Count Largest(1) Smallest(1) Confidence Level(95.0%) 59.995 146.000 205.995 1761431.341 10000.000 205.995 146.000 0.340 statistics for leptokurtic distribution generated. leptokurtic Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis 88.215 0.099 87.000 87.000 9.881 97.629 3.321 Skewness -0.167 Range Minimum Maximum Sum Count Largest(1) Smallest(1) Confidence Level(95.0%) 85.000 45.000 130.000 882145.000 10000.000 130.000 45.000 0.194 statistics for platykurtic distribution generated. platykurtic Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness 88.698 0.269 90.000 90.000 26.893 723.210 -0.676 Range Minimum Maximum Sum Count Largest(1) Smallest(1) -0.012 Confidence Level(95.0%) 32 117.000 30.000 147.000 886977.000 10000.000 147.000 30.000 0.527 statistics for bimodal distribution generated. bimodal (male-90,female-65) Mean Standard Error Median 75.610 0.158 75.000 Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum 65.000 Sum 15.840 Count 250.894 Largest(1) -0.428 Smallest(1) -0.004 Confidence Level(95.0%) 105.000 25.000 130.000 756104.000 10000.000 130.000 25.000 0.310 statistics for skewed distribution generated. skewed distribution - 1.396 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness 78.063 0.179 75.000 75.000 17.913 320.871 2.633 1.396 Range Minimum Maximum Sum Count Largest(1) Smallest(1) 105.000 50.000 155.000 780625.000 10000.000 155.000 50.000 Confidence Level(95.0%) 0.351 statistics for exponential distribution generated. Exponential distribution Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness 15.668 0.135 12.500 2.500 13.526 182.948 2.165 1.492 Range Minimum Maximum Sum Count Largest(1) Smallest(1) Confidence Level(95.0%) 72.5 2.5 75 156680 10000 75 2.5 0.265 statistics for approximated Zipf distribution generated. Zipf distribution Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness 16.312 0.170 8 2 17.077 291.645 0.522 1.261 Range Minimum Maximum Sum Count Largest(1) Smallest(1) Confidence Level(95.0%) 33 66 0 66 163124 10000 66 0 0.334