Observations Interval classifiers are relatively new area of database research, started to be investigated in early nineties. But they were used before and they originate from another computer science domain – artificial intelligence. Interval classifiers in AI are better known as machine learning algorithms, studied in one of the fundamental areas of research in AI – machine learning. Since the research in this area was very intensive in last two decades there are many machine learning algorithms developed and many papers and books written about this topic. So, although new in database research, interval classifiers have strong background and results collected in AI research should be used, considering the specific of database applications, for further development of this area of database research. The specifics of interval classifiers, in databases domain are: very large data sets are used, as in any data mining application. This means a wealth of data for training sets, hopefully helping reduce classification error. the algorithms have to be very efficient in order to be embedded in interactive systems and to answer adhoc queries about attributes with missing values. This wasn’t the important issue for AI machine learning since the decision trees were only generated ones and than used over and over again. One of the first machine learning algorithms, made specifically for data mining applications in 1992, was IC. IC Algorithm We have two data sets: E – training set where each tuple has a group label and D – test data set where there are no group labels. Except for group labels, these two sets have identical attributes, which can be categorical or numerical. The goal is using only the training data set to produce classification functions for each group label and then use these functions to determine missing group labels for test data set. IC is a tree classifier, which means that it generates the decision tree where each path to one of the leaves will represent one classification function. In this decision tree leaves are group labels. IC is not first tree classifier; best-known tree classifiers are ID3 and CART, both developed for machine learning purposes. The main difference between these two classifiers and IC is that they develop a binary tree for numeric attributes, while IC develops a k-ary tree. This should result in considerate efficiency improvement, since kary trees are smaller, traversal is simpler and faster and there are no multiple tests for the same attribute. The second big difference is that while ID3 and CART first develops complete binary tree and than prunes it in order to avoid overfitting the data, IC does dynamical pruning. Implementation As I planned, I implemented basic IC algorithm using GNU C, but for now my program is only working for categorical and discrete numerical attributes. At first I wanted to test different smoothing methods for continuous numeric data, but since I’m not an expert in numeric analysis I decided not to do this. Because of this I didn’t use attributes with continuous values at all and instead I concentrated only on attributes with discrete values in order to have more time investigating influence of various parameters on performance of the algo. I mostly followed results from the paper [1] and implemented only those methods which has showed to be algorithmically not expensive, but still relatively effective. The main procedure in IC algo is make_tree – recursive procedure that results in decision tree. Basically it consists of following functions: make_tree() { if stopping criteria is satisfied return NULL else make_histograms(); winner_attr=next_attr(); make_intervals(); for each week interval make recursive call to the function make_tree() for the tuples in that interval return Node; } make_histogarms() – for every value of every categorical attribute this function counts the number of tuples for every group label that have that value for corresponding attribute. The output from this function is three-dimensional array histograms[i][j][k], where i – the ordinal number of group label, j – the ordinal number of attribute, k – the ordinal number of value of attribute next_attr() – for every attribute goodness function is computed using histograms from previous procedure. Many different functions can be chosen for goodness function, but I’ve chosen the resubstitution error rate because it is computationally cheap and it performs only slightly poorer than other functions which are more complicated. Resubstitution error rate for an attribute is computed as 1-v winner_freq(v)/tottal_freq where winner_freq(v) is a frequency of the most frequent group label for the attribute value v, and total_freq is the total frequency of all groups over all values of this attribute . The attribute with the largest number for goodness function is chosen to be winning attribute. The winning attribute, along with his intervals calculated in the next function will be the next node in the further decision tree expansion. make_intervals() – for every value of winning attribute determine the frequency for each group label (these data already exist in array histograms[i][j][k]). Group with largest frequency will be selected as winning group of that value. Next step is to determine if this winner is strong or week for each value v of the attribute. Winner is strong if the ratio of the frequency of the winning group to the total frequency for the value v of the winning attribute is above certain precision threshold. This threshold can be fixed or it can be an adaptive function of the current depth of the decision tree. The results presented in the paper [1] show that adaptive threshold has a very small advantage in classification accuracy but it is obviously more complicated to compute. That is the reason why I’ve chosen to have fixed precision threshold, which is user specified. Once we get strength for each value of the winner attribute we can join the values, which have the same winner group with the same strength and make intervals. Of course, this makes sense only if the values can be ordered. So, as an output we have intervals of the winning attribute and each of these intervals has winning group with corresponding strength. Each of these intervals will form a new branch of the classification tree. make_tree() – there are three stopping conditions for this recursive procedure: - if an interval has a strong winning group, than this branch won’t expend any longer; - if the number of the tuples belonging to a week interval is less than user specified number than further expanding stops for that interval - if the maximum depth of the tree has been reached. Since results show that classification error is not changing much if the depth of the tree is more than 5, I decided to have maximum depth of the tree as one of the input parameters in order to test performance of the algo with different depth of the tree and to find optimum one. Besides these functions described in the paper [1], I also used functions print_tree() for printing the decision tree and test() to test test data set in order to calculate classification error – fraction of instances in test data set that are incorrectly classified. Function print_tree() prints the tree “depth-first” but it is easy to “read” it since the levels of the tree are marked. Input parameters for IC Instead of creating training and test data sets by my self as I thought at first I manage to found various data sets created or collected specifically for machine learning purposes. The only thing I had to do is to divide these data sets in training and test set. I allowed the size of the training set to be one of the input parameters for the algo. The idea was to test the accuracy of the algorithm depending on the ratio of the data used for the training set to the data used for the test set. Also, since my data sets were relatively small I thought that the choice of tuples for the training set would be important. There are two options offered for the extraction of training set for original machine learning data base: 1. choosing training set randomly from data set or 2. choosing equidistant tuples for training set The next input parameter, which I’ve already have mentioned is precision threshold. Value for threshold should be chosen from range [0,1], where of course, only value close to 1 will produce accurate decision trees. Minimum number of tuples which belong to a week interval in order for it to be further expanded is also one of the parameters. We could avoid this parameter and allow an interval to be further expanding if there is at least one tuple belonging to it, but than this will force tree to expand much deeper, with no improvement in accuracy since the percentage of the tuples following that path is very low. And obviously computational cost would be much higher. The last input algorithm for my implementation of IC algorithm is maximum tree depth. When this depth is reach further expansion of the tree will stop and all weak intervals in a leaf node are treated as strong. Results For testing my program I used three data sets found in the database of machine learning data sets on a WWW cite http://axon.cs.byu.edu/~martinez/470/programs/MLDB/: Tumor – is a data set, which has 10 attributes and 264 tuples. First attribute is called class and values of this attribute are actually the group labels. There are two group labels: Benign and Malignant. The rest 9 attributes are discrete numeric attributes and each of them has 10 different values. Zoo – is data set about animal species, and has 17 attributes and 90 tuples. First attribute has 7 values – group attributes. The rest attributes, except for one discrete numeric attribute, are boolean and have values only 0 and 1. Hayes – has 5 attributes, all discrete numeric and 93 tuples. There are 3 group labels. I made a little modification to each of these three data sets adding for each attribute a new characteristics – is it orderable or not. I needed this information in order to make decision about interval formation. I did experiments with various input parameters in order to reduce classification error, but in general my average classification error was much higher than paper suggested. It was usually between 20% - 40%. I think that main reason for that is that I used very small data sets (my training data sets usually had from 20-50 tuples) while results described in [1] were obtained on data sets containing 10000 tuples (2500 for training set) and more. Also, I think that the number of attributes and the number of values for each attribute should be considered when choosing optimum training set size: the more attributes and attributes values are present, the larger training set size will be required for optimal classification error. First, I did some experiments with the size of the training data set and the way how these training tuples were selected from the database. The results are shown in Figure 1 and Figure 2. 50 40 30 Tumor Zoo 20 Hayes 10 0 equidistant random average Figure 1: Percentage of classification error where training size is 20% of data set 50 40 30 Tumor Zoo 20 Hayes 10 0 equidistant random equidistant Figure 2: Percentage of classification error where training size is 40% of data set It seems that results are very unstable depending on the choice of the tuples, but the reason for that is that my training set sizes were very small. I am sure that with much larger training sets results would be more stable. And also, for the larger data sets it probably wouldn’t matter how the tuples were selected from the database. These results also show what was expected: with the larger percentage of data set selected as a training set we got less classification error. Next, I kept the training size fixed (25% of data set) and varied the precision threshold. The results are shown in Figure 3: 50 40 30 Tumor Zoo 20 Hayes 10 0 0.6 0.8 1 Figure 3: Percentage of classification error for different values of threshold The results show that the classification error hasn’t change for different values of threshold, except for drastic decrease for database Tumor when the threshold was 1. I repeated the experiment with the same value for Tumor and I mostly got the similar results. Again, I believe that explanation is that my training sets are small and there are very few or none tuples for some values of attributes . The ratio of the frequency of the winning group to the total frequency for one value of the winning attribute is usually high, and decreasing the value of the threshold won’t change the strength of the winner. The results for minimum number of tuples in one interval in order for it to be expended showed that for my data sets best accuracy is accomplished for values less or equal 3. For larger databases probably larger values would be more efficient. The results are shown in Figure 4. 60 50 40 Tumo 30 Zoo 20 Hayes 10 0 0 3 5 7 Figure 4: Values for minimum number of tuples. Size of the training set is 30% of database and the threshold is 0.9 For last input parameter – maximum depth of tree – I got almost the same values of classification error for depth of tree equals 2, 4, 6, 8, but important is that it didn’t significantly decreased with the maximum depth of tree. This is very important result, since this means that we can develop decision trees of relatively shallow depth, which will perform almost the same as larger trees. And the computational cost will be much cheaper. Conclusion My implementation tested for classification accuracy performed poorer than IC tested in paper [1]. The results obtained in paper [1] show that classification error is below 20%, while my program usually resulted in classification error in range from 20% - 40%. As I said, the main reason must be the drastic difference between our data sets sizes. Also, since I borrowed already existing machine learning data sets I can’t be sure how much noise was there. Also, I couldn’t test my program for efficiency in the way it was tested in paper [1]. I couldn’t obtain free copy of ID3 or CART packages in order to compare their performance to performance of my IC implementation. The reason for that is that they are commercial products (especially ID3), very much advertised and sold all over the world. I recently read that the author of ID3, Ross Quinlan, has upgraded the old version of his algorithm, and his new version, commercially called C5.0 is claimed to be 5 – 10 times faster than the old one. I think this is interesting information for the perspective of data mining. My opinion is that there is still lot of research material in this area. Starting from machine learning algorithm as bases, they can be modified in various ways considering the specification of the domain they are used in. Also the characteristics of the data sets used for training set and test set also play big roll in performance of the algorithm. I believe that for each, specific database input parameters can be tuned in that way that classification error is maximally reduced. If interval classifiers are going to be used in interactive loops, I think that good idea would be to make generation of tree dynamic, letting the algorithm “learn on its own mistakes”. One way would be to include all misclassified tuples in next training set. In this project report I also included printouts of three databases used and two program outputs with decision trees and classification error. References [1] Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer and Arun Swami, “ An Interval Classifier for Database Mining Application” [2] Jiawei Han, Yandong Cai and Nick Cercone, “ Knowledge Discovery in Databases: An Attribute-Oriented Approach”