Interval Classifiers for Data Mining

advertisement
Observations
Interval classifiers are relatively new area of database research, started to be investigated
in early nineties. But they were used before and they originate from another computer
science domain – artificial intelligence. Interval classifiers in AI are better known as
machine learning algorithms, studied in one of the fundamental areas of research in AI –
machine learning. Since the research in this area was very intensive in last two decades
there are many machine learning algorithms developed and many papers and books
written about this topic. So, although new in database research, interval classifiers have
strong background and results collected in AI research should be used, considering the
specific of database applications, for further development of this area of database
research.
The specifics of interval classifiers, in databases domain are:
 very large data sets are used, as in any data mining application. This means a wealth
of data for training sets, hopefully helping reduce classification error.
 the algorithms have to be very efficient in order to be embedded in interactive
systems and to answer adhoc queries about attributes with missing values. This wasn’t
the important issue for AI machine learning since the decision trees were only generated
ones and than used over and over again.
One of the first machine learning algorithms, made specifically for data mining
applications in 1992, was IC.
IC Algorithm
We have two data sets: E – training set where each tuple has a group label and D – test
data set where there are no group labels. Except for group labels, these two sets have
identical attributes, which can be categorical or numerical. The goal is using only the
training data set to produce classification functions for each group label and then use
these functions to determine missing group labels for test data set.
IC is a tree classifier, which means that it generates the decision tree where each path to
one of the leaves will represent one classification function. In this decision tree leaves are
group labels.
IC is not first tree classifier; best-known tree classifiers are ID3 and CART, both
developed for machine learning purposes. The main difference between these two
classifiers and IC is that they develop a binary tree for numeric attributes, while IC
develops a k-ary tree. This should result in considerate efficiency improvement, since kary trees are smaller, traversal is simpler and faster and there are no multiple tests for the
same attribute.
The second big difference is that while ID3 and CART first develops complete binary
tree and than prunes it in order to avoid overfitting the data, IC does dynamical pruning.
Implementation
As I planned, I implemented basic IC algorithm using GNU C, but for now my program
is only working for categorical and discrete numerical attributes. At first I wanted to test
different smoothing methods for continuous numeric data, but since I’m not an expert in
numeric analysis I decided not to do this. Because of this I didn’t use attributes with
continuous values at all and instead I concentrated only on attributes with discrete values
in order to have more time investigating influence of various parameters on performance
of the algo.
I mostly followed results from the paper [1] and implemented only those methods which
has showed to be algorithmically not expensive, but still relatively effective.
The main procedure in IC algo is make_tree – recursive procedure that results in
decision tree. Basically it consists of following functions:
make_tree()
{
if stopping criteria is satisfied return NULL
else
make_histograms();
winner_attr=next_attr();
make_intervals();
for each week interval make recursive call to the function make_tree()
for the tuples in that interval
return Node;
}
 make_histogarms() – for every value of every categorical attribute this function
counts the number of tuples for every group label that have that value for corresponding
attribute. The output from this function is three-dimensional array histograms[i][j][k],
where i – the ordinal number of group label, j – the ordinal number of attribute, k – the
ordinal number of value of attribute
 next_attr() – for every attribute goodness function is computed using histograms from
previous procedure. Many different functions can be chosen for goodness function, but
I’ve chosen the resubstitution error rate because it is computationally cheap and it
performs only slightly poorer than other functions which are more complicated.
Resubstitution error rate for an attribute is computed as
1-v winner_freq(v)/tottal_freq
where winner_freq(v) is a frequency of the most frequent group label for the attribute
value v, and total_freq is the total frequency of all groups over all values of this attribute .
The attribute with the largest number for goodness function is chosen to be winning
attribute. The winning attribute, along with his intervals calculated in the next function
will be the next node in the further decision tree expansion.
 make_intervals() – for every value of winning attribute determine the frequency for
each group label (these data already exist in array histograms[i][j][k]). Group with
largest frequency will be selected as winning group of that value. Next step is to
determine if this winner is strong or week for each value v of the attribute. Winner is
strong if the ratio of the frequency of the winning group to the total frequency for the
value v of the winning attribute is above certain precision threshold. This threshold can
be fixed or it can be an adaptive function of the current depth of the decision tree. The
results presented in the paper [1] show that adaptive threshold has a very small advantage
in classification accuracy but it is obviously more complicated to compute. That is the
reason why I’ve chosen to have fixed precision threshold, which is user specified.
Once we get strength for each value of the winner attribute we can join the values, which
have the same winner group with the same strength and make intervals. Of course, this
makes sense only if the values can be ordered.
So, as an output we have intervals of the winning attribute and each of these intervals has
winning group with corresponding strength. Each of these intervals will form a new
branch of the classification tree.
 make_tree() – there are three stopping conditions for this recursive procedure:
- if an interval has a strong winning group, than this branch won’t expend any longer;
- if the number of the tuples belonging to a week interval is less than user specified
number than further expanding stops for that interval
- if the maximum depth of the tree has been reached. Since results show that
classification error is not changing much if the depth of the tree is more than 5, I
decided to have maximum depth of the tree as one of the input parameters in order to
test performance of the algo with different depth of the tree and to find optimum one.
Besides these functions described in the paper [1], I also used functions print_tree() for
printing the decision tree and test() to test test data set in order to calculate classification
error – fraction of instances in test data set that are incorrectly classified.
Function print_tree() prints the tree “depth-first” but it is easy to “read” it since the
levels of the tree are marked.
Input parameters for IC
Instead of creating training and test data sets by my self as I thought at first I manage to
found various data sets created or collected specifically for machine learning purposes.
The only thing I had to do is to divide these data sets in training and test set. I allowed the
size of the training set to be one of the input parameters for the algo. The idea was to test
the accuracy of the algorithm depending on the ratio of the data used for the training set
to the data used for the test set.
Also, since my data sets were relatively small I thought that the choice of tuples for the
training set would be important. There are two options offered for the extraction of
training set for original machine learning data base:
1. choosing training set randomly from data set or
2. choosing equidistant tuples for training set
The next input parameter, which I’ve already have mentioned is precision threshold.
Value for threshold should be chosen from range [0,1], where of course, only value close
to 1 will produce accurate decision trees.
Minimum number of tuples which belong to a week interval in order for it to be further
expanded is also one of the parameters. We could avoid this parameter and allow an
interval to be further expanding if there is at least one tuple belonging to it, but than this
will force tree to expand much deeper, with no improvement in accuracy since the
percentage of the tuples following that path is very low. And obviously computational
cost would be much higher.
The last input algorithm for my implementation of IC algorithm is maximum tree depth.
When this depth is reach further expansion of the tree will stop and all weak intervals in a
leaf node are treated as strong.
Results
For testing my program I used three data sets found in the database of machine learning
data sets on a WWW cite http://axon.cs.byu.edu/~martinez/470/programs/MLDB/:
Tumor – is a data set, which has 10 attributes and 264 tuples. First attribute is called
class and values of this attribute are actually the group labels. There are two group labels:
Benign and Malignant. The rest 9 attributes are discrete numeric attributes and each of
them has 10 different values.
Zoo – is data set about animal species, and has 17 attributes and 90 tuples. First attribute
has 7 values – group attributes. The rest attributes, except for one discrete numeric
attribute, are boolean and have values only 0 and 1.
Hayes – has 5 attributes, all discrete numeric and 93 tuples. There are 3 group labels.
I made a little modification to each of these three data sets adding for each attribute a new
characteristics – is it orderable or not. I needed this information in order to make decision
about interval formation.
I did experiments with various input parameters in order to reduce classification error, but
in general my average classification error was much higher than paper suggested. It was
usually between 20% - 40%. I think that main reason for that is that I used very small
data sets (my training data sets usually had from 20-50 tuples) while results described in
[1] were obtained on data sets containing 10000 tuples (2500 for training set) and more.
Also, I think that the number of attributes and the number of values for each attribute
should be considered when choosing optimum training set size: the more attributes and
attributes values are present, the larger training set size will be required for optimal
classification error.
First, I did some experiments with the size of the training data set and the way how these
training tuples were selected from the database. The results are shown in Figure 1 and
Figure 2.
50
40
30
Tumor
Zoo
20
Hayes
10
0
equidistant
random average
Figure 1: Percentage of classification error where training size is 20% of data set
50
40
30
Tumor
Zoo
20
Hayes
10
0
equidistant
random equidistant
Figure 2: Percentage of classification error where training size is 40% of data set
It seems that results are very unstable depending on the choice of the tuples, but the
reason for that is that my training set sizes were very small. I am sure that with much
larger training sets results would be more stable. And also, for the larger data sets it
probably wouldn’t matter how the tuples were selected from the database.
These results also show what was expected: with the larger percentage of data set
selected as a training set we got less classification error.
Next, I kept the training size fixed (25% of data set) and varied the precision threshold.
The results are shown in Figure 3:
50
40
30
Tumor
Zoo
20
Hayes
10
0
0.6
0.8
1
Figure 3: Percentage of classification error for different values of threshold
The results show that the classification error hasn’t change for different values of
threshold, except for drastic decrease for database Tumor when the threshold was 1. I
repeated the experiment with the same value for Tumor and I mostly got the similar
results. Again, I believe that explanation is that my training sets are small and there are
very few or none tuples for some values of attributes . The ratio of the frequency of the
winning group to the total frequency for one value of the winning attribute is usually
high, and decreasing the value of the threshold won’t change the strength of the winner.
The results for minimum number of tuples in one interval in order for it to be expended
showed that for my data sets best accuracy is accomplished for values less or equal 3. For
larger databases probably larger values would be more efficient. The results are shown in
Figure 4.
60
50
40
Tumo
30
Zoo
20
Hayes
10
0
0
3
5
7
Figure 4: Values for minimum number of tuples. Size of the training set is 30% of
database and the threshold is 0.9
For last input parameter – maximum depth of tree – I got almost the same values of
classification error for depth of tree equals 2, 4, 6, 8, but important is that it didn’t
significantly decreased with the maximum depth of tree. This is very important result,
since this means that we can develop decision trees of relatively shallow depth, which
will perform almost the same as larger trees. And the computational cost will be much
cheaper.
Conclusion
My implementation tested for classification accuracy performed poorer than IC tested in
paper [1]. The results obtained in paper [1] show that classification error is below 20%,
while my program usually resulted in classification error in range from 20% - 40%. As I
said, the main reason must be the drastic difference between our data sets sizes. Also,
since I borrowed already existing machine learning data sets I can’t be sure how much
noise was there.
Also, I couldn’t test my program for efficiency in the way it was tested in paper [1]. I
couldn’t obtain free copy of ID3 or CART packages in order to compare their
performance to performance of my IC implementation. The reason for that is that they are
commercial products (especially ID3), very much advertised and sold all over the world.
I recently read that the author of ID3, Ross Quinlan, has upgraded the old version of his
algorithm, and his new version, commercially called C5.0 is claimed to be 5 – 10 times
faster than the old one. I think this is interesting information for the perspective of data
mining.
My opinion is that there is still lot of research material in this area. Starting from machine
learning algorithm as bases, they can be modified in various ways considering the
specification of the domain they are used in. Also the characteristics of the data sets used
for training set and test set also play big roll in performance of the algorithm. I believe
that for each, specific database input parameters can be tuned in that way that
classification error is maximally reduced.
If interval classifiers are going to be used in interactive loops, I think that good idea
would be to make generation of tree dynamic, letting the algorithm “learn on its own
mistakes”. One way would be to include all misclassified tuples in next training set.
In this project report I also included printouts of three databases used and two program
outputs with decision trees and classification error.
References
[1] Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer and Arun Swami, “ An
Interval Classifier for Database Mining Application”
[2] Jiawei Han, Yandong Cai and Nick Cercone, “ Knowledge Discovery in Databases:
An Attribute-Oriented Approach”
Download