Random Forest Pruning for Model Compression

Knowledge Distillation from Random Forest
(Pruning Method)
Abstract— There is much talk about which classification
algorithm is better in machine learning or deep learning . But it
more specifically depends on our choice. Because if we talk about
Support Vector Machine (SVM) , it gives us better result when
data is sparse. But random forest is much better than most of the
strong classification algorithms because it avoids overfitting the
data . Besides if we use much less data as a model compression
purpose , it gives very optimum results. But that is a major
problems when we talk about whole lot of other data or a new
problem statement. So to avoid such problem we have used
pruning technique which gives better results without taking the
whole bunch of data rather it took small chunk of data after
cutting out the useless or reused data . We tested our method on
small , medium and large type of datasets named as Iris dataset ,
breast cancer dataset and mnist digits classification dataset . Our
compression method maintains the accuracy besides shorten the
model .
Here we have used three datasets to get better insight on
our approach. Besides this, we deliberately have chosen
different styles of data small , medium and large. First is iris
dataset which contains 4 features and 3 target variables or
outputs .It describes about the flowers classification. The
second data we loaded is of breast cancer which contains more
features and total of 500+ entries. Similarly , the third dataset
is also famous known as mnist digits classification dataset
which has 64 features and more than 1300+ of entries.
Our method has two main aspects:
1) Performing the simple rule of k-folds on training data
2) Then checking the accuracy on test data one by one
Basically in first step, we pre-process the dataset using
scaler or other methods then distil the data using random
forest algorithm by picking the chunks of data and performing
our random forest algorithm on it and measuring accuracy and
We know today the deep learning algorithms are giving
precision and other metrics like f1-score and judging it after
enormously better results when we talk about image
performing GridSearchCv using several folds and pruning
classification problems, image segmentation or if we talk
parameters like max depth and min sample leafs. We have
about object detection or extracting info kind of things.
also more parameters like solver: ‘gini’ or ‘entropy’.
Despite their best results and remarkable accuracy, most
Then we do mutli-classification of that which may have
of the state of the art deep learning methods i.e., CNN, ANN
binary sort of classification and make dataset out of it and
and more are computationally more expensive plus time and
visualizing it through a binning technique using histogram
memory demanding. Such problems create hurdles in
bins which shows which predictor or estimator has affected
achieving the best accuracy and results with lesser complexity.
more as shown in figure .1.
Similarly, most of the methods require expensive
After binning the probabilities of output variable , we
hardware such as GPU with a million amount of memory and
perform our data on decision trees and then visualize the tree
parallel computation abilities to withdraw the burden.,
which shows the max depth of our data which generally and
Various methods have been proposed to reduce
most probably overfits the data . And the fact is , it gives the
computational complexity besides CNN to achieve model
best ever accuracy without generalizing the data and better
compression with best accuracy like a deeper teacher distils
some of the features to smaller student network.
Then, we tune the data using hyper parameter tuning
In this , we have proposed a new technique which creates a
methods. More specifically if we say our pruning technique in
difference in traditional pruning method.
which data is cut out and pruned through the bottom of the
trees and decides the best out of it . The best accuracy is
shown in Table .1 for all the datasets and also explained the
other metrics as accuracy, precision and f1-score.
Similar efforts were made by other machine learning
The pruned model is visualized as tree in figure 2 and 3 for
engineers using traditional knowledge distillation, truncation
the Iris and breast cancer dataset.
and quantization techniques. Also some of the efforts were
This is the whole pipelining of our proposed approach
made in the pruning, Neural Architecture Search (NAS) and
Low rank approximation techniques. But we have created a
difference by following the rules of K-folds which
differentiate the old pruning method.
which we have applied on our dataset to get optimal results
avoiding the time complexity and the issue of large model
having computational complexity of memory.
The second bar graph of breast cancer shows the balanced
outputs as follows:
MNIST Digits
Figure. 2 It shows the probabilities binning of the
breast cancer dataset and shows the effects of all predictors
with the output of two target variables.
The bars in above figure .2 has shown that we have two target
variables which are accurately classified if we take a look on y
axis (probability bins).
Similarly , if we display the sample from third dataset which
is shown as:
This table shows the numerical data of datasets
performance metrics based on the parameters like accuracy ,
precision and f1-score .
By looking into this table, we get to know the distilled
compressed model has somewhere best accuracy where
somewhere it has lesser performance shown.
Now , we will check the results of binning probabilities of
iris dataset , breast cancer and digits classification datasets as
Figure.3 It shows the probabilities binning of the digits
dataset and shows the effects of our four predictors in the
form of various colours and binning distances.
Now, we will look through the distilled classifiers outputs of
compressed model in the form of tree so that we can
comprehend the advantage of pruning technique:
Figure.1 It shows the probabilities binning of the iris
dataset and shows the effects of our four predictors in form of
different colors.
The bars in the above figure .1 shows that the data has two
outputs means it is basically binary classified data which is in
a very good and optimal form.
Figure.4 shows the tree of distilled classifiers for iris dataset
We can now decide from above figure 4 which shows how
model is compressed using our pruning technique.
Now we will go through the tree for the second dataset as
Figure.5 It shows the tree for breast cancer through
classified methods.
Although , this dataset contains almost 30 of features in it but
still it is compressed very effectively.
Now , we will look for the third dataset of digits classification
Figure.6 shows the tree visualization of digits classification
Now this shows the tree is complex and difficult to classify.
So, we can see the limitations of this method. For the too
bulky and non-linear type of data, it is showing the
classification but with much overfitting in it.
By looking at the results on small and medium range of
datasets we can see that our technique is performing much
better and model is compressing to a greater extent.
We have compared our model with other related work
which uses Pruning weight and knowledge distillation from
CNN techniques on the image classification dataset which has
given the accuracy for distilled classified model as 90 percent
or less while for without distillation , gives much better results
of 95 percent at most.
But for large and bulky range of dataset, it is creating
bigger results difficult to understand . Still classifying to much
extent but in this case, it is overfitting it.
One of the reasons is that it contains large number of nonlinear type of dataset which is creating issue.
My remarks about the techniques of model compression
concludes that the future of AI lies in the fact if we able to
compress our model without using much larger data and using
more computation with lots of complexities and ambiguities.
So the world is heading towards the deep learning
algorithms our neural networks to train the model and get
better results.
The top priority is to achieve the best appropriate and
considerable results which must be easy to deploy such as in
handheld devices.