Knowledge Distillation from Random Forest (Pruning Method) Abstract— There is much talk about which classification algorithm is better in machine learning or deep learning . But it more specifically depends on our choice. Because if we talk about Support Vector Machine (SVM) , it gives us better result when data is sparse. But random forest is much better than most of the strong classification algorithms because it avoids overfitting the data . Besides if we use much less data as a model compression purpose , it gives very optimum results. But that is a major problems when we talk about whole lot of other data or a new problem statement. So to avoid such problem we have used pruning technique which gives better results without taking the whole bunch of data rather it took small chunk of data after cutting out the useless or reused data . We tested our method on small , medium and large type of datasets named as Iris dataset , breast cancer dataset and mnist digits classification dataset . Our compression method maintains the accuracy besides shorten the model . III. METHODOLOGY Here we have used three datasets to get better insight on our approach. Besides this, we deliberately have chosen different styles of data small , medium and large. First is iris dataset which contains 4 features and 3 target variables or outputs .It describes about the flowers classification. The second data we loaded is of breast cancer which contains more features and total of 500+ entries. Similarly , the third dataset is also famous known as mnist digits classification dataset which has 64 features and more than 1300+ of entries. Our method has two main aspects: 1) Performing the simple rule of k-folds on training data 2) Then checking the accuracy on test data one by one Basically in first step, we pre-process the dataset using scaler or other methods then distil the data using random forest algorithm by picking the chunks of data and performing I. INTRODUCTION our random forest algorithm on it and measuring accuracy and We know today the deep learning algorithms are giving precision and other metrics like f1-score and judging it after enormously better results when we talk about image performing GridSearchCv using several folds and pruning classification problems, image segmentation or if we talk parameters like max depth and min sample leafs. We have about object detection or extracting info kind of things. also more parameters like solver: ‘gini’ or ‘entropy’. Despite their best results and remarkable accuracy, most Then we do mutli-classification of that which may have of the state of the art deep learning methods i.e., CNN, ANN binary sort of classification and make dataset out of it and and more are computationally more expensive plus time and visualizing it through a binning technique using histogram memory demanding. Such problems create hurdles in bins which shows which predictor or estimator has affected achieving the best accuracy and results with lesser complexity. more as shown in figure .1. Similarly, most of the methods require expensive After binning the probabilities of output variable , we hardware such as GPU with a million amount of memory and perform our data on decision trees and then visualize the tree parallel computation abilities to withdraw the burden., which shows the max depth of our data which generally and Various methods have been proposed to reduce most probably overfits the data . And the fact is , it gives the computational complexity besides CNN to achieve model best ever accuracy without generalizing the data and better compression with best accuracy like a deeper teacher distils training. some of the features to smaller student network. Then, we tune the data using hyper parameter tuning In this , we have proposed a new technique which creates a methods. More specifically if we say our pruning technique in difference in traditional pruning method. which data is cut out and pruned through the bottom of the trees and decides the best out of it . The best accuracy is shown in Table .1 for all the datasets and also explained the II. LITERATURE REVIEW other metrics as accuracy, precision and f1-score. Similar efforts were made by other machine learning The pruned model is visualized as tree in figure 2 and 3 for engineers using traditional knowledge distillation, truncation the Iris and breast cancer dataset. and quantization techniques. Also some of the efforts were This is the whole pipelining of our proposed approach made in the pruning, Neural Architecture Search (NAS) and Low rank approximation techniques. But we have created a difference by following the rules of K-folds which differentiate the old pruning method. which we have applied on our dataset to get optimal results avoiding the time complexity and the issue of large model having computational complexity of memory. The second bar graph of breast cancer shows the balanced outputs as follows: IV. RESULTS TABLE I ACCURACIES AND OTHER PERFORMANCE METRICS Set Iris Accuracies,Precision,F1-score Breast MNIST Digits Cancer Classification Accuracy Precision F1-score author email address (in Courier), cell in a table level-1 heading (in Small Caps), paragraph abstract body reference item (partial) abstract heading (also in Bold) level-2 heading, level-3 heading, author affiliation Figure. 2 It shows the probabilities binning of the breast cancer dataset and shows the effects of all predictors with the output of two target variables. The bars in above figure .2 has shown that we have two target variables which are accurately classified if we take a look on y axis (probability bins). Similarly , if we display the sample from third dataset which is shown as: This table shows the numerical data of datasets performance metrics based on the parameters like accuracy , precision and f1-score . By looking into this table, we get to know the distilled compressed model has somewhere best accuracy where somewhere it has lesser performance shown. V. GRAPHS Now , we will check the results of binning probabilities of iris dataset , breast cancer and digits classification datasets as follows: Figure.3 It shows the probabilities binning of the digits dataset and shows the effects of our four predictors in the form of various colours and binning distances. VI. VISUALIZATION TREES Now, we will look through the distilled classifiers outputs of compressed model in the form of tree so that we can comprehend the advantage of pruning technique: Figure.1 It shows the probabilities binning of the iris dataset and shows the effects of our four predictors in form of different colors. The bars in the above figure .1 shows that the data has two outputs means it is basically binary classified data which is in a very good and optimal form. Figure.4 shows the tree of distilled classifiers for iris dataset We can now decide from above figure 4 which shows how model is compressed using our pruning technique. Now we will go through the tree for the second dataset as follow: Figure.5 It shows the tree for breast cancer through classified methods. Although , this dataset contains almost 30 of features in it but still it is compressed very effectively. Now , we will look for the third dataset of digits classification model: Figure.6 shows the tree visualization of digits classification dataset. Now this shows the tree is complex and difficult to classify. So, we can see the limitations of this method. For the too bulky and non-linear type of data, it is showing the classification but with much overfitting in it. VII. DISCUSSION By looking at the results on small and medium range of datasets we can see that our technique is performing much better and model is compressing to a greater extent. We have compared our model with other related work which uses Pruning weight and knowledge distillation from CNN techniques on the image classification dataset which has given the accuracy for distilled classified model as 90 percent or less while for without distillation , gives much better results of 95 percent at most. But for large and bulky range of dataset, it is creating bigger results difficult to understand . Still classifying to much extent but in this case, it is overfitting it. One of the reasons is that it contains large number of nonlinear type of dataset which is creating issue. VIII. CONCLUSION: My remarks about the techniques of model compression concludes that the future of AI lies in the fact if we able to compress our model without using much larger data and using more computation with lots of complexities and ambiguities. So the world is heading towards the deep learning algorithms our neural networks to train the model and get better results. The top priority is to achieve the best appropriate and considerable results which must be easy to deploy such as in handheld devices.