Knowledge distillation from random forests 1 Description Random forests are one of the most commonly used machine learning methods, in which an ensemble of trees is combined into a simple classifier/regressor that yields superior performance. The aim of this project is to "compress" the knowledge encoded within a collection of trees into a single, interpretable tree, by following a reasoning similar reasoning to [1]. You will train a Random forest classifier with scikit-learn and use the probabilities provided to train a short tree classifier/regressor. 2 Do’s and don’t’s • This is an individual project and you must work on it by yourself. • DO NOT copy and paste from other sources (with or without referencing) — this is plagiarism. • DO NOT copy text from other sources and replace random words — this is also plagiarism (with or without referencing). You need to paraphrase the text. • DO NOT include screenshots of your code or code outputs in the report. • DO save figures properly from Python (e.g., using plt.savefig()) and include them in the report. • DO NOT write your name on the report: use your registration number. • DO NOT zip together the pdf and txt files when submitting your work to FASER. 3 Tasks 1. Choose three appropriate datasets from UCB online [2], load and inspect them using data exploration techniques. 2. Train a random forest on each dataset, with a reasonably high number of trees (e.g., 100 upwards). 3. Get the probabilities for each class and create a new dataset that includes the probabilities as calculated by the random forest. So if your original dataset included dogs vs. cats (i.e., a binary classification task), you should now create a new dataset for multi-class classification that includes multiple new classes, encoded similar to: 0.1-dog-0.9-cat 0.2-dog-0.8-cat ... Use any binning method you want (e.g., numpy.histogram will create the probability bins for you). 4. Learn decision tree classifiers on the new dataset (in which you’re using the binned probabilities above as the labels to predict), plot them and measure their accuracy on the original datasets. Use hyperparameter tuning to find appropriate parameters for the decision trees. 5. Using the same hyperparameters as in (4), train a decision tree with the original data (i.e., without the distillation procedure). 6. Compare the performances of the three methods: the distilled classifiers from (4), the short decision trees from (5) and the original random forests from (2), in terms of accuracies and other metrics that might be appropriate to your specific datasets. 7. Compare your performance with other state-of-the-art methods on the selected datasets. 1 4 References 1. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean.“Distilling the knowledge in a neural network.” 2. UCB online 5 Dataset examples 1. https://archive.ics.uci.edu/ml/datasets/Forest+Fires 2. https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data 3. https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring 6 Deliverables You will be assessed on both the paper you write and the quality of your code and documentation. All the code should be in GitHub (ideally, using .py files for the main code, and some illustrative examples on Notebooks). The notebooks should not be used to store all the code, just for examples, properly importing the functions from Python files. The report should include the following sections: 1. Title: Make sure the title of your paper is descriptive of your work. Do not use “Project 1/2/3/Reassessment” as your title! 2. Abstract: Provide a short description of your work to convince the reader that your paper is worth reading. A good abstract should include a statement of problem significance, a summary of the methods and results, and a short conclusion. The abstract should not be longer than 250 words. 3. Introduction: Explain the purpose of your work and motivates it – why is what you are doing important? This section should include references to show that what you’re doing is important and relevant. 4. Background/Literature review: Describe similar efforts done in the past (e.g., by others who used the same datasets, if applicable, mentioning their results) and also introduce any necessary background knowledge. Bear in mind that the paper is for an audience with a background in machine learning, so do not explain what random forests/decision trees are, and we also don’t need a formula for accuracy. 5. Methodology: Describe the dataset/s you used, and your whole pipeline to obtain the results. 6. Results: Use graphs and tables (actual tables, not screenshots of them imported as figures) to show your results, including intermediate results you may have obtained throughout the process. Check examples of papers from others (e.g., here) to describe the results. 7. Discussion: What are insights that you can extract from your results? Compare your results from others if applicable, what is better and worse in your approach? 8. Conclusion: A couple of short paragraphs describing any concluding remarks you might have, including potential avenues for future work/improvements. The report should be a maximum of 6 pages long, including references. Note that this is a hard limit: you should not go above it. If your report is longer than this, only the first 6 pages will be marked. The format of your report should adhere to IEEE standards. 7 Your FASER submission You need to submit 2 files to FASER: 2 1. A report in PDF format following this IEEE template. The maximum length allowed for the report is 6 pages. 2. A txt file with a link to a GitHub project that contains the code and the data you used. Make sure your README.md describes how to run the code and provides links to download the dataset/s if they are too large to upload to GitHub. Your code should be a mixture of (mostly) python code and (one or two) jupyter notebooks illustrating the methods you used. Make sure your code is well documented. 3