Decision Tree and Random Forest Classification of the Type of Wheat Alejandra T. Felix S. Rafi A. Laxmi K. C&PE Department February 14th, 2024 1 Introduction Project Goal: Developing a model to predict the type of wheat. Problem: Due to the large variety of wheat in North America, confusion in names is frequent. Wheat Classification helps to: Wheat Dataset Standardization of the grain. Facilitate the adequate industrial pretreatment Assist in varietal experiments 2 Model Parameters This is a classification model: the output variable is discrete. Classification Algorithms: Features: Labels: Decision tree 1. Area A, 1. Kama Random Forest 2. Perimeter P, 2. Rosa 3. Compactness C = 4piA/P^2, 3. Canadian 4. Length of kernel, 5. Width of kernel, 6. Asymmetry coefficient 7. Length of kernel groove. 3 Decision Tree: Random State of 22 4 Decision Tree: Random State of 25 5 Decision Tree: Random State of 100 6 Confusion Matrix Testing Data Set Training Data Set Aleja Felix Aleja Felix Laxmi Rafi Laxmi Rafi 7 Model Metrics Testing Data Set Aleja Laxmi Felix Rafi 8 Maximum Tree Depth Aleja Best max_depth is 4! Random_state=22 Felix Best max_depth is 6 Random_state=13 9 Maximum Tree Depth Laxmi Best max_depth is 6! Random_state=25 Rafi Best max_depth is 2! Random_state=100 10 Conclusions Random_state Max_depth Accuracy 22 4 93% 13 6 92% 25 6 91% 100 2 92% decision trees are relatively simple and very powerful classification algorithms. However, they are highly prone to undergo overfitting, if they are not properly restricted. A testing accuracy higher than a training accuracy suggest underfitting of the model The point after which the testing accuracy start decreasing might indicate the overfitting of the model 11 Random Forest: Confusion Matrix Testing Data Set Training Data Set Aleja Felix Aleja Felix Laxmi Rafi Laxmi Rafi 12 Random Forest: Model Metrics Testing Data Set Aleja Laxmi Felix Rafi 13 Hyperparameters Analyses Aleja Best n_estimators is 5! Best max_features is 2! Random_state=22 14 Hyperparameters Analyses Laxmi Best n_estimators is 15! Best max_features is 3! Random_state=22 15 Hyperparameters Analyses Rafi Best n_estimators is 3! Best max_features is 3! Random_state=22 16 Features Importance: Correct Fit - Aleja Random_state=22 testing_score =92.9% n_estimators = 5 , max_features = 2, random_state=22 testing_score=95.2% n_estimators = 5 , max_features = 2, random_state=1 testing_score =95.2% Features Importance After dropping the least important features: n_estimators = 5 , random_state=22 n_estimators = 5 , max_features = 4, random_state=22 testing_score =92.9% testing_score =92.9% No change in the testing score! Features Importance – Rafi Random_state=31 testing_score =98.2% Case of overfit data and hypothesis. - Felix's lucky 13 set In case of growing a big forest with a large estimation of seed population, eventually gives a saturation of a good score at higher values it is found that choosing the first peak at which the accuracy is maximum, gives good results and it is less computationally expensive. Either case the length of the groove is a common way to distinguish the wheat samples mathematically on a dataset by classification. Random Forest: Conclusions Random_state n_estimators Max_features Testing Score 22 5 2 95.2% 13 80 4 92.8% 25 15 3 90.4% 31 3 3 98.2% high sensitivity of the feature importance to changes in model hyperparameters (random_state, max_depth, n_estimators, max_features) The experimental high n estimators produced overfit data ( here Felix's model) and we decided on a less n estimators for growing the random forest. the hyperparameters optimization of this type of models can be time consuming the balance between model complexity and generalization capability guides the selection of an optimal model that maximizes predictive accuracy while minimizing the risk of overfitting. 21 Conclusion • Decision Trees vs Random Forest model for gives insights into how the models are different in construction and it also highlights the trade-off between simplicity and accuracy. • Decision Trees offer an intuitive, easy-to-interpret model but may suffer from overfitting. • In contrast, Random Forests, by aggregating multiple trees, provide a more robust and accurate model at the expense of increased complexity and computational demands, demonstrating their effectiveness in managing the diverse characteristics of wheat data. • Random Forest models can indeed overfit, although they are generally less prone to overfitting compared to individual decision trees unless we generate a big forest with a complex model with individual diverse decision trees, like one Felix did here. LOOKING FORWARD! • Applying regression to the wheat dataset could provide insights beyond classification, such as predicting quantitative attributes of wheat based on other features. • Example: One could use regression to predict the kernel's physical dimensions or weight from its shape descriptors aiding in quality assessments. 22 Cute To compensate for his overfit , here is a cute picture of an American Kestrel, sitting on the top of Snow Hall, took by Felix from the chi-omega fountain. 23 Questions? Thank you! Icons designed by Madebyoliver from Flaticon.