Uploaded by laxmiprasannakasireddy

Project1 team American Kestral-v4

advertisement
Decision Tree and Random Forest
Classification of the Type of Wheat
Alejandra T.
Felix S.
Rafi A.
Laxmi K.
C&PE Department
February 14th, 2024
1
Introduction
Project Goal:
Developing a model to predict the type of wheat.
Problem:
Due to the large variety of wheat in North America, confusion in names is
frequent.
Wheat Classification helps to:
Wheat Dataset
Standardization of the grain.
Facilitate the adequate
industrial pretreatment
Assist in varietal experiments
2
Model Parameters
This is a classification model: the output variable is discrete.
Classification Algorithms:
Features:
Labels:
 Decision tree
1. Area A,
1. Kama
 Random Forest
2. Perimeter P,
2. Rosa
3. Compactness C = 4piA/P^2,
3. Canadian
4. Length of kernel,
5. Width of kernel,
6. Asymmetry coefficient
7. Length of kernel groove.
3
Decision Tree: Random State of 22
4
Decision Tree: Random State of 25
5
Decision Tree: Random State of 100
6
Confusion Matrix
Testing Data Set
Training Data Set
Aleja
Felix
Aleja
Felix
Laxmi
Rafi
Laxmi
Rafi
7
Model Metrics
Testing Data Set
Aleja
Laxmi
Felix
Rafi
8
Maximum Tree Depth
Aleja
Best max_depth is 4!
Random_state=22
Felix
Best max_depth is 6
Random_state=13
9
Maximum Tree Depth
Laxmi
Best max_depth is 6!
Random_state=25
Rafi
Best max_depth is 2!
Random_state=100
10
Conclusions
Random_state
Max_depth
Accuracy
22
4
93%
13
6
92%
25
6
91%
100
2
92%
 decision trees are relatively simple and very powerful classification algorithms.
However, they are highly prone to undergo overfitting, if they are not properly
restricted.
 A testing accuracy higher than a training accuracy suggest underfitting of the
model
 The point after which the testing accuracy start decreasing might indicate the
overfitting of the model
11
Random Forest: Confusion Matrix
Testing Data Set
Training Data Set
Aleja
Felix
Aleja
Felix
Laxmi
Rafi
Laxmi
Rafi
12
Random Forest: Model Metrics
Testing Data Set
Aleja
Laxmi
Felix
Rafi
13
Hyperparameters Analyses
Aleja
Best n_estimators is 5!
Best max_features is 2!
Random_state=22
14
Hyperparameters Analyses
Laxmi
Best n_estimators is 15!
Best max_features is 3!
Random_state=22
15
Hyperparameters Analyses
Rafi
Best n_estimators is 3!
Best max_features is 3!
Random_state=22
16
Features Importance: Correct Fit - Aleja
Random_state=22
testing_score =92.9%
n_estimators = 5 , max_features = 2,
random_state=22
testing_score=95.2%
n_estimators = 5 , max_features = 2,
random_state=1
testing_score =95.2%
Features Importance
After dropping the least important
features:
n_estimators = 5 , random_state=22
n_estimators = 5 , max_features = 4, random_state=22
testing_score =92.9%
testing_score =92.9%
No change in the
testing score!
Features Importance – Rafi
Random_state=31
testing_score =98.2%
Case of overfit data and hypothesis. - Felix's lucky 13 set
In case of growing a big forest with a large estimation of seed population, eventually gives a saturation
of a good score at higher values it is found that choosing the first peak at which the accuracy is
maximum, gives good results and it is less computationally expensive. Either case the length of the
groove is a common way to distinguish the wheat samples mathematically on a dataset by
classification.
Random Forest: Conclusions
Random_state
n_estimators
Max_features
Testing Score
22
5
2
95.2%
13
80
4
92.8%
25
15
3
90.4%
31
3
3
98.2%
 high sensitivity of the feature importance to changes in model hyperparameters
(random_state, max_depth, n_estimators, max_features)
 The experimental high n estimators produced overfit data ( here Felix's model) and we
decided on a less n estimators for growing the random forest.
 the hyperparameters optimization of this type of models can be time consuming
 the balance between model complexity and generalization capability guides the selection of an
optimal model that maximizes predictive accuracy while minimizing the risk of overfitting.
21
Conclusion
•
Decision Trees vs Random Forest model for gives insights into how the models are different in construction and it
also highlights the trade-off between simplicity and accuracy.
•
Decision Trees offer an intuitive, easy-to-interpret model but may suffer from overfitting.
•
In contrast, Random Forests, by aggregating multiple trees, provide a more robust and accurate model at the expense of
increased complexity and computational demands, demonstrating their effectiveness in managing the diverse characteristics
of wheat data.
•
Random Forest models can indeed overfit, although they are generally less prone to overfitting compared to individual decision
trees unless we generate a big forest with a complex model with individual diverse decision trees, like one Felix did here.
LOOKING FORWARD!
•
Applying regression to the wheat dataset could provide insights beyond classification, such as predicting quantitative
attributes of wheat based on other features.
•
Example: One could use regression to predict the kernel's physical dimensions or weight from its shape descriptors
aiding in quality assessments.
22
Cute
To compensate for his overfit , here is a cute picture of an American Kestrel, sitting on
the top of Snow Hall, took by Felix from the chi-omega fountain.
23
Questions?
Thank you!
Icons designed by Madebyoliver from Flaticon.
Download