Uploaded by Phương Huyền Hà

Group5 IS-CS 468 AIS

advertisement
Ha Phuong Huyen
BỘ GIÁO DỤC VÀ ĐÀO TẠO
ĐẠI HỌC DUY TÂN
GROUP PROJECT
SUBJECT: Artificial Intelligence (for Business)
NAME: Ha Phuong Huyen
Nguyen Minh Phu
Vo Duy Nhat
Huynh Le Quynh Nhu
Contents
1. Use the KNN approach to classify cases with k= 1, 3, 5, 7, 9, 11, using species
as the outcome variable. Based on the test set, find the best k. Justify your
answer. .................................................................................................................. 2
KNN Classification ..................................... Ошибка! Закладка не определена.
- Use the KNN approach to classify cases with k= 1, 3, 5, 7, 9, 11, using
species as the outcome variable. Based on the test set, find the best k. Justify
your answer ................................................ Ошибка! Закладка не определена.
2) Run the KNN algorithm with the best k, and then evaluate your KNN model
by using a confusion matrix(ex. accuracy, recall, precision and F1 score) ......... 2
3) Run the Decision Tree approach to classify cases with max depth = 3, 5, 7
using species as the outcome variable. Based on the test set, find the best max
depth. Justify your answer. And then evaluate your Decision Tree model by using
a confusion matrix(ex. accuracy, recall, precision and F1 score) ...................... 13
Decision Tree Classification ............................................................................. 13
+ Run the Decision Tree approach to classify cases with max depth = 3, 5, 7
using species as the outcome variable. Based on the test set, find the best max
depth. ................................................................................................................ 17
-
Evaluate the decision tree model using the confusion matrix: ..................... 20
4) Run the Logistic Regression approach. After you create your logistic
regression model, you should evaluate your logistic regression model by using a
confusion matrix(ex. accuracy, recall, precision and F1 score). ........................ 22
5) We want to predict the species of a case with the following data. Please use the
'Create Table' function to input the dataset and apply three different
classification approaches for prediction. Explain what three models predicted for
new data, including the predicted outcome (probability values) for each species.
............................................................................................................................. 26
6) We need to report on the 3 classification based on the analysis result to your
boss. Create your report. In your report you should include all processes your
team conducted including one page executive summary. ................................... 28
1
1. Use the KNN approach to classify cases with k= 1, 3, 5, 7, 9, 11, using
species as the outcome variable. Based on the test set, find the best k. Justify
your answer. (by Phuong Huyen)
Brightics Process
Data Load
Data: “sample_iris.csv”
Column:





sepal_length: Double
sepal_width: Double
petal_length: Double
petal_width: Double
species: string
2
Split Data




Train Ratio
Test Ratio
Seed
Group by
:8
:2
:12
:species
KNN Classification
 Input: Train_Table -> Split Data -> train_table, Test_Table -> Split Data ->
test_table
 Feature Columns: select all (sepal_length, sepal_width, petal_length,
petal_width )
 Label Columns: species
 Number of neighbors: 1,3,5,7,9,11 (default)
 Suffix type: Label
3
Evaluate classification
 Inputs: default
 Label Column: ‘species’
 Prediction Column: ‘prediction’
4
With k =1
With k =3
5
With k =5
6
With k =7
7
With k =9
8
With k =11
9
Justification
The best value for k here is k=7 because it has the highest accuracy (Accuracy =
1.0).
Accuracy measures the model's correct prediction ratio but may not be sufficient
for imbalanced data.
10
Precision and recall provide a more nuanced view of positive prediction accuracy
and the ability to capture true positives.
In KNN, the choice of K (number of neighbors) is critical.
For K=1, 3, or 5, accuracy is lower compared to K > 5. K=7 strikes a balance
between accuracy and overfitting, optimizing all four evaluation metrics.
Thus, the optimal K for this dataset is 7.
2) Run the KNN algorithm with the best k, and then evaluate your KNN
model by using a confusion matrix(ex. accuracy, recall, precision and F1
score) (by Phuong Huyen)
Run with K=7
Evaluate Classification Result
Accuracy : 1.0
Metrics
Confusion matrix
11
12
3) Run the Decision Tree approach to classify cases with max depth = 3, 5, 7
using species as the outcome variable. Based on the test set, find the best max
depth. Justify your answer. And then evaluate your Decision Tree model by
using a confusion matrix(ex. accuracy, recall, precision and F1 score)( by Duy
Nhut)
Decision Tree Classification
Data load:
Data: “sample_iris.csv”
Column:
- sepal_length: Double
13
-
sepal_width: Double
petal_length: Double
petal_width: Double
species: string
Split data:
-
Train Ratio
Test Ratio
Seed
Group by
:8
:2
:12
: species
14
Decision Tree Classification Train
-
Parameters :
Table: train_table.
Label Column: Species
Feature Columns: select all (sepal_length, sepal_width, petal_length, petal_width )
Criterion: Gini
Splitter: Best
Max Depth: 3, 5, 7
15
Decision Tree Classification Predict
- Parameter .
Inputs : test_table, Decision Tree Classification Train model
16
Evaluate classification
- Inputs: default
- Label Column: ‘species’
- Prediction Column: ‘prediction’
+ Run the Decision Tree approach to classify cases with max depth = 3, 5, 7
using species as the outcome variable. Based on the test set, find the best max
depth.
With Max depth = 3
17
-
With Max depth = 5
18
-
With Max depth = 7
- After training decision trees with a maximum depth of 3, 5, 7, I will test their
accuracy on the test set. Accuracy is the ratio between the number of correct
predictions and the amount of test data. My results are:
+ Maximum depth = 3: Accuracy = 0.9333. This is a pretty good result, showing
that the model can properly classify most of the test data.
+ Maximum depth = 5: Accuracy = 0.93333. This is a very good result, showing
that the model can properly classify almost all test data.
+ Maximum depth = 7: Accuracy = 0.96667. This is a perfect result, showing that
the model can properly classify all test data. However, this could be a sign of the
model being overmatched, as it may have learned too much detail from the training
data and been unable to adapt to the new data.
- Based on this result, I can conclude that the best maximum depth for your problem
is 7, as it has the highest accuracy and moderate complexity.
19
- Evaluate the decision tree model using the confusion matrix:
Run with Max depth = 7
Evaluate Classification Result
Accuracy : 0.966666667
Confusion matrix
20
21
4) Run the Logistic Regression approach. After you create your logistic
regression model, you should evaluate your logistic regression model by using
a confusion matrix(ex. accuracy, recall, precision and F1 score). ( by Quynh
Nhu)
Data: “sample_iris.csv”
Column:
-
sepal_length: Double
sepal_width: Double
petal_length: Double
petal_width: Double
species: string
Pre-processing
- Query Executor
22
- Perform the conversion of the dependent variable into binary format (numeric
type) as per the input conditions, resulting in 1s and 0s.
SQL
Modeling
Split Data
Parameter
- Ratio: 8,2
- Seed: 12
23
Logistic Regression Train
- Select the dependent variable (SPEC) and explanatory variables (Sepal_length,
Sepal_width, Petal_length, Petal_width), then proceed with the analysis.
Parameter
- Inputs : Split Data-train_table
- Feautre Columns : Sepal_length, Sepal_width, Petal_length, Petal_width
- Label Column : SPEC
- In the MODEL Summary, you can find the following information: Logistic
Regression Result: Intercept, Regression Coefficients, and more.
24
Logistic Regression Predict - Perform predictions by applying the regression
equation generated from – parameter
- Inputs : Split Data-test_table, Logistic Regression Train-model
25
Evaluate Classification
-
Evaluate the predictions from Logistic Regression Predict – parameter
Inputs : Logistic Regression Pedict
Label Column : SPEC
Prediction Column : prediction
5) We want to predict the species of a case with the following data. Please use
the 'Create Table' function to input the dataset and apply three different
classification approaches for prediction. Explain what three models predicted
for new data, including the predicted outcome (probability values) for each
species. ( by Minh Phu)
26
Sepal_length
5.8
Sepal_width
3.2
Petal_length
2.2
Petal_width
1.5
- Use the KNN approach :
- Use the Decision Tree approach :
- Use the Logistic Regression approach :
Convert the data to a logical form where 1 is setosa and 0 is not sentosa then we
have result above
27
6) We need to report on the 3 classification based on the analysis result to your
boss. Create your report. In your report you should include all processes your
team conducted including one page executive summary. (by Minh Phu)
First, when using the knn approach, we can see that using k= 7 results in the
highest accuracy in the evaluation table of 1. Although it is possible to overfiting,
compared to the results with k is different from 7, this is considered the most stable
result.
Therefore, to perform the following questions using the knn method, we use k=7 to
predict the results.
Based on what we learned we went ahead and created a prediction model using a
decision tree approach.
By changing the value of max depth and relying on the evaluate classification
function to evaluate which max depth is best suited to the data set. (We find this
model is the most accurate) along with the confusion matrix of the model with max
depth = 7 which is extremely stable.
Continuing, we use the Logistic Regression approach:
To approach this method, we must first process the data into logistic form. The
above model is processed with data where 1 means setosa species and 0 means not
setosa species.
28
From there, we look at the evaluation classification table with the model's accuracy
being 1
From the pre-built models in the above sentences, we replace the test data with
create table function, we can predict that all 3 models predict that the data belongs
to the species setosa.
With the KNN approach k = 11 gives probability_setosa = 0.72727272 which is
quite good compared to the Decision Tree model with probability_setosa = 1
(overfitting) and Logistic Regression approach probability_1 = 0.6528 (prediction
accuracy is quite low).
29
Download