Ha Phuong Huyen BỘ GIÁO DỤC VÀ ĐÀO TẠO ĐẠI HỌC DUY TÂN GROUP PROJECT SUBJECT: Artificial Intelligence (for Business) NAME: Ha Phuong Huyen Nguyen Minh Phu Vo Duy Nhat Huynh Le Quynh Nhu Contents 1. Use the KNN approach to classify cases with k= 1, 3, 5, 7, 9, 11, using species as the outcome variable. Based on the test set, find the best k. Justify your answer. .................................................................................................................. 2 KNN Classification ..................................... Ошибка! Закладка не определена. - Use the KNN approach to classify cases with k= 1, 3, 5, 7, 9, 11, using species as the outcome variable. Based on the test set, find the best k. Justify your answer ................................................ Ошибка! Закладка не определена. 2) Run the KNN algorithm with the best k, and then evaluate your KNN model by using a confusion matrix(ex. accuracy, recall, precision and F1 score) ......... 2 3) Run the Decision Tree approach to classify cases with max depth = 3, 5, 7 using species as the outcome variable. Based on the test set, find the best max depth. Justify your answer. And then evaluate your Decision Tree model by using a confusion matrix(ex. accuracy, recall, precision and F1 score) ...................... 13 Decision Tree Classification ............................................................................. 13 + Run the Decision Tree approach to classify cases with max depth = 3, 5, 7 using species as the outcome variable. Based on the test set, find the best max depth. ................................................................................................................ 17 - Evaluate the decision tree model using the confusion matrix: ..................... 20 4) Run the Logistic Regression approach. After you create your logistic regression model, you should evaluate your logistic regression model by using a confusion matrix(ex. accuracy, recall, precision and F1 score). ........................ 22 5) We want to predict the species of a case with the following data. Please use the 'Create Table' function to input the dataset and apply three different classification approaches for prediction. Explain what three models predicted for new data, including the predicted outcome (probability values) for each species. ............................................................................................................................. 26 6) We need to report on the 3 classification based on the analysis result to your boss. Create your report. In your report you should include all processes your team conducted including one page executive summary. ................................... 28 1 1. Use the KNN approach to classify cases with k= 1, 3, 5, 7, 9, 11, using species as the outcome variable. Based on the test set, find the best k. Justify your answer. (by Phuong Huyen) Brightics Process Data Load Data: “sample_iris.csv” Column: sepal_length: Double sepal_width: Double petal_length: Double petal_width: Double species: string 2 Split Data Train Ratio Test Ratio Seed Group by :8 :2 :12 :species KNN Classification Input: Train_Table -> Split Data -> train_table, Test_Table -> Split Data -> test_table Feature Columns: select all (sepal_length, sepal_width, petal_length, petal_width ) Label Columns: species Number of neighbors: 1,3,5,7,9,11 (default) Suffix type: Label 3 Evaluate classification Inputs: default Label Column: ‘species’ Prediction Column: ‘prediction’ 4 With k =1 With k =3 5 With k =5 6 With k =7 7 With k =9 8 With k =11 9 Justification The best value for k here is k=7 because it has the highest accuracy (Accuracy = 1.0). Accuracy measures the model's correct prediction ratio but may not be sufficient for imbalanced data. 10 Precision and recall provide a more nuanced view of positive prediction accuracy and the ability to capture true positives. In KNN, the choice of K (number of neighbors) is critical. For K=1, 3, or 5, accuracy is lower compared to K > 5. K=7 strikes a balance between accuracy and overfitting, optimizing all four evaluation metrics. Thus, the optimal K for this dataset is 7. 2) Run the KNN algorithm with the best k, and then evaluate your KNN model by using a confusion matrix(ex. accuracy, recall, precision and F1 score) (by Phuong Huyen) Run with K=7 Evaluate Classification Result Accuracy : 1.0 Metrics Confusion matrix 11 12 3) Run the Decision Tree approach to classify cases with max depth = 3, 5, 7 using species as the outcome variable. Based on the test set, find the best max depth. Justify your answer. And then evaluate your Decision Tree model by using a confusion matrix(ex. accuracy, recall, precision and F1 score)( by Duy Nhut) Decision Tree Classification Data load: Data: “sample_iris.csv” Column: - sepal_length: Double 13 - sepal_width: Double petal_length: Double petal_width: Double species: string Split data: - Train Ratio Test Ratio Seed Group by :8 :2 :12 : species 14 Decision Tree Classification Train - Parameters : Table: train_table. Label Column: Species Feature Columns: select all (sepal_length, sepal_width, petal_length, petal_width ) Criterion: Gini Splitter: Best Max Depth: 3, 5, 7 15 Decision Tree Classification Predict - Parameter . Inputs : test_table, Decision Tree Classification Train model 16 Evaluate classification - Inputs: default - Label Column: ‘species’ - Prediction Column: ‘prediction’ + Run the Decision Tree approach to classify cases with max depth = 3, 5, 7 using species as the outcome variable. Based on the test set, find the best max depth. With Max depth = 3 17 - With Max depth = 5 18 - With Max depth = 7 - After training decision trees with a maximum depth of 3, 5, 7, I will test their accuracy on the test set. Accuracy is the ratio between the number of correct predictions and the amount of test data. My results are: + Maximum depth = 3: Accuracy = 0.9333. This is a pretty good result, showing that the model can properly classify most of the test data. + Maximum depth = 5: Accuracy = 0.93333. This is a very good result, showing that the model can properly classify almost all test data. + Maximum depth = 7: Accuracy = 0.96667. This is a perfect result, showing that the model can properly classify all test data. However, this could be a sign of the model being overmatched, as it may have learned too much detail from the training data and been unable to adapt to the new data. - Based on this result, I can conclude that the best maximum depth for your problem is 7, as it has the highest accuracy and moderate complexity. 19 - Evaluate the decision tree model using the confusion matrix: Run with Max depth = 7 Evaluate Classification Result Accuracy : 0.966666667 Confusion matrix 20 21 4) Run the Logistic Regression approach. After you create your logistic regression model, you should evaluate your logistic regression model by using a confusion matrix(ex. accuracy, recall, precision and F1 score). ( by Quynh Nhu) Data: “sample_iris.csv” Column: - sepal_length: Double sepal_width: Double petal_length: Double petal_width: Double species: string Pre-processing - Query Executor 22 - Perform the conversion of the dependent variable into binary format (numeric type) as per the input conditions, resulting in 1s and 0s. SQL Modeling Split Data Parameter - Ratio: 8,2 - Seed: 12 23 Logistic Regression Train - Select the dependent variable (SPEC) and explanatory variables (Sepal_length, Sepal_width, Petal_length, Petal_width), then proceed with the analysis. Parameter - Inputs : Split Data-train_table - Feautre Columns : Sepal_length, Sepal_width, Petal_length, Petal_width - Label Column : SPEC - In the MODEL Summary, you can find the following information: Logistic Regression Result: Intercept, Regression Coefficients, and more. 24 Logistic Regression Predict - Perform predictions by applying the regression equation generated from – parameter - Inputs : Split Data-test_table, Logistic Regression Train-model 25 Evaluate Classification - Evaluate the predictions from Logistic Regression Predict – parameter Inputs : Logistic Regression Pedict Label Column : SPEC Prediction Column : prediction 5) We want to predict the species of a case with the following data. Please use the 'Create Table' function to input the dataset and apply three different classification approaches for prediction. Explain what three models predicted for new data, including the predicted outcome (probability values) for each species. ( by Minh Phu) 26 Sepal_length 5.8 Sepal_width 3.2 Petal_length 2.2 Petal_width 1.5 - Use the KNN approach : - Use the Decision Tree approach : - Use the Logistic Regression approach : Convert the data to a logical form where 1 is setosa and 0 is not sentosa then we have result above 27 6) We need to report on the 3 classification based on the analysis result to your boss. Create your report. In your report you should include all processes your team conducted including one page executive summary. (by Minh Phu) First, when using the knn approach, we can see that using k= 7 results in the highest accuracy in the evaluation table of 1. Although it is possible to overfiting, compared to the results with k is different from 7, this is considered the most stable result. Therefore, to perform the following questions using the knn method, we use k=7 to predict the results. Based on what we learned we went ahead and created a prediction model using a decision tree approach. By changing the value of max depth and relying on the evaluate classification function to evaluate which max depth is best suited to the data set. (We find this model is the most accurate) along with the confusion matrix of the model with max depth = 7 which is extremely stable. Continuing, we use the Logistic Regression approach: To approach this method, we must first process the data into logistic form. The above model is processed with data where 1 means setosa species and 0 means not setosa species. 28 From there, we look at the evaluation classification table with the model's accuracy being 1 From the pre-built models in the above sentences, we replace the test data with create table function, we can predict that all 3 models predict that the data belongs to the species setosa. With the KNN approach k = 11 gives probability_setosa = 0.72727272 which is quite good compared to the Decision Tree model with probability_setosa = 1 (overfitting) and Logistic Regression approach probability_1 = 0.6528 (prediction accuracy is quite low). 29