ISYE6501: HW1 Taichi Nakatani (tnakatani3@gatech.edu) 0.1 Question 2.1 Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use. 0.1.1 ANSWER I work in the field of Machine Translation, particularly in applied to of product search. We evaluate the quality of our models in terms of machine translation quality (compared to a reference corpus), but we also want to compare how our (better) translation models positively affects product search quality. For instance, if a user types in “cadeau d’anniversaire pour maman” (“Birthday present for mom” in French), we expect that a better translation result will yield more relevant product matches compared to a poor translation. This can be considered a classification task where several predictors can be used to determine whether a search result is relevant to the query. The classification can be simply a binary result (“relevant”/“irrelevant”), based on whether the search result yields a product that is relevant to the user’s query. Some predictors we could use are the following: 1. Input string (prior to translation) 2. Machine translation output 3. Product title 4. Product description 0.2 Question 2.2 Using the support vector machine function ksvm contained in the R package kernlab, find a good classifier for this data. Show the equation of your classifier, and how well it classifies the data points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that topic soon.) You are welcome, but not required, to try other (nonlinear) kernels as well; we’re not covering them in this course, but they can sometimes be useful and might provide better predictions than vanilladot. 0.2.1 ANSWER To get a better understanding of how minimizing the error of the support vector machine model (in this case using the C parameter in the ksvm model) affects the accuracy of the model, I set a loop that searches across different magnitudes of C. Below is the code to run it and the results are printed below in table and graph form: # Load the data data <- read.table( 'credit_card_data.txt', sep="\t", header=FALSE 1 ) # Separate features (X) from labels (y) X = as.matrix(data[,1:10]) y = as.factor(data[,11]) # Store lambda iteration results in a data frame results_df = data.frame() # Search across lambda values and evaluate its performance for (i in 1:25) { c <- 1e-11 * 10ˆi model <- ksvm( X, y, type="C-svc", kernel="vanilladot", C=c, scaled=TRUE ) # calculate a1...am a <- colSums(model@xmatrix[[1]] * model@coef[[1]]) # calculate a0 a0 <- -model@b # see what the model predicts pred <- predict(model, data[,1:10]) # Get frequency count of binary prediction pred_0 = sum(pred == 0) pred_1 = sum(pred == 1) # see what fraction of the model’s predictions match the actual classification accuracy <- sum(pred == data[,11]) / nrow(data) } ## ## ## ## ## ## ## ## ## ## ## # Combine data points into a row, then append to dataframe result_row <- c(c, pred_0, pred_1, accuracy) results_df = rbind(results_df, result_row) Setting Setting Setting Setting Setting Setting Setting Setting Setting Setting Setting default default default default default default default default default default default kernel kernel kernel kernel kernel kernel kernel kernel kernel kernel kernel parameters parameters parameters parameters parameters parameters parameters parameters parameters parameters parameters 2 ## ## ## ## ## ## ## ## ## ## ## ## ## ## Setting Setting Setting Setting Setting Setting Setting Setting Setting Setting Setting Setting Setting Setting default default default default default default default default default default default default default default kernel kernel kernel kernel kernel kernel kernel kernel kernel kernel kernel kernel kernel kernel parameters parameters parameters parameters parameters parameters parameters parameters parameters parameters parameters parameters parameters parameters colnames(results_df) <- c('C','pred_0','pred_1', 'accuracy') results_df ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 C pred_0 pred_1 accuracy 1e-10 654 0 0.5474006 1e-09 654 0 0.5474006 1e-08 654 0 0.5474006 1e-07 654 0 0.5474006 1e-06 654 0 0.5474006 1e-05 654 0 0.5474006 1e-04 654 0 0.5474006 1e-03 408 246 0.8379205 1e-02 303 351 0.8639144 1e-01 303 351 0.8639144 1e+00 303 351 0.8639144 1e+01 303 351 0.8639144 1e+02 303 351 0.8639144 1e+03 304 350 0.8623853 1e+04 304 350 0.8623853 1e+05 303 351 0.8639144 1e+06 345 309 0.6253823 1e+07 403 251 0.5458716 1e+08 360 294 0.6636086 1e+09 353 301 0.8027523 1e+10 288 366 0.4923547 1e+11 344 310 0.6299694 1e+12 364 290 0.6880734 1e+13 289 365 0.6529052 1e+14 383 271 0.8027523 # Plot the results results_df$C <- as.factor(results_df$C) plot(subset(results_df, select=c(C, accuracy))) 3 0.8 0.7 0.5 0.6 accuracy 1e−10 1e−06 0.001 1 100 1e+05 1e+09 1e+13 C Table columns show the C value, count of predictions {0, 1}, and the accuracy of the model compared to the ground truth label respectively. The plot renders this data into visual form, using C and accuracy columns. The table shows that very small C values (e.g. 1e-10 to 1e-04) predicts all data points to be zero, resulting in a poor accuracy score (0.55). As this value increases, we see the distribution of predictions beginning to set its prediction plane to discriminate between the two labels. C between 1e-04 to 1e-02 appears to be where the model begins to set a more accurate prediction plane, which then reaches a maximum accuracy of 0.86 by 1e-02. Accuracy plateaus around 0.86 until 1e+06 where accuracy begins to fluctuate, ranging from 0.49 (C at 1e+10) to 0.80 (C at 1e+14). The increase in accuracy as the amount of penalty applied to misclassified points (in the form of a higher C value) makes intuitive sense, since the margin becomes increasingly smaller as the cost of making mistakes gets larger. What is interesting though, is that after a certain degree of C (ie. 1e+06) we begin to see the accuracy begin to destabilize. One hypothesis is that as the misclassification costs becomes very high, coefficients set by the model becomes more noisy as it is increasingly sensitive to minimizing errors. Choosing ksvm model with C set to 1e-01 as an optimal parameter, we can present the equation as follows: # Get the coefficients and intercept of a good classifier good_model <- ksvm( X, y, type="C-svc", kernel="vanilladot", C=1e-01, scaled=TRUE ) ## Setting default kernel parameters 4 # calculate a1...am a <- colSums(good_model@xmatrix[[1]] * good_model@coef[[1]]) # calculate a0 a0 <- -good_model@b print("Coefficients a1..am:") ## [1] "Coefficients a1..am:" print(a) ## V1 V2 V3 V4 ## -0.0011608980 -0.0006366002 -0.0015209679 0.0032020638 ## V6 V7 V8 V9 ## -0.0033773669 0.0002428616 -0.0004747021 -0.0011931900 V5 1.0041338724 V10 0.1064450527 print(paste("Intercept a0:", a0)) ## [1] "Intercept a0: 0.0815522639500845" 0.2.1.1 Equation of optimal ksvm classifier 0 = a1 (−0.0011608980) + a2 (−0.0006366002) + a3 (0.0015209679)+ a4 (0.0032020638) + a5 (1.0041338724) + a6 (−0.0033773669)+ a7 (0.0002428616) + a8 (−0.0004747021) + a9 (−0.0011931900)+ a10 (0.1064450527) + 0.0815522639500845 (1) It is interesting to note that most features appear to have weak coefficients. The strongest feature appear to be a5, followed by a10. 0.3 Question 2.3 Using the k-nearest-neighbors classification function kknn contained in the R kknn package, suggest a good value of k, and show how well it classifies that data points in the full data set. Don’t forget to scale the data (scale=TRUE in kknn). 0.3.1 ANSWER For this question, I used the kknn model as the predictor. To evaluate the effects of adjusting the k parameter. I did the following: - Loop through a range of k parameters. - For each k, determine the classifier’s accuracy across the entire dataset. To do this, I implement a “Leave One Out” cross validation method. This means that given a dataset, we will train the model on all but one data point, calculate its accuracy on the one data point, and repeat the process for each data point in the dataset. 5 # Create a data frame to save experiment results knn_results_df = data.frame() # Search k parameter to understand its effect on accuracy. for (i in 1:100) { # Initialize empty object to store results data_obs <- dim(data)[1] accuracies <- numeric(data_obs) # Loop through each row, leaving one data point out as the validation set. for (j in 1:data_obs) { X_train = data[-j,] X_valid = data[j,] knn_model <- kknn( V11~., X_train, X_valid, distance = 1, kernel = "rectangular", scale = TRUE, k = i ) # Save predictions. Since kknn response is continuous, # do a simple rounding to discretize results into binary form. pred <- round(fitted(knn_model)) # Calculate accuracy of prediction and data point j, then # append it to a list. accuracy <- sum(pred == X_valid[,11]) / nrow(X_valid) accuracies[j] <- accuracy } avg_accuracy <- mean(accuracies) } # Combine data points into a row, then append to dataframe result_row <- c(i, avg_accuracy) knn_results_df = rbind(knn_results_df, result_row) # Get results from the above code chunk colnames(knn_results_df) <- c('k', 'accuracy') # Plot the results plot(subset(knn_results_df, select=c(k, accuracy))) 6 0.85 0.83 0.79 0.81 accuracy 0 20 40 60 80 100 k # Get the most optimal K value, determined by accuracy. knn_results_df[which.max(knn_results_df$accuracy),] ## k accuracy ## 11 11 0.8608563 We see from the results that smaller k values have the lowest accuracy. This explains that that smaller k values result in the model not having enough examples near the input to get an accurate prediction of the class. We see an increase in accuracy as k approaches 11, reaching a maximum accuracy of 0.86. From there the accuracy plateaus between 0.83 and 0.85. This suggests that at at certain level of k, that an adequate number of examples are seen by the model to maintain a consistent level of classification accuracy. It would be interesting to see the effects of the model once it k reaches the actual size of the dataset, but due to computational constraints I’ve limited the search to 100. 7