Uploaded by Evan Fetsko

hw1

advertisement
ISYE6501: HW1
Taichi Nakatani (tnakatani3@gatech.edu)
0.1
Question 2.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a
classification model would be appropriate. List some (up to 5) predictors that you might use.
0.1.1
ANSWER
I work in the field of Machine Translation, particularly in applied to of product search. We evaluate the
quality of our models in terms of machine translation quality (compared to a reference corpus), but we also
want to compare how our (better) translation models positively affects product search quality. For instance,
if a user types in “cadeau d’anniversaire pour maman” (“Birthday present for mom” in French), we expect
that a better translation result will yield more relevant product matches compared to a poor translation.
This can be considered a classification task where several predictors can be used to determine whether a search
result is relevant to the query. The classification can be simply a binary result (“relevant”/“irrelevant”),
based on whether the search result yields a product that is relevant to the user’s query. Some predictors we
could use are the following: 1. Input string (prior to translation) 2. Machine translation output 3. Product
title 4. Product description
0.2
Question 2.2
Using the support vector machine function ksvm contained in the R package kernlab, find a good
classifier for this data. Show the equation of your classifier, and how well it classifies the data
points in the full data set. (Don’t worry about test/validation data yet; we’ll cover that topic
soon.)
You are welcome, but not required, to try other (nonlinear) kernels as well; we’re not covering
them in this course, but they can sometimes be useful and might provide better predictions than
vanilladot.
0.2.1
ANSWER
To get a better understanding of how minimizing the error of the support vector machine model (in this case
using the C parameter in the ksvm model) affects the accuracy of the model, I set a loop that searches across
different magnitudes of C. Below is the code to run it and the results are printed below in table and graph
form:
# Load the data
data <- read.table(
'credit_card_data.txt',
sep="\t",
header=FALSE
1
)
# Separate features (X) from labels (y)
X = as.matrix(data[,1:10])
y = as.factor(data[,11])
# Store lambda iteration results in a data frame
results_df = data.frame()
# Search across lambda values and evaluate its performance
for (i in 1:25) {
c <- 1e-11 * 10ˆi
model <- ksvm(
X,
y,
type="C-svc",
kernel="vanilladot",
C=c,
scaled=TRUE
)
# calculate a1...am
a <- colSums(model@xmatrix[[1]] * model@coef[[1]])
# calculate a0
a0 <- -model@b
# see what the model predicts
pred <- predict(model, data[,1:10])
# Get frequency count of binary prediction
pred_0 = sum(pred == 0)
pred_1 = sum(pred == 1)
# see what fraction of the model’s predictions match the actual classification
accuracy <- sum(pred == data[,11]) / nrow(data)
}
##
##
##
##
##
##
##
##
##
##
##
# Combine data points into a row, then append to dataframe
result_row <- c(c, pred_0, pred_1, accuracy)
results_df = rbind(results_df, result_row)
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
default
default
default
default
default
default
default
default
default
default
default
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
2
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
Setting
default
default
default
default
default
default
default
default
default
default
default
default
default
default
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
kernel
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
parameters
colnames(results_df) <- c('C','pred_0','pred_1', 'accuracy')
results_df
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
C pred_0 pred_1 accuracy
1e-10
654
0 0.5474006
1e-09
654
0 0.5474006
1e-08
654
0 0.5474006
1e-07
654
0 0.5474006
1e-06
654
0 0.5474006
1e-05
654
0 0.5474006
1e-04
654
0 0.5474006
1e-03
408
246 0.8379205
1e-02
303
351 0.8639144
1e-01
303
351 0.8639144
1e+00
303
351 0.8639144
1e+01
303
351 0.8639144
1e+02
303
351 0.8639144
1e+03
304
350 0.8623853
1e+04
304
350 0.8623853
1e+05
303
351 0.8639144
1e+06
345
309 0.6253823
1e+07
403
251 0.5458716
1e+08
360
294 0.6636086
1e+09
353
301 0.8027523
1e+10
288
366 0.4923547
1e+11
344
310 0.6299694
1e+12
364
290 0.6880734
1e+13
289
365 0.6529052
1e+14
383
271 0.8027523
# Plot the results
results_df$C <- as.factor(results_df$C)
plot(subset(results_df, select=c(C, accuracy)))
3
0.8
0.7
0.5
0.6
accuracy
1e−10
1e−06 0.001
1
100
1e+05
1e+09
1e+13
C
Table columns show the C value, count of predictions {0, 1}, and the accuracy of the model compared to the
ground truth label respectively. The plot renders this data into visual form, using C and accuracy columns.
The table shows that very small C values (e.g. 1e-10 to 1e-04) predicts all data points to be zero, resulting in
a poor accuracy score (0.55). As this value increases, we see the distribution of predictions beginning to set
its prediction plane to discriminate between the two labels. C between 1e-04 to 1e-02 appears to be where
the model begins to set a more accurate prediction plane, which then reaches a maximum accuracy of 0.86
by 1e-02. Accuracy plateaus around 0.86 until 1e+06 where accuracy begins to fluctuate, ranging from 0.49
(C at 1e+10) to 0.80 (C at 1e+14).
The increase in accuracy as the amount of penalty applied to misclassified points (in the form of a higher C
value) makes intuitive sense, since the margin becomes increasingly smaller as the cost of making mistakes
gets larger. What is interesting though, is that after a certain degree of C (ie. 1e+06) we begin to see
the accuracy begin to destabilize. One hypothesis is that as the misclassification costs becomes very high,
coefficients set by the model becomes more noisy as it is increasingly sensitive to minimizing errors.
Choosing ksvm model with C set to 1e-01 as an optimal parameter, we can present the equation as follows:
# Get the coefficients and intercept of a good classifier
good_model <- ksvm(
X,
y,
type="C-svc",
kernel="vanilladot",
C=1e-01,
scaled=TRUE
)
##
Setting default kernel parameters
4
# calculate a1...am
a <- colSums(good_model@xmatrix[[1]] * good_model@coef[[1]])
# calculate a0
a0 <- -good_model@b
print("Coefficients a1..am:")
## [1] "Coefficients a1..am:"
print(a)
##
V1
V2
V3
V4
## -0.0011608980 -0.0006366002 -0.0015209679 0.0032020638
##
V6
V7
V8
V9
## -0.0033773669 0.0002428616 -0.0004747021 -0.0011931900
V5
1.0041338724
V10
0.1064450527
print(paste("Intercept a0:", a0))
## [1] "Intercept a0: 0.0815522639500845"
0.2.1.1
Equation of optimal ksvm classifier
0 = a1 (−0.0011608980) + a2 (−0.0006366002) + a3 (0.0015209679)+
a4 (0.0032020638) + a5 (1.0041338724) + a6 (−0.0033773669)+
a7 (0.0002428616) + a8 (−0.0004747021) + a9 (−0.0011931900)+
a10 (0.1064450527) + 0.0815522639500845
(1)
It is interesting to note that most features appear to have weak coefficients. The strongest feature appear
to be a5, followed by a10.
0.3
Question 2.3
Using the k-nearest-neighbors classification function kknn contained in the R kknn package,
suggest a good value of k, and show how well it classifies that data points in the full data set.
Don’t forget to scale the data (scale=TRUE in kknn).
0.3.1
ANSWER
For this question, I used the kknn model as the predictor. To evaluate the effects of adjusting the k parameter.
I did the following: - Loop through a range of k parameters. - For each k, determine the classifier’s accuracy
across the entire dataset. To do this, I implement a “Leave One Out” cross validation method. This means
that given a dataset, we will train the model on all but one data point, calculate its accuracy on the one
data point, and repeat the process for each data point in the dataset.
5
# Create a data frame to save experiment results
knn_results_df = data.frame()
# Search k parameter to understand its effect on accuracy.
for (i in 1:100) {
# Initialize empty object to store results
data_obs <- dim(data)[1]
accuracies <- numeric(data_obs)
# Loop through each row, leaving one data point out as the validation set.
for (j in 1:data_obs) {
X_train = data[-j,]
X_valid = data[j,]
knn_model <- kknn(
V11~.,
X_train,
X_valid,
distance = 1,
kernel = "rectangular",
scale = TRUE,
k = i
)
# Save predictions. Since kknn response is continuous,
# do a simple rounding to discretize results into binary form.
pred <- round(fitted(knn_model))
# Calculate accuracy of prediction and data point j, then
# append it to a list.
accuracy <- sum(pred == X_valid[,11]) / nrow(X_valid)
accuracies[j] <- accuracy
}
avg_accuracy <- mean(accuracies)
}
# Combine data points into a row, then append to dataframe
result_row <- c(i, avg_accuracy)
knn_results_df = rbind(knn_results_df, result_row)
# Get results from the above code chunk
colnames(knn_results_df) <- c('k', 'accuracy')
# Plot the results
plot(subset(knn_results_df, select=c(k, accuracy)))
6
0.85
0.83
0.79
0.81
accuracy
0
20
40
60
80
100
k
# Get the most optimal K value, determined by accuracy.
knn_results_df[which.max(knn_results_df$accuracy),]
##
k accuracy
## 11 11 0.8608563
We see from the results that smaller k values have the lowest accuracy. This explains that that smaller k
values result in the model not having enough examples near the input to get an accurate prediction of the
class. We see an increase in accuracy as k approaches 11, reaching a maximum accuracy of 0.86. From there
the accuracy plateaus between 0.83 and 0.85. This suggests that at at certain level of k, that an adequate
number of examples are seen by the model to maintain a consistent level of classification accuracy. It would
be interesting to see the effects of the model once it k reaches the actual size of the dataset, but due to
computational constraints I’ve limited the search to 100.
7
Download