Classification Methods
The Stock Market Data
We will begin by examining some numerical and graphical summaries of the Smarket data,
which is part of the ISLR2 library. This data set consists of percentage returns for the S&P
500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each
date, we have recorded the percentage returns for each of the five previous trading days,
lagone through lagfive. We have also recorded volume (the number of shares traded on
the previous day, in billions), Today (the percentage return on the date in question) and
direction (whether the market was Up or Down on this date). Our goal is to predict
direction (a qualitative response) using the other features.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
names(Smarket)
## [1] "Year"
## [7] "Volume"
"Lag1"
"Today"
"Lag2"
"Lag3"
"Direction"
"Lag4"
"Lag5"
dim(Smarket)
## [1] 1250
9
summary(Smarket)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Year
Lag1
Lag2
Min.
:2001
Min.
:-4.922000
Min.
:-4.922000
1st Qu.:2002
1st Qu.:-0.639500
1st Qu.:-0.639500
Median :2003
Median : 0.039000
Median : 0.039000
Mean
:2003
Mean
: 0.003834
Mean
: 0.003919
3rd Qu.:2004
3rd Qu.: 0.596750
3rd Qu.: 0.596750
Max.
:2005
Max.
: 5.733000
Max.
: 5.733000
Lag4
Lag5
Volume
Min.
:-4.922000
Min.
:-4.92200
Min.
:0.3561
1st Qu.:-0.640000
1st Qu.:-0.64000
1st Qu.:1.2574
Median : 0.038500
Median : 0.03850
Median :1.4229
Mean
: 0.001636
Mean
: 0.00561
Mean
:1.4783
3rd Qu.: 0.596750
3rd Qu.: 0.59700
3rd Qu.:1.6417
Max.
: 5.733000
Max.
: 5.73300
Max.
:3.1525
Direction
Down:602
Up :648
Lag3
Min.
:-4.922000
1st Qu.:-0.640000
Median : 0.038500
Mean
: 0.001716
3rd Qu.: 0.596750
Max.
: 5.733000
Today
Min.
:-4.922000
1st Qu.:-0.639500
Median : 0.038500
Mean
: 0.003138
3rd Qu.: 0.596750
Max.
: 5.733000
##
##
pairs(Smarket)
The cor() function produces a matrix that contains all of the pairwise correlations among
the predictors in a data set. The first command below gives an error message because the
direction variable is qualitative.
cor(Smarket)
## Error in cor(Smarket): 'x' must be numeric
cor(Smarket[, -9])
##
Year
Lag1
Lag2
Lag3
Lag4
## Year
1.00000000 0.029699649 0.030596422 0.033194581 0.035688718
## Lag1
0.02969965 1.000000000 -0.026294328 -0.010803402 -0.002985911
## Lag2
0.03059642 -0.026294328 1.000000000 -0.025896670 -0.010853533
## Lag3
0.03319458 -0.010803402 -0.025896670 1.000000000 -0.024051036
## Lag4
0.03568872 -0.002985911 -0.010853533 -0.024051036 1.000000000
## Lag5
0.02978799 -0.005674606 -0.003557949 -0.018808338 -0.027083641
## Volume 0.53900647 0.040909908 -0.043383215 -0.041823686 -0.048414246
## Today 0.03009523 -0.026155045 -0.010250033 -0.002447647 -0.006899527
##
Lag5
Volume
Today
## Year
0.029787995 0.53900647 0.030095229
## Lag1
-0.005674606 0.04090991 -0.026155045
## Lag2
-0.003557949 -0.04338321 -0.010250033
## Lag3
-0.018808338 -0.04182369 -0.002447647
## Lag4
-0.027083641 -0.04841425 -0.006899527
## Lag5
1.000000000 -0.02200231 -0.034860083
## Volume -0.022002315 1.00000000 0.014591823
## Today -0.034860083 0.01459182 1.000000000
As one would expect, the correlations between the lag variables and today’s returns are
close to zero. In other words, there appears to be little correlation between today’s returns
and previous days’ returns. The only substantial correlation is between Year and volume.
By plotting the data, which is ordered chronologically, we see that volume is increasing
over time. In other words, the average number of shares traded daily increased from 2001
to 2005.
attach(Smarket)
plot(Volume)
Logistic Regression
Next, we will fit a logistic regression model in order to predict direction using lagone
through lagfive and volume. The glm() function can be used to fit many types of
generalized linear models, including logistic regression. The syntax of the glm() function is
similar to that of lm(), except that we must pass in the argument family = binomial in
order to tell R to run a logistic regression rather than some other type of generalized linear
model.
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket, family = binomial
)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
##
Volume, family = binomial, data = Smarket)
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000
0.240736 -0.523
0.601
## Lag1
-0.073074
0.050167 -1.457
0.145
## Lag2
-0.042301
0.050086 -0.845
0.398
## Lag3
0.011085
0.049939
0.222
0.824
## Lag4
0.009359
0.049974
0.187
0.851
## Lag5
0.010313
0.049511
0.208
0.835
## Volume
0.135441
0.158360
0.855
0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
The smallest π-value here is associated with lagone. The negative coefficient for this
predictor suggests that if the market had a positive return yesterday, then it is less likely to
go up today. However, at a value of 0.15, the π-value is still relatively large, and so there is
no clear evidence of a real association between lagone and direction.
We use the coef() function in order to access just the coefficients for this fitted model. We
can also use the summary() function to access particular aspects of the fitted model, such
as the π-values for the coefficients.
coef(glm.fits)
## (Intercept)
Lag1
Lag2
Lag5
## -0.126000257 -0.073073746 -0.042301344
0.010313068
##
Volume
## 0.135440659
Lag3
Lag4
0.011085108
0.009358938
summary(glm.fits)$coef
##
Estimate Std. Error
z value Pr(>|z|)
## (Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
## Lag1
## Lag2
## Lag3
## Lag4
## Lag5
## Volume
-0.073073746 0.05016739 -1.4565986 0.1452272
-0.042301344 0.05008605 -0.8445733 0.3983491
0.011085108 0.04993854 0.2219750 0.8243333
0.009358938 0.04997413 0.1872757 0.8514445
0.010313068 0.04951146 0.2082966 0.8349974
0.135440659 0.15835970 0.8552723 0.3924004
summary(glm.fits)$coef[, 4]
## (Intercept)
##
0.6006983
##
Volume
##
0.3924004
Lag1
0.1452272
Lag2
0.3983491
Lag3
0.8243333
Lag4
0.8514445
Lag5
0.8349974
The predict() function can be used to predict the probability that the market will go up,
given values of the predictors. The type = "response" option tells R to output probabilities
of the form π(π = 1|π), as opposed to other information such as the logit. If no data set is
supplied to the predict() function, then the probabilities are computed for the training
data that was used to fit the logistic regression model. Here we have printed only the first
ten probabilities. We know that these values correspond to the probability of the market
going up, rather than down, because the contrasts() function indicates that R has created
a dummy variable with a 1 for Up.
glm.probs <- predict(glm.fits, type = "response")
glm.probs[1:10]
##
1
2
3
4
5
6
7
8
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509
0.5092292
##
9
10
## 0.5176135 0.4888378
contrasts(Direction)
##
Up
## Down 0
## Up
1
In order to make a prediction as to whether the market will go up or down on a particular
day, we must convert these predicted probabilities into class labels, Up or Down. The
following two commands create a vector of class predictions based on whether the
predicted probability of a market increase is greater than or less than 0.5.
glm.pred <- rep("Down", 1250)
glm.pred[glm.probs > .5] = "Up"
The first command creates a vector of 1,250 Down elements. The second line transforms to
Up all of the elements for which the predicted probability of a market increase exceeds 0.5.
Given these predictions, the table() function can be used to produce a confusion matrix
in order to determine how many observations were correctly or incorrectly classified.
table(glm.pred, Direction)
##
Direction
## glm.pred Down Up
##
Down 145 141
##
Up
457 507
(507 + 145) / 1250
## [1] 0.5216
mean(glm.pred == Direction)
## [1] 0.5216
The diagonal elements of the confusion matrix indicate correct predictions, while the offdiagonals represent incorrect predictions. Hence our model correctly predicted that the
market would go up on 507 days and that it would go down on 145 days, for a total of 507 +
145 = 652 correct predictions. The mean() function can be used to compute the fraction of
days for which the prediction was correct. In this case, logistic regression correctly
predicted the movement of the market 52.2 % of the time.
At first glance, it appears that the logistic regression model is working a little better than
random guessing. However, this result is misleading because we trained and tested the
model on the same set of 1,250 observations. In other words, 100% − 52.2% = 47.8%, is
the training error rate. As we have seen previously, the training error rate is often overly
optimistic—it tends to underestimate the test error rate. In order to better assess the
accuracy of the logistic regression model in this setting, we can fit the model using part of
the data, and then examine how well it predicts the held out data. This will yield a more
realistic error rate, in the sense that in practice we will be interested in our model’s
performance not on the data that we used to fit the model, but rather on days in the future
for which the market’s movements are unknown.
To implement this strategy, we will first create a vector corresponding to the observations
from 2001 through 2004. We will then use this vector to create a held out data set of
observations from 2005.
train <- (Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
## [1] 252
9
Direction.2005 <- Direction[!train]
The object train is a vector of 1,250 elements, corresponding to the observations in our
data set. The elements of the vector that correspond to observations that occurred before
2005 are set to TRUE, whereas those that correspond to observations in 2005 are set to
FALSE. The object train is a Boolean vector, since its elements are TRUE and FALSE.
Boolean vectors can be used to obtain a subset of the rows or columns of a matrix. For
instance, the command Smarket[train, ] would pick out a submatrix of the stock market
data set, corresponding only to the dates before 2005, since those are the ones for which
the elements of train are TRUE. The ! symbol can be used to reverse all of the elements of
a Boolean vector. That is, !train is a vector similar to train, except that the elements that
are TRUE in train get swapped to FALSE in !train, and the elements that are FALSE in train
get swapped to TRUE in !train. Therefore, Smarket[!train, ] yields a submatrix of the
stock market data containing only the observations for which train is FALSE—that is, the
observations with dates in 2005. The output above indicates that there are 252 such
observations.
We now fit a logistic regression model using only the subset of the observations that
correspond to dates before 2005, using the subset argument. We then obtain predicted
probabilities of the stock market going up for each of the days in our test set—that is, for
the days in 2005.
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Smarket, family = binomial, subset = train
)
glm.probs <- predict(glm.fits, Smarket.2005,
type = "response")
Notice that we have trained and tested our model on two completely separate data sets:
training was performed using only the dates before 2005, and testing was performed using
only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to
the actual movements of the market over that time period.
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)
##
Direction.2005
## glm.pred Down Up
##
Down
77 97
##
Up
34 44
mean(glm.pred == Direction.2005)
## [1] 0.4801587
mean(glm.pred != Direction.2005)
## [1] 0.5198413
The != notation means not equal to, and so the last command computes the test set error
rate. The results are rather disappointing: the test error rate is 52 %, which is worse than
random guessing! Of course this result is not all that surprising, given that one would not
generally expect to be able to use previous days’ returns to predict future market
performance. (After all, if it were possible to do so, then the authors of this book would be
out striking it rich rather than writing a statistics textbook.)
We recall that the logistic regression model had very underwhelming π-values associated
with all of the predictors, and that the smallest π-value, though not very small,
corresponded to lagone. Perhaps by removing the variables that appear not to be helpful in
predicting direction, we can obtain a more effective model. After all, using predictors that
have no relationship with the response tends to cause a deterioration in the test error rate
(since such predictors cause an increase in variance without a corresponding decrease in
bias), and so removing such predictors may in turn yield an improvement. Below we have
refit the logistic regression using just lagone and lagtwo, which seemed to have the
highest predictive power in the original logistic regression model.
glm.fits <- glm(Direction ~ Lag1 + Lag2, data = Smarket,
family = binomial, subset = train)
glm.probs <- predict(glm.fits, Smarket.2005,
type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > .5] <- "Up"
table(glm.pred, Direction.2005)
##
Direction.2005
## glm.pred Down Up
##
Down
35 35
##
Up
76 106
mean(glm.pred == Direction.2005)
## [1] 0.5595238
106 / (106 + 76)
## [1] 0.5824176
Now the results appear to be a little better: 56% of the daily movements have been
correctly predicted. It is worth noting that in this case, a much simpler strategy of
predicting that the market will increase every day will also be correct 56% of the time!
Hence, in terms of overall error rate, the logistic regression method is no better than the
naive approach. However, the confusion matrix shows that on days when logistic
regression predicts an increase in the market, it has a 58% accuracy rate. This suggests a
possible trading strategy of buying on days when the model predicts an increasing market,
and avoiding trades on days when a decrease is predicted. Of course one would need to
investigate more carefully whether this small improvement was real or just due to random
chance.
Suppose that we want to predict the returns associated with particular values of lagone
and lagtwo. In particular, we want to predict direction on a day when lagone and lagtwo
equal 1.2 and~1.1, respectively, and on a day when they equal 1.5 and $-$0.8. We do this
using the predict() function.
predict(glm.fits,
newdata =
data.frame(Lag1 = c(1.2, 1.5),
type = "response"
)
Lag2 = c(1.1, -0.8)),
##
1
2
## 0.4791462 0.4960939
Linear Discriminant Analysis
Now we will perform LDA on the Smarket data. In R, we fit an LDA model using the lda()
function, which is part of the MASS library. Notice that the syntax for the lda() function is
identical to that of lm(), and to that of glm() except for the absence of the family option.
We fit the model using only the observations before 2005.
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
##
Boston
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Smarket,
subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
##
Down
Up
## 0.491984 0.508016
##
## Group means:
##
Lag1
Lag2
## Down 0.04279022 0.03389409
## Up
-0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
##
LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
plot(lda.fit)
The LDA output indicates that πΜ1 = 0.492 and πΜ2 = 0.508; in other words, 49.2 % of the
training observations correspond to days during which the market went down. It also
provides the group means; these are the average of each predictor within each class, and
are used by LDA as estimates of ππ . These suggest that there is a tendency for the previous
2~days’ returns to be negative on days when the market increases, and a tendency for the
previous days’ returns to be positive on days when the market declines. The coefficients of
linear discriminants output provides the linear combination of lagone and lagtwo that are
used to form the LDA decision rule. In other words, these are the multipliers of the
elements of π = π₯ in (4.24). If $-0.642 $`lagone`$ - 0.514 $lagtwo is large, then the LDA
classifier will predict a market increase, and if it is small, then the LDA classifier will
predict a market decline.
The plot() function produces plots of the linear discriminants, obtained by computing $0.642 $`lagone`$ - 0.514 $lagtwo for each of the training observations. The Up and Down
observations are displayed separately.
The predict() function returns a list with three elements. The first element, class,
contains LDA’s predictions about the movement of the market. The second element,
posterior, is a matrix whose πth column contains the posterior probability that the
corresponding observation belongs to the πth class, computed from (4.15). Finally, x
contains the linear discriminants, described earlier.
lda.pred <- predict(lda.fit, Smarket.2005)
names(lda.pred)
## [1] "class"
"posterior" "x"
As we observed in Section 4.5, the LDA and logistic regression predictions are almost
identical.
lda.class <- lda.pred$class
table(lda.class, Direction.2005)
##
Direction.2005
## lda.class Down Up
##
Down
35 35
##
Up
76 106
mean(lda.class == Direction.2005)
## [1] 0.5595238
Applying a 50 % threshold to the posterior probabilities allows us to recreate the
predictions contained in lda.pred$class.
sum(lda.pred$posterior[, 1] >= .5)
## [1] 70
sum(lda.pred$posterior[, 1] < .5)
## [1] 182
Notice that the posterior probability output by the model corresponds to the probability
that the market will decrease:
lda.pred$posterior[1:20, 1]
##
999
1000
1001
1002
1003
1004
1005
1006
## 0.4901792 0.4792185 0.4668185 0.4740011 0.4927877 0.4938562 0.4951016
0.4872861
##
1007
1008
1009
1010
1011
1012
1013
1014
## 0.4907013 0.4844026 0.4906963 0.5119988 0.4895152 0.4706761 0.4744593
0.4799583
##
1015
1016
1017
1018
## 0.4935775 0.5030894 0.4978806 0.4886331
lda.class[1:20]
## [1] Up
Up
Up
Up
Up
## [16] Up
Up
Down Up
## Levels: Down Up
Up
Up
Up
Up
Up
Up
Up
Up
Down Up
Up
If we wanted to use a posterior probability threshold other than 50 % in order to make
predictions, then we could easily do so. For instance, suppose that we wish to predict a
market decrease only if we are very certain that the market will indeed decrease on that
day—say, if the posterior probability is at least 90 %.
sum(lda.pred$posterior[, 1] > .9)
## [1] 0
No days in 2005 meet that threshold! In fact, the greatest posterior probability of decrease
in all of 2005 was 52.02 %.
Quadratic Discriminant Analysis
We will now fit a QDA model to the Smarket data. QDA is implemented in R using the qda()
function, which is also part of the MASS library. The syntax is identical to that of lda().
qda.fit <- qda(Direction ~ Lag1 + Lag2, data = Smarket,
subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)
##
## Prior probabilities of groups:
##
Down
Up
## 0.491984 0.508016
##
## Group means:
##
Lag1
Lag2
## Down 0.04279022 0.03389409
## Up
-0.03954635 -0.03132544
The output contains the group means. But it does not contain the coefficients of the linear
discriminants, because the QDA classifier involves a quadratic, rather than a linear,
function of the predictors. The predict() function works in exactly the same fashion as for
LDA.
qda.class <- predict(qda.fit, Smarket.2005)$class
table(qda.class, Direction.2005)
##
Direction.2005
## qda.class Down Up
##
Down
30 20
##
Up
81 121
mean(qda.class == Direction.2005)
## [1] 0.5992063
Interestingly, the QDA predictions are accurate almost 60 % of the time, even though the
2005 data was not used to fit the model. This level of accuracy is quite impressive for stock
market data, which is known to be quite hard to model accurately. This suggests that the
quadratic form assumed by QDA may capture the true relationship more accurately than
the linear forms assumed by LDA and logistic regression. However, we recommend
evaluating this method’s performance on a larger test set before betting that this approach
will consistently beat the market!
Naive Bayes
Next, we fit a naive Bayes model to the Smarket data. Naive Bayes is implemented in R
using the naiveBayes() function, which is part of the e1071 library. The syntax is identical
to that of lda() and qda(). By default, this implementation of the naive Bayes classifier
models each quantitative feature using a Gaussian distribution. However, a kernel density
method can also be used to estimate the distributions.
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = Smarket,
subset = train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
##
Down
Up
## 0.491984 0.508016
##
## Conditional probabilities:
##
Lag1
## Y
[,1]
[,2]
##
Down 0.04279022 1.227446
##
Up
-0.03954635 1.231668
##
##
Lag2
## Y
[,1]
[,2]
##
Down 0.03389409 1.239191
##
Up
-0.03132544 1.220765
The output contains the estimated mean and standard deviation for each variable in each
class. For example, the mean for lagone is 0.0428 for
Direction=Down, and the standard deviation is 1.23. We can easily verify this:
mean(Lag1[train][Direction[train] == "Down"])
## [1] 0.04279022
sd(Lag1[train][Direction[train] == "Down"])
## [1] 1.227446
The predict() function is straightforward.
nb.class <- predict(nb.fit, Smarket.2005)
table(nb.class, Direction.2005)
##
Direction.2005
## nb.class Down Up
##
Down
28 20
##
Up
83 121
mean(nb.class == Direction.2005)
## [1] 0.5912698
Naive Bayes performs very well on this data, with accurate predictions over 59% of the
time. This is slightly worse than QDA, but much better than LDA.
The predict() function can also generate estimates of the probability that each
observation belongs to a particular class. %
nb.preds <- predict(nb.fit, Smarket.2005, type = "raw")
nb.preds[1:5, ]
##
Down
Up
## [1,] 0.4873164 0.5126836
## [2,] 0.4762492 0.5237508
## [3,] 0.4653377 0.5346623
## [4,] 0.4748652 0.5251348
## [5,] 0.4901890 0.5098110
πΎ-Nearest Neighbors
We will now perform KNN using the knn() function, which is part of the class library. This
function works rather differently from the other model-fitting functions that we have
encountered thus far. Rather than a two-step approach in which we first fit the model and
then we use the model to make predictions, knn() forms predictions using a single
command. The function requires four inputs.
•
A matrix containing the predictors associated with the training data, labeled
train.X below.
•
•
•
A matrix containing the predictors associated with the data for which we wish to
make predictions, labeled test.X below.
A vector containing the class labels for the training observations, labeled
train.Direction below.
A value for πΎ, the number of nearest neighbors to be used by the classifier.
We use the cbind() function, short for column bind, to bind the lagone and lagtwo
variables together into two matrices, one for the training set and the other for the test set.
library(class)
## Warning: package 'class' was built under R version 4.3.3
train.X <- cbind(Lag1, Lag2)[train, ]
dim(train.X)
## [1] 998
2
test.X <- cbind(Lag1, Lag2)[!train, ]
train.Direction <- Direction[train]
length(train.X)
## [1] 1996
length(train.Direction)
## [1] 998
Now the knn() function can be used to predict the market’s movement for the dates in
2005. We set a random seed before we apply knn() because if several observations are
tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set
in order to ensure reproducibility of results.
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.2005)
##
Direction.2005
## knn.pred Down Up
##
Down
43 58
##
Up
68 83
mean(knn.pred == Direction.2005)
## [1] 0.5
(83 + 43) / 252
## [1] 0.5
The results using πΎ = 1 are not very good, since only 50 % of the observations are correctly
predicted. Of course, it may be that πΎ = 1 results in an overly flexible fit to the data. Below,
we repeat the analysis using πΎ = 3.
knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)
##
Direction.2005
## knn.pred Down Up
##
Down
48 54
##
Up
63 87
mean(knn.pred == Direction.2005)
## [1] 0.5357143
The results have improved slightly. But increasing πΎ further turns out to provide no further
improvements. It appears that for this data, QDA provides the best results of the methods
that we have examined so far.
KNN does not perform well on the Smarket data but it does often provide impressive
results. As an example we will apply the KNN approach to the Insurance data set, which is
part of the ISLR2 library. This data set includes 85 predictors that measure demographic
characteristics for 5,822 individuals. The response variable is Purchase, which indicates
whether or not a given individual purchases a caravan insurance policy. In this data set,
only 6 % of people purchased caravan insurance.
dim(Caravan)
## [1] 5822
86
attach(Caravan)
summary(Purchase)
##
No
## 5474
Yes
348
348 / 5822
## [1] 0.05977327
Because the KNN classifier predicts the class of a given test observation by identifying the
observations that are nearest to it, the scale of the variables matters. Variables that are on
a large scale will have a much larger effect on the distance between the observations, and
hence on the KNN classifier, than variables that are on a small scale. For instance, imagine
a data set that contains two variables, salary and age (measured in dollars and years,
respectively). As far as KNN is concerned, a difference of $1,000 in salary is enormous
compared to a difference of 50 years in age. Consequently, salary will drive the KNN
classification results, and age will have almost no effect. This is contrary to our intuition
that a salary difference of $1,000 is quite small compared to an age difference of 50 years.
Furthermore, the importance of scale to the KNN classifier leads to another issue: if we
measured salary in Japanese yen, or if we measured age in minutes, then we’d get quite
different classification results from what we get if these two variables are measured in
dollars and years.
A good way to handle this problem is to standardize the data so that all variables are given
a mean of zero and a standard deviation of one. Then all variables will be on a comparable
scale. The scale() function does just this. In standardizing the data, we exclude column
86, because that is the qualitative Purchase variable.
standardized.X <- scale(Caravan[, -86])
var(Caravan[, 1])
## [1] 165.0378
var(Caravan[, 2])
## [1] 0.1647078
var(standardized.X[, 1])
## [1] 1
var(standardized.X[, 2])
## [1] 1
Now every column of standardized.X has a standard deviation of one and a mean of zero.
We now split the observations into a test set, containing the first 1,000 observations, and a
training set, containing the remaining observations. We fit a KNN model on the training
data using πΎ = 1, and evaluate its performance on the test data.%
test <- 1:1000
train.X <- standardized.X[-test, ]
test.X <- standardized.X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Y, k = 1)
mean(test.Y != knn.pred)
## [1] 0.118
mean(test.Y != "No")
## [1] 0.059
The vector test is numeric, with values from 1 through 1,000. Typing
standardized.X[test, ] yields the submatrix of the data containing the observations
whose indices range from 1 to 1,000, whereas typing
standardized.X[-test, ] yields the submatrix containing the observations whose indices
do not range from 1 to 1,000. The KNN error rate on the 1,000 test observations is just
under 12 %. At first glance, this may appear to be fairly good. However, since only 6 % of
customers purchased insurance, we could get the error rate down to 6 % by always
predicting No regardless of the values of the predictors!
Suppose that there is some non-trivial cost to trying to sell insurance to a given individual.
For instance, perhaps a salesperson must visit each potential customer. If the company
tries to sell insurance to a random selection of customers, then the success rate will be
only 6 %, which may be far too low given the costs involved. Instead, the company would
like to try to sell insurance only to customers who are likely to buy it. So the overall error
rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy
insurance is of interest.
It turns out that KNN with πΎ = 1 does far better than random guessing among the
customers that are predicted to buy insurance. Among 77 such customers, 9, or 11.7 %,
actually do purchase insurance. This is double the rate that one would obtain from random
guessing.
table(knn.pred, test.Y)
##
test.Y
## knn.pred No Yes
##
No 873 50
##
Yes 68
9
9 / (68 + 9)
## [1] 0.1168831
Using πΎ = 3, the success rate increases to 19 %, and with πΎ = 5 the rate is 26.7 %. This is
over four times the rate that results from random guessing. It appears that KNN is finding
some real patterns in a difficult data set!
knn.pred <- knn(train.X, test.X, train.Y, k = 3)
table(knn.pred, test.Y)
##
test.Y
## knn.pred No Yes
##
No 920 54
##
Yes 21
5
5 / 26
## [1] 0.1923077
knn.pred <- knn(train.X, test.X, train.Y, k = 5)
table(knn.pred, test.Y)
##
test.Y
## knn.pred No Yes
##
No 930 55
##
Yes 11
4
4 / 15
## [1] 0.2666667
However, while this strategy is cost-effective, it is worth noting that only 15 customers are
predicted to purchase insurance using KNN with πΎ = 5. In practice, the insurance
company may wish to expend resources on convincing more than just 15 potential
customers to buy insurance.
As a comparison, we can also fit a logistic regression model to the data. If we use 0.5 as
the predicted probability cut-off for the classifier, then we have a problem: only seven of
the test observations are predicted to purchase insurance. Even worse, we are wrong
about all of these! However, we are not required to use a cut-off of 0.5. If we instead
predict a purchase any time the predicted probability of purchase exceeds 0.25, we get
much better results: we predict that 33 people will purchase insurance, and we are correct
for about 33 % of these people. This is over five times better than random guessing!
glm.fits <- glm(Purchase ~ ., data = Caravan,
family = binomial, subset = -test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.probs <- predict(glm.fits, Caravan[test, ],
type = "response")
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .5] <- "Yes"
table(glm.pred, test.Y)
##
test.Y
## glm.pred No Yes
##
No 934 59
##
Yes
7
0
glm.pred <- rep("No", 1000)
glm.pred[glm.probs > .25] <- "Yes"
table(glm.pred, test.Y)
##
test.Y
## glm.pred No Yes
##
No 919 48
##
Yes 22 11
11 / (22 + 11)
## [1] 0.3333333
Poisson Regression
Finally, we fit a Poisson regression model to the Bikeshare data set, which measures the
number of bike rentals (bikers) per hour in Washington, DC. The data can be found in the
ISLR2 library.
attach(Bikeshare)
dim(Bikeshare)
## [1] 8645
15
names(Bikeshare)
## [1] "season"
## [6] "weekday"
## [11] "hum"
"mnth"
"day"
"hr"
"holiday"
"workingday" "weathersit" "temp"
"atemp"
"windspeed" "casual"
"registered" "bikers"
We begin by fitting a least squares linear regression model to the data.
mod.lm <- lm(
bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare
)
summary(mod.lm)
##
## Call:
## lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
##
data = Bikeshare)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -299.00 -45.70
-6.23
41.08 425.29
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-68.632
5.307 -12.932 < 2e-16 ***
## mnthFeb
6.845
4.287
1.597 0.110398
## mnthMarch
16.551
4.301
3.848 0.000120 ***
## mnthApril
41.425
4.972
8.331 < 2e-16 ***
## mnthMay
72.557
5.641 12.862 < 2e-16 ***
## mnthJune
67.819
6.544 10.364 < 2e-16 ***
## mnthJuly
45.324
7.081
6.401 1.63e-10 ***
## mnthAug
53.243
6.640
8.019 1.21e-15 ***
## mnthSept
66.678
5.925 11.254 < 2e-16 ***
## mnthOct
75.834
4.950 15.319 < 2e-16 ***
## mnthNov
60.310
4.610 13.083 < 2e-16 ***
## mnthDec
46.458
4.271 10.878 < 2e-16 ***
## hr1
-14.579
5.699 -2.558 0.010536 *
## hr2
-21.579
5.733 -3.764 0.000168 ***
## hr3
-31.141
5.778 -5.389 7.26e-08 ***
## hr4
-36.908
5.802 -6.361 2.11e-10 ***
## hr5
-24.135
5.737 -4.207 2.61e-05 ***
## hr6
20.600
5.704
3.612 0.000306 ***
## hr7
120.093
5.693 21.095 < 2e-16 ***
## hr8
223.662
5.690 39.310 < 2e-16 ***
## hr9
120.582
5.693 21.182 < 2e-16 ***
## hr10
83.801
5.705 14.689 < 2e-16 ***
## hr11
105.423
5.722 18.424 < 2e-16 ***
## hr12
137.284
5.740 23.916 < 2e-16 ***
## hr13
136.036
5.760 23.617 < 2e-16 ***
## hr14
126.636
5.776 21.923 < 2e-16 ***
## hr15
132.087
5.780 22.852 < 2e-16 ***
## hr16
178.521
5.772 30.927 < 2e-16 ***
## hr17
296.267
5.749 51.537 < 2e-16 ***
## hr18
269.441
5.736 46.976 < 2e-16 ***
## hr19
186.256
5.714 32.596 < 2e-16 ***
## hr20
125.549
5.704 22.012 < 2e-16 ***
## hr21
87.554
5.693 15.378 < 2e-16 ***
## hr22
59.123
5.689 10.392 < 2e-16 ***
## hr23
26.838
5.688
4.719 2.41e-06 ***
## workingday
1.270
1.784
0.711 0.476810
## temp
157.209
10.261 15.321 < 2e-16 ***
## weathersitcloudy/misty
-12.890
1.964 -6.562 5.60e-11 ***
## weathersitlight rain/snow -66.494
2.965 -22.425 < 2e-16 ***
## weathersitheavy rain/snow -109.745
76.667 -1.431 0.152341
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 8605 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.6731
## F-statistic: 457.3 on 39 and 8605 DF, p-value: < 2.2e-16
Due to space constraints, we truncate the output of summary(mod.lm). In mod.lm, the first
level of hr (0) and mnth (Jan) are treated as the baseline values, and so no coefficient
estimates are provided for them: implicitly, their coefficient estimates are zero, and all
other levels are measured relative to these baselines. For example, the Feb coefficient of
6.845 signifies that, holding all other variables constant, there are on average about 7 more
riders in February than in January. Similarly there are about 16.5 more riders in March than
in January.
The results seen in Section 4.6.1 used a slightly different coding of the variables hr and
mnth, as follows:
contrasts(Bikeshare$hr) = contr.sum(24)
contrasts(Bikeshare$mnth) = contr.sum(12)
mod.lm2 <- lm(
bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare
)
summary(mod.lm2)
##
## Call:
## lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
##
data = Bikeshare)
##
## Residuals:
##
Min
1Q Median
3Q
Max
## -299.00 -45.70
-6.23
41.08 425.29
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
73.5974
5.1322 14.340 < 2e-16 ***
## mnth1
-46.0871
4.0855 -11.281 < 2e-16 ***
## mnth2
-39.2419
3.5391 -11.088 < 2e-16 ***
## mnth3
-29.5357
3.1552 -9.361 < 2e-16 ***
## mnth4
-4.6622
2.7406 -1.701 0.08895 .
## mnth5
26.4700
2.8508
9.285 < 2e-16 ***
## mnth6
21.7317
3.4651
6.272 3.75e-10 ***
## mnth7
-0.7626
3.9084 -0.195 0.84530
## mnth8
7.1560
3.5347
2.024 0.04295 *
## mnth9
20.5912
3.0456
6.761 1.46e-11 ***
## mnth10
29.7472
2.6995 11.019 < 2e-16 ***
## mnth11
14.2229
2.8604
4.972 6.74e-07 ***
## hr1
-96.1420
3.9554 -24.307 < 2e-16 ***
## hr2
-110.7213
3.9662 -27.916 < 2e-16 ***
## hr3
-117.7212
4.0165 -29.310 < 2e-16 ***
## hr4
-127.2828
4.0808 -31.191 < 2e-16 ***
## hr5
-133.0495
4.1168 -32.319 < 2e-16 ***
## hr6
-120.2775
4.0370 -29.794 < 2e-16 ***
## hr7
-75.5424
3.9916 -18.925 < 2e-16 ***
## hr8
23.9511
3.9686
6.035 1.65e-09 ***
## hr9
127.5199
3.9500 32.284 < 2e-16 ***
## hr10
24.4399
3.9360
6.209 5.57e-10 ***
## hr11
-12.3407
3.9361 -3.135 0.00172 **
## hr12
9.2814
3.9447
2.353 0.01865 *
## hr13
41.1417
3.9571 10.397 < 2e-16 ***
## hr14
39.8939
3.9750 10.036 < 2e-16 ***
## hr15
30.4940
3.9910
7.641 2.39e-14 ***
## hr16
35.9445
3.9949
8.998 < 2e-16 ***
## hr17
82.3786
3.9883 20.655 < 2e-16 ***
## hr18
200.1249
3.9638 50.488 < 2e-16 ***
## hr19
173.2989
3.9561 43.806 < 2e-16 ***
## hr20
90.1138
3.9400 22.872 < 2e-16 ***
## hr21
29.4071
3.9362
7.471 8.74e-14 ***
## hr22
-8.5883
3.9332 -2.184 0.02902 *
## hr23
-37.0194
3.9344 -9.409 < 2e-16 ***
## workingday
1.2696
1.7845
0.711 0.47681
## temp
157.2094
10.2612 15.321 < 2e-16 ***
## weathersitcloudy/misty
-12.8903
1.9643 -6.562 5.60e-11 ***
## weathersitlight rain/snow -66.4944
2.9652 -22.425 < 2e-16 ***
## weathersitheavy rain/snow -109.7446
76.6674 -1.431 0.15234
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 8605 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.6731
## F-statistic: 457.3 on 39 and 8605 DF, p-value: < 2.2e-16
What is the difference between the two codings? In mod.lm2, a coefficient estimate is
reported for all but the last level of hr and mnth. Importantly, in mod.lm2, the coefficient
estimate for the last level of mnth is not zero: instead, it equals the negative of the sum of
the coefficient estimates for all of the other levels. Similarly, in mod.lm2, the coefficient
estimate for the last level of hr is the negative of the sum of the coefficient estimates for all
of the other levels. This means that the coefficients of hr and mnth in mod.lm2 will always
sum to zero, and can be interpreted as the difference from the mean level. For example,
the coefficient for January of −46.087 indicates that, holding all other variables constant,
there are typically 46 fewer riders in January relative to the yearly average.
It is important to realize that the choice of coding really does not matter, provided that we
interpret the model output correctly in light of the coding used. For example, we see that
the predictions from the linear model are the same regardless of coding:
sum((predict(mod.lm) - predict(mod.lm2))^2)
## [1] 1.586608e-18
The sum of squared differences is zero. We can also see this using the all.equal()
function:
all.equal(predict(mod.lm), predict(mod.lm2))
## [1] TRUE
To reproduce the left-hand side of Figure 4.13, we must first obtain the coefficient
estimates associated with mnth. The coefficients for January through November can be
obtained directly from the mod.lm2 object. The coefficient for December must be explicitly
computed as the negative sum of all the other months.
coef.months <- c(coef(mod.lm2)[2:12],
-sum(coef(mod.lm2)[2:12]))
To make the plot, we manually label the π₯-axis with the names of the months.
plot(coef.months, xlab = "Month", ylab = "Coefficient",
xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A",
"M", "J", "J", "A", "S", "O", "N", "D"))
Reproducing the right-hand side of Figure 4.13 follows a similar process.
coef.hours <- c(coef(mod.lm2)[13:35],
-sum(coef(mod.lm2)[13:35]))
plot(coef.hours, xlab = "Hour", ylab = "Coefficient",
col = "blue", pch = 19, type = "o")
Now, we consider instead fitting a Poisson regression model to the Bikeshare data. Very
little changes, except that we now use the function glm() with the argument family =
poisson to specify that we wish to fit a Poisson regression model:
mod.pois <- glm(
bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare, family = poisson
)
summary(mod.pois)
##
## Call:
## glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
##
family = poisson, data = Bikeshare)
##
## Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
## (Intercept)
4.118245
0.006021 683.964 < 2e-16 ***
## mnth1
-0.670170
0.005907 -113.445 < 2e-16 ***
## mnth2
-0.444124
0.004860 -91.379 < 2e-16 ***
## mnth3
-0.293733
0.004144 -70.886 < 2e-16 ***
## mnth4
0.021523
0.003125
6.888 5.66e-12 ***
## mnth5
0.240471
0.002916
82.462 < 2e-16 ***
## mnth6
0.223235
0.003554
62.818 < 2e-16 ***
## mnth7
0.103617
0.004125
25.121 < 2e-16 ***
## mnth8
0.151171
0.003662
41.281 < 2e-16 ***
## mnth9
0.233493
0.003102
75.281 < 2e-16 ***
## mnth10
0.267573
0.002785
96.091 < 2e-16 ***
## mnth11
0.150264
0.003180
47.248 < 2e-16 ***
## hr1
-0.754386
0.007879 -95.744 < 2e-16 ***
## hr2
-1.225979
0.009953 -123.173 < 2e-16 ***
## hr3
-1.563147
0.011869 -131.702 < 2e-16 ***
## hr4
-2.198304
0.016424 -133.846 < 2e-16 ***
## hr5
-2.830484
0.022538 -125.586 < 2e-16 ***
## hr6
-1.814657
0.013464 -134.775 < 2e-16 ***
## hr7
-0.429888
0.006896 -62.341 < 2e-16 ***
## hr8
0.575181
0.004406 130.544 < 2e-16 ***
## hr9
1.076927
0.003563 302.220 < 2e-16 ***
## hr10
0.581769
0.004286 135.727 < 2e-16 ***
## hr11
0.336852
0.004720
71.372 < 2e-16 ***
## hr12
0.494121
0.004392 112.494 < 2e-16 ***
## hr13
0.679642
0.004069 167.040 < 2e-16 ***
## hr14
0.673565
0.004089 164.722 < 2e-16 ***
## hr15
0.624910
0.004178 149.570 < 2e-16 ***
## hr16
0.653763
0.004132 158.205 < 2e-16 ***
## hr17
0.874301
0.003784 231.040 < 2e-16 ***
## hr18
1.294635
0.003254 397.848 < 2e-16 ***
## hr19
1.212281
0.003321 365.084 < 2e-16 ***
## hr20
0.914022
0.003700 247.065 < 2e-16 ***
## hr21
0.616201
0.004191 147.045 < 2e-16 ***
## hr22
0.364181
0.004659
78.173 < 2e-16 ***
## hr23
0.117493
0.005225
22.488 < 2e-16 ***
## workingday
0.014665
0.001955
7.502 6.27e-14 ***
## temp
0.785292
0.011475
68.434 < 2e-16 ***
## weathersitcloudy/misty
-0.075231
0.002179 -34.528 < 2e-16 ***
## weathersitlight rain/snow -0.575800
0.004058 -141.905 < 2e-16 ***
## weathersitheavy rain/snow -0.926287
0.166782
-5.554 2.79e-08 ***
## --## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
##
Null deviance: 1052921 on 8644 degrees of freedom
## Residual deviance: 228041 on 8605 degrees of freedom
## AIC: 281159
##
## Number of Fisher Scoring iterations: 5
We can plot the coefficients associated with mnth and hr, in order to reproduce Figure
4.15:
coef.mnth <- c(coef(mod.pois)[2:12],
-sum(coef(mod.pois)[2:12]))
plot(coef.mnth, xlab = "Month", ylab = "Coefficient",
xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A", "M", "J", "J", "A",
"S", "O", "N", "D"))
coef.hours <- c(coef(mod.pois)[13:35],
-sum(coef(mod.pois)[13:35]))
plot(coef.hours, xlab = "Hour", ylab = "Coefficient",
col = "blue", pch = 19, type = "o")
We can once again use the predict() function to obtain the fitted values (predictions)
from this Poisson regression model. However, we must use the argument type =
"response" to specify that we want R to output exp(π½Μ0 + π½Μ1 π1 + β― + π½Μπ ππ ) rather than
π½Μ0 + π½Μ1 π1 + β― + π½Μπ ππ , which it will output by default.
plot(predict(mod.lm2), predict(mod.pois, type = "response"))
abline(0, 1, col = 2, lwd = 3)
The predictions from the Poisson regression model are correlated with those from the
linear model; however, the former are non-negative. As a result the Poisson regression
predictions tend to be larger than those from the linear model for either very low or very
high levels of ridership.
In this section, we used the glm() function with the argument family = poisson in order
to perform Poisson regression. Earlier in this lab we used the glm() function with family =
binomial to perform logistic regression. Other choices for the family argument can be
used to fit other types of GLMs. For instance, family = Gamma fits a gamma regression
model.
#Excersises 13. This question should be answered using the Weekly data set, which is part
of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s
lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to
the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there
appear to be any patterns?
library(ISLR2)
dim(Weekly)
## [1] 1089
9
summary(Weekly)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Year
Lag1
Lag2
Lag3
Min.
:1990
Min.
:-18.1950
Min.
:-18.1950
Min.
:-18.1950
1st Qu.:1995
1st Qu.: -1.1540
1st Qu.: -1.1540
1st Qu.: -1.1580
Median :2000
Median : 0.2410
Median : 0.2410
Median : 0.2410
Mean
:2000
Mean
: 0.1506
Mean
: 0.1511
Mean
: 0.1472
3rd Qu.:2005
3rd Qu.: 1.4050
3rd Qu.: 1.4090
3rd Qu.: 1.4090
Max.
:2010
Max.
: 12.0260
Max.
: 12.0260
Max.
: 12.0260
Lag4
Lag5
Volume
Today
Min.
:-18.1950
Min.
:-18.1950
Min.
:0.08747
Min.
:-18.1950
1st Qu.: -1.1580
1st Qu.: -1.1660
1st Qu.:0.33202
1st Qu.: -1.1540
Median : 0.2380
Median : 0.2340
Median :1.00268
Median : 0.2410
Mean
: 0.1458
Mean
: 0.1399
Mean
:1.57462
Mean
: 0.1499
3rd Qu.: 1.4090
3rd Qu.: 1.4050
3rd Qu.:2.05373
3rd Qu.: 1.4050
Max.
: 12.0260
Max.
: 12.0260
Max.
:9.32821
Max.
: 12.0260
Direction
Down:484
Up :605
pairs(Weekly)
(d) Nowfitthe
logistic regression model using a training data period from 1990 to 2008, with Lag2 as the
only predictor. Compute the confusion matrix and the overall fraction of correct
predictions for the held out data (that is, the data from 2009 and 2010).
#find unique year in Weekly
unique(Weekly$Year)
## [1] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
2004
## [16] 2005 2006 2007 2008 2009 2010
#change year to numeric
Weekly$Year <- as.numeric(Weekly$Year)
dim(Weekly)#1089
9
## [1] 1089
9
#split the data
set.seed(1)
train <- (Weekly$Year <= 2008)
test <- (Weekly$Year > 2008)
X_train <- matrix(Weekly$Lag2[train])
X_test <- matrix(Weekly$Lag2[test])
#fit the knn model
knn.pred <- knn(X_train,X_test,Weekly$Direction[train], k = 1)
#confusion matrix
table(knn.pred, Weekly$Direction[test])
##
## knn.pred Down Up
##
Down
21 30
##
Up
22 31
mean(knn.pred == Weekly$Direction[test])
## [1] 0.5
(j) Experiment with different combinations of predictors, includ ing possible
transformations and interactions, for each of the methods. Report the variables,
method, and associated confu sion matrix that appears to provide the best results
on the held out data. Note that you should also experiment with values for K in the
KNN classifier.
names(Weekly)
## [1] "Year"
## [7] "Volume"
"Lag1"
"Today"
"Lag2"
"Lag3"
"Direction"
"Lag4"
#using all predictor
X_train <- Weekly[,-9][train,]
x_test <- Weekly[,-9][test,]
knn_1.pred <- knn(X_train, x_test, Weekly$Direction[train], k = 1)
mean(knn_1.pred == Weekly$Direction[test])
"Lag5"
## [1] 0.7980769
#add more interaction
Weekly["Lag2*Volume"] <- Weekly$Lag2 * Weekly$Volume
Weekly["Lag2^2"] <- Weekly$Lag2^2
Weekly["Lag2^3"] <- Weekly$Lag2^3
Weekly["Lag2^4"] <- Weekly$Lag2^4
Weekly["Lag2^5"] <- Weekly$Lag2^5
X_train <- Weekly[,-9][train,]
x_test <- Weekly[,-9][test,]
knn_2.pred <- knn(X_train, x_test, Weekly$Direction[train], k = 1)
mean(knn_2.pred == Weekly$Direction[test])
## [1] 0.5673077
#loop to find the best k + interaction
k_values <- 1:20
error_rate <- rep(0,20)
for (i in 1:20){
knn.pred <- knn(X_train, x_test, Weekly$Direction[train], k = i)
error_rate[i] <- mean(knn.pred != Weekly$Direction[test])
}
plot(k_values, error_rate, type = "l", xlab = "K", ylab = "Error Rate")
#loop to find the best k without interaction
X_train <- Weekly[,1:8][train,]
x_test <- Weekly[,1:8][test,]
for (i in 1:20){
knn.pred <- knn(X_train, x_test, Weekly$Direction[train], k = i)
error_rate[i] <- mean(knn.pred != Weekly$Direction[test])
}
plot(k_values, error_rate, type = "l", xlab = "K", ylab = "Error Rate")
14. In this problem, you will develop a model to predict whether a given car gets high or
low gas mileage based on the Auto data set.
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its
median, and a 0 if mpg contains a value below its median. You can compute the
median using the median() function. Note you may find it helpful to use the
data.frame() function to create a single data set containing both mpg01 and the
other Auto variables.
data(Auto)
median(Auto$mpg)
## [1] 22.75
Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Auto <- data.frame(Auto)
(b) Explore the data graphically in order to investigate the associ ation between mpg01
and the other features. Which of the other features seem most likely to be useful in
predicting mpg01? Scat terplots and boxplots may be useful tools to answer this
ques tion. Describe your findings.
pairs(Auto)
#ambil numerik
Auto_num <- Auto[,sapply(Auto, is.numeric)]
#install corrplot
install.packages("corrplot")
## Installing package into 'C:/Users/ASUS/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## Error in contrib.url(repos, "source"): trying to use CRAN without setting
a mirror
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
#plot heatmap corelasi
corrplot(cor(Auto_num), method = "color")
#berdasarkan heatmap, dapat dilihat bahwa feature yang memiliki korelasi
tinggi dengan mpg01 adalah cylinders, displacement, horsepower, weight, year
(c) Split the data into a training set and a test set
set.seed(1)
train <- sample(392, 196)
train_data <- Auto[train,]
test_data <- Auto[-train,]
(h) Perform KNN on the training data, with several values of K, in order to predict
mpg01. Use only the variables that seemed most associated with mpg01 in (b).
What test errors do you obtain? Which value of K seems to perform the best on this
data set?
library(class)
train_data_num <- train_data[,c("cylinders", "displacement", "horsepower",
"weight", "year")]
test_data_num <- test_data[,c("cylinders", "displacement", "horsepower",
"weight", "year")]
k_values <- 1:20
error_rate <- rep(0,20)
for (i in 1:20){
knn.pred <- knn(train_data_num, test_data_num, train_data$mpg01, k = i)
error_rate[i] <- mean(knn.pred != test_data$mpg01)
}
plot(k_values, error_rate, type = "l", xlab = "K", ylab = "Error Rate")
16. Using the
Boston data set, fit knn model in order to predict whether a given census tract has a crime
rate above or below the median using various subsets of the predictors. Describe your
findings
Hint: You will have to create the response variable yourself, using the variables that are
contained in the Boston data set
#load data
data(Boston)
#find median
median(Boston$crim)
## [1] 0.25651
#add response variable
Boston$crim01 <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
#split data
set.seed(1)
train <- sample(506, 253)
train_data <- Boston[train,]
test_data <- Boston[-train,]
train.mpg01<-Boston$crim01
#ambil numerik
Boston_num <- Boston[,sapply(Boston, is.numeric)]
train_data_num <- train_data[,sapply(train_data, is.numeric)]
test_data_num <- test_data[,sapply(test_data, is.numeric)]
train_data_num[,-14]
##
crim
zn indus chas
nox
rm
age
black
## 505 0.10959 0.0 11.93
0 0.5730 6.794 89.3
393.45
## 324 0.28392 0.0 7.38
0 0.4930 5.708 74.3
391.13
## 167 2.01019 0.0 19.58
0 0.6050 7.929 96.2
369.30
## 129 0.32543 0.0 21.89
0 0.6240 6.431 98.8
396.90
## 418 25.94060 0.0 18.10
0 0.6790 5.304 89.1
127.36
## 471 4.34879 0.0 18.10
0 0.5800 6.167 84.0
396.90
## 299 0.06466 70.0 2.24
0 0.4000 6.345 20.1
368.24
## 270 0.09065 20.0 6.96
1 0.4640 5.920 61.5
391.34
## 466 3.16360 0.0 18.10
0 0.6550 5.759 48.2
334.40
## 187 0.05602 0.0 2.46
0 0.4880 7.831 53.6
392.63
## 307 0.07503 33.0 2.18
0 0.4720 7.420 71.9
396.90
## 481 5.82401 0.0 18.10
0 0.5320 6.242 64.7
396.90
## 85
0.05059 0.0 4.49
0 0.4490 6.389 48.0
396.90
## 277 0.10469 40.0 6.41
1 0.4470 7.267 49.0
389.25
## 362 3.83684 0.0 18.10
0 0.7700 6.251 91.1
350.65
## 438 15.17720 0.0 18.10
0 0.7400 6.152 100.0
9.32
## 330 0.06724 0.0 3.24
0 0.4600 6.333 17.2
375.21
## 263 0.52014 20.0 3.97
0 0.6470 8.398 91.5
386.86
## 329 0.06617 0.0 3.24
0 0.4600 5.868 25.8
382.44
## 79
0.05646 0.0 12.83
0 0.4370 6.232 53.7
386.40
dis rad tax ptratio
2.3889
1 273
21.0
4.7211
5 287
19.6
2.0459
5 403
14.7
1.8125
4 437
21.2
1.6475
24 666
20.2
3.0334
24 666
20.2
7.8278
5 358
14.8
3.9175
3 223
18.6
3.0665
24 666
20.2
3.1992
3 193
17.8
3.0992
7 222
18.4
3.4242
24 666
20.2
4.7794
3 247
18.5
4.7872
4 254
17.6
2.2955
24 666
20.2
1.9142
24 666
20.2
5.2146
4 430
16.9
2.2885
5 264
13.0
5.2146
4 430
16.9
5.0141
5 398
18.7
## 213 0.21719 0.0 10.59
390.94
## 37
0.09744 0.0 5.96
377.56
## 105 0.13960 0.0 8.56
392.69
## 217 0.04560 0.0 13.89
392.80
## 366 4.55587 0.0 18.10
354.70
## 165 2.24236 0.0 19.58
395.11
## 290 0.04297 52.5 5.32
371.72
## 492 0.10574 0.0 27.74
390.11
## 382 15.87440 0.0 18.10
396.90
## 89
0.05660 0.0 3.41
396.90
## 428 37.66190 0.0 18.10
18.82
## 463 6.65492 0.0 18.10
396.90
## 289 0.04590 52.5 5.32
396.90
## 340 0.05497 0.0 5.19
396.90
## 419 73.53410 0.0 18.10
16.45
## 326 0.19186 0.0 7.38
393.68
## 490 0.18337 0.0 27.74
344.05
## 42
0.12744 0.0 6.91
385.41
## 422 7.02259 0.0 18.10
319.98
## 111 0.10793 0.0 8.56
393.49
## 404 24.80170 0.0 18.10
396.90
## 412 14.05070 0.0 18.10
35.05
## 20
0.72580 0.0 8.14
390.95
## 44
0.15936 0.0 6.91
394.46
## 377 15.28800 0.0 18.10
363.02
1 0.4890 5.807
53.8
3.6526
4 277
18.6
0 0.4990 5.841
61.4
3.3779
5 279
19.2
0 0.5200 6.167
90.0
2.4210
5 384
20.9
1 0.5500 5.888
56.0
3.1121
5 276
16.4
0 0.7180 3.561
87.9
1.6132
24 666
20.2
0 0.6050 5.854
91.8
2.4220
5 403
14.7
0 0.4050 6.565
22.9
7.3172
6 293
16.6
0 0.6090 5.983
98.8
1.8681
4 711
20.1
0 0.6710 6.545
99.1
1.5192
24 666
20.2
0 0.4890 7.007
86.3
3.4217
2 270
17.8
0 0.6790 6.202
78.7
1.8629
24 666
20.2
0 0.7130 6.317
83.0
2.7344
24 666
20.2
0 0.4050 6.315
45.6
7.3172
6 293
16.6
0 0.5150 5.985
45.4
4.8122
5 224
20.2
0 0.6790 5.957 100.0
1.8026
24 666
20.2
0 0.4930 6.431
14.7
5.4159
5 287
19.6
0 0.6090 5.414
98.3
1.7554
4 711
20.1
0 0.4480 6.770
2.9
5.7209
3 233
17.9
0 0.7180 6.006
95.3
1.8746
24 666
20.2
0 0.5200 6.195
54.4
2.7778
5 384
20.9
0 0.6930 5.349
96.0
1.7028
24 666
20.2
0 0.5970 6.657 100.0
1.5275
24 666
20.2
0 0.5380 5.727
69.5
3.7965
4 307
21.0
0 0.4480 6.211
6.5
5.7209
3 233
17.9
0 0.6710 6.649
93.3
1.3449
24 666
20.2
## 343 0.02498 0.0 1.89
389.96
## 70
0.12816 12.5 6.07
396.90
## 121 0.06899 0.0 25.65
389.15
## 40
0.02763 75.0 2.95
395.63
## 172 2.31390 0.0 19.58
348.13
## 25
0.75026 0.0 8.14
394.33
## 375 18.49820 0.0 18.10
396.90
## 248 0.19657 22.0 5.86
376.14
## 198 0.04666 80.0 1.52
354.31
## 378 9.82349 0.0 18.10
396.90
## 39
0.17505 0.0 5.96
393.43
## 435 13.91340 0.0 18.10
100.63
## 298 0.14103 0.0 13.92
396.90
## 390 8.15174 0.0 18.10
396.90
## 280 0.21038 20.0 3.33
396.90
## 160 1.42502 0.0 19.58
364.31
## 14
0.62976 0.0 8.14
396.90
## 130 0.88125 0.0 21.89
396.90
## 45
0.12269 0.0 6.91
389.39
## 402 14.23620 0.0 18.10
396.90
## 22
0.85204 0.0 8.14
392.53
## 206 0.13642 0.0 10.59
396.90
## 230 0.44178 0.0 6.20
380.34
## 193 0.08664 45.0 3.44
390.49
## 371 6.53876 0.0 18.10
392.05
0 0.5180 6.540
59.7
6.2669
1 422
15.9
0 0.4090 5.885
33.0
6.4980
4 345
18.9
0 0.5810 5.870
69.7
2.2577
2 188
19.1
0 0.4280 6.595
21.8
5.4011
3 252
18.3
0 0.6050 5.880
97.3
2.3887
5 403
14.7
0 0.5380 5.924
94.1
4.3996
4 307
21.0
0 0.6680 4.138 100.0
1.1370
24 666
20.2
0 0.4310 6.226
79.2
8.0555
7 330
19.1
0 0.4040 7.107
36.6
7.3090
2 329
12.6
0 0.6710 6.794
98.8
1.3580
24 666
20.2
0 0.4990 5.966
30.2
3.8473
5 279
19.2
0 0.7130 6.208
95.0
2.2222
24 666
20.2
0 0.4370 5.790
58.0
6.3200
4 289
16.0
0 0.7000 5.390
98.9
1.7281
24 666
20.2
0 0.4429 6.812
32.2
4.1007
5 216
14.9
0 0.8710 6.510 100.0
1.7659
5 403
14.7
0 0.5380 5.949
61.8
4.7075
4 307
21.0
0 0.6240 5.637
94.7
1.9799
4 437
21.2
0 0.4480 6.069
40.0
5.7209
3 233
17.9
0 0.6930 6.343 100.0
1.5741
24 666
20.2
0 0.5380 5.965
89.2
4.0123
4 307
21.0
0 0.4890 5.891
22.3
3.9454
4 277
18.6
0 0.5040 6.552
21.4
3.3751
8 307
17.4
0 0.4370 7.178
26.3
6.4798
5 398
15.2
1 0.6310 7.016
97.5
1.2024
24 666
20.2
## 104 0.21161 0.0 8.56
394.47
## 501 0.22438 0.0 9.69
396.90
## 255 0.04819 80.0 3.64
392.89
## 450 7.52601 0.0 18.10
304.21
## 436 11.16040 0.0 18.10
109.85
## 103 0.22876 0.0 8.56
70.80
## 331 0.04544 0.0 3.24
368.57
## 13
0.09378 12.5 7.87
390.50
## 296 0.12932 0.0 13.92
396.90
## 483 5.73116 0.0 18.10
395.28
## 176 0.06664 0.0 4.05
390.96
## 345 0.03049 55.0 3.78
387.97
## 279 0.07978 40.0 6.41
396.90
## 110 0.26363 0.0 8.56
391.23
## 84
0.03551 25.0 4.86
390.64
## 359 5.20177 0.0 18.10
395.43
## 29
0.77299 0.0 8.14
387.94
## 141 0.29090 0.0 21.89
388.08
## 252 0.21409 22.0 5.86
377.07
## 406 67.92080 0.0 18.10
384.97
## 221 0.35809 0.0 6.20
391.70
## 465 7.83932 0.0 18.10
396.90
## 108 0.13117 0.0 8.56
387.69
## 304 0.10000 34.0 6.09
390.43
## 33
1.38799 0.0 8.14
232.60
0 0.5200 6.137
87.4
2.7147
5 384
20.9
0 0.5850 6.027
79.7
2.4982
6 391
19.2
0 0.3920 6.108
32.0
9.2203
1 315
16.4
0 0.7130 6.417
98.3
2.1850
24 666
20.2
0 0.7400 6.629
94.6
2.1247
24 666
20.2
0 0.5200 6.405
85.4
2.7147
5 384
20.9
0 0.4600 6.144
32.2
5.8736
4 430
16.9
0 0.5240 5.889
39.0
5.4509
5 311
15.2
0 0.4370 6.678
31.1
5.9604
4 289
16.0
0 0.5320 7.061
77.0
3.4106
24 666
20.2
0 0.5100 6.546
33.1
3.1323
5 296
16.6
0 0.4840 6.874
28.1
6.4654
5 370
17.6
0 0.4470 6.482
32.1
4.1403
4 254
17.6
0 0.5200 6.229
91.2
2.5451
5 384
20.9
0 0.4260 6.167
46.7
5.4007
4 281
19.0
1 0.7700 6.127
83.4
2.7227
24 666
20.2
0 0.5380 6.495
94.4
4.4547
4 307
21.0
0 0.6240 6.174
93.6
1.6119
4 437
21.2
0 0.4310 6.438
8.9
7.3967
7 330
19.1
0 0.6930 5.683 100.0
1.4254
24 666
20.2
1 0.5070 6.951
88.5
2.8617
8 307
17.4
0 0.6550 6.209
65.4
2.9634
24 666
20.2
0 0.5200 6.127
85.2
2.1224
5 384
20.9
0 0.4330 6.982
17.7
5.4917
7 329
16.1
0 0.5380 5.950
82.0
3.9900
4 307
21.0
## 443 5.66637 0.0 18.10
395.69
## 149 2.33099 0.0 19.58
356.99
## 287 0.01965 80.0 1.76
341.60
## 102 0.11432 0.0 8.56
395.58
## 145 2.77974 0.0 19.58
396.90
## 488 4.83567 0.0 18.10
388.22
## 461 4.81213 0.0 18.10
255.23
## 339 0.03306 0.0 5.19
396.14
## 118 0.15098 0.0 10.01
394.51
## 346 0.03113 0.0 4.39
385.64
## 413 18.81100 0.0 18.10
28.79
## 107 0.17120 0.0 8.56
395.67
## 64
0.12650 25.0 5.13
395.58
## 224 0.61470 0.0 6.20
396.90
## 431 8.49213 0.0 18.10
83.45
## 316 0.25356 0.0 9.90
396.42
## 51
0.08873 21.0 5.64
395.56
## 416 18.08460 0.0 18.10
27.25
## 480 14.33370 0.0 18.10
383.32
## 138 0.35233 0.0 21.89
394.08
## 503 0.04527 0.0 11.93
396.90
## 500 0.17783 0.0 9.69
395.77
## 282 0.03705 20.0 3.33
392.23
## 143 3.32105 0.0 19.58
396.90
## 285 0.00906 90.0 2.97
394.72
0 0.7400 6.219 100.0
2.0048
24 666
20.2
0 0.8710 5.186
93.8
1.5296
5 403
14.7
0 0.3850 6.230
31.5
9.0892
1 241
18.2
0 0.5200 6.781
71.3
2.8561
5 384
20.9
0 0.8710 4.903
97.8
1.3459
5 403
14.7
0 0.5830 5.905
53.2
3.1523
24 666
20.2
0 0.7130 6.701
90.0
2.5975
24 666
20.2
0 0.5150 6.059
37.3
4.8122
5 224
20.2
0 0.5470 6.021
82.6
2.7474
6 432
17.8
0 0.4420 6.014
48.5
8.0136
3 352
18.8
0 0.5970 4.628 100.0
1.5539
24 666
20.2
0 0.5200 5.836
91.9
2.2110
5 384
20.9
0 0.4530 6.762
43.4
7.9809
8 284
19.7
0 0.5070 6.618
80.8
3.2721
8 307
17.4
0 0.5840 6.348
86.1
2.0527
24 666
20.2
0 0.5440 5.705
77.7
3.9450
4 304
18.4
0 0.4390 5.963
45.7
6.8147
4 243
16.8
0 0.6790 6.434 100.0
1.8347
24 666
20.2
0 0.6140 6.229
88.0
1.9512
24 666
20.2
0 0.6240 6.454
98.4
1.8498
4 437
21.2
0 0.5730 6.120
76.7
2.2875
1 273
21.0
0 0.5850 5.569
73.5
2.3999
6 391
19.2
0 0.4429 6.968
37.2
5.2447
5 216
14.9
1 0.8710 5.403 100.0
1.3216
5 403
14.7
0 0.4000 7.088
7.3073
1 285
15.3
20.8
## 170 2.44953 0.0 19.58
330.04
## 48
0.22927 0.0 6.91
392.74
## 204 0.03510 95.0 2.68
392.78
## 295 0.08199 0.0 13.92
396.90
## 24
0.98843 0.0 8.14
394.54
## 181 0.06588 0.0 2.46
395.56
## 214 0.14052 0.0 10.59
385.81
## 476 6.39312 0.0 18.10
302.76
## 225 0.31533 0.0 6.20
385.05
## 498 0.26838 0.0 9.69
396.90
## 442 9.72418 0.0 18.10
385.96
## 163 1.83377 0.0 19.58
389.61
## 43
0.14150 0.0 6.91
383.37
## 1
0.00632 18.0 2.31
396.90
## 420 11.81230 0.0 18.10
48.45
## 78
0.08707 0.0 12.83
386.96
## 433 6.44405 0.0 18.10
97.95
## 284 0.01501 90.0 1.21
395.52
## 116 0.17134 0.0 10.01
344.91
## 233 0.57529 0.0 6.20
385.91
## 293 0.03615 80.0 4.95
396.90
## 61
0.14932 25.0 5.13
395.11
## 86
0.05735 0.0 4.49
392.30
## 327 0.30347 0.0 7.38
396.90
## 423 12.04820 0.0 18.10
291.55
0 0.6050 6.402
95.2
2.2625
5 403
14.7
0 0.4480 6.030
85.5
5.6894
3 233
17.9
0 0.4161 7.853
33.2
5.1180
4 224
14.7
0 0.4370 6.009
42.3
5.5027
4 289
16.0
0 0.5380 5.813 100.0
4.0952
4 307
21.0
0 0.4880 7.765
83.3
2.7410
3 193
17.8
0 0.4890 6.375
32.3
3.9454
4 277
18.6
0 0.5840 6.162
97.4
2.2060
24 666
20.2
0 0.5040 8.266
78.3
2.8944
8 307
17.4
0 0.5850 5.794
70.6
2.8927
6 391
19.2
0 0.7400 6.406
97.2
2.0651
24 666
20.2
1 0.6050 7.802
98.2
2.0407
5 403
14.7
0 0.4480 6.169
6.6
5.7209
3 233
17.9
0 0.5380 6.575
65.2
4.0900
1 296
15.3
0 0.7180 6.824
76.5
1.7940
24 666
20.2
0 0.4370 6.140
45.8
4.0905
5 398
18.7
0 0.5840 6.425
74.8
2.2004
24 666
20.2
1 0.4010 7.923
24.8
5.8850
1 198
13.6
0 0.5470 5.928
88.2
2.4631
6 432
17.8
0 0.5070 8.337
73.3
3.8384
8 307
17.4
0 0.4110 6.630
23.4
5.1167
4 245
19.2
0 0.4530 5.741
66.2
7.2254
8 284
19.7
0 0.4490 6.630
56.1
4.4377
3 247
18.5
0 0.4930 6.312
28.9
5.4159
5 287
19.6
0 0.6140 5.648
87.6
1.9512
24 666
20.2
## 355 0.04301 80.0 1.91
382.80
## 496 0.17899 0.0 9.69
393.29
## 300 0.05561 70.0 2.24
371.58
## 49
0.25387 0.0 6.91
396.90
## 396 8.71675 0.0 18.10
391.98
## 242 0.10612 30.0 4.93
394.62
## 246 0.19133 22.0 5.86
389.13
## 305 0.05515 33.0 2.18
393.68
## 306 0.05479 33.0 2.18
393.36
## 247 0.33983 22.0 5.86
390.18
## 239 0.08244 30.0 4.93
379.41
## 219 0.11069 0.0 13.89
396.90
## 135 0.97617 0.0 21.89
262.76
## 467 3.77498 0.0 18.10
22.01
## 464 5.82115 0.0 18.10
393.82
## 395 13.35980 0.0 18.10
396.90
## 53
0.05360 21.0 5.64
396.90
## 444 9.96654 0.0 18.10
386.73
## 401 25.04610 0.0 18.10
396.90
## 65
0.01951 17.5 1.38
393.24
## 421 11.08740 0.0 18.10
318.75
## 484 2.81838 0.0 18.10
392.92
## 124 0.15038 0.0 25.65
370.31
## 77
0.10153 0.0 12.83
373.66
## 218 0.07013 0.0 13.89
392.78
0 0.4130 5.663
21.9 10.5857
4 334
22.0
0 0.5850 5.670
28.8
2.7986
6 391
19.2
0 0.4000 7.041
10.0
7.8278
5 358
14.8
0 0.4480 5.399
95.3
5.8700
3 233
17.9
0 0.6930 6.471
98.8
1.7257
24 666
20.2
0 0.4280 6.095
65.1
6.3361
6 300
16.6
0 0.4310 5.605
70.2
7.9549
7 330
19.1
0 0.4720 7.236
41.1
4.0220
7 222
18.4
0 0.4720 6.616
58.1
3.3700
7 222
18.4
0 0.4310 6.108
34.9
8.0555
7 330
19.1
0 0.4280 6.481
18.5
6.1899
6 300
16.6
1 0.5500 5.951
93.8
2.8893
5 276
16.4
0 0.6240 5.757
98.4
2.3460
4 437
21.2
0 0.6550 5.952
84.7
2.8715
24 666
20.2
0 0.7130 6.513
89.9
2.8016
24 666
20.2
0 0.6930 5.887
94.7
1.7821
24 666
20.2
0 0.4390 6.511
21.1
6.8147
4 243
16.8
0 0.7400 6.485 100.0
1.9784
24 666
20.2
0 0.6930 5.987 100.0
1.5888
24 666
20.2
0 0.4161 7.104
59.5
9.2229
3 216
18.6
0 0.7180 6.411 100.0
1.8589
24 666
20.2
0 0.5320 5.762
40.3
4.0983
24 666
20.2
0 0.5810 5.856
97.0
1.9444
2 188
19.1
0 0.4370 6.279
74.5
4.0522
5 398
18.7
0 0.5500 6.642
85.1
3.4211
5 276
16.4
## 98
0.12083 0.0 2.89
396.90
## 194 0.02187 60.0 2.93
393.37
## 19
0.80271 0.0 8.14
288.99
## 273 0.11460 20.0 6.96
394.96
## 31
1.13081 0.0 8.14
360.17
## 174 0.09178 0.0 4.05
395.50
## 237 0.52058 0.0 6.20
388.45
## 75
0.07896 0.0 12.83
394.92
## 16
0.62739 0.0 8.14
395.62
## 458 8.20058 0.0 18.10
3.50
## 265 0.55007 20.0 3.97
387.89
## 353 0.07244 60.0 1.69
392.33
## 92
0.03932 0.0 3.41
393.55
## 122 0.07165 0.0 25.65
377.67
## 152 1.49632 0.0 19.58
341.60
## 392 5.29305 0.0 18.10
378.38
## 207 0.22969 0.0 10.59
394.87
## 249 0.16439 22.0 5.86
374.71
## 446 10.67180 0.0 18.10
43.06
## 229 0.29819 0.0 6.20
377.51
## 140 0.54452 0.0 21.89
396.90
## 126 0.16902 0.0 25.65
385.02
## 445 12.80230 0.0 18.10
240.52
## 368 13.52220 0.0 18.10
131.42
## 328 0.24103 0.0 7.38
396.90
0 0.4450 8.069
76.0
3.4952
2 276
18.0
0 0.4010 6.800
9.9
6.2196
1 265
15.6
0 0.5380 5.456
36.6
3.7965
4 307
21.0
0 0.4640 6.538
58.7
3.9175
3 223
18.6
0 0.5380 5.713
94.1
4.2330
4 307
21.0
0 0.5100 6.416
84.1
2.6463
5 296
16.6
1 0.5070 6.631
76.5
4.1480
8 307
17.4
0 0.4370 6.273
6.0
4.2515
5 398
18.7
0 0.5380 5.834
56.5
4.4986
4 307
21.0
0 0.7130 5.936
80.3
2.7792
24 666
20.2
0 0.6470 7.206
91.6
1.9301
5 264
13.0
0 0.4110 5.884
18.5 10.7103
4 411
18.3
0 0.4890 6.405
73.9
3.0921
2 270
17.8
0 0.5810 6.004
84.1
2.1974
2 188
19.1
0 0.8710 5.404 100.0
1.5916
5 403
14.7
0 0.7000 6.051
82.5
2.1678
24 666
20.2
0 0.4890 6.326
52.5
4.3549
4 277
18.6
0 0.4310 6.433
49.1
7.8265
7 330
19.1
0 0.7400 6.459
94.8
1.9879
24 666
20.2
0 0.5040 7.686
17.0
3.3751
8 307
17.4
0 0.6240 6.151
97.9
1.6687
4 437
21.2
0 0.5810 5.986
88.4
1.9929
2 188
19.1
0 0.7400 5.854
96.6
1.8956
24 666
20.2
0 0.6310 3.863 100.0
1.5106
24 666
20.2
0 0.4930 6.083
5.4159
5 287
19.6
43.7
## 271 0.29916 20.0 6.96
388.65
## 344 0.02543 55.0 3.78
396.90
## 342 0.01301 35.0 1.52
394.74
## 333 0.03466 35.0 6.06
362.25
## 212 0.37578 0.0 10.59
395.24
## 127 0.38735 0.0 25.65
359.29
## 133 0.59005 0.0 21.89
385.76
## 41
0.03359 75.0 2.95
395.62
## 36
0.06417 0.0 5.96
396.90
## 297 0.05372 0.0 13.92
392.85
## 399 38.35180 0.0 18.10
396.90
## 391 6.96215 0.0 18.10
394.43
## 360 4.26131 0.0 18.10
390.74
## 117 0.13158 0.0 10.01
393.30
## 408 11.95110 0.0 18.10
332.09
## 50
0.21977 0.0 6.91
396.90
## 286 0.01096 55.0 2.25
394.72
## 254 0.36894 22.0 5.86
396.90
## 72
0.15876 0.0 10.81
376.94
## 437 14.42080 0.0 18.10
27.49
## 168 1.80028 0.0 19.58
227.61
## 313 0.26169 0.0 9.90
396.30
## 113 0.12329 0.0 10.01
394.95
## 234 0.33147 0.0 6.20
378.95
## 459 7.75223 0.0 18.10
272.21
0 0.4640 5.856
42.1
4.4290
3 223
18.6
0 0.4840 6.696
56.4
5.7321
5 370
17.6
0 0.4420 7.241
49.3
7.0379
1 284
15.5
0 0.4379 6.031
23.3
6.6407
1 304
16.9
1 0.4890 5.404
88.6
3.6650
4 277
18.6
0 0.5810 5.613
95.6
1.7572
2 188
19.1
0 0.6240 6.372
97.9
2.3274
4 437
21.2
0 0.4280 7.024
15.8
5.4011
3 252
18.3
0 0.4990 5.933
68.2
3.3603
5 279
19.2
0 0.4370 6.549
51.0
5.9604
4 289
16.0
0 0.6930 5.453 100.0
1.4896
24 666
20.2
0 0.7000 5.713
97.0
1.9265
24 666
20.2
0 0.7700 6.112
81.3
2.5091
24 666
20.2
0 0.5470 6.176
72.5
2.7301
6 432
17.8
0 0.6590 5.608 100.0
1.2852
24 666
20.2
0 0.4480 5.602
62.0
6.0877
3 233
17.9
0 0.3890 6.453
31.9
7.3073
1 300
15.3
0 0.4310 8.259
8.4
8.9067
7 330
19.1
0 0.4130 5.961
17.5
5.2873
4 305
19.2
0 0.7400 6.461
93.3
2.0026
24 666
20.2
0 0.6050 5.877
79.2
2.4259
5 403
14.7
0 0.5440 6.023
90.4
2.8340
4 304
18.4
0 0.5470 5.913
92.9
2.3534
6 432
17.8
0 0.5070 8.247
70.4
3.6519
8 307
17.4
0 0.7130 6.301
83.7
2.7831
24 666
20.2
## 73
0.09164 0.0 10.81
390.91
## 27
0.67191 0.0 8.14
376.88
## 405 41.52920 0.0 18.10
329.46
## 15
0.63796 0.0 8.14
380.02
## 62
0.17171 25.0 5.13
378.08
## 132 1.19294 0.0 21.89
396.90
## 35
1.61282 0.0 8.14
248.31
## 338 0.03041 0.0 5.19
394.81
## 473 3.56868 0.0 18.10
393.37
## 185 0.08308 0.0 2.46
391.00
## 153 1.12658 0.0 19.58
343.28
## 332 0.05023 35.0 6.06
394.02
## 485 2.37857 0.0 18.10
370.73
## 434 5.58107 0.0 18.10
100.19
## 231 0.53700 0.0 6.20
378.35
## 28
0.95577 0.0 8.14
306.38
## 389 14.33370 0.0 18.10
372.92
## 148 2.36862 0.0 19.58
391.71
## 325 0.34109 0.0 7.38
396.90
## 60
0.10328 25.0 5.13
396.90
## 275 0.05644 40.0 6.41
396.90
## 93
0.04203 28.0 15.04
395.01
## 202 0.03445 82.5 2.03
393.77
## 336 0.03961 0.0 5.19
396.90
## 241 0.11329 30.0 4.93
391.25
0 0.4130 6.065
7.8
5.2873
4 305
19.2
0 0.5380 5.813
90.3
4.6820
4 307
21.0
0 0.6930 5.531
85.4
1.6074
24 666
20.2
0 0.5380 6.096
84.5
4.4619
4 307
21.0
0 0.4530 5.966
93.4
6.8185
8 284
19.7
0 0.6240 6.326
97.7
2.2710
4 437
21.2
0 0.5380 6.096
96.9
3.7598
4 307
21.0
0 0.5150 5.895
59.6
5.6150
5 224
20.2
0 0.5800 6.437
75.0
2.8965
24 666
20.2
0 0.4880 5.604
89.8
2.9879
3 193
17.8
1 0.8710 5.012
88.0
1.6102
5 403
14.7
0 0.4379 5.706
28.4
6.6407
1 304
16.9
0 0.5830 5.871
41.9
3.7240
24 666
20.2
0 0.7130 6.436
87.9
2.3158
24 666
20.2
0 0.5040 5.981
68.1
3.6715
8 307
17.4
0 0.5380 6.047
88.8
4.4534
4 307
21.0
0 0.7000 4.880 100.0
1.5895
24 666
20.2
0 0.8710 4.926
95.7
1.4608
5 403
14.7
0 0.4930 6.415
40.1
4.7211
5 287
19.6
0 0.4530 5.927
47.2
6.9320
8 284
19.7
1 0.4470 6.758
32.9
4.0776
4 254
17.6
0 0.4640 6.442
53.6
3.6659
4 270
18.2
0 0.4150 6.162
38.4
6.2700
2 348
14.7
0 0.5150 6.037
34.5
5.9853
5 224
20.2
0 0.4280 6.897
54.3
6.3361
6 300
16.6
## 415 45.74610 0.0 18.10
88.27
## 281 0.03578 20.0 3.33
387.31
## 449 9.32909 0.0 18.10
396.90
## 364 4.22239 0.0 18.10
353.04
## 427 12.24720 0.0 18.10
24.65
## 478 15.02340 0.0 18.10
349.48
## 274 0.22188 20.0 6.96
390.77
## 414 28.65580 0.0 18.10
210.97
##
lstat crim01
## 505 6.48
0
## 324 11.74
1
## 167 3.70
1
## 129 15.39
1
## 418 26.64
1
## 471 16.29
1
## 299 4.97
0
## 270 13.65
0
## 466 14.13
1
## 187 4.45
0
## 307 6.47
0
## 481 10.74
1
## 85
9.62
0
## 277 6.05
0
## 362 14.19
1
## 438 26.45
1
## 330 7.34
0
## 263 5.91
1
## 329 9.97
0
## 79 12.34
0
## 213 16.03
0
## 37 11.41
0
## 105 12.33
0
## 217 13.51
0
## 366 7.12
1
## 165 11.64
1
## 290 9.51
0
## 492 18.07
0
## 382 21.08
1
## 89
5.50
0
## 428 14.52
1
## 463 13.99
1
## 289 7.60
0
0 0.6930 4.519 100.0
1.6582
24 666
20.2
0 0.4429 7.820
64.5
4.6947
5 216
14.9
0 0.7130 6.185
98.7
2.2616
24 666
20.2
1 0.7700 5.803
89.0
1.9047
24 666
20.2
0 0.5840 5.837
59.7
1.9976
24 666
20.2
0 0.6140 5.304
97.3
2.1007
24 666
20.2
1 0.4640 7.691
51.8
4.3665
3 223
18.6
0 0.5970 5.155 100.0
1.5894
24 666
20.2
## 340 9.74
## 419 20.62
## 326 5.08
## 490 23.97
## 42
4.84
## 422 15.70
## 111 13.00
## 404 19.77
## 412 21.22
## 20 11.28
## 44
7.44
## 377 23.24
## 343 8.65
## 70
8.79
## 121 14.37
## 40
4.32
## 172 12.03
## 25 16.30
## 375 37.97
## 248 10.15
## 198 8.61
## 378 21.24
## 39 10.13
## 435 15.17
## 298 15.84
## 390 20.85
## 280 4.85
## 160 7.39
## 14
8.26
## 130 18.34
## 45
9.55
## 402 20.32
## 22 13.83
## 206 10.87
## 230 3.76
## 193 2.87
## 371 2.96
## 104 13.44
## 501 14.33
## 255 6.57
## 450 19.31
## 436 23.27
## 103 10.63
## 331 9.09
## 13 15.71
## 296 6.27
## 483 7.01
## 176 5.33
## 345 4.61
## 279 7.19
0
1
0
0
0
1
0
1
1
1
0
1
0
0
0
0
1
1
1
0
0
1
0
1
0
1
0
1
1
1
0
1
1
0
1
0
1
0
0
0
1
1
0
0
0
0
1
0
0
0
## 110 15.55
## 84
7.51
## 359 11.48
## 29 12.80
## 141 24.16
## 252 3.59
## 406 22.98
## 221 9.71
## 465 13.22
## 108 14.09
## 304 4.86
## 33 27.71
## 443 16.59
## 149 28.32
## 287 12.93
## 102 7.67
## 145 29.29
## 488 11.45
## 461 16.42
## 339 8.51
## 118 10.30
## 346 10.53
## 413 34.37
## 107 18.66
## 64
9.50
## 224 7.60
## 431 17.64
## 316 11.50
## 51 13.45
## 416 29.05
## 480 13.11
## 138 14.59
## 503 9.08
## 500 15.10
## 282 4.59
## 143 26.82
## 285 7.85
## 170 11.32
## 48 18.80
## 204 3.81
## 295 10.40
## 24 19.88
## 181 7.56
## 214 9.38
## 476 24.10
## 225 4.14
## 498 14.10
## 442 19.52
## 163 1.92
## 43
5.81
1
0
1
1
1
0
1
1
1
0
0
1
1
1
0
0
1
1
1
0
0
0
1
0
0
1
1
0
0
1
1
1
0
0
0
1
0
1
0
0
0
1
0
0
1
1
1
1
1
0
## 1
4.98
## 420 22.74
## 78 10.27
## 433 12.03
## 284 3.16
## 116 15.76
## 233 2.47
## 293 4.70
## 61 13.15
## 86
6.53
## 327 6.15
## 423 14.10
## 355 8.05
## 496 17.60
## 300 4.74
## 49 30.81
## 396 17.12
## 242 12.40
## 246 18.46
## 305 6.93
## 306 8.93
## 247 9.16
## 239 6.36
## 219 17.92
## 135 17.31
## 467 17.15
## 464 10.29
## 395 16.35
## 53
5.28
## 444 18.85
## 401 26.77
## 65
8.05
## 421 15.02
## 484 10.42
## 124 25.41
## 77 11.97
## 218 9.69
## 98
4.21
## 194 5.03
## 19 11.69
## 273 7.73
## 31 22.60
## 174 9.04
## 237 9.54
## 75
6.78
## 16
8.47
## 458 16.94
## 265 8.10
## 353 7.79
## 92
8.20
0
1
0
1
0
0
1
0
0
0
1
1
0
0
0
0
1
0
0
0
0
1
0
0
1
1
1
1
0
1
1
0
1
1
0
0
0
0
0
1
0
1
0
1
0
1
1
1
0
0
## 122 14.27
## 152 13.28
## 392 18.76
## 207 10.97
## 249 9.52
## 446 23.98
## 229 3.92
## 140 18.46
## 126 14.81
## 445 23.79
## 368 13.33
## 328 12.79
## 271 13.00
## 344 7.18
## 342 5.49
## 333 7.83
## 212 23.98
## 127 27.26
## 133 11.12
## 41
1.98
## 36
9.68
## 297 7.39
## 399 30.59
## 391 17.11
## 360 12.67
## 117 12.04
## 408 12.13
## 50 16.20
## 286 8.23
## 254 3.54
## 72
9.88
## 437 18.05
## 168 12.14
## 313 11.72
## 113 16.21
## 234 3.95
## 459 16.23
## 73
5.52
## 27 14.81
## 405 27.38
## 15 10.26
## 62 14.44
## 132 12.26
## 35 20.34
## 338 10.56
## 473 14.36
## 185 13.98
## 153 12.12
## 332 12.43
## 485 13.34
0
1
1
0
0
1
1
1
0
1
1
0
1
0
0
0
1
1
1
0
0
0
1
1
1
0
1
0
0
1
0
1
1
1
0
1
1
0
1
1
1
0
1
1
0
1
0
1
0
1
## 434 16.22
## 231 11.65
## 28 17.28
## 389 30.62
## 148 29.53
## 325 6.12
## 60
9.22
## 275 3.53
## 93
8.16
## 202 7.43
## 336 8.01
## 241 11.38
## 415 36.98
## 281 3.76
## 449 18.13
## 364 14.64
## 427 15.69
## 478 24.91
## 274 6.58
## 414 20.08
1
1
1
1
1
1
0
0
0
0
0
0
1
0
1
1
1
1
0
1
#fit knn model
k_values <- 1:20
error_rate <- rep(0,20)
for (i in 1:20){
knn.pred <- knn(train_data_num, test_data_num, train_data$crim01, k = i)
error_rate[i] <- mean(knn.pred != test_data$crim01)
}
plot(k_values, error_rate, type = "l", xlab = "K", ylab = "Error Rate")
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )