Adopted from slides by Roger Grosse, Alex Ihler
Nearest neighbor
• Keep the whole training dataset: {(x, y)}
• A query example (vector) q comes
• Find closest example(s) x *
• Predict y *
• Works both for regression and classification
• Collaborative filtering is an example of k-NN classifier
• Find k most similar people to user x that have rated movie y
• Predict rating y x of x as an average of y k
Nearest neighbor regression
• Find training data 𝑥 (𝑖) closest to 𝑥 (𝑛𝑒𝑤)
; predict 𝑦 (𝑖)
• Defines an (implicit) function 𝑓(𝑥)
• “Form” is piecewise constant
Nearest neighbor classification
• Find the nearest input vector to x in the training set and copy its label
• Can formalize “nearest” in terms of Euclidean distance 𝑑 𝑥, 𝑥 ′ = 𝑥 𝑖
− 𝑥 𝑖
′ 2 𝑖
Nearest neighbor classification
• We can visualize the behavior in the classification setting using a
Voronoi diagram.
Nearest neighbor classification
1
Class 1
1
1
Nearest Nbr:
Piecewise linear boundary
0
0
Class 0
0
X
1
Nearest neighbor classification
• Decision boundary: the boundary between regions of input space assigned to different categories.
Nearest neighbor classification
• 3D decision boundary
• In general: Nearest-neighbor classifier produces piecewise linear decision boundaries
𝑘 -nearest neighbors ( 𝑘 -NN)
• 1-nearest neighbor is sensitive to noise or miss-labeled data
• Solution: smooth by having 𝑘 nearest neighbors
𝑘 -nearest neighbors
• Find the 𝑘 -nearest neighbors to 𝑥 (𝑛𝑒𝑤) in the data
• i.e., rank the feature vectors according to Euclidean distance
• select the 𝑘 vectors which are have smallest distance to 𝑥 (𝑛𝑒𝑤)
• Regression
• Usually just average the 𝑦 -values of the 𝑘 closest training examples
• Classification
• ranking yields 𝑘 feature vectors and a set of 𝑘 class labels
• pick the class label which is most common in this set (“vote”)
• Note: for two-class problems, if k is odd ( 𝑘 = 1, 3, 5, … ) there will never be any “ties”
Distance-weighted 𝑘 -NN
• Might want to weight nearer neighbors more heavily 𝑓 𝑥 (𝑛𝑒𝑤) = 𝑛 𝑖=1 𝜔 (𝑖) 𝑦 (𝑖) 𝑛 𝑖=1 𝜔 (𝑖) where 𝜔 (𝑖) =
1 𝑑 𝑥 𝑛𝑒𝑤 , 𝑥 𝑖 2 and 𝑑 𝑥 𝑛𝑒𝑤 , 𝑥 𝑖
2 𝒘
(𝒊)
= 𝐞𝐱𝐩(− 𝒅 𝒙 𝒊 is the distance between 𝑥 (𝑛𝑒𝑤) and 𝑥 𝑖
, 𝒙 (𝒏𝒆𝒘) 𝟐
)
𝑲 𝒘
• Note now it makes sense to use all training examples instead of 𝑘
𝑘 -nearest neighbors 𝑘 = 1 𝑘 = 15
𝑘 -nearest neighbors
• Tradeoffs in choosing 𝑘
• Small 𝑘
• Good at capturing fine-grained patterns
• May overfit , i.e. be sensitive to random idiosyncrasies in the training data
• Large 𝑘
• Makes stable predictions by averaging over lots of examples
• May underfit , i.e. fail to capture important regularities
• Rule of thumb: 𝑘 < 𝑛 , where 𝑛 is the number of training examples
𝑘 -nearest neighbors
• We can measure the generalization error using a test set.
𝑘 -nearest neighbors
• 𝑘 is an example of a hyperparameter, something we can’t fit as part of the learning algorithm itself
• We can tune hyperparameters using a validation set:
• The test set is used only at the very end, to measure the generalization performance of the final configuration.
Pitfalls: the curse of dimensionality
• Low-dimensional visualizations are misleading! In high dimensions,
“most” points are far apart.
• In high dimensions, “most” points are approximately the same distance.
• As the dimensions grow, the amount of data we need to generalize accurately grows exponentially.
Pitfalls: normalization
• Nearest neighbors can be sensitive to the ranges of different features.
• Often, the units are arbitrary:
• Simple fix: normalize each dimension to be zero mean and unit variance. I.e., compute the mean 𝜇 𝑗 take 𝑥 𝑗 𝑥 𝑗
=
− 𝜇 𝑗 𝜎 𝑗 and standard deviation 𝜎 𝑗
, and
Pitfalls: computational cost
• Number of computations at training time: 0
• Number of computations at test time, per query (naïve algorithm)
• Calculate 𝑚 -dimensional Euclidean distances with 𝑛 data points: 𝑂(𝑚𝑛)
• Sort the distances: 𝑂(𝑛 log 𝑛)
• This must be done for each query, which is very expensive by the standards of a learning algorithm!
• Need to store the entire dataset in memory!
• Tons of work has gone into algorithms and data structures for efficient nearest neighbors with high dimensions and/or large datasets.
Example: digit classification
• Decent performance when lots of data
Conclusions
• Simple algorithm that does all its work at test time — in a sense, no learning!
• Can control the complexity by varying 𝑘
• Suffers from the Curse of Dimensionality