Uploaded by Hafiz Sunny

6-knn

advertisement

Nearest Neighbors

Adopted from slides by Roger Grosse, Alex Ihler

Nearest neighbor

• Keep the whole training dataset: {(x, y)}

• A query example (vector) q comes

• Find closest example(s) x *

• Predict y *

• Works both for regression and classification

Collaborative filtering is an example of k-NN classifier

• Find k most similar people to user x that have rated movie y

• Predict rating y x of x as an average of y k

Nearest neighbor regression

• Find training data 𝑥 (𝑖) closest to 𝑥 (𝑛𝑒𝑤)

; predict 𝑦 (𝑖)

• Defines an (implicit) function 𝑓(𝑥)

• “Form” is piecewise constant

Nearest neighbor classification

• Find the nearest input vector to x in the training set and copy its label

• Can formalize “nearest” in terms of Euclidean distance 𝑑 𝑥, 𝑥 ′ = 𝑥 𝑖

− 𝑥 𝑖

′ 2 𝑖

Nearest neighbor classification

• We can visualize the behavior in the classification setting using a

Voronoi diagram.

Nearest neighbor classification

1

Class 1

1

1

Nearest Nbr:

Piecewise linear boundary

0

0

Class 0

0

X

1

Nearest neighbor classification

• Decision boundary: the boundary between regions of input space assigned to different categories.

Nearest neighbor classification

• 3D decision boundary

• In general: Nearest-neighbor classifier produces piecewise linear decision boundaries

𝑘 -nearest neighbors ( 𝑘 -NN)

• 1-nearest neighbor is sensitive to noise or miss-labeled data

• Solution: smooth by having 𝑘 nearest neighbors

𝑘 -nearest neighbors

• Find the 𝑘 -nearest neighbors to 𝑥 (𝑛𝑒𝑤) in the data

• i.e., rank the feature vectors according to Euclidean distance

• select the 𝑘 vectors which are have smallest distance to 𝑥 (𝑛𝑒𝑤)

• Regression

• Usually just average the 𝑦 -values of the 𝑘 closest training examples

• Classification

• ranking yields 𝑘 feature vectors and a set of 𝑘 class labels

• pick the class label which is most common in this set (“vote”)

• Note: for two-class problems, if k is odd ( 𝑘 = 1, 3, 5, … ) there will never be any “ties”

Distance-weighted 𝑘 -NN

• Might want to weight nearer neighbors more heavily 𝑓 𝑥 (𝑛𝑒𝑤) = 𝑛 𝑖=1 𝜔 (𝑖) 𝑦 (𝑖) 𝑛 𝑖=1 𝜔 (𝑖) where 𝜔 (𝑖) =

1 𝑑 𝑥 𝑛𝑒𝑤 , 𝑥 𝑖 2 and 𝑑 𝑥 𝑛𝑒𝑤 , 𝑥 𝑖

2 𝒘

(𝒊)

= 𝐞𝐱𝐩(− 𝒅 𝒙 𝒊 is the distance between 𝑥 (𝑛𝑒𝑤) and 𝑥 𝑖

, 𝒙 (𝒏𝒆𝒘) 𝟐

)

𝑲 𝒘

• Note now it makes sense to use all training examples instead of 𝑘

𝑘 -nearest neighbors 𝑘 = 1 𝑘 = 15

𝑘 -nearest neighbors

• Tradeoffs in choosing 𝑘

• Small 𝑘

• Good at capturing fine-grained patterns

• May overfit , i.e. be sensitive to random idiosyncrasies in the training data

• Large 𝑘

• Makes stable predictions by averaging over lots of examples

• May underfit , i.e. fail to capture important regularities

• Rule of thumb: 𝑘 < 𝑛 , where 𝑛 is the number of training examples

𝑘 -nearest neighbors

• We can measure the generalization error using a test set.

𝑘 -nearest neighbors

• 𝑘 is an example of a hyperparameter, something we can’t fit as part of the learning algorithm itself

• We can tune hyperparameters using a validation set:

• The test set is used only at the very end, to measure the generalization performance of the final configuration.

Pitfalls: the curse of dimensionality

• Low-dimensional visualizations are misleading! In high dimensions,

“most” points are far apart.

• In high dimensions, “most” points are approximately the same distance.

• As the dimensions grow, the amount of data we need to generalize accurately grows exponentially.

Pitfalls: normalization

• Nearest neighbors can be sensitive to the ranges of different features.

• Often, the units are arbitrary:

• Simple fix: normalize each dimension to be zero mean and unit variance. I.e., compute the mean 𝜇 𝑗 take 𝑥 𝑗 𝑥 𝑗

=

− 𝜇 𝑗 𝜎 𝑗 and standard deviation 𝜎 𝑗

, and

Pitfalls: computational cost

• Number of computations at training time: 0

• Number of computations at test time, per query (naïve algorithm)

• Calculate 𝑚 -dimensional Euclidean distances with 𝑛 data points: 𝑂(𝑚𝑛)

• Sort the distances: 𝑂(𝑛 log 𝑛)

• This must be done for each query, which is very expensive by the standards of a learning algorithm!

• Need to store the entire dataset in memory!

• Tons of work has gone into algorithms and data structures for efficient nearest neighbors with high dimensions and/or large datasets.

Example: digit classification

• Decent performance when lots of data

Conclusions

• Simple algorithm that does all its work at test time — in a sense, no learning!

• Can control the complexity by varying 𝑘

• Suffers from the Curse of Dimensionality

Download