Instance-Based Learning – p.1 Outline k-Nearest Neighbor Locally weighted regression Lazy and eager learning – p.2 Instance-Based Learning Learning methods so far: construct a general, explicit hypothesis of the target function over the entire instance space when training examples are provided Key idea: just store all training examples xi , f (xi ) When a new query instance is encountered, a set of similar related instances is retrieved and used to classify the new query instance – p.3 Nearest Neighbor Learning Assume all instances correspond to points in the d -dimensional space ℜd The standard Euclidean distance For function approximation: Given query instance xq , first locate nearest training example xn , then estimate fˆ(xq ) ← f (xn ) For classification: Given query instance xq , first locate nearest training example xn , then assign xq to the category associated with xn – p.4 Voronoi diagram x3 x2 x1 x1 FIGURE 4.13. In two dimensions, the nearest-neighbor algorithm leads to a partitioning of the input space into Voronoi cells, each labeled by the category of the training point it contains. In three dimensions, the cells are three-dimensional, and the decision boundary resembles the surface of a crystal. From: Richard O. Duda, Peter E. Hart, and c 2001 by John Wiley & Sons, Inc. David G. Stork, Pattern Classification. Copyright – p.5 k-Nearest neighbor Learning For classification: Given xq , take vote among its k nearest nbrs, i.e., assign it the category most frequently represented among the k nearest samples For real-valued function approximation: take mean of f values of k nearest nbrs k f (xi ) ∑ fˆ(xq ) ← i=1 k – p.6 x2 x x1 FIGURE 4.15. The k -nearest-neighbor query starts at the test point x and grows a spherical region until it encloses k training samples, and it labels the test point by a majority vote of these samples. In this k = 5 case, the test point x would be labeled the category of the black points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern c 2001 by John Wiley & Sons, Inc. Classification. Copyright – p.7 Relation to Bayes Classifier Let’s place a cell of volume V around x and capture k samples. Assume ki samples amongst k turned out to be labeled ωi . Then P(ωi |x) can be estimated as ki /k. For minimum error rate, we select the category most frequently represented within the cell If there are enough samples (→ ∞) and k is large, the performance will approach the Bayes optimal If the number of samples is large (unlimited), the error rate of the 1-nearest-neighbor classifi er is never worse than twice the Bayes rate – p.8 Distance-Weighted kNN Might want to weight the contribution of each of the k neighbors according to their distance to the query point, weighting nearer neighbors more heavily... k ˆf (xq ) ← ∑i=1 wi f (xi ) ∑ki=1 wi where wi ≡ 1 d(xq , xi )2 and d(xq , xi ) is distance between xq and xi – p.9 Distance-Weighted kNN For classification: assign to the category with largest total weights Note now it makes sense to use all training examples instead of just k, but more slow – p.10 Metrics The nearest-neighbor classifi er relies on a metric or distance function between patterns. We have assumed the Euclidean metric The notion of a metric is far more general! e.g., Manhattan/city block distance d d(a, b) = ∑ |ai − bi | i=1 If there is a large disparity in the ranges of data in each dimension, a common procedure is to rescale the data to equalize such ranges – p.11 x x 2 2 x x x 1 αx 1 FIGURE 4.18. Scaling the coordinates of a feature space can change the distance relationships computed by the Euclidean metric. Here we see how such scaling can change the behavior of a nearest-neighbor classifer. Consider the test point x and its nearest neighbor. In the original space (left), the black prototype is closest. In the figure at the right, the x1 axis has been rescaled by a factor 1/3; now the nearest prototype is the red one. If there is a large disparity in the ranges of the full data in each dimension, a common procedure is to rescale all the data to equalize such ranges, and this is equivalent to changing the metric in the original space. From: Richard O. Duda, Peter E. Hart, and c 2001 by John Wiley & Sons, Inc. David G. Stork, Pattern Classification. Copyright – p.12 Curse of Dimensionality Imagine instances described by 20 attributes, but only 2 are relevant to target function Instances that have identical values for the 2 relevant attributes may be distant from one another in the 20-dimensional instance space. Nearest nbr approaches are easily misled by many irrelevant attributes when high-dimensional x Select relevant features – p.13 Remarks When To Consider Nearest Neighbor Instances map to points in ℜd Lots of training data Advantages: Training is very fast Learn complex target functions Don’t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes – p.14 Locally Weighted Regression Note kNN forms local approximation to f for each query point xq Why not form an explicit approximation fˆ(x) for region surrounding xq Fit linear function to k nearest neighbors Fit quadratic, ... Produces “piecewise approximation” to f – p.15 Locally Weighted Regression Several choices of error to minimize: Squared error over k nearest neighbors 1 E1 (xq ) ≡ ∑ 2 x∈ k nearest nbrs o f ( f (x) − fˆ(x))2 xq Distance-weighted squared error over all data E2 (xq ) ≡ 1 ( f (x) − fˆ(x))2 K(d(xq , x)) ∑ 2 x∈D ... – p.16 Locally Weighted Regression Typically fit simple functions such as constant, linear, or quadratic The cost of fitting more complex functions for each query instance is prohibitively high These simple approximations model the target function quite well over a sufficiently small subregion of the instance space – p.17 Case-Based Reasoning (CBR) Can apply instance-based learning even when X = ℜd → need different “distance” metric Case-Based Reasoning is instance-based learning applied to instances with symbolic descriptions Require a similarity metric Retrieval and combination of cases to solve the query may rely on knowledge-based reasoning Help desk, reasoning about legal cases, ... – p.18 Lazy and Eager Learning Lazy: wait for query before generalizing k-N EAREST N EIGHBOR, Case based reasoning Eager: generalize before seeing query neural networks, decision tree, NaiveBayes, . . . Obvious differences in computation time – p.19 Lazy and Eager Learning Eager learner must create global approximation Lazy learner can construct a different local approximation to the target function for each distinct query instance If they use same model space H , lazy can represent more complex fns (e.g., consider H = linear functions) – p.20