Instance-Based Learning Outline

advertisement
Instance-Based Learning
– p.1
Outline
k-Nearest Neighbor
Locally weighted regression
Lazy and eager learning
– p.2
Instance-Based Learning
Learning methods so far: construct a general,
explicit hypothesis of the target function over the
entire instance space when training examples are
provided
Key idea: just store all training examples
xi , f (xi )
When a new query instance is encountered, a set
of similar related instances is retrieved and used
to classify the new query instance
– p.3
Nearest Neighbor Learning
Assume all instances correspond to points in the
d -dimensional space ℜd
The standard Euclidean distance
For function approximation:
Given query instance xq , first locate nearest
training example xn , then estimate fˆ(xq ) ← f (xn )
For classification:
Given query instance xq , first locate nearest
training example xn , then assign xq to the
category associated with xn
– p.4
Voronoi diagram
x3
x2
x1
x1
FIGURE 4.13. In two dimensions, the nearest-neighbor algorithm leads to a partitioning of the input space into Voronoi cells, each labeled by the category of the training
point it contains. In three dimensions, the cells are three-dimensional, and the decision
boundary resembles the surface of a crystal. From: Richard O. Duda, Peter E. Hart, and
c 2001 by John Wiley & Sons, Inc.
David G. Stork, Pattern Classification. Copyright – p.5
k-Nearest neighbor Learning
For classification:
Given xq , take vote among its k nearest nbrs, i.e.,
assign it the category most frequently represented
among the k nearest samples
For real-valued function approximation:
take mean of f values of k nearest nbrs
k
f (xi )
∑
fˆ(xq ) ← i=1
k
– p.6
x2
x
x1
FIGURE 4.15. The k -nearest-neighbor query starts at the test point x and grows a spherical region until it encloses k training samples, and it labels the test point by a majority
vote of these samples. In this k = 5 case, the test point x would be labeled the category
of the black points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern
c 2001 by John Wiley & Sons, Inc.
Classification. Copyright – p.7
Relation to Bayes Classifier
Let’s place a cell of volume V around x and capture k
samples. Assume ki samples amongst k turned out to be
labeled ωi .
Then P(ωi |x) can be estimated as ki /k.
For minimum error rate, we select the category most
frequently represented within the cell
If there are enough samples (→ ∞) and k is large, the
performance will approach the Bayes optimal
If the number of samples is large (unlimited), the error rate
of the 1-nearest-neighbor classifi er is never worse than
twice the Bayes rate
– p.8
Distance-Weighted kNN
Might want to weight the contribution of each of the k
neighbors according to their distance to the query point,
weighting nearer neighbors more heavily...
k
ˆf (xq ) ← ∑i=1 wi f (xi )
∑ki=1 wi
where
wi ≡
1
d(xq , xi )2
and d(xq , xi ) is distance between xq and xi
– p.9
Distance-Weighted kNN
For classification: assign to the category with
largest total weights
Note now it makes sense to use all training
examples instead of just k, but more slow
– p.10
Metrics
The nearest-neighbor classifi er relies on a metric or
distance function between patterns.
We have assumed the Euclidean metric
The notion of a metric is far more general!
e.g., Manhattan/city block distance
d
d(a, b) = ∑ |ai − bi |
i=1
If there is a large disparity in the ranges of data in each
dimension, a common procedure is to rescale the data to
equalize such ranges
– p.11
x
x
2
2
x
x
x
1
αx
1
FIGURE 4.18. Scaling the coordinates of a feature space can change the distance relationships computed by the Euclidean metric. Here we see how such scaling can change
the behavior of a nearest-neighbor classifer. Consider the test point x and its nearest
neighbor. In the original space (left), the black prototype is closest. In the figure at the
right, the x1 axis has been rescaled by a factor 1/3; now the nearest prototype is the red
one. If there is a large disparity in the ranges of the full data in each dimension, a common procedure is to rescale all the data to equalize such ranges, and this is equivalent
to changing the metric in the original space. From: Richard O. Duda, Peter E. Hart, and
c 2001 by John Wiley & Sons, Inc.
David G. Stork, Pattern Classification. Copyright – p.12
Curse of Dimensionality
Imagine instances described by 20 attributes, but only 2 are
relevant to target function
Instances that have identical values for the 2 relevant
attributes may be distant from one another in the
20-dimensional instance space.
Nearest nbr approaches are easily misled by many
irrelevant attributes when high-dimensional x
Select relevant features
– p.13
Remarks
When To Consider Nearest Neighbor
Instances map to points in ℜd
Lots of training data
Advantages:
Training is very fast
Learn complex target functions
Don’t lose information
Disadvantages:
Slow at query time
Easily fooled by irrelevant attributes
– p.14
Locally Weighted Regression
Note kNN forms local approximation to f for each
query point xq
Why not form an explicit approximation fˆ(x) for region
surrounding xq
Fit linear function to k nearest neighbors
Fit quadratic, ...
Produces “piecewise approximation” to f
– p.15
Locally Weighted Regression
Several choices of error to minimize:
Squared error over k nearest neighbors
1
E1 (xq ) ≡
∑
2 x∈ k nearest
nbrs o f
( f (x) − fˆ(x))2
xq
Distance-weighted squared error over all data
E2 (xq ) ≡
1
( f (x) − fˆ(x))2 K(d(xq , x))
∑
2 x∈D
...
– p.16
Locally Weighted Regression
Typically fit simple functions such as constant,
linear, or quadratic
The cost of fitting more complex functions for
each query instance is prohibitively high
These simple approximations model the target
function quite well over a sufficiently small
subregion of the instance space
– p.17
Case-Based Reasoning (CBR)
Can apply instance-based learning even when X = ℜd
→ need different “distance” metric
Case-Based Reasoning is instance-based learning
applied to instances with symbolic descriptions
Require a similarity metric
Retrieval and combination of cases to solve the
query may rely on knowledge-based reasoning
Help desk, reasoning about legal cases, ...
– p.18
Lazy and Eager Learning
Lazy: wait for query before generalizing
k-N EAREST N EIGHBOR, Case based reasoning
Eager: generalize before seeing query
neural networks, decision tree, NaiveBayes, . . .
Obvious differences in computation time
– p.19
Lazy and Eager Learning
Eager learner must create global approximation
Lazy learner can construct a different local
approximation to the target function for each
distinct query instance
If they use same model space H , lazy can
represent more complex fns (e.g., consider H =
linear functions)
– p.20
Related documents
Download