CS 6243 Machine Learning

advertisement
CS 6243 Machine Learning
Instance-based learning
Lazy vs. Eager Learning
• Lazy vs. eager learning
– Eager learning (e.g., decision tree learning): Given a set of
training set, constructs a classification model before receiving
new (e.g., test) data to classify
– Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is given
a test tuple
• Lazy: less time in training but more time in predicting
Nearest neighbor classifier - Basic
idea
• For each test case h
– Find the k training instances that are closest
to h
– Return the most frequent class label
Practical issues
•
•
•
•
•
Similarity / distance function
Number of neighbors
Instance weighting
Attribute weighting
Algorithms / data structure to improve
efficiency
• Explicit concept generalization
Similarity / distance measure
• Euclidean distance
( a1  a1 )  ( a 2  a 2 )  ...  ( a k  a k )
(1 )
(2)
2
(1 )
(2)
2
• City-block
(Manhattan) distance
k

a
(1 )
i
a
(2)
i
i 1
• Dot product / cosine function
(1 )
(2)

a
k
i 1
(1 )
2
(1 )
(2)
ai ai
 a
(2)
– Good for high-dimensional sparse feature vectors
– Popular in document classification / information retrieval
• Pearson correlation coefficient
– Measures linear dependency
– Popular in biology
• Nominal attributes: distance is set to 1 if values are
different, 0 if they are equal
Normalization and other issues
• Different attributes are measured on different
scales  need to be normalized:
ai 
v i  min v i
max v i  min v i
or
ai 
v i  Avg ( v i )
StDev ( v i )
vi : the actual value of attribute i
• Row normalization / column normalization
• Common policy for missing values: assumed to
be maximally distant (given normalized
attributes)
witten&eibe
Number of neighbors
• 1-NN is sensitive to noisy instances
• In general, the larger the number of training
instances, the larger the value of k
• Can be determined by minimizing estimated
classification error (using cross validation)
– Search over K = (1,2,3,…,Kmax). Choose search
size Kmax based on compute constraints
– Estimate average classification error for each K
– Pick K to minimize the classification error
Instance weighting
• We might want to weight nearer neighbors more heavily
wi 
1
D (a
(q)
,a
(i )
)
2
• Each nearest neighbor cast its vote with a weight
• Final prediction is the class with the highest sum of
weights
– In this case may use all instance (no need to choose k)
– Shepard’s method
• Can also do numerical prediction
f (a
(q)
) :

k
i 1
wi f (a (i ) )

k
i 1
wi
Attribute weighting
• Simple strategy:
– Calculate correlation between attribute values
and class labels
– More relevant attributes have higher weights
• More advanced strategy:
– Iterative updating (IBk)
– Slides for Ch6
Other issues
• Algorithms / data structure to improve efficiency
– Data structure to enable efficiently finding nearest
neighbors: kD tree, ball tree
• Does not affect classification results
• Ch4 slides
– Algorithms to select prototype
• May affect classification results
• IBk. Ch6 slides
• Concept generalization
– Should we do it or not do it?
– Ch6 slides
Discussion of kNN
• Pros:
–
–
–
–
Often very accurate
Easy to implement
Fast to train
Arbitrary decision boundary
• Cons:
– Classification is slow (remedy: ball tree, prototype
selection)
– Assumes all attributes are equally important
(remedy: attribute selection or weights, but, still,
curse of dimensionality)
– No explicit knowledge discovery
witten&eibe
Sec.14.6
Decision boundary
_
y
+
x
12
Download