CS 6243 Machine Learning Instance-based learning Lazy vs. Eager Learning • Lazy vs. eager learning – Eager learning (e.g., decision tree learning): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify – Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple • Lazy: less time in training but more time in predicting Nearest neighbor classifier - Basic idea • For each test case h – Find the k training instances that are closest to h – Return the most frequent class label Practical issues • • • • • Similarity / distance function Number of neighbors Instance weighting Attribute weighting Algorithms / data structure to improve efficiency • Explicit concept generalization Similarity / distance measure • Euclidean distance ( a1 a1 ) ( a 2 a 2 ) ... ( a k a k ) (1 ) (2) 2 (1 ) (2) 2 • City-block (Manhattan) distance k a (1 ) i a (2) i i 1 • Dot product / cosine function (1 ) (2) a k i 1 (1 ) 2 (1 ) (2) ai ai a (2) – Good for high-dimensional sparse feature vectors – Popular in document classification / information retrieval • Pearson correlation coefficient – Measures linear dependency – Popular in biology • Nominal attributes: distance is set to 1 if values are different, 0 if they are equal Normalization and other issues • Different attributes are measured on different scales need to be normalized: ai v i min v i max v i min v i or ai v i Avg ( v i ) StDev ( v i ) vi : the actual value of attribute i • Row normalization / column normalization • Common policy for missing values: assumed to be maximally distant (given normalized attributes) witten&eibe Number of neighbors • 1-NN is sensitive to noisy instances • In general, the larger the number of training instances, the larger the value of k • Can be determined by minimizing estimated classification error (using cross validation) – Search over K = (1,2,3,…,Kmax). Choose search size Kmax based on compute constraints – Estimate average classification error for each K – Pick K to minimize the classification error Instance weighting • We might want to weight nearer neighbors more heavily wi 1 D (a (q) ,a (i ) ) 2 • Each nearest neighbor cast its vote with a weight • Final prediction is the class with the highest sum of weights – In this case may use all instance (no need to choose k) – Shepard’s method • Can also do numerical prediction f (a (q) ) : k i 1 wi f (a (i ) ) k i 1 wi Attribute weighting • Simple strategy: – Calculate correlation between attribute values and class labels – More relevant attributes have higher weights • More advanced strategy: – Iterative updating (IBk) – Slides for Ch6 Other issues • Algorithms / data structure to improve efficiency – Data structure to enable efficiently finding nearest neighbors: kD tree, ball tree • Does not affect classification results • Ch4 slides – Algorithms to select prototype • May affect classification results • IBk. Ch6 slides • Concept generalization – Should we do it or not do it? – Ch6 slides Discussion of kNN • Pros: – – – – Often very accurate Easy to implement Fast to train Arbitrary decision boundary • Cons: – Classification is slow (remedy: ball tree, prototype selection) – Assumes all attributes are equally important (remedy: attribute selection or weights, but, still, curse of dimensionality) – No explicit knowledge discovery witten&eibe Sec.14.6 Decision boundary _ y + x 12