PPT - Bo Yuan

advertisement
Classification I
Lecturer: Dr. Bo Yuan
LOGO
E-mail: yuanb@sz.tsinghua.edu.cn
Overview
K-Nearest Neighbor Algorithm
Naïve Bayes Classifier
Thomas Bayes
3
Classification
4
Definition
 Classification is one of the fundamental skills for survival.
 Food vs. Predator
 A kind of supervised learning
 Techniques for deducing a function from data
 <Input, Output>
 Input: a vector of features
 Output: a Boolean value (binary classification) or integer (multiclass)
 “Supervised” means:
 A teacher or oracle is needed to label each data sample.
 We will talk about unsupervised learning later.
5
Classifiers
Weight
{boy, girl}
Sam
Peter
Jack
Z=f(x,y)
Jane
Tom
Lisa
Helen
Mary
Height
Height
6
Weight
Training a Classifier
7
Lazy Learners
Truck
Car
8
Neighborhood
9
K-Nearest Neighbor Algorithm
 The algorithm procedure:
 Given a set of n training data in the form of <x, y>.
 Given an unknown sample x′.
 Calculate the distance d(x′, xi) for i=1 … n.
 Select the K samples with the shortest distances.
 Assign x′ the label that dominates the K samples.
 It is the simplest classifier you will ever meet (I mean it!).
 No Training (literally)
 A memory of the training data is maintained.
 All computation is deferred until classification.
 Produces satisfactory results in many cases.
 Should give it a go whenever possible.
10
Properties of KNN
Instance-Based Learning
No explicit description of the target function
Can handle complicated situations.
11
Properties of KNN
K=7 Neighborhood
?
K=1 Neighborhood
Dependent of the data distributions.
Can make mistakes at boundaries.
12
Challenges of KNN

Non-monotonous impact on accuracy

Too Big vs. Too Small

Rule of thumbs
Accuracy
 The Value of K
 Weights

K
Different features may have different impact …
 Distance

There are many different ways to measure the distance.

Euclidean, Manhattan …
 Complexity

Need to calculate the distance between x′ and all training data.

In proportion to the size of the training data.
13
Distance Metrics
1/ k
 d
k
Lk x, y     xi  yi 
 i 1

1/ 2
 d
2
L2 x, y     xi  yi 
 i 1

 d

L1 x, y     xi  yi 
 i 1

14
Distance Metrics
The shortest path between two points …
15
Mahalanobis Distance
Distance from a point to a point set
16
Mahalanobis Distance
DM ( x) 
x   T S 1 x   
For identity matrix S:
DM ( x) 
x   T x   
For diagonal matrix S:
DM ( x) 
n

i 1
xi  i 2
 i2
17
Voronoi Diagram
perpendicular bisector
18
Voronoi Diagram
19
Structured Data
1
0.5
0
0.5
1
20
KD-Tree
Point Set: {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}
21
KD-Tree
 function kdtree (list of points pointList, int depth)
 {

if pointList is empty

return nil;

else

{

// Select axis based on depth so that axis cycles through all valid values

var int axis := depth mod k;


// Sort point list and choose median as pivot element

select median by axis from pointList;


// Create node and construct subtrees

var tree_node node;

node.location := median;

node.leftChild := kdtree(points in pointList before median, depth+1);

node.rightChild := kdtree(points in pointList after median, depth+1);

return node;

}
 }
22
KD-Tree
23
Evaluation
 Accuracy
 Recall what we have learned in the first lecture …
 Confusion Matrix
 ROC Curve
 Training Set vs. Test Set
 N-fold Cross Validation
Test Set
Test Set
Test Set
Test Set
Test Set
24
LOOCV
 Leave One Out Cross Validation
 An extreme case of N-fold cross validation
 N=number of available samples
 Usually very time consuming but okay for KNN
 Now, let’s try KNN+LOOCV …
 All students in this class are given one of two labels.
 Gender: Male vs. Female
 Major: CS vs. EE vs. Automation
25
26
Bayes Theorem
P A  B  P A  PB  P A  B
A
B
P A  B  P A | BPB  PB | AP A
PB | AP A
P A | B  
P B 
Bayes
Theorem
likelihood  prior
posterior 
evidence
27
Fish Example
 Salmon vs. Tuna
 P(ω1)=P(ω2)
 P(ω1)>P(ω2)
 Additional information
Px | i Pi 
Pi | x  
P x 
28
Shooting Example
 Probability of Kill
 P(A): 0.6
 P(B): 0.5
 The target is killed with:
 One shoot from A
 One shoot from B
 What is the probability that it is shot down by A?
 C: The target is killed.
P(A C ) 
P(C A )P(A )
P(C )

1  0.6
3

0.6  0.5  0.4  0.5  0.6  0.5
4
29
Cancel Example
 ω1: Cancer;
ω2: Normal
 P(ω1)=0.008; P(ω2)=0.992
 Lab Test Outcomes: + vs. –
 P(+|ω1)=0.98; P(-|ω1)=0.02
 P(+|ω2)=0.03; P(-|ω2)=0.97
 Now someone has a positive test result…
 Is he/she doomed?
30
Cancel Example
P1 |   P | 1 P1   0.98 0.008  0.0078
P2 |   P | 2 P2   0.03 0.992  0.0298
P1 |   P2 | 
0.0078
P1 |   
 0.21  P1 
0.0078  0.0298
31
Headache & Flu Example
 H=“Having a headache”
 F=“Coming down with flu”
 P(H)=1/10; P(F)=1/40;
P(H|F)=1/2
 What does this mean?
 One day you wake up with a headache …
 Since 50% flu cases are associated with headaches …
 I must have a 50-50 chance of coming down with flu!
32
Headache & Flu Example
The truth is …
P( H | F ) P( F ) 1 / 2 1 / 40 1
P( F | H ) 


P( H )
1 / 10
8
Flu
Headache
33
Naïve Bayes Classifier
MAP  arg max Pi | a1 , a2 ,...,an 
i 
MAP
Pa1 , a2 ,...,an | i Pi 
 arg max
Pa1 , a2 ,...,an 
i 
MAP  arg max Pa1 , a2 ,...,an | i Pi 
i 
Conditionally Independent
MAP  arg max Pi  Pa j | i 
i 
MAP: Maximum A Posterior
j
34
Independence
P A  B  P APB | A
PB | A  PB
P A  B   P APB 
P( A, B | G)  P( A | G) P( B | G)
P( A | G, B)  P( A | G)
Conditionally Independent
P( A, B | G )  P( A, B,G ) / P(G )  P( A | B,G )  P( B,G ) / P(G )
 P( A | B,G )  P( B | G )
35
Conditional Independence
P( R  B | Y )  P( R | Y ) P( B | Y )
36
Independent ≠ Uncorrelated
𝜌𝑋,𝑌
𝐸 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌
𝑐𝑜𝑣 𝑋, 𝑌
=
=
𝜎𝑋 𝜎𝑌
𝜎𝑋 𝜎𝑌
X  [1, 1];
Cov (X,Y)=0  X and Y are uncorrelated
Y  X2
However, Y is completely determined by X.
X
Y
1
1
0.5
0.25
1
0.9
0.8
0.7
0.2
0.04
0
0
-0.2
0.04
-0.5
0.25
-1
1
Y
0.6
0.5
0.4
0.3
0.2
0.1
0
-1
-0.8
-0.6
-0.4
37
-0.2
0
X
0.2
0.4
0.6
0.8
1
Estimating P(αj|ωi)
α1
α2
+
α3
ω
ω1
ω2
-
ω1
+
ω1
ω2
Laplace Smoothing
P1   3 / 5;
P2   2 / 5
Pa2 ''| 1   2 / 3
Pa2 ''| 1   1 / 3
Pa jk | i  
a j  a jk    i  1
  i  a j
How about continuous variables?
38
Tennis Example
Day
Outlook
Temperature
Humidity
Wind
Play
Tennis
Day1
Day2
Sunny
Sunny
Hot
Hot
High
High
Weak
Strong
No
No
Day3
Overcast
Hot
High
Weak
Yes
Day4
Rain
Mild
High
Weak
Yes
Day5
Rain
Cool
Normal
Weak
Yes
Day6
Rain
Cool
Normal
Strong
No
Day7
Overcast
Cool
Normal
Strong
Yes
Day8
Sunny
Mild
High
Weak
No
Day9
Sunny
Cool
Normal
Weak
Yes
Day10
Rain
Mild
Normal
Weak
Yes
Day11
Sunny
Mild
Normal
Strong
Yes
Day12
Overcast
Mild
High
Strong
Yes
Day13
Overcast
Hot
Normal
Weak
Yes
Day14
Rain
Mild
High
Strong
No
39
Tennis Example
Given :
 Outlook sunny, T emperature  cool, Humidity  high, Wind  strong 
Predict:
PlayTennis( yes or no)
Bayes Solution:
PPlayTennis yes  9 / 14
PPlayTennis no  5 / 14
PWind  strong | PlayTennis yes  3 / 9
PWind  strong | PlayTennis no  3 / 5
...
P ( yes) P ( sunny| yes) P (cool | yes) P(high | yes) P ( strong | yes)  0.0053
P (no) P ( sunny| no) P (cool | no) P (high | no) P( strong | no)  0.0206
s wit h probability :
T heconclusionis not t oplay t enni
40
0.0206
 0.795
0.0206 0.0053
Text Classification Example
Interesting? Boring?
Politics? Entertainment? Sports?
41
Text Representation
α1
α2
α3
α4
…
αn
ω
Long
long
ago
there
…
king
1
New
sanctions
will
be
…
Iran
0
Hidden
Markov
models
are
…
method
0
The
Federal
Court
today
…
investigate
0
We need to estimate probabilities such as 𝑃 𝑎2 = 𝑘𝑖𝑛𝑔|𝜔 = 1 .
However, there are 2×n×|Vocabulary| terms in total. For n=100 and a
vocabulary of 50,000 distinct words, it adds up to 10 million terms!
42
Text Representation
 By only considering the probability of encountering a specific word
instead of the specific word position, we can reduce the number of
probabilities to be estimated.
 We only count the frequency of each word.
 Now, 2×50,000=100,000 terms need to be estimated.
nk  1
PVK |   i  
n | Vocabulary|
 n: the total number of word positions in all training samples whose
target value is ωi.
 nk: the number of times word Vk is found among these n positions.
43
Case Study: Newsgroups
 Classification
 Joachims, 1996
 20 newsgroups
 20,000 documents
 Random Guess: 5%
 NB: 89%
 Recommendation
 Lang, 1995
 NewsWeeder
 User rated articles
 Interesting vs. Uninteresting
 Top 10% selected articles
 16% vs. 59%
44
Reading Materials
 C. C. Aggarwal, A. Hinneburg and D. A. Keim, “On the Surprising Behavior of
Distance Metrics in High Dimensional Space,” Proc. the 8th International
Conference on Database Theory, LNCS 1973, pp. 420-434, London, UK, 2001.
 J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best
Matches in Logarithmic Expected Time,” ACM Transactions on Mathematical
Software, 3(3):209–226, 1977.
 S. M. Omohundro, “Bumptrees for Efficient Function, Constraint, and Classification
Learning,” Advances in Neural Information Processing Systems 3, pp. 693-699,
Morgan Kaufmann, 1991.
 Tom Mitchell, Machine Learning (Chapter 6), McGraw-Hill.
 Additional reading about Naïve Bayes Classifier

http://www-2.cs.cmu.edu/~tom/NewChapters.html
 Software for text classification using Naïve Bayes Classifier

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
45
Review
 What is classification?
 What is supervised learning?
 What does KNN stand for?
 What are the major challenges of KNN?
 How to accelerate KNN?
 What is N-fold cross validation?
 What does LOOCV stand for?
 What is Bayes Theorem?
 What is the key assumption in Naïve Bayes Classifiers?
46
Next Week’s Class Talk
 Volunteers are required for next week’s class talk.
 Topic 1: Efficient KNN Implementations
 Hints:
 Ball Trees
 Metric Trees
 R Trees
 Topic 2: Bayesian Belief Networks
 Length: 20 minutes plus question time
47
Download