EE 6885 Statistical Pattern Recognition Reading

advertisement
EE 6885 Statistical Pattern
Recognition
Fall 2005
Prof. Shih-Fu Chang
http://www.ee.columbia.edu/~sfchang
Lecture 10 (10/12/05)
EE6887-Chang
„
Reading
„
Distance Metrics
„
„
DHS Chap. 4.6
Linear Discriminant Functions
„
„
11-1
DHS Chap. 5.1-5.4
Midterm Exam
„
Oct. 24th 2005 Monday 1pm-2:30pm (90mins)
„
Open books/notes, no computer
EE6887-Chang
11-2
1
kn-Nearest-Neighbor
For classification, estimate p(x)
for each class ωi
p ( x, ωi )
k
pn (ωi | x) = c n
= i
∑ pn ( x, ω j ) k
„
j =1
„
Performance bound of 1-nearest
neighbor (Cover & Hart ’67)
c *
*
P ≤ lim Pn (e) ≤ P* (2 −
P)
n →∞
c −1
„
Combine K-NN with clustering
„ K-Means, LVQ, GMM
„ Reduce complexity
„
„
When K increases, complexity?
Smooth decision boundaries
EE6887-Chang
11-3
Distance Metrics
„
„
Nearest neighbor rules need distance metrics
Required properties of a metric
1. non-negativity: D(a, b) ≥ 0
2. reflexivity: D (a, b) = 0 iff a = b
3. symmetry: D(a, b) = D(b, a )
4. trangular inequality: D (a, b) + D (b, c ) ≥ D(c, a )
D (a, b) ≥ D (c, a ) − D (b, c)
„
Minkowski Metric
„ Euclidean
„ Manhattan
„
„
L∞
h
?
x
useful in indexing
d
Lk (a, b) = (∑ | ai − bi |k )1/ k
i =1
n + n − 2n12 ( n1 − n12 ) + ( n2 − n12 )
Tanimono Metric
Dtanimono ( S1 , S2 ) = 1 2
=
n1 + n2 − n12
n1 + n2 − n12
„ sets of elements
n2
n1
„ Point-point distance not
useful
EE6887-Chang
11-4
2
Discriminant Functions Revisited
define discriminant function gi ( x) for class ωi
map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i
e.g., gi ( x) = ln P ( x | ωi ) + ln P(ωi )
MAP classifier
Gaussian Case: P ( x | ωi ) = N ( μi , Σ i )
1
−1
t
exp( ( x − μ i ) Σi −1 ( x − μ i ))
P(x | wi ) =
d /2
2
Σi
( 2π )
„
Σi = Σ
Case I:
„
g i ( x) = wi t x + wi 0
Case II:
„
a hyperplane with bias wi 0
Σ = arbitrary
i
1
d
1
g i ( x ) = − ( x − μ i ) t Σ i −1 ( x − μ i ) − ln 2π − ln Σ i + ln P (ωi )
2
2
2
g i ( x ) = x tWi x + wit x + wi 0
EE6887-Chang
Decision boundaries may be Hyperplane,
hypersphere, hyperellipsoid,
hyperparaboloids, hyperhyperboloids
11-5
Discriminant Functions (Chap. 5)
„
„
Directly define discriminant functions and optimize them
„ Do not assume parameter distribution functions for P ( x | ωi )
„ Easy to derive useful classifiers
Linear Functions
g ( x) = w t x + w0 ,w: weight vector, w0 : bias
„
Two-Category Case
map x to class ω1 if g (x)>0, otherwise class ω2
Decision surface H : g (x) = 0
x = xp + r⋅
w
w
x p : projection of x onto H , g (x p ) = 0
r : distance from x to H
⎛
w ⎞⎟
w
⎟⎟ = rw t
=r w
g (x) = g ⎜⎜⎜x p + r ⋅
⎜⎝
w ⎠⎟
w
EE6887-Chang
⇒ r=
g ( x)
w
11-6
3
Multi-category Case
c categories: ω1 , ω2 , …, ωc
„
number of classifiers needed?
Approaches
⇒
x
belongs to class ωi or not?
„ Use two-class discriminant
„
„
Use two-class discriminant for each pair
of classes
⇒ need c(c −1) / 2 discriminant functions
General Approach
one function for each class gi (x) = w i t x + wi 0
map x to class ωi if gi ( x) ≥ g j ( x) ∀j ≠ i
decision boundary H ij : gi (x) = g j (x)
H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0
„
„
Each decision regions is convex
and singly connected.
Good for monomodal distributions
EE6887-Chang
11-7
Method for searching decision boundaries
g i (x) = w i t x + wi 0 ⇒ find weight ω and bias ωo
„
Augmented Vector
„
Decision Boundary
⎡1⎤
⎢ ⎥
⎡ 1⎤ ⎢ x1 ⎥
y = ⎢ ⎥ = ⎢⎢ ⎥⎥
⎢⎣ x⎥⎦
⎢ ⎥
⎢x ⎥
⎣ d⎦
H ij : (w i − w j )t x + ( wi 0 − w j 0 ) = 0
„
⎡ ωi 0 ⎤
⎢ ⎥
⎡ ωi 0 ⎤ ⎢ ωi1 ⎥
ai = ⎢ ⎥ = ⎢⎢ ⎥⎥
⎢ ωi ⎥
⎣ ⎦ ⎢ ⎥
⎢ω ⎥
⎣ id ⎦
⇒ g i ( y ) = ai y
⇒ H ij : (ai − a j )t y = 0
2-category case
H : (w )t x + wi 0 = 0
⇓
H : (a)t y = 0
„
A hyperplane in augmented y
space, with normal vector a
EE6887-Chang
11-8
4
Search Method for Linear Discriminant
all sample points reside in the y1 = 1 subspace
distance from x to boundary in x space:
distance from x to boundary in y space:
r=
[1, x]t
g ( x)
w
i.e., r ′ and r same signs, bounds of r ′
r ′ = at y / a ≤ r
⇒ bounds for r
Design Objective
„
Find a that correctly classify each sample data
„
∀y i in class ω1 , a t y i >0 ∀y i in class ω2 , a t y i <0
∀y i in class ω2 , y i ← −( y i )
„
Normalization
„
New Design Objective
∀y i in class ω1 or ω2 , a t y i >0
Solution region
„
„
Intersection of positive
sides of all hyperplanes
EE6887-Chang
11-9
Searching Linear Discriminant Solutions
„
Solution region with margin
∀y i in class ω1 or ω2 , a t y i >b
„
Search Approaches
„
„
Gradient decent methods to find a solution in the
solution region
Maximize margin
EE6887-Chang
11-10
5
Gradient Decent (GD)
„
Choose criterion function J(a)
„
„
J(a) is minimized when a is in the solution region
Examples
„
„
# of samples misclassified J (a) = count of Y
Sum of distances from misclassified samples to H
Æ perceptron distance
J p (a) = ∑ (−a t y ), where Y is the set of misclassified samples
y∈Y
„
Quadratic error J (a) = (a t y )2
∑
q
„
Quadratic error with margin
y ∈Y
(a y − b)
1
, where Y :{y | a t y < b}
J q (a) = ∑
2
2 y∈Y
y
t
2
„
Repeat a(k + 1) = a(k ) − η (k )∇J(a(k ))
„
Results of GD make the mis-classification set Y empty or other
stop criteria met
EE6887-Chang
η (k ) : learning rate
11-11
Different Criterion Functions
GD not applicable
Solutions may be trapped to
boundaries
EE6887-Chang
Not differentiable
Solutions moved away
from boundaries
11-12
6
GD based on perceptron criterion
J p (a) = ∑ (−a t y ), where Y is the set of misclassified samples
y ∈Y
∇J p (a) = ∑ (−y )
y∈Y
Batch Perceptron Update
„
initialize a(1), choose rate η (.), and stop criterion θ
Loop a(k + 1) = a(k ) + η (k )∑ y
y∈Y
until η (k )∑ y < θ
y∈Y
„
„
Example a(1) = 0, η (k)=1
„ Add sum of misclassified samples
Theorem:
If samples are separable, then a solution can
always be found within finite steps.
EE6887-Chang
11-13
7
Download