Linear discriminant functions

advertisement
Introduction to Pattern Recognition for Human ICT
Linear discriminant functions
2014. 11. 7
Hyunki Hong
Contents
•
Perceptron learning
•
Minimum squared error (MSE) solution
•
Least-mean squares (LMS) rule
•
Ho-Kashyap procedure
Linear discriminant functions
• The objective of this lecture is to present methods for
learning linear discriminant functions of the form.
where 𝑀 is the weight vector and 𝑀0 is the threshold weight or
bias.
- The discriminant functions will be derived in a nonparametric fashion, that is, no assumptions will be made
about the underlying densities.
– For convenience, we will focus on the binary classification
problem.
– Extension to the multi-category case can be easily achieved
by
1. Using πœ”π‘–/β”Œπœ”π‘– dichotomies
2. Using πœ”π‘–/πœ”π‘— dichotomies
Dichotomy: splitting a whole into exactly two nonoverlapping parts. a procedure in which a whole is
divided into two parts.
Gradient descent
• general method for function minimization
– Recall that the minimum of a function 𝐽(π‘₯) is defined by the
zeros of the gradient.
1. Only in special cases this minimization function has a closed
form solution.
2. In some other cases, a closed form solution may exist, but is
numerically ill-posed or impractical (e.g., memory
requirements).
– Gradient descent finds the minimum in
an iterative fashion by moving in the
direction of steepest descent.
where πœ‚ is a learning rate.
Perceptron learning
• Let’s now consider the problem of learning a binary
classification problem with a linear discriminant function.
– As usual, assume we have a dataset π‘₯ = {π‘₯(1, π‘₯(2, …, π‘₯(𝑁 }
containing examples from the two classes.
– For convenience, we will absorb the intercept 𝑀0 by
augmenting the feature vector π‘₯ with an additional constant
dimension.
- Keep in mind that our objective is to find a vector a such that
– To simplify the derivation, we will “normalize” the training set
by replacing all examples from class πœ”2 by their negative
- This allows us to ignore class labels and look for vector π‘Ž
such that
• To find a solution we must first define an objective
function 𝐽(π‘Ž)
– A good choice is what is known as the Perceptron criterion
function
, where π‘Œπ‘€ is the set of examples misclassified by π‘Ž.
Note that 𝐽𝑃(π‘Ž) is non-negative since π‘Žπ‘‡π‘¦ < 0 for all misclassified
samples.
• To find the minimum of this function, we use gradient
descent
– The gradient is defined by
- And the gradient descent update rule becomes
- This is known as the perceptron batch update rule.
1. The weight vector may also be updated in an “on-line” fashion,
this is, after the presentation of each individual example
π‘Ž(π‘˜+1) = π‘Ž(π‘˜) + πœ‚π‘¦ (𝑖
where 𝑦(𝑖 is an example that has been misclassified by π‘Ž(π‘˜).
Perceptron learning
• If classes are linearly separable, the perceptron rule is
guaranteed to converge to a valid solution.
– Some version of the perceptron rule use a variable learning
rate πœ‚(π‘˜).
– In this case, convergence is guaranteed only under certain
conditions.
• However, if the two classes are not linearly separable, the
perceptron rule will not converge
– Since no weight vector a can correctly classify every sample
in a non-separable dataset, the corrections in the perceptron
rule will never cease.
– One ad-hoc solution to this problem is to enforce convergence
by using variable learning rates πœ‚(π‘˜) that approach zero as π‘˜
→∞.
Minimum Squared Error (MSE) solution
• The classical MSE criterion provides an alternative to
the perceptron rule.
– The perceptron rule seeks a weight vector π‘Žπ‘‡ such that
π‘Žπ‘‡π‘¦(𝑖 > 0
The perceptron rule only considers misclassified samples,
since these are the only ones that violate the above
inequality.
– Instead, the MSE criterion looks for a solution to the
equality π‘Žπ‘‡π‘¦ (𝑖 = 𝑏(𝑖 , where 𝑏(𝑖 are some pre-specified target
values (e.g., class labels).
As a result, the MSE solution uses ALL of the samples in
the training set.
• The system of equations solved by MSE is
– where π‘Ž is the weight vector, each row in π‘Œ is a training
example, and each row in b is the corresponding class label.
1. For consistency, we will continue assuming that examples
from class πœ”2 have been replaced by their negative vector,
although this is not a requirement for the MSE solution.
• An exact solution to π‘Œπ‘Ž = 𝑏 can sometimes be found.
– If the number of (independent) equations (𝑁) is equal to the
number of unknowns (𝐷+1), the exact solution is defined by
π‘Ž = π‘Œ−1𝑏
– In practice, however, π‘Œ will be singular so its inverse does
not exist.
π‘Œ will commonly have more rows (examples) than columns
(unknowns), which yields an over-determined system, for
which an exact solution cannot be found.
• The solution in this case is to minimizes some function of
the error between the model (π‘Žπ‘Œ) and the desired output
(𝑏).
– In particular, MSE seeks to Minimize the sum Squared Error
which, as usual, can be found by setting its gradient to zero.
The pseudo-inverse solution
• The gradient of the objective function is
– with zeros defined by π‘Œπ‘‡π‘Œπ‘Ž = π‘Œπ‘‡π‘
– Notice that π‘Œπ‘‡π‘Œ is now a square matrix!.
• If π‘Œπ‘‡π‘Œ is nonsingular, the MSE solution becomes
– where π‘Œ† = (π‘Œπ‘‡π‘Œ)−1π‘Œπ‘‡ is known as the pseudo-inverse of π‘Œ
since π‘Œ†π‘Œ = 𝐼.
Note that, in general, π‘Œπ‘Œ† ≠ 𝐼.
Least-mean-squares solution
• The objective function 𝐽𝑀𝑆𝐸(π‘Ž) can also be minimize using a
gradient descent procedure.
– This avoids the problems that arise when π‘Œπ‘‡π‘Œ is singular.
– In addition, it also avoids the need for working with large
matrices.
• Looking at the expression of the gradient, the obvious
update rule is
π‘Ž(π‘˜+1) = π‘Ž(π‘˜) + πœ‚(π‘˜)π‘Œπ‘‡(𝑏− π‘Œπ‘Ž(π‘˜))
– It can be shown that if πœ‚(π‘˜) = πœ‚(1)/π‘˜, where πœ‚(1) is any positive
constant, this rule generates a sequence of vectors that
converge to a solution to π‘Œπ‘‡(π‘Œπ‘Ž−𝑏) = 0.
– The storage requirements of this algorithm can be reduced by
considering each sample sequentially.
π‘Ž(π‘˜+1) = π‘Ž(π‘˜) + πœ‚(π‘˜)(𝑏(𝑖 − 𝑦(𝑖 π‘Ž(π‘˜))𝑦(𝑖
– This is known as the Widrow-Hoff, least-mean-squares (LMS)
or delta rule.
Numerical example
Summary: perceptron vs. MSE
• Perceptron rule
– The perceptron rule always finds a solution if the classes are
linearly separable, but does not converge if the classes are
non-separable
• MSE criterion
– The MSE solution has guaranteed convergence, but it may not
find a separating hyperplane if classes are linearly separable.
Notice that MSE tries to minimize the sum of the squares of the
distances of the training data to the separating hyperplane, as
opposed to finding this hyperplane.
ν•™μŠ΅ λ‚΄μš©
μ„ ν˜• νŒλ³„ν•¨μˆ˜
νŒλ³„ν•¨μˆ˜μ™€ νŒ¨ν„΄μΈμ‹
νŒλ³„ν•¨μˆ˜μ˜ ν˜•νƒœ
νŒλ³„ν•¨μˆ˜μ™€ κ²°μ •μ˜μ—­
νŒλ³„ν•¨μˆ˜μ˜ ν•™μŠ΅
μ΅œμ†Œμ œκ³±λ²•
νΌμ…‰νŠΈλ‘  ν•™μŠ΅
λ§€νŠΈλž©μ„ μ΄μš©ν•œ μ„ ν˜• νŒλ³„ν•¨μˆ˜ μ‹€ν—˜
1) νŒλ³„ν•¨μˆ˜
οƒΌ 이(binary) λΆ„λ₯˜ 문제
ν•˜λ‚˜μ˜ νŒλ³„ν•¨μˆ˜ g(x)
οƒΌ 닀쀑 클래슀 λΆ„λ₯˜ 문제
ν΄λž˜μŠ€λ³„ νŒλ³„ν•¨μˆ˜ gi(x)
2) νŒλ³„ν•¨μˆ˜μ˜ ν˜•νƒœ
1. 직선 ν˜•νƒœμ˜ μ„ ν˜• νŒλ³„ν•¨μˆ˜ (고차원 κ³΅κ°„μ˜ 경우 μ΄ˆν‰λ©΄)
n차원 μž…λ ₯ νŠΉμ§•λ²‘ν„° x = [x1, x2, …, xn]T
클래슀 Ci의 μ„ ν˜• νŒλ³„ν•¨μˆ˜
wi οƒ  μ΄ˆν‰λ©΄μ˜ 기울기λ₯Ό κ²°μ •ν•˜λŠ” 벑터 (μ΄ˆν‰λ©΄μ˜ 법선 벑터)
wi0 οƒ  x = 0인 μ›μ μ—μ„œ μ§μ„ κΉŒμ§€μ˜ 거리λ₯Ό κ²°μ •ν•˜λŠ” λ°”μ΄μ–΄μŠ€
wi , wi0 : κ²°μ •κ²½κ³„μ˜ μœ„μΉ˜λ₯Ό κ²°μ •ν•˜λŠ” νŒŒλΌλ―Έν„° (2차원: 직선 방정식)
3) νŒλ³„ν•¨μˆ˜μ˜ ν˜•νƒœ
2. 닀항식 ν˜•νƒœμ˜ νŒλ³„ν•¨μˆ˜ (보닀 일반적인 ν˜•νƒœ)
클래슀 Ci에 λŒ€ν•œ 이차 닀항식 ν˜•νƒœμ˜ νŒλ³„ν•¨μˆ˜
Wi : 벑터 x의 두 μ›μ†Œλ“€λ‘œ κ΅¬μ„±λœ 2μ°¨ν•­ λ“€μ˜ κ³„μˆ˜μ˜ ν–‰λ ¬.
n차원은 n(n+1)/2개 μ›μ†Œ
n차원 μž…λ ₯ λ°μ΄ν„°μ˜ 경우 μΆ”μ •ν•΄μ•Ό ν•  νŒŒλΌλ―Έν„°μ˜ 개수 : n(n+1)/2+n+1개
pμ°¨ 닀항식  O(np)
κ³„μ‚°λŸ‰μ˜ 증가, μΆ”μ •μ˜ 정확도 μ €ν•˜
4) νŒλ³„ν•¨μˆ˜μ˜ ν˜•νƒœ
→ μ„ ν˜• λΆ„λ₯˜κΈ° μ œμ•½μ  +닀항식 λΆ„λ₯˜κΈ°μ˜ νŒŒλΌλ―Έν„° 수의 문제 ν•΄κ²°
3. λΉ„μ„ ν˜• κΈ°μ €(basis) ν•¨μˆ˜μ˜ μ„ ν˜•ν•©μ„ μ΄μš©ν•œ νŒλ³„ν•¨μˆ˜
νŒλ³„ν•¨μˆ˜μ˜ ν˜•νƒœ κ²°μ •:
μ‚Όκ°ν•¨μˆ˜
κ°€μš°μ‹œμ•ˆ ν•¨μˆ˜
ν•˜μ΄νΌνƒ„μ  νŠΈ ν•¨μˆ˜
μ‹œκ·Έλͺ¨μ΄λ“œ ν•¨μˆ˜
…
5) νŒλ³„ν•¨μˆ˜μ˜ ν˜•νƒœ
οƒΌ νŒλ³„ν•¨μˆ˜μ™€ 그에 λ”°λ₯Έ 결정경계
(b)
(a
)
μ„ ν˜•ν•¨μˆ˜
(c
)
2μ°¨ 닀항식
μ‚Όκ°ν•¨μˆ˜
6) νŒλ³„ν•¨μˆ˜μ™€ κ²°μ •μ˜μ—­
κ²°μ •μ˜μ—­
(a)
(b)
2
1
C1
C2
1
2
C2
C1
7) νŒλ³„ν•¨μˆ˜μ™€ κ²°μ •μ˜μ—­
μž…λ ₯νŠΉμ§•μ΄ 각 ν΄λž˜μŠ€μ— μ†ν•˜λŠ” μ—¬λΆ€λ₯Ό κ²°μ •: μΌλŒ€λ‹€ 방식 (one-to-all method)
g1 ( x) ο€½ 0
g1 ο€Ό 0
3
g1 ο€Ύ 0
?
g1 or g3 ?
?
1
?
g2 ο€Ό 0
g ο€Ύ0
g 2 ( x) ο€½ 0 2
?
2
g3 ο€Ύ 0
g3 ( x) ο€½ 0
g3 ο€Ό 0
λ―Έκ²°μ •μ˜μ—­: λͺ¨λ“  νŒλ³„ν•¨μˆ˜μ—
음수
λͺ¨ν˜Έν•œ μ˜μ—­: 두 νŒλ³„ν•¨μˆ˜μ—
μ–‘μˆ˜
g1 or g2 ?
8) νŒλ³„ν•¨μˆ˜μ™€ κ²°μ •μ˜μ—­
μž„μ˜μ˜ (κ°€λŠ₯ν•œ) 두 클래슀 μŒμ— λŒ€ν•œ νŒλ³„ν•¨μˆ˜
클래슀 μˆ˜κ°€ m개
: νŒλ³„μ‹ m(m -1)/2개 ν•„μš”
g12 ( x) ο€½ 0
C1
C2
μΌλŒ€μΌ 방식
(one-to-one method)
1
C1
?
C3
2
C2
g 23 ( x) ο€½ 0
C3
3
g13 ( x) ο€½ 0
λ―Έκ²°μ •μ˜μ—­: λͺ¨λ“  νŒλ³„ν•¨μˆ˜μ—
음수
9) νŒλ³„ν•¨μˆ˜μ™€ κ²°μ •μ˜μ—­
ν΄λž˜μŠ€λ³„ νŒλ³„ν•¨μˆ˜
κ²°μ •κ·œμΉ™
νŒλ³„ν•¨μˆ˜μ˜ λ™μ‹œ ν•™μŠ΅μ΄ ν•„μš”
g 2 ( x) ο€½ 0
g 2 ο€­ g3 ο€½ 0
3
g1 ο€­ g3 ο€½ 0
g3 ( x) ο€½ 0
1
2
g1 ( x) ο€½ 0
g1 ο€­ g2 ο€½ 0
νŒλ³„ν•¨μˆ˜μ˜ ν•™μŠ΅
1) μ΅œμ†Œμ œκ³±λ²•
νŒλ³„ν•¨μˆ˜
ν•™μŠ΅ 데이터 집합
νŒŒλΌλ―Έν„° 벑터
νŒλ³„ν•¨μˆ˜ gi의 λͺ©ν‘œμΆœλ ₯κ°’
ξjκ°€ 클래슀 Ci에 μ†ν•˜λ©΄, λͺ©ν‘œμΆœλ ₯κ°’ tji = 1
νŒλ³„ν•¨μˆ˜ gi의 λͺ©μ ν•¨μˆ˜(“μ˜€μ°¨ν•¨μˆ˜”)
νŒλ³„ν•¨μˆ˜ 좜λ ₯κ°’κ³Ό λͺ©ν‘œμΆœλ ₯κ°’ 차이의 제곱
“μ΅œμ†Œμ œκ³±λ²•”
2) μ΅œμ†Œμ œκ³±λ²•
큰 μž…λ ₯ 차원, λ§Žμ€ 데이터
 ν–‰λ ¬ 계산에 λ§Žμ€ μ‹œκ°„ μ†Œμš”
“순차적 반볡 μ•Œκ³ λ¦¬μ¦˜”
ν•œ λ²ˆμ— ν•˜λ‚˜μ˜ 데이터 에 λŒ€ν•΄ 제곱였차λ₯Ό κ°μ†Œμ‹œν‚€λŠ”
λ°©ν–₯으둜 νŒŒλΌλ―Έν„°λ₯Ό 반볡적으둜 μˆ˜μ •
→ μ˜€μ°¨ν•¨μˆ˜μ˜ 기울기λ₯Ό λ”°λΌμ„œ μ˜€μ°¨κ°’μ΄ μ€„μ–΄λ“œλŠ” λ°©ν–₯으둜
3) μ΅œμ†Œμ œκ³±λ²•
νŒŒλΌλ―Έν„°μ˜ λ³€ν™”λŸ‰
νŒŒλΌλ―Έν„° μˆ˜μ •μ‹ (τ+1번째 νŒŒλΌλ―Έν„°)
ν•™μŠ΅λ₯ 
learning rate
“μ΅œμ†Œν‰κ· μ œκ³±(Least Mean Square) μ•Œκ³ λ¦¬μ¦˜”
: μž…λ ₯κ³Ό μ›ν•˜λŠ” 좜λ ₯과의 μž…μΆœλ ₯ 맀핑을 κ°€μž₯ 잘 κ·Όμ‚¬ν•˜λŠ” μ„ ν˜•ν•¨μˆ˜ μ°ΎκΈ°
κΈ°μ €ν•¨μˆ˜μ˜ μ„ ν˜•ν•©μ— μ˜ν•œ νŒλ³„ν•¨μˆ˜μ˜ 경우
: μž…λ ₯ 벑터 x λŒ€μ‹ , κΈ°μ € ν•¨μˆ˜λ₯Ό μ΄μš©ν•œ 맀핑값인 Ρ„(x) μ‚¬μš©
4) νΌμ…‰νŠΈλ‘ (perceptron) ν•™μŠ΅
οƒΌ Rosenblatt
οƒΌ 인간 λ‡Œμ˜ μ •λ³΄μ²˜λ¦¬ 방식을 λͺ¨λ°©ν•˜μ—¬ λ§Œλ“  μΈκ³΅μ‹ κ²½νšŒλ‘œλ§μ˜ 기초
적인 λͺ¨λΈ
μž…λ ₯ x에 λŒ€ν•œ i번째 좜λ ₯ yi κ²°μ •
동일
μ„ ν˜• νŒλ³„ν•¨μˆ˜λ₯Ό μ΄μš©ν•œ κ²°μ •κ·œμΉ™
5) νΌμ…‰νŠΈλ‘  ν•™μŠ΅
οƒΌ M개의 νŒ¨ν„΄ λΆ„λ₯˜ 문제
• 각 ν΄λž˜μŠ€λ³„λ‘œ ν•˜λ‚˜μ˜ 좜λ ₯을 μ •μ˜ν•˜μ—¬ λͺ¨λ‘ M개의 νΌμ…‰νŠΈλ‘ 
νŒλ³„ν•¨μˆ˜λ₯Ό μ •μ˜
• ν•™μŠ΅ 데이터λ₯Ό μ΄μš©ν•˜μ—¬ μ›ν•˜λŠ” 좜λ ₯을 내도둝 νŒŒλΌλ―Έν„° μ‘°μ •
: i 번째 ν΄λž˜μŠ€μ— μ†ν•˜λŠ” μž…λ ₯ λ“€μ–΄μ˜€λ©΄, 이에 ν•΄λ‹Ήν•˜λŠ” 좜λ ₯ yi만 1, λ‚˜λ¨Έμ§€λŠ” λͺ¨λ‘ 0λ˜λ„λ‘
λͺ©ν‘œμΆœλ ₯κ°’ ti μ •μ˜
νŒŒλΌλ―Έν„° μˆ˜μ •μ‹
6) νΌμ…‰νŠΈλ‘  ν•™μŠ΅
νΌμ…‰νŠΈλ‘  ν•™μŠ΅μ— μ˜ν•œ κ²°μ •κ²½κ³„μ˜ λ³€ν™”
μ›λž˜ “·”λŠ” 클래슀 C1에 μ†ν•˜κ³  λͺ©ν‘œκ°’은 1, “o”λŠ” 클래슀 C2 μ†ν•˜κ³  λͺ©ν‘œκ°’은 0
(a)
(b)
x (1)
(c)
w (1)  x (1)
w
w (1)
( 0)
w
( 2)
x
( 2)
w ( 2)
x ( 2)
 1 2
1 2
(a)
x(1)의 λͺ©ν‘œμΆœλ ₯κ°’ tκ°€ 0μ΄μ§€λ§Œ,
ν˜„μž¬ 좜λ ₯값은 1.
→ w(1) = w(0) + η(ti - yi) x(1) = - η x(1)
(b)
이 κ²½μš°λŠ” λ°˜λŒ€.
→ w(2) = w(1) + η(ti - yi) x(2) = η x(2)
1 2
7) νΌμ…‰νŠΈλ‘  ν•™μŠ΅
λ§€νŠΈλž©μ„ μ΄μš©ν•œ νΌμ…‰νŠΈλ‘  λΆ„λ₯˜κΈ° μ‹€ν—˜
1) 데이터 생성 및 ν•™μŠ΅ κ²°κ³Ό
NumD=10;
ν•™μŠ΅ λ°μ΄ν„°μ˜ 생성
train_c1 = randn(NumD,2);
train_c2 = randn(NumD,2)+repmat([2.5,2.5],NumD,1);
train=[train_c1; train_c2];
train_out(1:NumD,1)=zeros(NumD,1);
train_out(1+NumD:2*NumD,1)=zeros(NumD,1)+1;
plot(train(1:NumD,1), train(1:NumD,2),'r*'); hold on
plot(train(1+NumD:2*NumD,1), train(1+NumD:NumD*2,2),'o');
ν•™μŠ΅ 데이터
2) νΌμ…‰νŠΈλ‘  ν•™μŠ΅ ν”„λ‘œκ·Έλž¨
νΌμ…‰νŠΈλ‘  ν•™μŠ΅
Mstep=5;
% μ΅œλŒ€ 반볡횟수 μ„€μ •
INP=2; OUT=1;
% μž…μΆœλ ₯ 차원 μ„€μ •
w=rand(INP,1)*0.4-0.2; wo=rand(1)*0.4-0.2; % νŒŒλΌλ―Έν„° μ΄ˆκΈ°ν™”
a=[-3:0.1:6];
% κ²°μ •κ²½κ³„μ˜ 좜λ ₯
plot(a, (-w(1,1)*a-wo)/w(2,1));
eta=0.5;
% ν•™μŠ΅λ₯  μ„€μ •
for j = 2:Mstep
% λͺ¨λ“  ν•™μŠ΅ 데이터에 λŒ€ν•΄ 반볡횟수만큼 ν•™μŠ΅ 반볡
for i=1:NumD*2
x=train(i,:); ry=train_out(i,1);
ν•™μŠ΅μ—
λ”°λ₯Έ 좜λ ₯
κ²°μ •κ²½κ³„μ˜
y=stepf(x*w+wo);
% νΌμ…‰νŠΈλ‘ 
계산 λ³€ν™”
e=ry-y;
% 였차 계산
E(i,1)=e;
dw= eta*e*x';
% νŒŒλΌλ―Έν„° μˆ˜μ •λŸ‰ 계산
dwo= eta*e*1;
w=w+dw;
% νŒŒλΌλ―Έν„° μˆ˜μ •
wo=wo+dwo;
end
plot(a, (-w(1,1)*a-wo)/w(2,1),'r');
% λ³€ν™”λœ 결정경계 좜λ ₯
end
Download