Ricardo Gutierrez-Osuna Wright State University Statistical Pattern

advertisement
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
1
Ricardo Gutierrez-Osuna
Wright State University
Outline
g Introduction to Pattern Recognition
g Dimensionality reduction
g Classification
g Validation
g Clustering
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
2
Ricardo Gutierrez-Osuna
Wright State University
SECTION I: Introduction
g Features, patterns and classifiers
g Components of a PR system
g An example
g Probability definitions
g Bayes Theorem
g Gaussian densities
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
3
Ricardo Gutierrez-Osuna
Wright State University
Features, patterns and classifiers
g Feature
n Feature is any distinctive aspect, quality or characteristic
g Features may be symbolic (i.e., color) or numeric (i.e., height)
n The combination of d features is represented as a d-dimensional
column vector called a feature vector
g The d-dimensional space defined by the feature vector is called
feature space
g Objects are represented as points in feature space. This
representation is called a scatter plot
Feature 2
 x1 
x 
2
x= 
 
 
x d 
x3
Class 1
Class 3
x
x1
x2
Class 2
Feature 1
Feature vector
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Feature space (3D)
Statistical Pattern Recognition
4
Scatter plot (2D)
Ricardo Gutierrez-Osuna
Wright State University
Features, patterns and classifiers
g Pattern
n Pattern is a composite of traits or features characteristic of an
individual
n In classification, a pattern is a pair of variables {x,ω} where
g x is a collection of observations or features (feature vector)
g ω is the concept behind the observation (label)
g What makes a “good” feature vector?
n The quality of a feature vector is related to its ability to
discriminate examples from different classes
g Examples from the same class should have similar feature values
g Examples from different classes have different feature values
“Good” features
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
“Bad” features
Statistical Pattern Recognition
5
Ricardo Gutierrez-Osuna
Wright State University
Features, patterns and classifiers
g More feature properties
Linear separability
Non-linear separability
Multi-modal
Highly correlated features
g Classifiers
n The goal of a classifier is to partition feature space into classlabeled decision regions
n Borders between decision regions are called decision
boundaries
R1
R1
R2
R3
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
R3
R2
Statistical Pattern Recognition
6
R4
Ricardo Gutierrez-Osuna
Wright State University
Components of a pattern recognition system
g A typical pattern recognition system contains
n
n
n
n
n
A sensor
A preprocessing mechanism
A feature extraction mechanism (manual or automated)
A classification or description algorithm
A set of examples (training set) already classified or described
Feedback / Adaptation
The
“real world”
Sensor /
transducer
Preprocessing
and
enhancement
Classification
algorithm
Class
assignment
Description
algorithm
Description
Feature
extraction
[Duda, Hart and Stork, 2001]
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
7
Ricardo Gutierrez-Osuna
Wright State University
An example
g Consider the following scenario*
n A fish processing plan wants to automate the process of sorting
incoming fish according to species (salmon or sea bass)
n The automation system consists of
g
g
g
g
g
a conveyor belt for incoming products
two conveyor belts for sorted products
a pick-and-place robotic arm
a vision system with an overhead CCD camera
a computer to analyze images and control the robot arm
CCD
camera
Conveyor
belt (salmon)
computer
Conveyor
belt
Robot
arm
Conveyor
belt (bass)
*Adapted from [Duda, Hart and Stork, 2001]
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
8
Ricardo Gutierrez-Osuna
Wright State University
An example
g Sensor
n The camera captures an image as a new fish enters the sorting area
g Preprocessing
n Adjustments for average intensity levels
n Segmentation to separate fish from background
g Feature Extraction
n Suppose we know that, on the average, sea bass is larger than salmon
g Classification
n Collect a set of examples from both species
g Plot a distribution of lengths for both classes
n Determine a decision boundary (threshold)
that minimizes the classification error
g We estimate the system’s probability of
error and obtain a discouraging result of
40%
Decision
boundary
count
Salmon
Sea bass
n What is next?
length
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
9
Ricardo Gutierrez-Osuna
Wright State University
An example
g Improving the performance of our PR system
n Committed to achieve a recognition rate of 95%, we try a number of
features
g Width, Area, Position of the eyes w.r.t. mouth...
g only to find out that these features contain no discriminatory information
n Finally we find a “good” feature: average intensity of the scales
Decision
boundary
Sea bass
Salmon
Avg. scale intensity
n We combine “length” and “average
intensity of the scales” to improve
class separability
n We compute a linear discriminant
function to separate the two classes,
and obtain a classification rate of
95.7%
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
10
length
count
Decision
boundary
Sea bass
Salmon
Avg. scale intensity
Ricardo Gutierrez-Osuna
Wright State University
An example
g Cost Versus Classification rate
n Is classification rate the best objective function for this problem?
g The cost of misclassifying salmon as sea bass is that the end
customer will occasionally find a tasty piece of salmon when he
purchases sea bass
g The cost of misclassifying sea bass as salmon is a customer upset
when he finds a piece of sea bass purchased at the price of salmon
Decision
boundary
Sea bass
New
Decision
boundary
length
length
n We could intuitively shift the decision boundary to minimize an
alternative cost function
Sea bass
Salmon
Salmon
Avg. scale intensity
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
11
Avg. scale intensity
Ricardo Gutierrez-Osuna
Wright State University
An example
g The issue of generalization
g We then design an artificial neural
network with five hidden layers, a
combination of logistic and hyperbolic
tangent activation functions, train it
with the Levenberg-Marquardt algorithm
and obtain an impressive classification
rate of 99.9975% with the following
decision boundary
length
n The recognition rate of our linear classifier (95.7%) met the
design specs, but we still think we can improve the performance
of the system
Sea bass
Salmon
Avg. scale intensity
n Satisfied with our classifier, we integrate the system and deploy it
to the fish processing plant
g A few days later the plant manager calls to complain that the system
is misclassifying an average of 25% of the fish
g What went wrong?
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
12
Ricardo Gutierrez-Osuna
Wright State University
Review of probability theory
g Probability
n Probabilities are numbers assigned to events that indicate “how likely” it
is that the event will occur when a random experiment is performed
A2
A1
A4
Probability
Law
probability
Sample space
A3
A1
A2
A3
A4
event
g Conditional Probability
n If A and B are two events, the probability of event A when we already
know that event B has occurred P[A|B] is defined by the relation
P[A | B] =
P[A I B]
for P[B] > 0
P[B]
g P[A|B] is read as the “conditional probability of A conditioned on B”,
or simply the “probability of A given B”
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
13
Ricardo Gutierrez-Osuna
Wright State University
Review of probability theory
g Conditional probability: graphical interpretation
S
A
A∩B
B
S
“B has
occurred”
A
A∩B
B
g Theorem of Total Probability
n Let B1, B2, …, BN be mutually exclusive events, then
N
P[A] = P[A | B1 ]P[B1 ] + ...P[A | BN ]P[B N ] = ∑ P[A | Bk ]P[B k ]
k =1
B1
B3
BN-1
A
B2
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
BN
Statistical Pattern Recognition
14
Ricardo Gutierrez-Osuna
Wright State University
Review of probability theory
g Bayes Theorem
n Given B1, B2, …, BN, a partition of the sample space S. Suppose
that event A occurs; what is the probability of event Bj?
n Using the definition of conditional probability and the Theorem of
total probability we obtain
P[B j | A] =
P[A I B j ]
P[A]
=
P[A | B j ] ⋅ P[B j ]
N
∑ P[A | B
k
] ⋅ P[Bk ]
k =1
n Bayes Theorem is definitely the
fundamental relationship in
Statistical Pattern Recognition
Rev. Thomas Bayes (1702-1761)
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
15
Ricardo Gutierrez-Osuna
Wright State University
Review of probability theory
g For pattern recognition, Bayes Theorem can be expressed as
P(ω j | x) =
P(x | ω j ) ⋅ P(ω j )
N
∑ P(x | ωk ) ⋅ P(ωk )
=
P(x | ω j ) ⋅ P(ω j )
P(x)
k =1
n where ωj is the ith class and x is the feature vector
g Each term in the Bayes Theorem has a special name, which you
should be familiar with
n
n
n
n
P(ωi)
P(ωi|x)
P(x|ωi)
P(x)
Prior probability (of class ωi)
Posterior Probability (of class ωi given the observation x)
Likelihood (conditional prob. of x given class ωi)
A normalization constant that does not affect the decision
g Two commonly used decision rules are
n Maximum A Posteriori (MAP): choose the class ωi with highest P(ωi|x)
n Maximum Likelihood (ML): choose the class ωi with highest P(x|ωi)
g ML and MAP are equivalent for non-informative priors (P(ωi)=constant)
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
16
Ricardo Gutierrez-Osuna
Wright State University
Review of probability theory
g Characterizing features/vectors
n Complete: Probability mass/density function
1
1
pmf
pdf
5/6
4/6
3/6
2/6
1/6
x(lb)
pdf for a person’s weight
100
200 300 400
500
x
pmf for rolling a (fair) dice
1
2
3
4
5
6
n Partial: Statistics
g Expectation
n The expectation represents the center of mass of a density
g Variance
n The variance represents the spread about the mean
g Covariance (only for random vectors)
n The tendency of each pair of features to vary together, i.e., to co-vary
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
17
Ricardo Gutierrez-Osuna
Wright State University
Review of probability theory
g The covariance matrix (cont.)
COV[X] = ∑ = E[(X −
; − T
]
 E[(x1 − 1 )(x1 − 1 )] ... E[(x1 −
...
= 
E[(xN − N )(x1 − 1 )] ... E[(xN −
O
)]  σ 12 ... c1N 

 =  ... ...

 
2
 c 1N ... σ N 
N )(x N − N )]
1
)(xN −
N
n The covariance terms can be expressed as
2
c ii = σ i and c ik = ρ ikσ iσ k
g where ρik is called the correlation coefficient
g Graphical interpretation
Xk
Xk
Xk
Xi
C ik=-σiσk
ρik=-1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Xk
Xi
C ik=-½σiσ k
ρik=-½
Xk
Xi
C ik=0
ρik=0
Statistical Pattern Recognition
18
Xi
C ik=+½σiσk
ρik=+½
Xi
C ik=σ iσk
ρik=+1
Ricardo Gutierrez-Osuna
Wright State University
Review of probability theory
g Meet the multivariate Normal or Gaussian
density N(µ,Σ):
fX (x) =
1
(2 π )n/2 ∑
1/2
 1
exp  − (X −
 2
T
∑ −1(X −



n For a single dimension, this formula reduces to
the familiar expression
 1  X - 2 
1
f X (x) =
exp− 
 
2
σ
2π σ

 

g Gaussian distributions are very popular
n The parameters (µ,Σ) are sufficient to uniquely
characterize the normal distribution
n If the xi’s are mutually uncorrelated (cik=0), then
they are also independent
g The covariance matrix becomes diagonal, with
the individual variances in the main diagonal
n Marginal and conditional densities
n Linear transformations
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
19
Ricardo Gutierrez-Osuna
Wright State University
SECTION II: Dimensionality reduction
g The “curse of dimensionality”
g Feature extraction vs. feature selection
g Feature extraction criteria
g Principal Components Analysis (briefly)
g Linear Discriminant Analysis
g Feature Subset Selection
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
20
Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g The “curse of dimensionality” [Bellman, 1961]
n Refers to the problems associated with multivariate data analysis as the
dimensionality increases
g Consider a 3-class pattern recognition problem
n A simple (Maximum Likelihood) procedure would be to
g Divide the feature space into uniform bins
g Compute the ratio of examples for each class at each bin and,
g For a new example, find its bin and choose the predominant class in that bin
n We decide to start with one feature and divide the real line into 3 bins
x1
g Notice that there exists a large overlap between classes ⇒ to improve
discrimination, we decide to incorporate a second feature
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
21
Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g Moving to two dimensions increases the number of bins from 3
to 32=9
n QUESTION: Which should we maintain constant?
g The density of examples per bin? This increases the number of examples from
9 to 27
g The total number of examples? This results in a 2D scatter plot that is very
sparse
x2
Constant density
x2
Constant # examples
x2
x1
x1
g Moving to three features …
n The number of bins grows to 33=27
n To maintain the initial density of examples,
the number of required examples grows to 81
n For the same number of examples the
3D scatter plot is almost empty
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
22
x1
x3
Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g Implications of the curse of dimensionality
n Exponential growth with dimensionality in the number of examples
required to accurately estimate a function
g How do we beat the curse of dimensionality?
n By incorporating prior knowledge
n By providing increasing smoothness of the target function
n By reducing the dimensionality
g In practice, the curse of dimensionality means that
g In most cases, the information
that was lost by discarding some
features is compensated by a
more accurate mapping in lowerdimensional space
performance
n For a given sample size, there is a maximum number of features above
which the performance of our classifier will degrade rather than improve
dimensionality
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
23
Ricardo Gutierrez-Osuna
Wright State University
The curse of dimensionality
g Implications of the curse of dimensionality
n Exponential growth in the number of examples required to maintain a
given sampling density
g For a density of N examples/bin and D dimensions, the total number of
examples grows as ND
n Exponential growth in the complexity of the target function (a density)
with increasing dimensionality
g “A function defined in high-dimensional space is likely to be much more
complex than a function defined in a lower-dimensional space, and those
complications are harder to discern” –Jerome Friedman
n This means that a more complex target function requires denser sample points to
learn it well!
n What to do if it ain’t Gaussian?
g For one dimension, a large number of density functions are available
g For most multivariate problems, only the Gaussian density is employed
g For high-dimensional data, the Gaussian density can only be handled in a
simplified form!
n Human performance in high dimensions
g Humans have an extraordinary capacity to discern patterns in 1-3
dimensions, but these capabilities degrade drastically for 4 or higher
dimensions
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
24
Ricardo Gutierrez-Osuna
Wright State University
Dimensionality reduction
g Two approaches to perform dim. reduction ℜN→ℜM (M<N)
n Feature selection: choosing a subset of all the features
[x1 x 2...x N ] →[xi
feature
selection
1
x i2 ...x iM
]
n Feature extraction: creating new features by combining existing ones
[x1 x 2...x N ] extraction
 →[y1 y 2 ...y M ] = f ([x i
feature
1
x i2 ...x iM
])
g In either case, the goal is to find a low-dimensional representation of the
data that preserves (most of) the information or structure in the data
g Linear feature extraction
n The “optimal” mapping y=f(x) is, in general, a non-linear function whose
form is problem-dependent
g Hence, feature extraction is commonly limited to linear projections y=Wx
 x1 
 y1   w 11
x 
y   w
 2
linear feature extraction
     →  2  =  21
  
 
  
 
 yM   w M1
 xN 
M
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
M
w 12
w 22
M
w M2
Statistical Pattern Recognition
25
L
L
O
 x1 
w 1N   
x2
w 2N   
 
 

w MN   
 xN 
M
M
Ricardo Gutierrez-Osuna
Wright State University
Signal representation versus classification
g Two criteria can be used to find the “optimal” feature extraction
mapping y=f(x)
n Signal representation: The goal of feature extraction is to represent the
samples accurately in a lower-dimensional space
n Classification: The goal of feature extraction is to enhance the classdiscriminatory information in the lower-dimensional space
g Based on signal representation
n Fisher’s Linear Discriminant (LDA)
g Based on classification
a
gn
Si
n
io
at
nt
se
re )
ep A
l r (PC
2
2 2
22 2 2
111
22 2
1
2 22
1 1 11
2 2
1 1
2
2 2
1 111
111 1 2
11
1
C
la
ss
(L ific
D at
A) i o
n
n Principal Components (PCA)
Feature 2
g Within the realm of linear feature extraction, two techniques are
commonly used
Feature 1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
26
Ricardo Gutierrez-Osuna
Wright State University
Principal Components Analysis
g Summary
n The optimal* approximation of a random vector x∈ℜN by a linear
combination of M (M<N) independent vectors is obtained by projecting
the random vector x onto the eigenvectors ϕi corresponding to the
largest eigenvalues λi of the covariance matrix Σx
g *optimality is defined as the minimum of the sum-square magnitude of the
approximation error
n Since PCA uses the eigenvectors of the covariance matrix Σx, it is able
to find the independent axes of the data under the unimodal Gaussian
assumption
g However, for non-Gaussian or multi-modal Gaussian data, PCA simply decorrelates the axes
n The main limitation of PCA is that it does not consider class separability
since it does not take into account the class label of the feature vector
g PCA simply performs a coordinate rotation that aligns the transformed axes
with the directions of maximum variance
g There is no guarantee that the directions of maximum variance will
contain good features for discrimination!!
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
27
Ricardo Gutierrez-Osuna
Wright State University
PCA example
g 3D Gaussian distribution
12
n Mean and covariance are
= [0 5 2]
T
and
10
7
 25 − 1

=  −1
4 − 4
 7 − 4 10
8
6
4
3
x
g RESULTS
2
0
-2
n First PC has largest variance
n PCA projections are de-correlated
-4
-6
-8
10
5
-5
-10
0
0
x1
x2
12
15
10
10
5
10
10
5
8
6
2
y
0
3
y
5
4
2
0
-5
0
-5
-10
-15
-10
-5
0
y1
5
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
10
15
-2
-15
-10
-5
0
y1
5
Statistical Pattern Recognition
28
10
15
-10
-8
-6
-4
-2
0
y2
2
Ricardo Gutierrez-Osuna
Wright State University
4
6
8
10
Linear Discriminant Analysis, two-classes
g The objective of LDA is to perform dimensionality reduction
while preserving as much of the class discriminatory
information as possible
n Assume a set of D-dimensional samples {x1, x2, …, xN}, N1 of which
belong to class ω1, and N2 to class ω2
n We seek to obtain a scalar y by projecting the samples x onto a line
y=wTx
n Of all possible lines we want the one that maximizes the separability of
the scalars
x2
x2
x1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
29
x1
Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis, two-classes
g In order to find a good projection vector, we need to define a
measure of separation between the projections
n We could choose the distance between the projected means
J( w ) = ~1 − ~2 = w T (
1
−
2
)
g Where the mean values of the x and y examples are
i
=
1
Ni
∑x
x∈
and
i
~ = 1
i
Ni
1
∑y = N ∑w
y∈
i y∈
i
T
x = wT
i
i
n However, the distance between projected means is not a very good
measure since it does not take into account the standard deviation within
the classes
x2
µ1
This axis yields better class separability
µ2
x1
This axis has a larger distance between means
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
30
Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis, two-classes
n The solution proposed by Fisher is to normalize the difference
between the means by a measure of the within-class variance
g For each class we define the scatter, an equivalent of the variance, as
~
si2 =
∑ (y − ~ )
2
i
y∈
i
g and the quantity (~
s12 + ~
s22 ) is called the within-class scatter of the
projected examples
n The Fisher linear discriminant is defined as the linear function wTx that
maximizes the criterion function
~ −~ 2
1
2
J(w) = ~ 2 ~ 2
s +s
1
2
n In a nutshell: we look for a projection
where examples from the same class
are projected very close to each other
and, at the same time, the projected
means are as farther apart as possible
x2
µ1
µ2
x1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
31
Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis, two-classes
n Noticing that
(~1 − ~2 )2 = (w T
1
− wT
)
2
2
T
T
)(
1 − 2 ) w = w SB w
144424443
= wT ( 1 −
2
BETWEEN - CLASS
SCATTER SB
~
si2 =
∑ (y − ~ ) = ∑ (w
2
i
y∈
i
x∈
T
x − wT
)
2
i
= w T ∑ (x −
i
14243
~
s12 + ~
s22 = w T (S1 + S 2 ) w = w T S W w
)(x − i )T w = w TSi w
144424443
x∈
i
i
CLASS SCATTER S i
WITHIN -CLASS
SCATTER S W
n We can express J(w) in terms of x and w as
w T SB w
J( w ) = T
w SW w
n It can be shown that the projection vector w* which maximizes J(w) is
 w T SB w 
−1
w* = argmax  T
 = SW ( 1 −
w
 w SW w 
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
32
2
)
Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis, C-classes
g Fisher’s LDA generalizes for C-class problems very gracefully
n For the C class problem we will seek (C-1) projection vectors wi, which can be
arranged by columns into a projection matrix W=[w1|w2|…|wC-1]
g The solution uses a slightly different formulation
n The within-class scatter matrix is similar as the two-class problem
C
C
S W = ∑ S i = ∑ ∑ (x −
i=1
i =1 x∈
i
)(x − i )T
i
n The generalization of the between-class scatter matrix is
C
SB = ∑ Ni ( i −
i=1
)(
i
−
)T
g Where µ is the mean of the entire dataset
n The objective function becomes
~
SB
W T SB W
J( W ) = ~ =
W TS W W
SW
g Recall that we are looking for a projection that, in some sense, maximizes
the ratio of between-class to within-class scatter:
n Since the projection is not scalar (it has C-1 dimensions), we have used the
determinant of the scatter matrices to obtain a scalar criterion function
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
33
Ricardo Gutierrez-Osuna
Wright State University
Linear Discriminant Analysis, C-classes
n It can be shown that the optimal projection matrix W* is the one whose
columns are the eigenvectors corresponding to the largest eigenvalues
of the following generalized eigenvalue problem
[
W* = w | w |
*
1
*
2
L| w ]
*
C −1
 W T SB W 
*
= argmax  T
 ⇒ (SB − iS W )w i = 0
 W SW W 
g NOTE
n SB is the sum of C matrices of rank one or less and the mean vectors
are constrained by ΣNiµi=Nµ
g Therefore, SB will be at most of rank (C-1)
g This means that only (C-1) of the eigenvalues λi will be non-zero
n Therefore, LDA produces at most C-1 feature projections
g If the classification error estimates establish that more features are needed,
some other method must be employed to provide those additional features
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
34
Ricardo Gutierrez-Osuna
Wright State University
PCA Versus LDA
1
77
axis 2
44
2
29
2 58
93
7
6
8
46
2
1
5
2
9
10710410 10 4
98
1 511
664
3
2 7
7
4
9
9
8
2 1 58 9
710 4
6
89
10
10
77710
7 6 6 2 26 55255899
3285933
10
7
10
7 7 10
4
6
4
1
9
10
10
9 3
10
6 2
0
4 4
7
10
4 46 6 16 22125 19858 9 8 8
8
6 6 9 1 21 95
538 5 3
7
-2
46 2 61 21 2
9 85
10
4
2
5
6 6 5
10 7 4
15 31 2 8 383
1
-4
9 3 335
10
15 1 3
7
3
-6
83
3
2
4
PCA
10
4
10
6
5
axis 3
4
6
0
-5
-10
-8
-5
2
-10
-10
6
-5
7
7 7 10
74 77
4
44
7
1010
7 27 10
41010
4 2
9
7
7 10 4 6
28
10
46 6 109
7
7
7 10107 7
4
10
2
5
7
41
12
5528 22 85
10
6 692 9118
10
6
44
2564
5
10
10
1
2825 51
9 9 29
74410
10
466
5
664 61566229
124
1939
1
3988 93
589 538
1
6
9
6 5 85313 5 8
1769
10
62122985
9 9
8 89
51 2
134 6
2
1
5
8 2
8
8 1 3313 8 3
5
3
3
53 83 5
5
3
3
3
0
0
8
-5
5
0
axis 1
5
10
10
-10
axis 2
axis 1
-3
x 10
7
7 77
7 7777
7
7777
77
7 7
15
5
0
axis 3
LDA
axis 2
10
-3
x 10
5
95
999 995555 9
9
59555
99
55555
8 9
9899
8
98
8
5555
888
8
8
5
0 88
338
8
38
33338
8
333
333333
3
-5 3
-0.02
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
2 22
22222
22
2222
2
2
11
11
1
11
12 1
2
11111
11
11
-0.01
-5
-10
-15
4 444
4444444 4
44
4 44
4
6 66 6
666666666
66
6
6 66
6
0
0.01
axis 1
0.02
0.03
9
99
9999
99
9
9
55
55
5252
9
5
9
9
8888
8
5
5
2
9
8
5
2
58
9 55
2
52
2522
5
22
8888
81212522
8888822
11111
11
13111
1
111 6
6
333
6666
666
333
666
6
3
3
3
6666
3
6
33
3
6
3 33
3
7 77 7
7
7 777
7777 7 7
4
444 44
44444444
44 4
4
10
10
10
10
10
10
10
10
10
10
10
1010
10
10
10
15
-0.02
10
10
10
10
10
10
10
10
10
10
1010
10
10
10
10
10
10
0.04
0.05
10
0
5
0.02
0.04
axis 1
Statistical Pattern Recognition
35
0
-5
axis 2
Ricardo Gutierrez-Osuna
Wright State University
-3
x 10
Limitations of LDA
g LDA assumes unimodal Gaussian likelihoods
n If the densities are significantly non-Gaussian, LDA may not preserve
any complex structure of the data needed for classification
µ1=µ2=µ
ω1
ω2
ω2
µ1
ω1
µ1=µ2=µ
ω1
ω2
ω1
µ2
g LDA will fail when the discriminatory
information is not in the mean but
rather in the variance of the data
x2
PC
A
LD
A
ω2
x1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
36
Ricardo Gutierrez-Osuna
Wright State University
Limitations of LDA
g LDA has a tendency to overfit training data
n To illustrate this problem, we generate an artificial dataset
g Three classes, 50 examples per class, with the exact same likelihood: a
multivariate Gaussian with zero mean and identity covariance
g As we arbitrarily increase the number of dimensions, classes appear to
separate better, even though they come from the same distribution
10 dimensions
50 dimensions
3
2 2 2 2 22 1
1
2 12 221 2 3 2
3
1
11 22
3 23
3
2
1 3 313 11 2
31
2
22322 3
1 1 2213 133
1
1
3
3
1
2 1 331
2 23 32
3 213 3 23 13 2
1 1 1311211 1
3
2 22 3
1
2 2
2
31 2
1 2213 1132111 3223323113
2
33 1 3 1
3
3
21 1 1333 2121
3 3
33
1
3
1
1 1
1 1 2 121
1 1
2
1 1
1
1
1
1
3 1 2 1 11
1
1
1
11
2
2
2
1 1 1 2 21
1
3 323
1 31
32
133
2221 2
3 1 11
1
22
3
1 33 3331 1 2 122
22 1 31 2
22 2 2
3 3 3 33 313 2
2
3
3
1
2 222 2 2 2
3 3 3133
122 2 2 2 2
3
3
2
1
2
2 3 33
2
3
3
1
3
3
33 3 32
3 1
3 2 3
2
100 dimensions
150 dimensions
1
1 1 11
1 1
1 111
1
1 1 1 11
3 3
1 111 111 111 1 1
1
3
333 3 33
111 1 1111 1 1
3 3333
1
33 33 3 1
11
3
3
3333 33 33 2
3 33
2
3
2
2
33
12
333 2212 2 2
2 22 2 2
3 3 3 2 2 22
222
3 3
22 22 222222222 2
2 2 2
2 2222 2 22
2
3
3
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
3
22
222
222222222
2
2
2
2
222
22
222222
2222
2
2
2 2
33 33333
333
333333
3
33
3333
3333333
3333333
3
3 33 3
1111111
111111
1
11
1111111
111
11 1111
11
1
1 1 1
Statistical Pattern Recognition
37
Ricardo Gutierrez-Osuna
Wright State University
Feature Subset Selection
g Why Feature Subset Selection?
n Feature Subset Selection is necessary in a number of situations
g Features may be expensive to obtain
n You evaluate a large number of features (sensors) in the test bed and select only
a few for the final implementation
g You may want to extract meaningful rules from your classifier
n When you transform or project, the measurement units (length, weight, etc.) of
your features are lost
g Features may not be numeric
n A typical situation in the machine learning domain
g Although FSS can be thought of as a special case of feature
extraction (think of an identity projection matrix with a few
diagonal zeros), in practice it is a quite different problem
n FSS looks at dimensionality reduction from a different perspective
n FSS has a unique set of methodologies
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
38
Ricardo Gutierrez-Osuna
Wright State University
Search strategy and objective function
g Feature Subset Selection requires
n A search strategy to select candidate subsets
n An objective function to evaluate candidates
g Search Strategy
N
 
 M
n Exhaustive evaluation involves
feature
subsets for a fixed value of M, and 2N subsets
if M must be optimized as well
g This number of combinations is unfeasible, even
for moderate values of M and N
n For example, exhaustive evaluation of 10 out of
20 features involves 184,756 feature subsets;
exhaustive evaluation of 10 out of 20 involves
more than 1013 feature subsets
n A search strategy is needed to explore the
space of all possible feature combinations
g Objective Function
n The objective function evaluates candidate
subsets and returns a measure of their
“goodness”, a feedback that is used by the
search strategy to select new candidates
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
39
Training data
Complete feature set
Feature Subset Selection
Search
Feature
subset
“Goodness”
Objective
function
Final feature subset
PR
algorithm
Ricardo Gutierrez-Osuna
Wright State University
Objective function
g Objective functions are divided in two groups
n Filters: The objective function evaluates feature subsets by their
information content, typically interclass distance, statistical dependence
or information-theoretic measures
n Wrappers: The objective function is a pattern classifier, which evaluates
feature subsets by their predictive accuracy (recognition rate on test
data) by statistical resampling or cross-validation
Training data
Complete feature set
Training data
Complete feature set
Filter FSS
Wrapper FSS
Search
Search
Feature
subset
Information
content
Feature
subset
Objective
function
PR
algorithm
Final feature subset
Final feature subset
ML
algorithm
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Predictive
accuracy
PR
algorithm
Statistical Pattern Recognition
40
Ricardo Gutierrez-Osuna
Wright State University
Filter types
g Distance or separability measures
n These methods use distance metrics to measure class separability:
g Distance between classes: Euclidean, Mahalanobis, etc.
g Determinant of SW-1SB (LDA eigenvalues)
g Correlation and information-theoretic measures
n These methods assume that “good” subsets contain features highly
correlated with (predictive of) the class, yet uncorrelated with each other
g Linear relation measures
M
n Linear relationship between variables can be
measured using the correlation coefficient J(YM)
g where ρ is a correlation coefficient
J(YM ) =
∑
ic
i=1
M M
∑∑
ij
i=1 j=i+1
g Non-Linear relation measures
n Correlation is only capable of measuring linear dependence. A more powerful
method is the mutual information I(Yk;C)
C
J(YM ) = I(YM; C) = H(C ) − H(C | YM ) = ∑ ∫ P(YM,
c =1 YM
c
) lg
P(YM, c )
dx
P(YM ) P( c )
g The mutual information between feature vector and class label I(YM;C)
measures the amount by which the uncertainty in the class H(C) is
decreased by knowledge of the feature vector H(C|YM), where H(·) is the
entropy function
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
41
Ricardo Gutierrez-Osuna
Wright State University
Filters Vs. Wrappers
g Wrappers
n Advantages
g Accuracy: wrappers generally achieve better recognition rates than filters since they
are tuned to the specific interactions between the classifier and the dataset
g Ability to generalize: wrappers have a mechanism to avoid overfitting, since they
typically use cross-validation measures of predictive accuracy
n Disadvantages
g Slow execution: since the wrapper must train a classifier for each feature subset (or
several classifiers if cross-validation is used), the method is unfeasible for
computationally intensive classifiers
g Lack of generality: the solution lacks generality since it is tied to the bias of the
classifier used in the evaluation function
g Filters
n Advantages
g Fast execution: Filters generally involve a non-iterative computation on the dataset,
which can execute much faster than a classifier training session
g Generality: Since filters evaluate the intrinsic properties of the data, rather than their
interactions with a particular classifier, their results exhibit more generality: the solution
will be “good” for a larger family of classifiers
n Disadvantages
g Tendency to select large subsets: Since the filter objective functions are generally
monotonic, the filter tends to select the full feature set as the optimal solution. This
forces the user to select an arbitrary cutoff on the number of features to be selected
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
42
Ricardo Gutierrez-Osuna
Wright State University
Naïve sequential feature selection
g One may be tempted to evaluate each
individual feature separately and select those
M features with the highest scores
x2
ωω45
ω3
n Unfortunately, this strategy will very rarely work
since it does not account for feature dependence
g An example will help illustrate the poor
performance of this naïve approach
n The scatter plots show a 4-dimensional pattern
recognition problem with 5 classes
ω2
ω1
x1
x4
g The objective is to select the best subset of 2
features using the naïve sequential FSS procedure
n A reasonable objective function will generate the
following feature ranking: J(x1)>J(x2)≈J(x3)>J(x4)
g The optimal feature subset turns out to be {x1, x4},
because x4 provides the only information that x1
needs: discrimination between classes ω4 and ω5
g If we were to choose features according to the
individual scores J(xk), we would choose x1 and
either x2 or x3, leaving classes ω4 and ω5 non
separable
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
43
ω5
ω1
ω3
ω2
ω4
x3
Ricardo Gutierrez-Osuna
Wright State University
Search strategies
g FSS search strategies can be grouped in three categories*
n Sequential
g These algorithms add or remove features sequentially, but have a tendency
to become trapped in local minima
n Sequential Forward Selection
n Sequential Backward Selection
n Sequential Floating Selection
n Exponential
g These algorithms evaluate a number of subsets that grows exponentially
with the dimensionality of the search space
n Exhaustive Search (already discussed)
n Branch and Bound
n Beam Search
n Randomized
g These algorithms incorporating randomness into their search procedure to
escape local minima
n Simulated Annealing
n Genetic Algorithms
*The subsequent description of FSS is borrowed from [Doak, 1992]
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
44
Ricardo Gutierrez-Osuna
Wright State University
Sequential Forward Selection (SFS)
g Sequential Forward Selection is a simple greedy search
1.
1. Start
Start with
with the
the empty
empty set
set Y={∅}
Y={∅}
+
2.
2. Select
Select the
the next
next best
best feature
feature x = arg max [J(Yk + x )]
x∈X − Yk
=Y +x; k=k+1
3.
3. Update
Update Y
Yk+1
k+1=Ykk+x; k=k+1
4.
4. Go
Go to
to 22
g Notes
Empty feature set
n SFS performs best when the optimal subset has a
small number of features
g When the search is near the empty set, a large number
of states can be potentially evaluated
g Towards the full set, the region examined by SFS is
narrower since most of the features have already been
selected
g The main disadvantage of SFS is that it is unable to
remove features that become obsolete with the addition
of new features
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
45
Full feature set
Ricardo Gutierrez-Osuna
Wright State University
Sequential Backward Selection (SBS)
g Sequential Backward Selection works in the opposite manner as
SFS
1.
1. Start
Start with
with the
the full
full set
set Y=X
Y=X
−
2.
2. Remove
Remove the
the worst
worst feature
feature x = arg max [J(Yk − x )]
x∈Yk
=Y -x; k=k+1
3.
3. Update
Update Y
Yk+1
k+1=Ykk-x; k=k+1
4.
4. Go
Go to
to 22
Empty feature set
g Notes
n SBS works best when the optimal feature subset
has a large number of features, since SBS
spends most of its time visiting large subsets
n The main limitation of SBS is its inability to
reevaluate the usefulness of a feature after it has
been discarded
Full feature set
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
46
Ricardo Gutierrez-Osuna
Wright State University
Branch and Bound
g The Branch and Bound algorithm is
guaranteed to find the optimal feature subset
under the monotonicity assumption
Empty feature set
n The monotonicity assumption states that the
addition of features can only increase the value of
the objective function, this is
( ) (
) (
)
J x i1 < J x i1 , x i2 < J x i1 , x i2 , x i3 <
L < J(x , x ,L, x
i1
i2
iN
)
n Branch and Bound starts from the full set and
removes features using a depth-first strategy
g Nodes whose objective function are lower than the
current best are not explored since the monotonicity
assumption ensures that their children will not
contain a better solution
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
47
Full feature set
Ricardo Gutierrez-Osuna
Wright State University
Branch and Bound
g Algorithm
n The algorithm is better explained by considering the subsets of M’=N-M
features already discarded, where N is the dimensionality of the state
space and M is the desired number of features
n Since the order of the features is irrelevant, we will only consider an
increasing ordering i1<i2<...iM’ of the feature indices, this will avoid
exploring states that differ only in the ordering of their features
n The Branch and Bound tree for N=6 and M=2 is shown below (numbers
indicate features that are being removed)
g Notice that at the level directly below the root we only consider removing
features 1, 2 or 3, since a higher number would not allow sequences
(i1<i2<i3<i4) with four indices
1
2
2
4
3
4
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
5
3
6
5
5
6
6
4
5
6
4
5
5
6
6
Statistical Pattern Recognition
48
3
4
5
6
3
4
4
5
5
5
6
6
6
Ricardo Gutierrez-Osuna
Wright State University
Branch and Bound
1.
1.
2.
2.
3.
3.
Initialize:
Initialize: αα==-∞
∞,, k=0
k=0
Generate
successors
Generate successors of
of the
the current
current node
node and
and store
store them
them in
in LIST(k)
LIST(k)
Select
Select new
new node
node
ifif LIST(k)
LIST(k) is
is empty
empty
go
go to
to Step
Step 55
else
else
[(
ik = arg max J x i1 , x i2 ,...x ik −1 , j
j ∈LIST ( k )
)]
delete ik from LIST(k )
4.
4. Check
Check bound
bound
ifif J x i1 , xi2 ,...x ik <
go
go to
to 55
else
else ifif k=M’
k=M’ (we
(we have
have the
the desired
desired number
number of
of features)
features)
go
go to
to 66
else
else
k=k+1
k=k+1
go
go to
to 22
5.
5. Backtrack
Backtrack to
to lower
lower level
level
set
set k=k-1
k=k-1
ifif k=0
k=0
terminate
terminate algorithm
algorithm
else
else
go
go to
to 33
6.
6. Last
Last level
level
Set = J x i1 , x i2 ,...x ik −1 , j and YM* ’ = x i1 , x i2 ,...x ik
go
go to
to 55
(
)
(
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
)
{
Statistical Pattern Recognition
49
}
Ricardo Gutierrez-Osuna
Wright State University
Beam Search
g Beam Search is a variation of best-first
search with a bounded queue to limit the
scope of the search
n The queue organizes states from best to worst,
with the best states placed at the head of the
queue
n At every iteration, BS evaluates all possible
states that result from adding a feature to the
feature subset, and the results are inserted into
the queue in their proper locations
Empty feature set
Full feature set
g It is trivial to notice that BS degenerates to
Exhaustive search if there is no limit on the size of
the queue. Similarly, if the queue size is set to one,
BS is equivalent to Sequential Forward Selection
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
50
Ricardo Gutierrez-Osuna
Wright State University
Beam Search
g This example illustrates BS for a 4D space and a queue of size 3
n BS cannot guarantee that the optimal subset is found:
g in the example, the optimal is 2-3-4(9), which is never explored
g however, with the proper queue size, Beam Search can avoid local minimal by
preserving solutions from varying regions in the search space
root
1(5)
2(6)
3(7)
3(5)
4(3)
4(2)
4(5)
2(3)
3(8)
3(4)
4(6)
4(1)
LIST={∅}
4(1)
LIST={1(5), 3(4), 2(3)}
4(1)
LIST={1-2(6), 1-3(5), 3(4)}
4(1)
LIST={1-2-3(7), 1-3(5), 3(4)}
4(5)
4(9)
4(5)
root
1(5)
2(6)
3(7)
3(5)
4(3)
4(2)
4(5)
2(3)
3(8)
3(4)
4(6)
4(5)
4(9)
4(5)
root
1(5)
2(6)
3(7)
3(5)
4(3)
4(2)
4(5)
2(3)
3(8)
3(4)
4(6)
4(5)
4(9)
4(5)
root
1(5)
2(6)
3(7)
3(5)
4(3)
4(5)
4(2)
2(3)
3(8)
3(4)
4(6)
4(5)
4(9)
4(5)
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
51
Ricardo Gutierrez-Osuna
Wright State University
Genetic Algorithms
g Genetic algorithms are optimization techniques that mimic the
evolutionary process of “survival of the fittest”
n Starting with an initial random population of solutions, evolve new
populations by mating (crossover) pairs of solutions and mutating
solutions according to their fitness (an objective function)
n The better solutions are more likely to be selected for the mating and
mutation operators and, therefore, carry their “genetic code” from
generation to generation
Empty feature set
n In FSS, individual solutions are represented with a
binary number (1 if the feature is selected, 0 otherwise)
1.
1.
2.
2.
3.
3.
Create
Create an
an initial
initial random
random population
population
Evaluate
Evaluate initial
initial population
population
Repeat
Repeat until
until convergence
convergence (or
(or aa number
number of
of generations)
generations)
3a.
3a. Select
Select the
the fittest
fittest individuals
individuals in
in the
the population
population
3b.
3b. Perform
Perform crossover
crossover on
on selected
selected individuals
individuals to
to create
create offspring
offspring
3c.
Perform
mutation
on
selected
individuals
3c. Perform mutation on selected individuals
3d.
3d. Create
Create new
new population
population from
from old
old population
population and
and offspring
offspring
3e.
3e. Evaluate
Evaluate the
the new
new population
population
Full feature set
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
52
Ricardo Gutierrez-Osuna
Wright State University
Genetic operators
g Single-point crossover
n Select two individuals (parents) according to their fitness
n Select a crossover point
n With probability Pc (0.95 is reasonable) create two offspring by
combining the parents
Crossover point selected randomly
Parenti
01001010110
Parentj
11010110000
11011010110
Offspringi
01000110000
Offspringj
Crossover
g Binary mutation
n Select an individual according to its fitness
n With probability PM (0.01 is reasonable) mutate each one of its bits
Mutated bits
Individual
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
11010110000
Mutation
Statistical Pattern Recognition
53
11001010111
Offspringi
Ricardo Gutierrez-Osuna
Wright State University
Selection methods
g The selection of individuals is based on their fitness
g We will describe a selection method called Geometric selection
n Several methods are available: Roulette Wheel, Tournament Selection…
g Geometric selection
g q is the probability of selecting the best
individual (0.05 is a reasonable value)
n The geometric distribution assigns
higher probability to individuals ranked
better, but also allows unfit individuals
to be selected
g In addition, it is typical to carry the
best individual of each population to
the next one
0.08
0.07
Selection probability
n The probability of selecting the rth best
individual is given by the geometric
r -1
probability mass function P(r ) = q(1 - q)
0.06
q=0.08
0.05
q=0.04
0.04
q=0.02
0.03
q=0.01
0.02
0.01
0
0
10
20
30
40
50
60
70
Individual rank
n This is called the Elitist Model
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
54
Ricardo Gutierrez-Osuna
Wright State University
80
90
100
GAs, parameter choices for Feature selection
g The choice of crossover rate PC is not critical
n You will want a value close to 1.0 to have a large number of offspring
g The choice of mutation rate PM is very critical
n An optimal choice of PM will allow the GA to explore the more promising
regions while avoiding getting trapped in local minima
g A large value (i.e., PM>0.25) will not allow the search to focus on the better
regions, and the GA will perform like random search
g A small value (i.e., close to 0.0) will not allow the search to escape local
minima
g The choice of ‘q’, the probability of selecting the best individual
is also critical
n An optimal value of ‘q’ will allow the GA to explore the most promising
solution, and at the same time provide sufficient diversity to avoid early
convergence of the algorithm
g In general, poorly selected control parameters will result in suboptimal solutions due to early convergence
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
55
Ricardo Gutierrez-Osuna
Wright State University
Search Strategies, summary
Exhaustive
Sequential
Randomized
Accuracy
Complexity
Advantages
Disadvantages
Always finds the
optimal solution
Exponential
High accuracy
High complexity
Quadratic O(NEX )
Simple and fast
Cannot
backtrack
Generally low
Designed to
escape local
minima
Difficult to
choose good
parameters
Good if no
backtracking
needed
Good with proper
control
parameters
2
g A highly recommended review of this material
Justin Doak
“An evaluation of feature selection methods and their application to Computer Security”
University of California at Davis, Tech Report CSE-92-18
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
56
Ricardo Gutierrez-Osuna
Wright State University
SECTION III: Classification
g Discriminant functions
g The optimal Bayes classifier
g Quadratic classifiers
g Euclidean and Mahalanobis metrics
g K Nearest Neighbor Classifiers
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
57
Ricardo Gutierrez-Osuna
Wright State University
Discriminant functions
g A convenient way to represent a pattern classifier is in terms of
a family of discriminant functions gi(x) with a simple MAX gate
as the classification rule
Class assignment
Select
Select max
max
Costs
Costs
gg1(x)
1(x)
xx1
1
gg2(x)
2(x)
ggC(x)
C(x)
xx3
3
xxd
d
xx2
2
Assign x to class
i
Discriminant functions
Features
if gi (x) > g (x) ∀j ≠ i
j
g How do we choose the discriminant functions gi(x)
n Depends on the objective function to minimize
g Probability of error
g Bayes Risk
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
58
Ricardo Gutierrez-Osuna
Wright State University
Minimizing probability of error
g Probability of error P[error|x] is “the probability of assigning x
to the wrong class”
n For a two-class problem, P[error|x] is simply
P(
P(error | x) = 
P(
1
| x) if we decide
2
2
| x) if we decide
1
g It makes sense that the classification rule be designed to minimize the
average probability of error P[error] across all possible values of x
+∞
+∞
−∞
−∞
P(error) = ∫ P(error, x)dx = ∫ P(error | x)P(x)dx
g To ensure P(error) is minimum we minimize P(error|x) by choosing the
class with maximum posterior P(ωi|x) at each x
n This is called the MAXIMUM A POSTERIORI (MAP) RULE
g And the associated discriminant functions become
gMAP
(x) = P(
i
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
i
| x)
Statistical Pattern Recognition
59
Ricardo Gutierrez-Osuna
Wright State University
g We “prove” the optimality of
the MAP rule graphically
n The right plot shows the
posterior for each of the two
classes
n The bottom plots shows the
P(error) for the MAP rule and
another rule
n Which one has lower P(error)
(color-filled area) ?
P(wi|x)
Minimizing probability of error
x
THE MAP RULE
Choose
RED
Choose
BLUE
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
THE “OTHER” RULE
Choose
RED
Choose
RED
Statistical Pattern Recognition
60
Choose
BLUE
Choose
RED
Ricardo Gutierrez-Osuna
Wright State University
Quadratic classifiers
g Let us assume that the likelihood densities are Gaussian
P(x |
i
)=
1
n/2
(2
∑i
1/2
 1

exp − (x − i )T ∑ i−1(x − i )

 2
g Using Bayes rule, the MAP discriminant functions become
gi (x) = P(
i
| x) =
P(x |
)P(
P(x)
i
i
)
=
1
(2
n/2
∑i
1/2
 1

exp − (x − i )T ∑ i−1(x − i )P(
 2

i
)
1
P(x)
n Eliminating constant terms
gi (x) = ∑ i
-1/2
 1

exp − (x − i )T ∑ i−1(x − i )P(
 2

i
)
n We take natural logs (the logarithm is monotonically increasing)
1
1
gi (x) = − (x − i )T ∑ i−1(x − i ) - log( ∑ i ) + log(P(
2
2
i
))
g This is known as a Quadratic Discriminant Function
g The quadratic term is know as the Mahalanobis distance
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
61
Ricardo Gutierrez-Osuna
Wright State University
Mahalanobis distance
g The Mahalanobis distance can be thought of vector distance
that uses a ∑i-1 norm
x1
Mahalanobis Distance
x-y
−1
2
∑i
−1
= (x − y)T ∑ i (x − y)
µ
x2
xi xi -
2
∑ −1
2
=K
=
n ∑-1 can be thought of as a stretching factor on the space
n Note that for an identity covariance matrix (∑i=I), the Mahalanobis
distance becomes the familiar Euclidean distance
g In the following slides we look at special cases of the Quadratic
classifier
n For convenience we will assume equiprobable priors so we can drop
the term log(P(ωi))
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
62
Ricardo Gutierrez-Osuna
Wright State University
Special case I: Σi=σ2I
g In this case, the discriminant
becomes
gi (x) = −(x − i )T (x − i )
n This is known as a MINIMUM
DISTANCE CLASSIFIER
n Notice the linear decision
boundaries
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
63
Ricardo Gutierrez-Osuna
Wright State University
Special case 2: Σi= Σ (Σ diagonal)
g In this case, the discriminant
becomes
1
gi (x) = − (x − i )T ∑ −1(x − i )
2
n This is known as a MAHALANOBIS
DISTANCE CLASSIFIER
n Still linear decision boundaries
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
64
Ricardo Gutierrez-Osuna
Wright State University
Special case 3: Σi=Σ (Σ non-diagonal)
g In this case, the discriminant
becomes
1
−1
gi (x) = − (x − i )T ∑ i (x − i )
2
n This is also known as a
MAHALANOBIS DISTANCE
CLASSIFIER
n Still linear decision boundaries
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
65
Ricardo Gutierrez-Osuna
Wright State University
Case 4: Σi=σi2I, example
g In this case the quadratic
expression cannot be
simplified any further
g Notice that the decision
boundaries are no longer
linear but quadratic
Zoom
out
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
66
Ricardo Gutierrez-Osuna
Wright State University
Case 5: Σi≠Σj general case, example
g In this case there are no
constraints so the quadratic
expression cannot be
simplified any further
g Notice that the decision
boundaries are also quadratic
Zoom
out
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
67
Ricardo Gutierrez-Osuna
Wright State University
Limitations of quadratic classifiers
g The fundamental limitation is the unimodal Gaussian
assumption
n For non-Gaussian or multimodal
Gaussian, the results may be
significantly sub-optimal
g A practical limitation is associated with the minimum
required size for the dataset
n If the number of examples per class is less than the number of
dimensions, the covariance matrix becomes singular and,
therefore, its inverse cannot be computed
g In this case it is common to assume the same covariance structure
for all classes and compute the covariance matrix using all the
examples, regardless of class
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
68
Ricardo Gutierrez-Osuna
Wright State University
Conclusions
g We can extract the following conclusions
n The Bayes classifier for normally distributed classes is quadratic
n The Bayes classifier for normally distributed classes with equal
covariance matrices is a linear classifier
n The minimum Mahalanobis distance classifier is optimum for
g normally distributed classes and equal covariance matrices and equal priors
n The minimum Euclidean distance classifier is optimum for
g normally distributed classes and equal covariance matrices proportional to
the identity matrix and equal priors
n Both Euclidean and Mahalanobis distance classifiers are linear
g The goal of this discussion was to show that some of the most
popular classifiers can be derived from decision-theoretic
principles and some simplifying assumptions
n It is important to realize that using a specific (Euclidean or Mahalanobis)
minimum distance classifier implicitly corresponds to certain statistical
assumptions
n The question whether these assumptions hold or don’t can rarely be
answered in practice; in most cases we are limited to posting and
answering the question “does this classifier solve our problem or not?”
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
69
Ricardo Gutierrez-Osuna
Wright State University
K Nearest Neighbor classifier
g The kNN classifier is based on non-parametric density
estimation techniques
n Let us assume we seek to estimate the density function P(x) from a
dataset of examples
n P(x) can be approximated by the expression
 V is the volume surroundin g x
k

where  N is the total number of examples
P(x) ≅
NV
 k is the number of examples inside V
n The volume V is determined by the
D-dim distance RkD(x) between x
and its k nearest neighbor
V=πR2
P(x) =
x
P(x) ≅
N R
R
k
k
=
NV N ⋅ c D ⋅ RDk (x)
g Where cD is the volume of the
unit sphere in D dimensions
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
70
k
Ricardo Gutierrez-Osuna
Wright State University
2
K Nearest Neighbor classifier
g We use the previous result to estimate the posterior probability
n The unconditional density is, again, estimated with
P(x |
i)=
ki
Ni V
n And the priors can be estimated by
P( i ) =
Ni
N
n The posterior probability then becomes
k i Ni
⋅
P(x | i )P( i ) Ni V N k i
=
=
P( i | x) =
k
P(x)
k
NV
n Yielding discriminant functions
gi (x) =
ki
k
g This is known as the k Nearest Neighbor classifier
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
71
Ricardo Gutierrez-Osuna
Wright State University
K Nearest Neighbor classifier
g The kNN classifier is a very intuitive method
n Examples are classified based on their similarity with training data
g For a given unlabeled example xu∈ℜD, find the k “closest” labeled examples
in the training data set and assign xu to the class that appears most
frequently within the k-subset
g The kNN only requires
10
4
10
4
10
10
10
4
10
44
1010
10
10
10
10
1010 ? 4 444 444
4
10
5 10
10
4 4 44
n An integer k
n A set of labeled examples
n A measure of “closeness”
66 6
6
66
6
6
6
6 6666
6
6
6
axis 2
0
-5
1
111
111111
1 1111
12
222
12
2
222
333
2
22
3
8338
3
22
2222
33838
2 2 58
3
8
8
3
8
8
35859
38
88838
5
8
9
9
5
9
5
5
855
555
99
535
5
9
5
9
9
8
99
55 9
9
9
9
-10
-15
-20
77
7 7
7 77
77
7777 7
77
7
7
-0.06 -0.05 -0.04 -0.03 -0.02 -0.01
axis 1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
72
0
0.01
0.02
Ricardo Gutierrez-Osuna
Wright State University
0.03
kNN in action: example 1
g We generate data for a 2-dimensional 3class problem, where the class-conditional
densities are multi-modal, and non-linearly
separable
g We used kNN with
n k = five
n Metric = Euclidean distance
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
73
Ricardo Gutierrez-Osuna
Wright State University
kNN in action: example 2
g We generate data for a 2-dim 3-class
problem, where the likelihoods are
unimodal, and are distributed in rings
around a common mean
n These classes are also non-linearly separable
g We used kNN with
n k = five
n Metric = Euclidean distance
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
74
Ricardo Gutierrez-Osuna
Wright State University
kNN versus 1NN
1-NN
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
5-NN
Statistical Pattern Recognition
75
20-NN
Ricardo Gutierrez-Osuna
Wright State University
Characteristics of the kNN classifier
g Advantages
n Analytically tractable, simple implementation
n Nearly optimal in the large sample limit (N→∞)
g P[error]Bayes<P[error]1-NNR<2P[error]Bayes
n Uses local information, which can yield highly adaptive behavior
n Lends itself very easily to parallel implementations
g Disadvantages
n Large storage requirements
n Computationally intensive recall
n Highly susceptible to the curse of dimensionality
g 1NN versus kNN
n The use of large values of k has two main advantages
g Yields smoother decision regions
g Provides probabilistic information: The ratio of examples for each class
gives information about the ambiguity of the decision
n However, too large values of k are detrimental
g It destroys the locality of the estimation
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
76
Ricardo Gutierrez-Osuna
Wright State University
Section IV: Validation
g Motivation
g The Holdout
g Re-sampling techniques
g Three-way data splits
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
77
Ricardo Gutierrez-Osuna
Wright State University
Motivation
g Validation techniques are motivated by two
fundamental problems in pattern recognition: model
selection and performance estimation
g Model selection
n Almost invariably, all pattern recognition techniques have one or
more free parameters
g The number of neighbors in a kNN classification rule
g The network size, learning parameters and weights in MLPs
n How do we select the “optimal” parameter(s) or model for a
given classification problem?
g Performance estimation
n Once we have chosen a model, how do we estimate its
performance?
g Performance is typically measured by the TRUE ERROR RATE,
the classifier’s error rate on the ENTIRE POPULATION
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
78
Ricardo Gutierrez-Osuna
Wright State University
Motivation
g If we had access to an unlimited number of examples these
questions have a straightforward answer
n Choose the model that provides the lowest error rate on the entire
population and, of course, that error rate is the true error rate
g In real applications we only have access to a finite set of
examples, usually smaller than we wanted
n One approach is to use the entire training data to select our classifier
and estimate the error rate
g This naïve approach has two fundamental problems
n The final model will normally overfit the training data
g This problem is more pronounced with models that have a large number of
parameters
n The error rate estimate will be overly optimistic (lower than the true error rate)
g In fact, it is not uncommon to have 100% correct classification on training
data
n A much better approach is to split the training data into disjoint subsets:
the holdout method
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
79
Ricardo Gutierrez-Osuna
Wright State University
The holdout method
g Split dataset into two groups
n Training set: used to train the classifier
n Test set: used to estimate the error rate of the trained classifier
Total number of examples
Training Set
Test Set
g A typical application the holdout method is determining a
stopping point for the back propagation error
MSE
Stopping point
Test set error
Training set error
Epochs
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
80
Ricardo Gutierrez-Osuna
Wright State University
The holdout method
g The holdout method has two basic drawbacks
n In problems where we have a sparse dataset we may not be able to
afford the “luxury” of setting aside a portion of the dataset for testing
n Since it is a single train-and-test experiment, the holdout estimate of
error rate will be misleading if we happen to get an “unfortunate” split
g The limitations of the holdout can be overcome with a family of
resampling methods at the expense of more computations
n Cross Validation
g Random Subsampling
g K-Fold Cross-Validation
g Leave-one-out Cross-Validation
n Bootstrap
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
81
Ricardo Gutierrez-Osuna
Wright State University
Random Subsampling
g Random Subsampling performs K data splits of the dataset
n Each split randomly selects a (fixed) no. examples without replacement
n For each data split we retrain the classifier from scratch with the training
examples and estimate Ei with the test examples
Total number of examples
Test example
Experiment 1
Experiment 2
Experiment 3
g The true error estimate is obtained as the average of the
separate estimates Ei
n This estimate is significantly better than the holdout estimate
1 K
E = ∑ Ei
K i=1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
82
Ricardo Gutierrez-Osuna
Wright State University
K-Fold Cross-validation
g Create a K-fold partition of the the dataset
n For each of K experiments, use K-1 folds for training and the remaining
one for testing
Total number of examples
Experiment 1
Experiment 2
Experiment 3
Test examples
Experiment 4
g K-Fold Cross validation is similar to Random Subsampling
n The advantage of K-Fold Cross validation is that all the examples in the
dataset are eventually used for both training and testing
g As before, the true error is estimated as the average error rate
1 K
E = ∑ Ei
K i=1
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
83
Ricardo Gutierrez-Osuna
Wright State University
Leave-one-out Cross Validation
g Leave-one-out is the degenerate case of K-Fold Cross
Validation, where K is chosen as the total number of examples
n For a dataset with N examples, perform N experiments
n For each experiment use N-1 examples for training and the remaining
example for testing
Total number of examples
Experiment 1
Experiment 2
Experiment 3
Single test example
Experiment N
g As usual, the true error is estimated as the average error rate on
test examples
N
E=
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
1
∑ Ei
N i=1
Statistical Pattern Recognition
84
Ricardo Gutierrez-Osuna
Wright State University
How many folds are needed?
g With a large number of folds
+ The bias of the true error rate estimator will be small (the estimator will
be very accurate)
- The variance of the true error rate estimator will be large
- The computational time will be very large as well (many experiments)
g With a small number of folds
+ The number of experiments and, therefore, computation time are
reduced
+ The variance of the estimator will be small
- The bias of the estimator will be large (conservative or larger than the
true error rate)
g In practice, the choice of the number of folds depends on the
size of the dataset
n For large datasets, even 3-Fold Cross Validation will be quite accurate
n For very sparse datasets, we may have to use leave-one-out in order to
train on as many examples as possible
g A common choice for K-Fold Cross Validation is K=10
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
85
Ricardo Gutierrez-Osuna
Wright State University
Three-way data splits
g If model selection and true error estimates are to be computed
simultaneously, the data needs to be divided into three disjoint
sets
n Training set: a set of examples used for learning the parameters of the
model
g In the MLP case, we would use the training set to find the “optimal” weights
with the back propagation rule
n Validation set: a set of examples used to tune the meta-parameters of
a model
g In the MLP case, we would use the validation set to find the “optimal”
number of hidden units or determine a stopping point for the back
propagation algorithm
n Test set: a set of examples used only to assess the performance of a
fully-trained classifier
g In the MLP case, we would use the test to estimate the error rate after we
have chosen the final model (MLP size and actual weights)
g After assessing the final model with the test set, YOU MUST NOT further
tune the model
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
86
Ricardo Gutierrez-Osuna
Wright State University
Three-way data splits
g Why separate test and validation sets?
n The error rate estimate of the final model on validation data will
be biased (smaller than the true error rate) since the validation
set is used to select the final model
n After assessing the final model with the test set, YOU MUST
NOT tune the model any further
g Procedure outline
1.
1. Divide
Divide the
the available
available data
data into
into training,
training, validation
validation and
and test
test set
set
2.
2. Select
Select architecture
architecture and
and training
training parameters
parameters
3.
3. Train
Train the
the model
model using
using the
the training
training set
set
4.
4. Evaluate
Evaluate the
the model
model using
using the
the validation
validation set
set
5.
5. Repeat
Repeat steps
steps 22 through
through 44 using
using different
different architectures
architectures and
and training
training parameters
parameters
6.
6. Select
Select the
the best
best model
model and
and train
train itit using
using data
data from
from the
the training
training and
and validation
validation set
set
7.
7. Assess
Assess this
this final
final model
model using
using the
the test
test set
set
n This outline assumes a holdout method
g If CV or Bootstrap are used, steps 3 and 4 have to be repeated for
each of the K folds
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
87
Ricardo Gutierrez-Osuna
Wright State University
Three-way data splits
Test set
Training set
Validation set
Model2
Model3
Model4
Σ
Error1
Σ
Error2
Σ
Error3
Σ
Error4
Min
Model1
Final
Model
Model
Selection
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
88
Σ
Final
Error
Error
Rate
Ricardo Gutierrez-Osuna
Wright State University
Section V: Clustering
g Unsupervised learning concepts
g K-means
g Agglomerative clustering
g Divisive clustering
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
89
Ricardo Gutierrez-Osuna
Wright State University
Unsupervised learning
g In this section we investigate a family of techniques which use
unlabeled data: the feature vector X without the class label ω
n These methods are called unsupervised since they are not provided the
correct answer
g Although unsupervised learning methods may appear to have
limited capabilities, there are several reasons that make them
extremely useful
n Labeling large data sets can be costly (i.e., speech recognition)
n Class labels may not be known beforehand (i.e., data mining)
n Very large datasets can be compressed by finding a small set of
prototypes (i.e., kNN)
g Two major approaches for unsupervised learning
n Parametric (mixture modeling)
g A functional form for the underlying density is assumed, and we seek to
estimate the parameters of the model
n Non-parametric (clustering)
g No assumptions are made about the underlying densities, instead we seek
a partition of the data into clusters
n These methods are typically referred to as clustering
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
90
Ricardo Gutierrez-Osuna
Wright State University
Unsupervised learning
g Non-parametric unsupervised learning involves three steps
n Defining a measure of similarity between examples
n Defining a criterion function for clustering
n Defining an algorithm to minimize (or maximize) the criterion function
g Similarity measures
n The most straightforward similarity measure is distance: Euclidean, city
block or Mahalanobis
g Euclidean and city block measures are very sensitive to scaling of the axis,
so caution must be exercised
g Criterion function
n The most widely used criterion function for clustering is the sum-ofsquare-error criterion
C
JMSE = ∑ ∑ x −
i =1 x ≈
2
i
where
i
i
=
1
Ni
∑x
x≈
i
g This criterion measures how well the data set X={x(1, x(2, …, x(N} is
represented by the cluster centers µ={µ(1, µ(2, …, µ(N}
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
91
Ricardo Gutierrez-Osuna
Wright State University
Unsupervised learning
g Once a criterion function has been defined, we must find a partition of
the data set that minimizes the criterion
n Exhaustive enumeration of all partitions, which guarantees the optimal solution,
is unfeasible
n For example, a problem with 5 clusters and 100 examples yields 1067 different
ways to partition the data
g The common approach is to proceed in an iterative fashion
n Find some reasonable initial partition and then
n Move samples from one cluster to another in order to reduce the criterion
function
g These iterative methods produce sub-optimal solution but are
computationally tractable
g We will consider two groups of iterative methods
n Flat clustering algorithms
g These algorithms produce a set of disjoint clusters (a.k.a. a partition)
g Two algorithms are widely used: k-means and ISODATA
n Hierarchical clustering algorithms:
g The result is a hierarchy of nested clusters
g These algorithms can be broadly divided into agglomerative and divisive approaches
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
92
Ricardo Gutierrez-Osuna
Wright State University
The k-means algorithm
g The k-means algorithm is a simple clustering procedure that
attempts to minimize the criterion function JMSE iteratively
1.
1. Define
Define the
the number
number of
of clusters
clusters
2.
2. Initialize
Initialize clusters
clusters by
by
•• an
an arbitrary
arbitrary assignment
assignment of
of examples
examples to
to clusters
clusters or
or
•• an
an arbitrary
arbitrary set
set of
of cluster
cluster centers
centers (examples
(examples assigned
assigned to
to nearest
nearest centers)
centers)
3.
3. Compute
Compute the
the sample
sample mean
mean of
of each
each cluster
cluster
4.
4. Reassign
Reassign each
each example
example to
to the
the cluster
cluster with
with the
the nearest
nearest mean
mean
5.
5. IfIf the
the classification
classification of
of all
all samples
samples has
has not
not changed,
changed, stop,
stop, else
else go
go to
to step
step 33
n The k-means algorithm is widely used in the fields of signal processing
and communication for Vector Quantization
g Scalar signal values are usually quantized into a number of levels (typically
a power of 2 so the signal can be transmitted in binary)
g The same idea can be extended for multiple channels
n However, rather than quantizing each separate channel, we can obtain a more
efficient signal coding if we quantize the overall multidimensional vector by
finding a number of multidimensional prototypes (cluster centers)
g The set of cluster centers is called a “codebook”, and the problem of finding
this codebook is normally solved using the k-means algorithm
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
93
Ricardo Gutierrez-Osuna
Wright State University
Hierarchical clustering
g k-means creates disjoint clusters, resulting in a “flat” data
representation (a partition)
g Sometimes it is desirable to obtain a hierarchical representation
of data, with clusters and sub-clusters arranged in a treestructured fashion
n Hierarchical representations are commonly used in the sciences (i.e.,
biological taxonomy)
g Hierarchical clustering methods can be grouped in two general
classes
n Agglomerative (bottom-up, merging):
g Starting with N singleton clusters, successively merge clusters until one
cluster is left
n Divisive (top-down, splitting)
g Starting with a unique cluster, successively split the clusters until N singleton
examples are left
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
94
Ricardo Gutierrez-Osuna
Wright State University
Hierarchical clustering
g The preferred representation for hierarchical clusters is the
dendrogram
n The dendrogram is a binary tree that shows the structure of the clusters
g In addition to the binary tree, the dendrogram provides the similarity measure
between clusters (the vertical axis)
n An alternative representation is based on sets: {{x1, {x2, x3}}, {{{x4, x5},
{x6, x7}}, x8}} but, unlike the dendrogram, sets cannot express
quantitative information
x1 x 2 x 3 x4 x5 x6 x7 x 8
High similarity
Low similarity
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
95
Ricardo Gutierrez-Osuna
Wright State University
Divisive clustering
g Outline
n Define
g NC: Number of clusters
g NEX: Number of examples
1.
1. Start
Start with
with one
one large
large cluster
cluster
2.
2. Find
Find “worst”
“worst” cluster
cluster
3.
3. Split
Split itit
4.
go to 1
4. IfIf N
NCC<N
<NEX
EX go to 1
g How to choose the “worst” cluster
n
n
n
n
Largest number of examples
Largest variance
Largest sum-squared-error
...
g How to split clusters
n Mean-median in one feature direction
n Perpendicular to the direction of largest variance
n …
g The computations required by divisive clustering are more
intensive than for agglomerative clustering methods and, thus,
agglomerative approaches are more common
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
96
Ricardo Gutierrez-Osuna
Wright State University
Agglomerative clustering
g Outline
1.
singleton clusters
1. Start
Start with
with N
NEX
EX singleton clusters
2.
2. Find
Find nearest
nearest clusters
clusters
3.
3. Merge
Merge them
them
4.
4. IfIf N
NCC>1
>1 go
go to
to 11
n Define
g NC: Number of clusters
g NEX: Number of examples
g How to find the “nearest” pair of clusters
n Minimum distance
n Maximum distance
n Average distance
n Mean distance
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
dmin ( i ,
j
) = min x − y
x∈
y∈
dmax ( i ,
j
davg ( i ,
)=
dmean ( i ,
j
i
j
) = max x − y
j
x∈
y∈
i
j
1
NiN j
)=
i
∑∑ x−y
x∈
−
Statistical Pattern Recognition
97
i y∈
j
j
Ricardo Gutierrez-Osuna
Wright State University
Agglomerative clustering
g Minimum distance
n When dmin is used to measure distance between clusters, the algorithm is called
the nearest-neighbor or single-linkage clustering algorithm
n If the algorithm is allowed to run until only one cluster remains, the result is a
minimum spanning tree (MST)
n This algorithm favors elongated classes
g Maximum distance
n When dmax is used to measure distance between clusters, the algorithm is called
the farthest-neighbor or complete-linkage clustering algorithm
n From a graph-theoretic point of view, each cluster constitutes a complete subgraph
n This algorithm favors compact classes
g Average and mean distance
n The minimum and maximum distance are extremely sensitive to outliers since
their measurement of between-cluster distance involves minima or maxima
n The average and mean distance approaches are more robust to outliers
n Of the two, the mean distance is computationally more attractive
g Notice that the average distance approach involves the computation of NiNj distances
for each pair of clusters
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
98
Ricardo Gutierrez-Osuna
Wright State University
Conclusions
g Dimensionality reduction
n PCA may not find the most discriminatory axes
n LDA may overfit the data
g Classifiers
n Quadratic classifiers are efficient for unimodal data
n kNN is very versatile but is computationally inefficient
g Validation
n An fundamental subject, oftentimes overlooked
g Clustering
n Helps understand the underlying structure of the data
ACKNOWLEDGMENTS
This research was supported by award
from NSF/CAREER 9984426.
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
99
Ricardo Gutierrez-Osuna
Wright State University
References
g Fukunaga, K., 1990, Introduction to statistical pattern recognition, 2nd edition,
Academic Pr.
g Duda, R. O.; Hart, P. E. and Stork, D. G., 2001, Pattern classification, 2nd Ed.,
Wiley.
g Bishop, C. M., 1995, Neural networks for Pattern Recognition, Oxford.
g Therrien, C. W., 1989, Decision, estimation, and classification : an introduction
to pattern recognition and related topics, Wiley.
g Devijver, P. A. and Kittler, J., 1982, Pattern recognition: a statistical approach,
Prentice-Hall.
g Ballard, D., 1997, An introduction to Natural Computation, The MIT Press.
g Silverman, B. W., 1986, Density estimation for statistics and data analysis,
Chapman & Hall.
g Smith, M., 1994, Neural networks for statistical modeling, Van Nostrand
Reinhold.
g Doak, J., 1992, An evaluation of feature selection methods and their application
to Computer Security, University of California at Davis, Tech Report CSE-92-18
g Kohavi, R. and Sommerfield, D., 1995, "Feature Subset Selection Using the
Wrapper Model: Overfitting and Dynamic Search Space Topology" in Proc. of
the 1st Intl’ Conf. on Knowledge Discovery and Data Mining, pp. 192-197.
NOSE 2nd Summer School
Lloret de Mar, Spain, October 2-5, 2000
Statistical Pattern Recognition
100
Ricardo Gutierrez-Osuna
Wright State University
Download