Pattern Classification

advertisement
Pattern classification
Basic principles and tools
2
Summary
• Why a lecture on pattern
•
•
•
recognition?
Introduction to Pattern
Recognition (Duda - Sections
1.1-1.6)
An example of unsupervised
learning: PCA
Tools
Pattern Classification, Chapter 1
3
Intelligent media environment
Ambient Intelligence
electronic environments
that are sensitive and
responsive to the
presence of people
AmI = Ubiquitous
computing + Ubiquitous
communication +
Intelligent social user
interfaces
AmI at Philips, a video:
http://www.date-conference.com/conference/2003/keynotes/
IBM video : smart supermarket
Ambient intelligence envisions a
world where people are
surrounded by intelligent and
intuitive interfaces embedded in
the everyday objects around
them. These interfaces recognize
and respond to the presence and
behavior of an individual in a
personalized and relevant way.
Pattern Classification, Chapter 1
4
Wireless Sensor Networks
1.
2.
3.
Smart environments need “information feed”
sensors
Sensor data must be communicated, stored,
processed  network
Networking anywhere, everywhere, little
infrastructure  wireless
The “sensory system” of the intelligent ambient “organism”
Human naturally recognize faces, understand spoken
words, read handwritting characters, identify a key
in a bag by touch…
How to provide intelligence to the ‘digital organism’?
Pattern Classification, Chapter 1
5
What is pattern recognition?
“The assignment of a physical object or event to one of
several pre-specified categories” -- Duda & Hart
•
•
•
•
A pattern is an object, process or event that can be given a name.
A pattern class (or category) is a set of patterns sharing common
attributes and usually originating from the same source.
During recognition (or classification) given objects are assigned
to prescribed classes.
A classifier is a machine which performs classification.
Pattern Classification, Chapter 1
6
Examples of applications
• Handwritten: sorting letters by postal code,
input device for PDA‘s.
• Optical Character
• Printed texts: reading machines for blind
people, digitalization of text documents.
Recognition (OCR)
• Biometrics
• Face recognition, verification, retrieval.
• Finger prints recognition.
• Speech recognition.
• Diagnostic systems
• Medical diagnosis: X-Ray, EKG analysis.
• Machine diagnostics, waster detection.
• Military applications
• Automated Target Recognition (ATR).
• Image segmentation and analysis (recognition
from aerial or satelite photographs).
Pattern Classification, Chapter 1
7
Examples of applications
Localization, HCI,
user awareness,
cooperative work
and playtime
Smart Objects
Smart Environment
Wearable and BAN
Gestures, Natural
Interfaces, HCI
Bio-feedback,
rehabilitation and
healthcare,
assistive
technologies
Static and dynamic
posture and activity
monitoring
MicrelEye
Pattern Classification, Chapter 1
8
Design space
• The design space is wide… two examples:
• Seq. of static posture:
• Threshold based algorithm, network star topology:
•
•
•
Can be embedded in microcontroller
Can be distributed among nodes
More nodes (to understand complex postures) means problems in
terms of: scalability, wearability, real-time loss, etc…
• Activity recognition (Gait): one sensor
• SVM based algorithm, one sensor
•
•
•
Extreme wearability
Need computational power
Can understand more complex activity
Pattern Classification, Chapter 1
9
An Example
• “Sorting incoming Fish on a conveyor according to
species using optical sensing”
Sea bass
Species
Salmon
Material of the following slides mainly taken from:
Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000
Pattern Classification, Chapter 1
10
• Problem Analysis
• Set up a camera and take some sample images to extract
features
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for use in our
classifier!
Pattern Classification, Chapter 1
11
•
Preprocessing
• Use a segmentation operation to isolate fishes from one
another and from the background
• Information from a single fish is sent to a feature
extractor whose purpose is to reduce the data by
measuring certain features
• The features are passed to a classifier
Pattern Classification, Chapter 1
12
Pattern Classification, Chapter 1
13
• Classification
• Select the length of the fish as a possible feature for
discrimination
Pattern Classification, Chapter 1
14
The length is a poor feature alone!
Select the lightness as a possible feature.
Pattern Classification, Chapter 1
15
• Threshold decision boundary and cost relationship
• Move our decision boundary toward smaller values of
lightness in order to minimize the cost (reduce the number
of sea bass that are classified salmon!)
Task of decision theory
Pattern Classification, Chapter 1
16
Feature extraction
Task: to extract features which are good for classification.
Good features: • Objects from the same class have similar feature values.
• Objects from different classes have different values.
“Good” features
“Bad” features
Pattern Classification, Chapter 1
17
Feature vector
• Adopt the lightness and add the width of the fish
Fish
xT = [x1, x2]
Lightness
Width
Pattern Classification, Chapter 1
18
Therefore… Basic concepts
Pattern
y
 x1 
x 
 2  x

 
 xn 
Feature vector x  X
- A vector of observations (measurements).
- x is a point in feature space X .
Hidden state y Y
- Cannot be directly measured.
- Patterns with equal hidden state belong to the same class.
Task
- To design a classifer (decision rule) q : X  Y
which decides about a hidden state based on an observation.
Pattern Classification, Chapter 1
19
In our case…
Sea bass
Salmon
lightness
Task: fish recognition.
 x1 
x   x
 2
The set of hidden state is Y  {H , J }
The feature space is X   2
width
Training examples {( x1 , y1 ),..., (xl , yl )}
Linear classifier:
yH
x2
H if (w  x)  b  0
q(x)  
 J if (w  x)  b  0
yJ
( w  x)  b  0
x1
Pattern Classification, Chapter 1
20
Pattern Classification, Chapter 1
• We might add other features that are not correlated
•
21
with the ones we already have. A precaution should be
taken not to reduce the performance by adding such
“noisy features”
Ideally, the best decision boundary should be the one
which provides an optimal performance such as:
Pattern Classification, Chapter 1
22
• However, our satisfaction is premature because
the central aim of designing a classifier is to
correctly classify novel input
Issue of generalization!
Pattern Classification, Chapter 1
23
Pattern Classification, Chapter 1
24
Overfitting and underfitting
Problem:
underfitting
good fit
overfitting
Pattern Classification, Chapter 1
25
Components of PR system
Pattern
Sensors and
preprocessing
Teacher
Feature
extraction
Classifier
Class
assignment
Learning algorithm
• Sensors and preprocessing (segmentation / windowing)
• A feature extraction aims to create discriminative features good for classification.
• A classifier.
• A teacher provides information about hidden state -- supervised learning.
• A learning algorithm sets PR from training examples.
Pattern Classification, Chapter 1
26
Classifier
A classifier partitions feature space X into class-labeled regions such that
X  X 1  X 2  ...  X |Y |
X1
and
X 1  X 2  ...  X |Y |  {0}
X1
X3
X2
X1
X2
X3
The classification consists of determining to which region a feature vector x belongs to.
Borders between decision boundaries are called decision regions.
Pattern Classification, Chapter 1
27
Representation of classifier
A classifier is typically represented as a set of discriminant functions
f i (x) : X  , i  1,..., | Y |
The classifier assigns a feature vector x to the i-the class if
f i ( x)  f j ( x) j  i
f1 (x)
x
Feature vector
f 2 (x)

max
y
Class identifier
f|Y | (x)
Discriminant function
Pattern Classification, Chapter 1
28
Post-processing and evaluation
• Voting
• Costs and Risks
• Computational
complexity
(differentiating
between learning and
classifying)
Pattern Classification, Chapter 1
29
The Design Cycle
• Data collection
• Feature Choice
• Model Choice
• Training
• Evaluation
• Computational Complexity
Pattern Classification, Chapter 1
30
• Data Collection
•
How do we know when we have collected an adequately
large and representative set of examples for training and
testing the system?
• Feature Choice
•
Depends on the characteristics of the problem domain.
Simple to extract, invariant to irrelevant transformation
insensitive to noise.
• Model Choice
•
Unsatisfied with the performance of our fish classifier and
want to jump to another class of model
Pattern Classification, Chapter 1
31
• Training
•
Use data to determine the classifier. Many different
procedures for training classifiers and choosing models
• Evaluation
•
Measure the error rate (or performance and switch from
one set of features to another one
• Computational Complexity
•
•
What is the trade-off between computational ease and
performance?
(How an algorithm scales as a function of the number of
features, patterns or categories?)
Pattern Classification, Chapter 1
32
Learning and Adaptation
• Supervised learning
• A teacher provides a category label or cost for each
pattern in the training set
• Unsupervised learning
• The system forms clusters or “natural groupings” of the
input patterns
Pattern Classification, Chapter 1
33
Unsupervised learning: PCA
• Principal Component Analysis
• Used abundantly because it is a method for
•
•
•
extracting relevant information from confusing
datasets
It is Simple and Non parametric
Can be used to reduce a complex data set to a lower
dimension, revealing hidden simplified structures
Starting point - we are experimenter: phenomenon to
measures, but data appears clouded, unclear,
redundant
Pattern Classification, Chapter 1
34
Unsupervised learning: PCA as example
• The Toy Example: motion of the ideal spring
A ball of mass m attached to a massless, frictionless spring. The ball is released a small distance
away from equilibrium it oscillates indef. along x at a set freq about its equilibrium
We are ignorant we don’t know how many axes and
dimensions to measure
We decide to use:
- 3 cameras, not orthogonal
- each camera records at
-
200Hz an image of the
two-dim position of rthe
ball (a projection)
we chose three axes
{a,b,c}
Pattern Classification, Chapter 1
35
The Toy Example – con’t
•
•
•
•
how do we get from this data set to a simple equation of x?
One goal of PCA is to compute the most meaningful basis to
re-express a noisy data set.
Goal in our example: “the dynamics are along the x-axis.” x the unit basis vector along the x-axis - is the important
dimension.
Our data set is at each time:
Where each camera contributes a 2-dimensional projection of
the ball’s position to the entire vector X
If we record the ball’s position for 10 minutes at 120 Hz, then
we have recorded 10x60x120 = 72000 of these vectors.
Pattern Classification, Chapter 1
36
Change of Basis
• Each sample X is an m-dimensional vector, where m
•
•
is the number of measurements types (e.g. 6) 
every samples is a vector lying in an m-dim vector
space spanned by an orthonormal basis B and can
be expressed as a lin. comb. of bi
Exist B’, which is lin comb of B, that best reexpresses our data set?
Linearity assumption
• restrict potential basis
• formalize implicit assumption of dataset continuity
• PCA limited to re-express the data as a linear
combination of its basis vector
Pattern Classification, Chapter 1
•
•
•
1.
2.
3.
•
•
X be the original data set (columns =samples in t)
X is m x n (m=6, n=72000)
Y is m x n related to X by P
PX=Y
37
P is a matrix that transforms X in Y
P is geometrically a rotation and stretch between X, Y
And…
The rows of P, {p1, . . . ,
pm}, are a set of new basis
vectors for expressing the
columns of X.
jth coeff of yi is a projection on the jth
row of P
Therefore the rows of P are a new set
of basis vectors for columns of X
Pattern Classification, Chapter 1
38
Variance and the goal
•
•
•
Rows of P  principal component of X
•
•
What is the best way to re-express X?
What is a good choice for P?
What does“best express” the data mean?
Decipher ‘garbled’ data. In a linear system “garbled”
refers to:
•
•
•
noise,
rotation and
redundancy.
Pattern Classification, Chapter 1
39
A. Noise and Rotation
Noise is measured relative to the measurement. A common
measure is the signal-to-noise ratio (SNR), or a ratio of
variances σ2
SNR (>> 1) precision data, low SNR noise contaminated data
• SNR measures how “fat” the cloud is
• We assume that directions with largest
variances in our measurement vector space
contain the dynamics of interest.
Maximizing the variance (and by
assumption the SNR ) corresponds to
finding the appropriate rotation of the
naive basis.
Simulated data of
(xA, yA) for camera
A.
Pattern Classification, Chapter 1
• This intuition corresponds to finding the direction p in
40
Figure 2b. How do we find p* In the 2-dimensional
case of Figure 2a, p falls along the direction of the
best-fit line for the data cloud. Thus, rotating the
naive basis to lie parallel to p would reveal the
direction of motion of the spring for the 2-D case.
Pattern Classification, Chapter 1
41
• Possible plots
B. Redundancy
between two arbitrary
measurement types r1
and r2.
(a) no apparent
relationship = r1 is
entirely uncorrelated
with r2. r1 and r2 are
statistically
independent.
•
•
•
On the other extreme, Figure 3c depicts highly correlated
recordings.
Clearly in panel (c) it would be more meaningful to just have
recorded a single variable, not both.
Indeed, this is the very idea behind dimensional reduction.
Pattern Classification, Chapter 1
42
Covariance matrix
•
•
Generalizing to higher dimensionality
Two sets of measurements (zero-mean)
•
The variances are:
•
•
And covariance
The covariance measures the degree of the linear
relationship between two variables. A large (small) value
indicates high (low) redundancy.
Pattern Classification, Chapter 1
43
•
•
•
•
Generalizing to m vectors x1…xm we obtain a m x n matrix
Each row of X corresponds to all measurements of a
particular type (xi).
Each column of X corresponds to a set of measurements from
one particular trial
Covariance matrix:
Pattern Classification, Chapter 1
44
•
•
•
•
CX is a square symmetric m x m matrix.
The diagonal terms of CX are the variance of particular
measurement types.
The off-diagonal terms of CX are the covariance between
measurement types.
CX captures the correlations between all possible pairs of
measurements. The correlation values reflect the noise and
redundancy in our measurements.
•
•
•
In the diagonal terms, by assumption, large (small) values correspond
to interesting dynamics (or noise).
In the off-diagonal terms large (small) values correspond to high (low)
redundancy.
Pretend we have the option of manipulating CX. We will
suggestively define our manipulated covariance matrix CY.
What features do we want to optimize in CY?
Pattern Classification, Chapter 1
45
Diagonalize the Covariance Matrix
(1) to minimize redundancy, measured by covariance,
(2) maximize the signal, measured by variance.
CY must be diagonal
To do so PCA use the easiest way:
- PCA assumes P is an orthonormal matrix
- PCA assumes the directions with the largest variances the
signals and the most “important” or principal
- P acts as a generalized rotation to align a basis with the
maximally variant axis
Pattern Classification, Chapter 1
46
1. Select a normalized direction in m-dimensional
space along which the variance in X is maximized.
Save this vector as p1.
2. Find another direction along which variance is
maximized, however, because of the orthonormality
condition, restrict the search to all directions
perpendicular to all previous selected directions.
Save this vector as p2
3. Repeat this procedure until m vectors are selected.
Y=PX and CY is diagonal
Pattern Classification, Chapter 1
47
SOLVING PCA
Two ways for the algebraic solution:
• EIGENVECTORS OF COVARIANCE
• A MORE GENERAL SOLUTION: SVD (singular value
decomposition)
The first method corresponds for a given data set X to
(1) Subtract the mean of each measurements type
(2) Compute the eigenvectors of XXT (= obtain P)
More in general performing PCA is done by three steps
1. Organize a data set as an m x n matrix, where m is the
number of measurement types and n is the number of trials.
2. Subtract off the mean for each measurement type or row xi.
3. Calculate the SVD or the eigenvectors of the covariance.
Pattern Classification, Chapter 1
48
TOOLS and demonstration (30min)
– Commercial
– Open-source
• WEKA
http://www.cs.waikato.ac.nz/~ml/weka/index.html
• YALE (now RapidMiner http://rapid-i.com/)
• The R Project for Statistical Computing http://www.rproject.org/
• Pentaho – whole BI solutions.
http://www.pentaho.com/
– Matlab
Pattern Classification, Chapter 1
Download