Document 15072466

advertisement
Mata kuliah : T0283 - Computer Vision
Tahun
: 2010
Lecture 10
Pattern Recognition and Classification II
Learning Objectives
After carefully listening this lecture, students will be able
to do the following :
demonstrate the use of PCA technique in face recognition
Explain a real-time robust object detection procedure
developed by Viola and Jones.
January 20, 2010
T0283 - Computer Vision
3
Feature Selection
What features to use? How do we extract them from
the image?
Using images themselves as feature vectors is easy, but
has problem of high dimensionality
A 128 x 128 image = 16,384-dimensional feature space!
What do we know about the structure of the
categories in feature space?
Intuitively, we want features that result in wellseparated classes
January 20, 2010
T0283 - Computer Vision
4
Dimensionality Reduction
Functions yi = yi(x) can reduce dimensionality of
feature space  More efficient classification
If chosen intelligently, we won’t lose much
information and classification is easier
Common methods
Principal components analysis (PCA): Projection
maximizing total variance of data
Fisher’s Linear Discriminant (FLD): Maximize ratio of
between-class variance to within-class variance
January 20, 2010
T0283 - Computer Vision
5
Geometric Interpretation of Covariance
C
Covariance C = X XT can be thought of as linear transform
that redistributes variance of unit normal distribution, where zero-mean
X is n (number of dimensions) x d (number of points)
January 20, 2010
T0283 - Computer Vision
adapted from Z. Dodds
6
Geometric Factorization of Covariance
SVD of covariance matrix C = RT D R describes
geometric components of transform by extracting:
Diagonal scaling matrix D
Rotation matrix R
E.g., given points X =
2
-2
1
-1
5
-5
-1
1
,
the covariance factors as
X
January 20, 2010
XT =
2.5
5
5
13
=
cos, sin of 70
.37 -.93
15
0
.37 .93
.93 .37
0
.5
-.93 .37
T0283 - Computer Vision
major, minor
axis lengths
“best”
2nd-best
axis
axis
7
PCA for Dimensionality Reduction
Any point in n-dimensional feature space can be expressed
as a linear combination of the n eigenvectors (the rows of R)
via a set of weights [1 , 2, …,  n] (this is just a
coordinate system change)
By projecting points onto only the first k << n principal
components (eigenvectors with the largest eigenvalues),
we are essentially throwing away the least important
feature information
January 20, 2010
T0283 - Computer Vision
8
Projection onto Principal Components
Full n-dimensional
space (here n = 2)
January 20, 2010
k-dimensional subspace
(here k = 1)
T0283 - Computer Vision
adapted from Z. Dodds
9
Face Recognition
?
January 20, 2010
20 faces (i.e., classes),
9 examples
T0283
- Computer (i.e.,
Vision training data) of each
10
Simple Face Recognition
Idea: Search over training set for most similar image (e.g.,
in SSD sense) and choose its class
This is the same as a 1-nearest neighbor classifier when
feature space = image space
Issues
Large storage requirements (nd, where n = image
space dimensionality and d = number of faces in training
set)
Correlation is computationally expensive
January 20, 2010
T0283 - Computer Vision
11
Eigenfaces
Idea: Compress image space to “face space” by projecting
onto principal components (“eigenfaces” = eigenvectors of
image space)
Represent each face as a low-dimensional vector (weights
on eigenfaces)
Measure similarity in face space for classification
Advantage: Storage requirements are (n + d) k instead of
nd
January 20, 2010
T0283 - Computer Vision
12
Eigenfaces: Initialization
Calculate eigenfaces
Compute n–dimensional mean face 
Compute difference of every face from mean face
 j = j - 
Form covariance matrix of these C
= AAT, where
A = [1, 2, …, d]
Extract eigenvectors ui from C such that Cui = iui
Eigenfaces are k eigenvectors with largest
eigenvalues
Example eigenfaces
January 20, 2010
T0283 - Computer Vision
13
Eigenfaces: Initialization
Project faces into face space
Get eigenface weights for every face in the training
set
[j1 , j2, …,  jk] for face j are
computed via dot products ji = ui T j
The weights
January 20, 2010
T0283 - Computer Vision
14
Calculating Eigenfaces
Obvious way is to perform SVD of covariance
matrix, but this is often prohibitively expensive
E.g., for 128 x 128 images, C is 16,384 x 16,384
Consider eigenvector decomposition of d x d
matrix ATA: ATAvi = ivi. Multiplying both sides on
the left by A, we have
AA
CTAvi = iAvi
So ui = Avi are the eigenvectors of C = AAT
January 20, 2010
T0283 - Computer Vision
15
Eigenfaces: Recognition
Project new face into face space
Classify
Assign class of nearest face from
training set
Or, precalculate class means over
training set and find nearest mean class face
[1 , 2, …,  8]
Original face
January 20, 2010
8 eigenfaces
T0283 - Computer Vision
Weights
adapted from Z. Dodds
16
Robust Real-time Object Detection
by
Paul Viola and Michael Jones
Presentation by Chen Goldberg
Computer Science
Tel Aviv university
June 13, 2007
About the paper
Presented in 2001 by Paul Viola and Michael Jones
(published 2002 – IJCV)
Specifically demonstrated (and motivated by) the
face detection task.
Placed a strong emphasis upon speed optimization.
Allegedly, was the first real-time face detection
system.
Was widely adopted and re-implemented.
Intel distributes this algorithm in a computer vision
toolkit (OpenCV).
Paul viola
Michael Jones
January 20, 2010
T0283 - Computer Vision
18
Framework scheme
Framework consists of :
Trainer
Detector
The trainer is supplied with positive and negative
samples:
Positive samples – images containing the object.
Negative samples – images not containing the object.
The trainer then creates a final classifier.
A lengthy process, to be calculated offline.
The detector utilizes the final classifier across a
given input image.
January 20, 2010
T0283 - Computer Vision
19
Abstract detector
1.
2.
3.
Iteratively sample
image windows.
Operate Final Classifier
on each window, and
mark accordingly.
Repeat with larger
window.
January 20, 2010
T0283 - Computer Vision
20
Features
We describe an object using
simple functions also called:
Harr-like features.
Given a sub-window, the feature
function calculates a brightness
differential.
For example: The value of a tworectangle feature is the difference
between the sum of the pixels
within the two rectangular
regions.
January 20, 2010
T0283 - Computer Vision
21
Features example
Faces share many similar
properties which can be
represented with Haar-like features
For example, it is easy to notice
that:
The eye region is darker than
the upper-cheeks.
The nose bridge region is
brighter than the eyes.
January 20, 2010
T0283 - Computer Vision
22
Three challenges ahead
1.
How can we evaluate features quickly?
Feature calculation is critically frequent.
Image scale pyramid is too expensive to calculate.
2.
3.
How do we obtain the best representing features
possible?
How can we refrain from wasting time on image
background? (i.e. non-object)
January 20, 2010
T0283 - Computer Vision
23
Introducing Integral Image
Definition: The integral image
at location (x,y), is the sum of
the pixel values above and to
the left of (x,y), inclusive.
we can calculate the integral
image representation of the
image in a single pass.
January 20, 2010
T0283 - Computer Vision
24
Rapid evaluation of rectangular features
Using the integral image
representation one can
compute the value of any
rectangular sum in constant
time.
For example the integral sum
inside rectangle D we can
compute as:
ii(4) + ii(1) – ii(2) – ii(3)
As a result: two-, three-, and
four-rectangular features can
be computed with 6, 8 and 9
array references respectively.
Now that’s fast!
January 20, 2010
T0283 - Computer Vision
25
Scaling
Integral image enables us
to evaluate all rectangle
sizes in constant time.
Therefore, no image
scaling is necessary.
Scale the rectangular
features instead!
January 20, 2010
T0283 - Computer Vision
1
2
3
4
5
6
26
Feature selection
Given a feature set and labeled training set of images, we
create a strong object classifier.
However, we have 45,396 features associated with each
image sub-window, hence the computation of all features is
computationally prohibitive.
Hypothesis: A combination of only a small number of
discriminant features can yield an effective classifier.
Variety is the key here – if we want a small number of
features – we must make sure they compensate each
other’s flaws.
January 20, 2010
T0283 - Computer Vision
27
Boosting
Boosting is a machine learning meta-algorithm for
performing supervised learning.
Creates a “strong” classifier from a set of “weak” classifiers.
Definitions:
“weak” classifier - has an error rate <0.5 (i.e. a better
than average advice).
“strong” classifier - has an error rate of ε (i.e. our final
classifier).
January 20, 2010
T0283 - Computer Vision
28
AdaBoost
Stands for “Adaptive boost”.
AdaBoost is a boosting algorithm for searching out a small
number of good classifiers which have significant variety.
AdaBoost accomplishes this, by endowing misclassified
training examples with more weight (thus enhancing their
chances to be classified correctly next).
The weights tell the learning algorithm the importance of
the example.
January 20, 2010
T0283 - Computer Vision
29
AdaBoost example
 Adaboost starts with a uniform
distribution of “weights” over training
examples.
 Select the classifier with the lowest
weighted error (i.e. a “weak” classifier)
 Increase the weights on the training
examples that were misclassified.
 (Repeat)
 At the end, carefully make a linear
combination of the weak classifiers
obtained at all iterations.

1 1h1 (x) 
hstrong (x)  
0
January 20, 2010
1
 1 
2
otherwise
  n hn (x) 
 n 
Slide taken from a T0283
presentation
byVision
Qing Chen, Discover Lab, University of Ottawa
- Computer
30
Back to Feature selection
We use a variation of AdaBoost for aggressive
feature selection.
Basically similar to the previous example.
Our training set consists of positive and negative
images.
Our simple classifier consists of a single feature.
January 20, 2010
T0283 - Computer Vision
31
Simple classifier
A Simple classifier depends on
a single feature.
Hence, there are 45,396
classifiers to choose from.
For each classifier we set an
optimal threshold such that the
minimum number of examples
are misclassified.
hj  x
1
 
0

p j f j  x   p j j
else
h j - Simple classifier
f j - Feature
 j - Threshold
p j - Pairity, indicating the direction
of the inequality sign.
January 20, 2010
T0283 - Computer Vision
32
Feature selection pseudo-code

Given example images (x1,y1) , … , (xn,yn) where yi = 0, 1 for negative and positive
examples respectively.
 Initialize weights w1,i = 1/(2m), 1/(2l) for training example i, where m and l are the
number of negatives and positives respectively.
For t = 1 … T
1) Normalize weights so that wt is a distribution
2) For each feature j train a classifier hj and evaluate its error j with respect to wt.
3) Chose the classifier hj with lowest error.
4) Update weights according to:
w
t 1,i
 wt ,i1ei
t
where ei = 0 if xi is classified correctly, 1 otherwise, and


t


t
1
t
The final strong classifier is:

1
h( x )  
0
January 20, 2010
1 T

2 t 1 t ,
otherwise
t 1 t ht ( x) 
T
where
T0283 - Computer Vision

t
 log(
1

)
t
33
The Attentional Cascade
Overwhelming majority of
windows are in fact negative.
Simpler, boosted classifiers can
reject many of negative subwindows while detecting all
positive instances.
A cascade of gradually more
complex classifiers achieves good
detection rates.
Consequently, on average, much
fewer features are calculated per
window.
January 20, 2010
T0283 - Computer Vision
34
Training a Cascaded Classifier
Subsequent classifiers are trained only on
examples which pass through all the previous
classifiers
The task faced by classifiers further down the
cascade is more difficult.
January 20, 2010
T0283 - Computer Vision
35
Training a Cascaded Classifier (cont.)
Given false positive rate F and detection rate D, we would like to minimize
the expected number of features evaluated per window.
Since this optimization is extremely difficult, the usual framework is to
choose a minimal acceptable false positive and detection rate per layer.


N  n 0    n i p 
j
i 1 
j i

K
N
ni
K
pi
January 20, 2010
: Expected number of features evaluated per window.
: The number of features in the i-th classifier.
: Number of classifiers/layers.
: The positive rate of the i -th classifier.
T0283 - Computer Vision
36
Pseudo-Code for Cascade Trainer
 User selects values for f, the maximum acceptable false positive rate per layer and d,
the minimum acceptable detection rate per layer.
 User selects target overall false positive rate Ftarget.
 P = set of positive examples
 N = set of negative examples
 F0 = 1.0; D0 = 1.0; i = 0
While Fi > Ftarget
i++
ni = 0; Fi = Fi-1
while Fi > f x Fi-1
o ni ++
o Use P and N to train a classifier with ni features using AdaBoost
o Evaluate current cascaded classifier on validation set to determine Fi and Di
o Decrease threshold for the ith classifier until the current cascaded classifier has
a detection rate of at least d x Di-1 (this also affects Fi)
N=
If Fi > Ftarget then evaluate the current cascaded detector on the set of non-face
images and put any false detections into the set N.
January 20, 2010
T0283 - Computer Vision
37
Download