FCV_BioCV_Cottrell - Frontiers in Computer Vision

advertisement
Unsupervised learning of
visual representations and
their use in object & face
recognition
Gary Cottrell
Chris Kanan
Honghao Shan
Lingyun Zhang Matthew Tong
Tim Marks
1
Collaborators
Honghao Shan
Chris Kanan
2
Collaborators
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see t his pict ure.
Tim Marks
Lingyun Zhang
Matt Tong
3
Efficient Encoding of the
world
Sparse Principal Components Analysis:
A model of unsupervised learning for early
perceptual processing (Honghao Shan)
The model embodies three constraints
1.
Keep as much information as possible
2.
While trying to equalize the neural responses
4
3.
And minimizing the connections.
Temporal extent
Efficient Encoding of the world leads to
magno- and parvo-cellular response
properties…
Midget? Persistent, small
Transient, large
Parasol?
Trained on grayscale images
Trained on color images
Trained on video cubes
Spatial extent
This suggests that these cell types exist because they are
useful for efficiently encoding the temporal dynamics of the world. 5
Efficient Encoding of the world leads
to gammatone filters as in auditory
nerves:

Using exactly the same algorithm, applied to
speech, environmental sounds, etc.:
6
Efficient Encoding of the
world

A single unsupervised learning algorithm
leads to
Model cells with properties similar to those
found in the retina when applied to natural
videos
 Models cells with properties similar to those
found in auditory nerve when applied to
natural sounds


One small step towards a unified theory of
temporal processing.
7
Unsupervised Learning of
Hierarchical Representations
(RICA 2.0; cf. Shan et al., NIPS 19)

Recursive ICA (RICA 1.0 (Shan et al., 2008)):
Alternately compress and expand
representation using PCA and ICA;
 ICA was modified by a component-wise
nonlinearity
 Receptive fields expanded at each ICA layer

8
Unsupervised Learning of
Hierarchical Representations
(RICA 2.0; cf. Shan et al., NIPS 19)

ICA was modified by a component-wise
nonlinearity:
Think of ICA as a generative model: The
pixels are the sum of many independent
random variables: Gaussian.
 Hence ICA prefers its inputs to be Gaussiandistributed.
 We apply an inverse cumulative Gaussian to
the absolute value of the ICA components to
“gaussianize” them.

9
Unsupervised Learning of
Hierarchical Representations
(RICA 2.0; cf. Shan et al., NIPS 19)

Strong responses, either positive or
negative, are mapped to the positive tail of
the Gaussian; weak ones, to the negative
tail; ambiguous ones to the center.
10
Unsupervised Learning of
Hierarchical Representations
(RICA 2.0; cf. Shan et al., NIPS 19)

RICA 2.0:

Replace PCA by SPCA
SPCA
SPCA
11
Unsupervised Learning of
Hierarchical Representations
(RICA 2.0; cf. Shan et al., NIPS 19)

RICA 2.0 Results: Multiple layer system with
Center-surround receptive fields at the first layer
 Simple edge filters at the second (ICA) layer
 Spatial pooling of orientations at the third (SPCA)
layer:

QuickTime™ and a
decompressor
are needed to see this picture.

V2-like response properties at the fourth (ICA) layer
12
Unsupervised Learning of
Hierarchical Representations
(RICA 2.0; cf. Shan et al., NIPS 19)





V2-like response properties at the
fourth (ICA) layer
These maps show strengths of
connections to layer 1 ICA filters.
Warm and cold colors are strong
+/- connections, gray is weak
connections, orientation
corresponds to layer 1 orientation.
The left-most column displays two
model neurons that show uniform
orientation preference to layer-1
ICA features.
The middle column displays model
neurons that have nonuniform/varying orientation
preference to layer-1 ICA features.
The right column displays two
model neurons that have location
preference, but no orientation
preference, to layer-1 ICA features.
The left two columns are consistent with
Anzen, Peng, & Van Essen 2007. The
right hand column is a prediction
13
Unsupervised Learning of
Hierarchical Representations
(RICA 2.0; cf. Shan et al., NIPS 19)

Dimensionality Reduction & Expansion might be a general strategy of
information processing in the brain.





The first step removes noise and reduces complexity, the second step
captures the statistical structure.
We showed that retinal ganglion cells and V1 complex cells may be
derived from the same learning algorithm, applied to pixels in one
case, and V1 simple cell outputs in the second.
This highly simplified model of early vision is the first one that learns
the RFs of all early visual layers, using a consistent theory - the
efficient coding theory.
We believe it could serve as a basis for more sophisticated models of
early vision.
An obvious next step is to train and thus make predictions about higher
layers.
14
Nice, but is it useful?



We showed in Shan & Cottrell (CVPR 2008)
that we could achieve state-of-the-art face
recognition with the non-linear ICA features
and a simple softmax output.
We showed in Kanan & Cottrell (CVPR 2010)
that we could achieve state-of-the-art face and
object recognition with a system that used an
ICA-based salience map, simulated fixations,
non-linear ICA features, and a kernel-density
memory.
Here I briefly describe the latter.
15
One reason why this might be
a good idea…

Our attention is
automatically drawn to
interesting regions in
images.

Our salience algorithm is
automatically drawn to
interesting regions in
images.

These are useful locations
for discriminating one
object (face, butterfly) from
another.
16
Main Idea

Training Phase (learning
object appearances):

Use the salience map to decide
where to look. (We use the ICA salience map)

Memorize these samples of the
image, with labels (Bob, Carol, Ted,
or Alice) (We store the (compressed) ICA feature
values)
17
Main Idea

Testing Phase (recognizing
objects we have learned):

Now, given a new face, use the salience
map to decide where to look.
Compare new image samples to stored
ones - the closest ones in memory get to
vote for their label.

18
Stored memories of Bob
Stored memories of Alice
New fragments
Result: 7 votes for Alice, only 3 for Bob. It’s Alice!
19
19
Voting



The voting process is based on Bayesian
updating (with Naïve Bayes).
The size of the vote depends on the
distance from the stored sample, using
kernel density estimation.
Hence NIMBLE: NIM with Bayesian
Likelihood Estimation.
20
Overview of the system
QuickTime™ and a
decompressor
are needed to see this picture.



The ICA features do double-duty:
They are combined to make the salience
map - which is used to decide where to look
They are stored to represent the object at
that location
21
NIMBLE vs. Computer Vision

Compare this to (most, not all!)
computer vision systems:
Image

Global
Features
Global
Classifier
Decision
One pass over the image, and global
features.
22
QuickTime™ and a
H.264 decompressor
are needed to see this picture.
23
Belief After 1 Fixation
Belief After 10 Fixations
24
Robust Vision



Human vision works in multiple environments our basic features (neurons!) don’t change
from one problem to the next.
We tune our parameters so that the system
works well on Bird and Butterfly datasets - and
then apply the system unchanged to faces,
flowers, and objects
This is very different from standard computer
vision systems, that are (usually) tuned to a
particular domain
25
Cal Tech 101: 101 Different Categories
AR dataset: 120 Different People with different
lighting, expression, and accessories
26
Flowers: 102 Different Flower Species
27

~7 fixations required to achieve
at least 90% of maximum
performance
28


So, we created a simple cognitive model
that uses simulated fixations to recognize
things.
 But it isn’t that complicated.
How does it compare to approaches in
computer vision?
29




Caveats:
As of mid-2010.
Only comparing to single feature type
approaches (no “Multiple Kernel
Learning” (MKL) approaches).
Still superior to MKL with very few
training examples per category.
30
1
5
15
30
NUMBER OF TRAINING EXAMPLES
31
1
2
3
6
8
NUMBER OF TRAINING EXAMPLES
32
QuickTime™ and a
decompressor
are needed to see this picture.
33
Again, best for single feature-type systems
and for 1 training instance better than all systems
34

More neurally and behaviorally relevant
gaze control and fixation integration.
 People don’t randomly sample images.

A foveated retina

Comparison with human eye movement
data during recognition/classification of
faces, objects, etc.
35
A biologically-inspired, fixationbased approach can work well for
image classification.
 Fixation-based models can achieve,
and even exceed, some of the best
models in computer vision.
…Especially when you don’t have a lot
of training images.

36


Software and Paper Available at
www.chriskanan.com
For more details email:
ckanan@ucsd.edu
This work was supported
by the NSF (grant #SBE0542013) to the Temporal
Dynamics of Learning
Center.
37
Thanks!
38
Sparse Principal
Components Analysis

We minimize:

Subject to the following constraint:
39
The SPCA model as a neural
net…
It is AT that is mostly 0…
40
Results

suggesting the 1/f power spectrum of images is
where this is coming from…
41
Results

The role of :

Recall this reduces the number of connections…
42
Results


The role of : higher  means
fewer connections, which alters
the contrast sensitivity function
(CSF).
Matches recent data on
malnourished kids and their
CSF’s: lower sensitivity at low
spatial frequencies, but slightly
better at high than normal
controls…
43



NIMBLE represents its beliefs using
probability distributions
Simple nearest neighbor density
estimation to estimate:
P(fixationt | Category = k)
Fixations are combined over
fixations/time using Bayesian updating
44
45
Download