Video-Tracking-Using-Learned-Hierarchical

advertisement
Video Tracking Using
Learned Hierarchical
Features
LI WANG, TING LIU, GANG WANG, KAP
LUK CHAN, AND QINGXIONG YANG
IEEE TRANSACTIONS ON IMAGE
PROCESSING, VOL. 24, NO. 4, APRIL
2015
1
Abstract
Introduction
Tracking System Overview
Learning Features for Video Tracking
Experiments
Conclusions
2
Deep Learning
a new area of Machine Learning research, moving it
closer to one of its original goals: Artificial Intelligence.
Multi-node, multi-layer
Pixel -> line and boundary -> recognize picture
It has achieved impressive performance on image
classification, action recognition, and speech
recognition, etc.
3
First layer
4
Second layer
5
Third layer
6
Fourth layer
7
Introduction
N. Wang and D.-Y. Yeung, “Learning a deep compact
image representation for visual tracking,” in Proc. NIPS,
2013, pp. 809–817.
drawback: they do not have an integrated objective
function to bridge offline training and online tracking.
To address the issue in DLT [4], we propose a domain
adaptation based deep learning method to learn
hierarchical features for model-free object tracking.
8
Introduction
Offline learning
Online Learning
9
Introduction
We employ the limited memory BFGS (L-BFGS)
algorithm [7] to solve the optimization problem in the
adaptation module.
J. Nocedal, “Updating quasi-Newton matrices with
limited storage,” Math. Comput., vol. 35, no. 151, pp.
773–782, 1980.
10
Abstract
Introduction
Tracking System Overview
Learning Features for Video Tracking
Experiments
Conclusions
11
Tracking System Overview
The tracking system with the adaptive structural local
sparse appearance model (ASLSA) [32] achieves very
good performance.
Hence, we integrate our feature learning method into
this system.
X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive
structural local sparse appearance model,” in Proc. IEEE
Conf. CVPR, Jun. 2012, pp. 1822–1829.
12
Tracking System Overview
An observation set of target
A corresponding feature representation set
13
Tracking System Overview
An observation set of target
A corresponding feature representation set
14
Tracking System Overview
An observation set of target
A corresponding feature representation set
appearance model
15
Tracking System Overview
An observation set of target
A corresponding feature representation set
motion model
16
17
18
19
20
Abstract
Introduction
Tracking System Overview
Learning Features for Video Tracking
Experiments
Conclusions
21
Pre-Learning Generic
Features
22
Pre-Learning Generic
Features
Given the offline training patch ๐‘ฅ๐‘– from the ๐‘–๐‘กโ„Ž frame
the corresponding learned feature is
The feature transformation matrix W is learned by
solving the following unconstrained minimization
problem,
23
Pre-Learning Generic
Features
the corresponding learned feature
The feature transformation matrix W is learned by
solving the following unconstrained minimization
problem,
Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng, “ICA with reconstruction
cost for efficient overcomplete feature learning,” in Proc. NIPS,
2011, pp. 1017–1025.
24
Domain adaptation module
The first layer can extract features robust to local
motion patterns e.g. translations.
From the second layer, we could extract features
robust to more complicated motion Transformations.
25
Domain adaptation module
Given a target video sequence, we employ ASLSA [32]
to track the target object in the first N frames and use
the tracking results as the training data for the
adaptation module.
The adapted feature is denoted as
is the object image patch in the ith frame of the
training data
W is the feature transformation matrix to be learned
26
Domain adaptation module
We formulate the adaptation module by adding a
regularization term as follows,
denotes the pre-learned feature transformation
matrix.
27
Optimization
J. Nocedal, “Updating quasi-Newton matrices with limited storage,”
Math. Comput., vol. 35, no. 151, pp. 773–782, 1980.
X: object region
๐œƒ โˆถ element in W
28
Abstract
Introduction
Tracking System Overview
Learning Features for Video Tracking
Experiments
Conclusions
29
Experiments
First, we evaluate our learned hierarchical features to
demonstrate its robustness to complicated motion
transformations.
Second, we evaluate the temporal slowness constraint
and the adaptation module in our feature learning
Algorithm.
Third, we evaluate our tracker’s capability of handling
typical problems in visual tracking.
30
Experiments
Then, we compare our tracker with 14 state-of-the-art
trackers.
Moreover, we present the comparison results between
DLT [4] and our tracker.
Finally, we present the generalizability of our feature
learning algorithm on the other 2 tracking methods.
31
Experiments
We use two measurements to quantitatively evaluate
tracking performances.
center location error
overlap rate
32
The purple, green, cyan, blue and red bounding boxes refer to ASLSA
[32]_RAW, ASLSA [32]_HOG, โ„“1_APG [51], CT_DIF [36] and our tracker
respectively.
33
Experiments
N. Li and J. J. DiCarlo, “Unsupervised natural experience rapidly
alters invariant object representation in visual cortex,” Science, vol.
321, no. 5895, pp. 1502–1507, Sep. 2008.
Temporal slowness constraint is beneficial for learning
features robust to complicated motion transformations.
34
35
36
37
Experiments
38
Tracker’s ability of handling
typical problems
39
Our tracker vs. DLT
40
Experiments
To demonstrate the generalizability of our learned
features, we integrate our feature learning algorithm
into another baseline tracker which is called the
incremental learning tracker (IVT) [12].
D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental
learning for robust visual tracking,” Int. J. Comput. Vis., vol. 77,
nos. 1–3, pp. 125–141, May 2008.
41
42
Experiments
In addition, we verify our learned feature’s
generalizability by using โ„“1_APG tracker [51] and
evaluating performances on the same 12 sequences
as used for IVT.
โ„“1_APG can hardly handle these challenging
sequences with complicated motion transformations.
In contrast, integrating our learned features into
โ„“1_APG can succeed to track objects in 6 of 12
sequences.
43
Abstract
Introduction
Tracking System Overview
Learning Features for Video Tracking
Experiments
Conclusions
44
Conclusions
We learn the generic features from auxiliary video
sequences by using a two-layer convolutional neural
network.
Moreover, we propose an adaptation module to adapt
the pre-learned features according to specific target
objects.
As a result, the adapted features are robust to both
complicated motion transformations and appearance
changes of specific target objects.
45
Thanks for your listening!
46
Download