ROBUST VISUAL TRACKING VIA TRANSFER LEARNING Wenhan

advertisement
ROBUST VISUAL TRACKING VIA TRANSFER LEARNING
Wenhan Luo† , Xi Li†† , Wei Li† , Weiming Hu†
†
National Lab of Pattern Recognition, Institute of Automation, CAS, Beijing, China.
††
School of Computer Science, University of Adelaide, Australia.
ABSTRACT
In this paper, we propose a boosting based tracking framework using transfer learning. To deal with complex appearance variations, the proposed tracking framework tries to utilize discriminative information from previous frames to conduct the tracking task in the current frame, and thus transfers
some prior knowledge from the previous source data domain
to the current target data domain, resulting in a high discriminative tracker for distinguishing the object from the background. The proposed tracking system has been tested on
several challenging sequences. Experimental results demonstrate the effectiveness of the proposed tracking framework.
Index Terms— tracking, transfer learning, boosting
1. INTRODUCTION
Object tracking has attracted more and more attention because
of its crucial role in many applications such as surveillance,
human-computer interaction, and video analysis. There have
been various methods for object tracking. Most of them could
be grouped into two categories: appearance model based and
classifier based. The appearance model based methods aim
to model the statistical properties of tracked objects and carry
out object localization by probabilistic model matching. Such
examples are SMOG [1], IVT [2], Eigentracking [3] and so
on. However, their common drawback is to omit discriminative information from the background. In contrast, classifier
based methods consider object tracking as a classification
problem, and attempt to find out the optimal decision boundary which well separates the object and the background. For
instance, an ensemble of weak classifiers is online learned for
adapting to the changes in the object and the background [4].
Avidan’s support vector tracker [5] constructs an optical flowbased SVM classifier for object/non-object classification. In
[6], an online Adaboost feature selection algorithm is proposed for tracking. A randomized learning technique called
Random Forests is developed in [7] to select features online
for visual tracking. Note that the key idea of the aforementioned classifier-based methods is how to construct robust
classifiers for distinguishing the object from the background.
This work is partly supported by NSFC (Grant No. 60825204 and
60935002).
Thus, it is essential for trackers to have adaptive capabilities
as both the object and the background change. Recently,
some other machine learning techniques are employed to
track object. Babenko et al. [8] use multiple instance learning
to avoid the drift problem, and thus achieve superior results.
Zha et al. [9] propose a tracking method based on transductive learning over the labeled and unlabeled samples. This
method can obtain greater class separability, leading to better
tracking performance.
More recently, transfer learning becomes popular due to
its superior performance. Traditional machine learning assumes that the training data and the test data should follow
the identical distribution. But this assumption does not always hold. To address this problem, Dai et al. present a transfer learning framework called TrAdaboost in [10] for classification. TrAdaboost extends the traditional boosting-based
learning methods so as to transfer knowledge from the old domain to a new domain. It allows learning an accurate model
for classification using only a tiny amount of new data and a
large amount of old data. But its performance heavily depends
on the relationship between the old data (source) and the new
data (target). Brute force leveraging of a source poorly related to the target may decrease performance of the classifier,
which is known as the negative transfer. To reduce negative
transfer, Yao et al. extend the boosting framework in [11] for
transferring knowledge from multiple sources, which can increase the chance of finding one source closely related to the
target. As a result, two algorithms, MultiSourceTrAdaBoost
and TaskTrAdaBoost, are proposed for object category recognition and object detection with satisfactory results.
To avoid inappropriate updating of the classifier, we proposed a boosting based framework treating object tracking as
a transfer learning problem. Generally speaking, the video
data (including the object and background) near the current
time is more related to the current scene than those far away
from the current time. Thus, an assumption is made that
the video data far away from the current time are defined as
source data and the video data near the current time are defined as target data. Updating is equivalent to learning new
knowledge contained in the target data so as to alter the classifier. Therefore, the updating task can be naturally treated as
a transfer learning problem. Specifically, transferring knowledge from source to target could keep classifier up to date,
which would lead to a more stable tracker.
The remainder of the paper is organized as follows. In
Section 2 we introduce the boosting based transfer learning
framework for tracking. Experimental results are reported in
Section 3. Finally, the paper is concluded in Section 4.
2. TRACKING VIA TRANSFER LEARNING
The proposed transfer learning boost tracker (named TLBoost tracker) tries to construct classifiers to separate the
object from the background. To ensure discrimination, its
classifiers are trained online via transfer learning.
The workflow of the framework is shown in Fig.1. In the
current frame, a set of candidate patches are sampled around
the object location estimated in the last frame. The one with
the maximal likelihood is chosen as the current object state
via the latest classifiers. Then positive and negative instances
are selected in the current frame to serve as the target data
which represents the latest change of the object and background. Next, we transfer knowledge from source data (target
data in the last frame) to target data. This process enables the
classifier to tackle the variety of the object and background.
Finally, the target data in the current frame serve as the source
data in the next frame.
In the following, we successively introduce image representation, the classifiers, the training of the classifiers, and
finally illustrate how to track object with the classifiers.
Strong classifier: For a given patch x and a weak classifier hn , its label is softly determined by:
hn (x) = log
p(y = 1|x)
p(y = 0|x)
(1)
where p(y = 1|x) is likelihood of the patch when it is labeled
positive. This likelihood is formulated like this:
p(y = 1|x) = √
1
(f (x) − µ+ )2
exp(−
)
2
2σ+
2πσ+
(2)
where f (x) is feature value of patch x for a specific weak
classifier and p(y = 0|x) is defined in a similar way.
The strong classifier is combination of weak classifiers:
∑
H(x) =
αn hn (x)
(3)
n=1,..,N
where αn is the corresponding weak classifier’s weight.
2.3. Selecting Weak Classifiers via Transfer Learning
Our approach maintains M weak classifiers at all times.
Firstly we use the idea proposed by Babenko et al. [8] to update the M weak classifiers in parallel. Patches locate within
a radius r1 of the object’s current estimation are sampled as
positive instances, and patches locate with a radius between
r1 and r2 (r2 > r1 ) of the object’s current estimation are
sampled as negative instances. Then, the haar-like features’
values over these samples help to update the weak classifiers:
1 ∑
µ+ ← βµ+ + (1 − β)
f (xi )
(4)
n+ y =1
√
σ+ ← βσ+ + (1 − β)
i
1 ∑
(f (xi ) − µ+ )2
n+ y =1
(5)
i
Fig. 1. The flow of the proposed framework
2.1. Patch Representation
In our implementation, haar-like feature is adopted to represent the patch. Similar to [8], each feature consists of 2 to 6
rectangles weighted by a real value. The feature’s value lies
in the sum of the pixels within the weighted rectangles.
2.2. Classifier
Weak classifier: each weak classifier h is a hypothesis composed of a haar-like feature and two Gaussian distributions,
N + (µ+ , σ+ ) and N − (µ− , σ− ). µ+ and σ+ are mean and
standard deviation of feature values over positive samples. µ−
and σ− are defined in a similar vein.
where β is the learning rate, n+ is the number of positive
instances, and yi = 1 indicates the instance’s label. Updating
rules for µ− and σ− are similarly defined.
We then choose N (N < M ) classifiers from the M classifiers. This step consists of two phases. In the first phase,
we modify the traditional Adaboost algorithm to select weak
classifier sequentially in N iterations using the source data.
In the traditional Adaboost framework, weak classifier is selected minimizing the classification error over the training set.
In our modified Adaboost framework, weak classifier hn is
measured by the negative likelihood over the training set. It is
formulated as:
∑
Υn =
(1−yi )(Hk−1 (xi )+hn (xi ))−yi (Hk−1 (xi )+hn (xi ))
i=1,..,SN
(6)
where yi is the label and Hk−1 is the strong classifier in the
last round, then the current optimal weak classifier ht is determined by minimizing Equation (7):
t = arg min Υi
i
(7)
Fig. 2. Images excerpted from the sequences. From the top down they are david, boat, surfer, football and walking respectively.
Fig. 3. RMSE plots for the sequences.
Algorithm 1 Phase-1 of the classifier selecting
N
{xi , yi }Si=1
,
Input: source data S =
all the weak classi,
the
maximum
number
of iterations N .
fiers Φ = {hi }M
i=1
Output: set of candidate weak classifiers ℜ
1. Empty the set of candidate weak classifiers, ℜ ← ø
2. for k ← 1 to N do
3. for m ← 1 to M do
4.
Compute negative likelihood Υm
5. Find the optimal classifier hk via equation (7).
6. Calculate αk according to equation (8).
7. ℜ ← ℜ ∪ hk
Return ℜ
background undergo gradual changes over time, weak classifiers trained with the outdated source may not be discriminative enough for classification. We modify some weak classifiers obtained in the first phase, to make the classifier adapt
the change of the object and background. This means transferring knowledge from source to target. If a weak classifier
is worse than random guess over the target data, then this classifier should be reversed in order to improve the strong classifier. The reverse means that swapping the mean and standard
deviation of the positive and negative samples, like this:
µ+ ⇔ µ− , σ+ ⇔ σ−
(9)
The pseudo-code of this phase is shown in Algorithm 2.
Its corresponding weight is calculated by:
αt = exp(−Υt /θ)
(8)
where θ is a regularization parameter.
The pseudo-code of this phase is listed in Algorithm 1.
In the second phase we import the target data to transfer
knowledge from source data to target data. As the object and
2.4. Tracking
We employ the tracking-by-detection strategy for tracking.
N
All the patches X = {xi }C
i=1 locate within a radius r of the
object’s estimation in the last frame are candidates. These
patches are evaluated by the strong classifier, and the one with
Algorithm 2 Phase-2 of the classifier selecting
N
= {xi , yi }D
i=1 , the set of candidate
N
{hi , αi }i=1 , the maximum number
Input: target data D
weak classifiers ℜ =
of iterations N .
Output: the final set of weak classifiers R
1. Empty the weak classifier set, R ← ø
2. for m ← 1 to N do
3. Calculate
∑ the error of hm over target data D
ϵ ← j=1,..,DN [yj ̸= sign(hm (xj ))]/DN
4. If ϵ > 0.5 then hm ← −hm , ϵ ← 1 − ϵ
5. R ← R ∪ (hm , αm )
Return R
the maximal likelihood is chosen as the estimation of the object’s state. This can be formulated as:
xt = arg max H(xi )
xi ∈X
(10)
3. EXPERIMENTS
Data sets: We tested our TLBoost tracker on five challenging
video sequences, which are named david, boat, surfer, football and walking. Table.1 describes them.
Sequence
Frame length
David
Boat
Surfer
Football
Walking
462
407
172
362
140
Description
1, 2, 3, 4
2, 5
2, 4, 5, 6
2, 6
4, 7
Table 1. Description of the sequences. Notes: 1. Illumination variation 2. Out-of-plane rotation 3. Expression changes
4.Scale changes 5. Blur 6. Occlusion 7. Rapid movement
Experimental setup: The object’s state is represented as
a rectangle with fixed size. To demonstrate that transfer learning results in a more stable tracker, we adopt the OAB tracker
[6] and MIL tracker [8] for comparison. They are also classifier based approaches utilizing haar-like feature for patch representation, and tracking-by-detection strategy for tracking. It
is convincing to validate our tracker comparing with these two
trackers. For fairness, the parameters illustrated in the following are identical for the three trackers. Searching radius is 25.
Radius for sampling positive samples is 5 (for OAB tracker it
is 1) and for negative samples is 65. The number of features
N is set to 250 and the number of selected weak classifiers is
50. The learning rate β is 0.85. The regularization parameter
θ is set to 0.4. We also conduct a comparison with the Frag
tracker [12], which is a deterministic tracker. This tracker is
robust to occlusion as its appearance model is part based.
Results: Fig.2 shows sample images excerpted from the
sequences. To conduct quantitative comparison, we calculate
the root mean square error every five frames. Note that for the
TLBoost tracker, MIL tracker and OAB tracker, we conduct
five trials as they involve slight randomness. From Table 2 we
can find that for the boat and surfer sequences the proposed
TLB tracker perform slightly worse than the MIL tracker, and
for the remaining three sequences the TLB tracker perform
best. That is because that the MIL tracker adopts multiple instance learning to avoid drifting, but our tracker is lack of this
type of consideration. For the remaining three sequences, our
tracker achieves the best performance. With the help of transfer learning, our tracker could handle appearance changes of
the object in a natural way.
Sequence OAB Frag
MIL TLB
David 13.9845 15.383213.3943 7.5078
Boat 14.9304 18.3277 6.6299 7.6170
Surfer 16.7333 48.263812.2936 14.0845
Football 52.4364 60.787329.8524 22.1962
Walking 18.9025 46.212029.4691 9.4324
Table 2. RMSE for the sequences. Bold indicates best performance and green indicates second best.
4. CONCLUSION
In this paper we have extended the boosting framework for
transfer learning and applied it to object tracking. This framework allows us to obtain a more discriminative classifier by
transferring knowledge from the outdated training data to the
latest training data, leading to a more robust tracker. Experimental results on various challenging sequences have verified
the effectiveness of the proposed framework.
5. REFERENCES
[1] H. Wang, D. Suter, K. Schindler and C. Shen, Adaptive object tracking
based on an effective appearance filter, PAMI, pp. 1661-1667, 2007.
[2] D. A. Ross, J. Lim, R. S. Lin and M. H. Yang, Incremental learning
for robust visual tracking, IJCV, vol. 77, no. 1, pp. 125-141, 2008.
[3] M. J. Black and A. D. Jepson, Eigentracking: Robust matching
and tracking of articulated objects using a view-based representation,
IJCV, vol. 26, no. 1, pp. 63-84, 1998.
[4] S. Avidan, Ensemble tracking, PAMI, pp. 261-271, 2007.
[5] S. Avidan, Support vector tracking, PAMI, vol. 26, no. 8, pp. 10641072, 2004.
[6] H. Grabner and H. Bischof, On-line boosting and vision, in CVPR,
vol. 1, pp. 260-267, 2006.
[7] A. Saffari, C. Leistner, J. Santner, M. Godec and H. Bischof, On-line
random forests, in Proc. IEEE OLCV Workshop, 2009.
[8] B. Babenko, M.H. Yang and S. Belongie, Visual tracking with online
multiple instance learning, in CVPR, pp. 983-990, 2009.
[9] Y. Zha, Y. Yang and D. Bi, Graph-based transductive learning for robust visual tracking, Pattern Recognition, vol. 43, no. 1, pp. 187-196,
2010.
[10] W. Dai, Q. Yang, G.R. Xue and Y. Yu, Boosting for transfer learning,
Proceedings of the 24th international conference on Machine learning,
pp. 193-200, 2007.
[11] Y. Yao and G. Doretto, Boosting for transfer learning with multiple
sources, in CVPR, pp. 1855-1862, 2010.
[12] A. Adam, E. Rivlin and I. Shimshoni, Robust fragments-based tracking using the integral histogram, in CVPR, vol. 1, pp. 798-805, 2006.
Download