ROBUST VISUAL TRACKING VIA TRANSFER LEARNING Wenhan Luo† , Xi Li†† , Wei Li† , Weiming Hu† † National Lab of Pattern Recognition, Institute of Automation, CAS, Beijing, China. †† School of Computer Science, University of Adelaide, Australia. ABSTRACT In this paper, we propose a boosting based tracking framework using transfer learning. To deal with complex appearance variations, the proposed tracking framework tries to utilize discriminative information from previous frames to conduct the tracking task in the current frame, and thus transfers some prior knowledge from the previous source data domain to the current target data domain, resulting in a high discriminative tracker for distinguishing the object from the background. The proposed tracking system has been tested on several challenging sequences. Experimental results demonstrate the effectiveness of the proposed tracking framework. Index Terms— tracking, transfer learning, boosting 1. INTRODUCTION Object tracking has attracted more and more attention because of its crucial role in many applications such as surveillance, human-computer interaction, and video analysis. There have been various methods for object tracking. Most of them could be grouped into two categories: appearance model based and classifier based. The appearance model based methods aim to model the statistical properties of tracked objects and carry out object localization by probabilistic model matching. Such examples are SMOG [1], IVT [2], Eigentracking [3] and so on. However, their common drawback is to omit discriminative information from the background. In contrast, classifier based methods consider object tracking as a classification problem, and attempt to find out the optimal decision boundary which well separates the object and the background. For instance, an ensemble of weak classifiers is online learned for adapting to the changes in the object and the background [4]. Avidan’s support vector tracker [5] constructs an optical flowbased SVM classifier for object/non-object classification. In [6], an online Adaboost feature selection algorithm is proposed for tracking. A randomized learning technique called Random Forests is developed in [7] to select features online for visual tracking. Note that the key idea of the aforementioned classifier-based methods is how to construct robust classifiers for distinguishing the object from the background. This work is partly supported by NSFC (Grant No. 60825204 and 60935002). Thus, it is essential for trackers to have adaptive capabilities as both the object and the background change. Recently, some other machine learning techniques are employed to track object. Babenko et al. [8] use multiple instance learning to avoid the drift problem, and thus achieve superior results. Zha et al. [9] propose a tracking method based on transductive learning over the labeled and unlabeled samples. This method can obtain greater class separability, leading to better tracking performance. More recently, transfer learning becomes popular due to its superior performance. Traditional machine learning assumes that the training data and the test data should follow the identical distribution. But this assumption does not always hold. To address this problem, Dai et al. present a transfer learning framework called TrAdaboost in [10] for classification. TrAdaboost extends the traditional boosting-based learning methods so as to transfer knowledge from the old domain to a new domain. It allows learning an accurate model for classification using only a tiny amount of new data and a large amount of old data. But its performance heavily depends on the relationship between the old data (source) and the new data (target). Brute force leveraging of a source poorly related to the target may decrease performance of the classifier, which is known as the negative transfer. To reduce negative transfer, Yao et al. extend the boosting framework in [11] for transferring knowledge from multiple sources, which can increase the chance of finding one source closely related to the target. As a result, two algorithms, MultiSourceTrAdaBoost and TaskTrAdaBoost, are proposed for object category recognition and object detection with satisfactory results. To avoid inappropriate updating of the classifier, we proposed a boosting based framework treating object tracking as a transfer learning problem. Generally speaking, the video data (including the object and background) near the current time is more related to the current scene than those far away from the current time. Thus, an assumption is made that the video data far away from the current time are defined as source data and the video data near the current time are defined as target data. Updating is equivalent to learning new knowledge contained in the target data so as to alter the classifier. Therefore, the updating task can be naturally treated as a transfer learning problem. Specifically, transferring knowledge from source to target could keep classifier up to date, which would lead to a more stable tracker. The remainder of the paper is organized as follows. In Section 2 we introduce the boosting based transfer learning framework for tracking. Experimental results are reported in Section 3. Finally, the paper is concluded in Section 4. 2. TRACKING VIA TRANSFER LEARNING The proposed transfer learning boost tracker (named TLBoost tracker) tries to construct classifiers to separate the object from the background. To ensure discrimination, its classifiers are trained online via transfer learning. The workflow of the framework is shown in Fig.1. In the current frame, a set of candidate patches are sampled around the object location estimated in the last frame. The one with the maximal likelihood is chosen as the current object state via the latest classifiers. Then positive and negative instances are selected in the current frame to serve as the target data which represents the latest change of the object and background. Next, we transfer knowledge from source data (target data in the last frame) to target data. This process enables the classifier to tackle the variety of the object and background. Finally, the target data in the current frame serve as the source data in the next frame. In the following, we successively introduce image representation, the classifiers, the training of the classifiers, and finally illustrate how to track object with the classifiers. Strong classifier: For a given patch x and a weak classifier hn , its label is softly determined by: hn (x) = log p(y = 1|x) p(y = 0|x) (1) where p(y = 1|x) is likelihood of the patch when it is labeled positive. This likelihood is formulated like this: p(y = 1|x) = √ 1 (f (x) − µ+ )2 exp(− ) 2 2σ+ 2πσ+ (2) where f (x) is feature value of patch x for a specific weak classifier and p(y = 0|x) is defined in a similar way. The strong classifier is combination of weak classifiers: ∑ H(x) = αn hn (x) (3) n=1,..,N where αn is the corresponding weak classifier’s weight. 2.3. Selecting Weak Classifiers via Transfer Learning Our approach maintains M weak classifiers at all times. Firstly we use the idea proposed by Babenko et al. [8] to update the M weak classifiers in parallel. Patches locate within a radius r1 of the object’s current estimation are sampled as positive instances, and patches locate with a radius between r1 and r2 (r2 > r1 ) of the object’s current estimation are sampled as negative instances. Then, the haar-like features’ values over these samples help to update the weak classifiers: 1 ∑ µ+ ← βµ+ + (1 − β) f (xi ) (4) n+ y =1 √ σ+ ← βσ+ + (1 − β) i 1 ∑ (f (xi ) − µ+ )2 n+ y =1 (5) i Fig. 1. The flow of the proposed framework 2.1. Patch Representation In our implementation, haar-like feature is adopted to represent the patch. Similar to [8], each feature consists of 2 to 6 rectangles weighted by a real value. The feature’s value lies in the sum of the pixels within the weighted rectangles. 2.2. Classifier Weak classifier: each weak classifier h is a hypothesis composed of a haar-like feature and two Gaussian distributions, N + (µ+ , σ+ ) and N − (µ− , σ− ). µ+ and σ+ are mean and standard deviation of feature values over positive samples. µ− and σ− are defined in a similar vein. where β is the learning rate, n+ is the number of positive instances, and yi = 1 indicates the instance’s label. Updating rules for µ− and σ− are similarly defined. We then choose N (N < M ) classifiers from the M classifiers. This step consists of two phases. In the first phase, we modify the traditional Adaboost algorithm to select weak classifier sequentially in N iterations using the source data. In the traditional Adaboost framework, weak classifier is selected minimizing the classification error over the training set. In our modified Adaboost framework, weak classifier hn is measured by the negative likelihood over the training set. It is formulated as: ∑ Υn = (1−yi )(Hk−1 (xi )+hn (xi ))−yi (Hk−1 (xi )+hn (xi )) i=1,..,SN (6) where yi is the label and Hk−1 is the strong classifier in the last round, then the current optimal weak classifier ht is determined by minimizing Equation (7): t = arg min Υi i (7) Fig. 2. Images excerpted from the sequences. From the top down they are david, boat, surfer, football and walking respectively. Fig. 3. RMSE plots for the sequences. Algorithm 1 Phase-1 of the classifier selecting N {xi , yi }Si=1 , Input: source data S = all the weak classi, the maximum number of iterations N . fiers Φ = {hi }M i=1 Output: set of candidate weak classifiers ℜ 1. Empty the set of candidate weak classifiers, ℜ ← ø 2. for k ← 1 to N do 3. for m ← 1 to M do 4. Compute negative likelihood Υm 5. Find the optimal classifier hk via equation (7). 6. Calculate αk according to equation (8). 7. ℜ ← ℜ ∪ hk Return ℜ background undergo gradual changes over time, weak classifiers trained with the outdated source may not be discriminative enough for classification. We modify some weak classifiers obtained in the first phase, to make the classifier adapt the change of the object and background. This means transferring knowledge from source to target. If a weak classifier is worse than random guess over the target data, then this classifier should be reversed in order to improve the strong classifier. The reverse means that swapping the mean and standard deviation of the positive and negative samples, like this: µ+ ⇔ µ− , σ+ ⇔ σ− (9) The pseudo-code of this phase is shown in Algorithm 2. Its corresponding weight is calculated by: αt = exp(−Υt /θ) (8) where θ is a regularization parameter. The pseudo-code of this phase is listed in Algorithm 1. In the second phase we import the target data to transfer knowledge from source data to target data. As the object and 2.4. Tracking We employ the tracking-by-detection strategy for tracking. N All the patches X = {xi }C i=1 locate within a radius r of the object’s estimation in the last frame are candidates. These patches are evaluated by the strong classifier, and the one with Algorithm 2 Phase-2 of the classifier selecting N = {xi , yi }D i=1 , the set of candidate N {hi , αi }i=1 , the maximum number Input: target data D weak classifiers ℜ = of iterations N . Output: the final set of weak classifiers R 1. Empty the weak classifier set, R ← ø 2. for m ← 1 to N do 3. Calculate ∑ the error of hm over target data D ϵ ← j=1,..,DN [yj ̸= sign(hm (xj ))]/DN 4. If ϵ > 0.5 then hm ← −hm , ϵ ← 1 − ϵ 5. R ← R ∪ (hm , αm ) Return R the maximal likelihood is chosen as the estimation of the object’s state. This can be formulated as: xt = arg max H(xi ) xi ∈X (10) 3. EXPERIMENTS Data sets: We tested our TLBoost tracker on five challenging video sequences, which are named david, boat, surfer, football and walking. Table.1 describes them. Sequence Frame length David Boat Surfer Football Walking 462 407 172 362 140 Description 1, 2, 3, 4 2, 5 2, 4, 5, 6 2, 6 4, 7 Table 1. Description of the sequences. Notes: 1. Illumination variation 2. Out-of-plane rotation 3. Expression changes 4.Scale changes 5. Blur 6. Occlusion 7. Rapid movement Experimental setup: The object’s state is represented as a rectangle with fixed size. To demonstrate that transfer learning results in a more stable tracker, we adopt the OAB tracker [6] and MIL tracker [8] for comparison. They are also classifier based approaches utilizing haar-like feature for patch representation, and tracking-by-detection strategy for tracking. It is convincing to validate our tracker comparing with these two trackers. For fairness, the parameters illustrated in the following are identical for the three trackers. Searching radius is 25. Radius for sampling positive samples is 5 (for OAB tracker it is 1) and for negative samples is 65. The number of features N is set to 250 and the number of selected weak classifiers is 50. The learning rate β is 0.85. The regularization parameter θ is set to 0.4. We also conduct a comparison with the Frag tracker [12], which is a deterministic tracker. This tracker is robust to occlusion as its appearance model is part based. Results: Fig.2 shows sample images excerpted from the sequences. To conduct quantitative comparison, we calculate the root mean square error every five frames. Note that for the TLBoost tracker, MIL tracker and OAB tracker, we conduct five trials as they involve slight randomness. From Table 2 we can find that for the boat and surfer sequences the proposed TLB tracker perform slightly worse than the MIL tracker, and for the remaining three sequences the TLB tracker perform best. That is because that the MIL tracker adopts multiple instance learning to avoid drifting, but our tracker is lack of this type of consideration. For the remaining three sequences, our tracker achieves the best performance. With the help of transfer learning, our tracker could handle appearance changes of the object in a natural way. Sequence OAB Frag MIL TLB David 13.9845 15.383213.3943 7.5078 Boat 14.9304 18.3277 6.6299 7.6170 Surfer 16.7333 48.263812.2936 14.0845 Football 52.4364 60.787329.8524 22.1962 Walking 18.9025 46.212029.4691 9.4324 Table 2. RMSE for the sequences. Bold indicates best performance and green indicates second best. 4. CONCLUSION In this paper we have extended the boosting framework for transfer learning and applied it to object tracking. This framework allows us to obtain a more discriminative classifier by transferring knowledge from the outdated training data to the latest training data, leading to a more robust tracker. Experimental results on various challenging sequences have verified the effectiveness of the proposed framework. 5. REFERENCES [1] H. Wang, D. Suter, K. Schindler and C. Shen, Adaptive object tracking based on an effective appearance filter, PAMI, pp. 1661-1667, 2007. [2] D. A. Ross, J. Lim, R. S. Lin and M. H. Yang, Incremental learning for robust visual tracking, IJCV, vol. 77, no. 1, pp. 125-141, 2008. [3] M. J. Black and A. D. Jepson, Eigentracking: Robust matching and tracking of articulated objects using a view-based representation, IJCV, vol. 26, no. 1, pp. 63-84, 1998. [4] S. Avidan, Ensemble tracking, PAMI, pp. 261-271, 2007. [5] S. Avidan, Support vector tracking, PAMI, vol. 26, no. 8, pp. 10641072, 2004. [6] H. Grabner and H. Bischof, On-line boosting and vision, in CVPR, vol. 1, pp. 260-267, 2006. [7] A. Saffari, C. Leistner, J. Santner, M. Godec and H. Bischof, On-line random forests, in Proc. IEEE OLCV Workshop, 2009. [8] B. Babenko, M.H. Yang and S. Belongie, Visual tracking with online multiple instance learning, in CVPR, pp. 983-990, 2009. [9] Y. Zha, Y. Yang and D. Bi, Graph-based transductive learning for robust visual tracking, Pattern Recognition, vol. 43, no. 1, pp. 187-196, 2010. [10] W. Dai, Q. Yang, G.R. Xue and Y. Yu, Boosting for transfer learning, Proceedings of the 24th international conference on Machine learning, pp. 193-200, 2007. [11] Y. Yao and G. Doretto, Boosting for transfer learning with multiple sources, in CVPR, pp. 1855-1862, 2010. [12] A. Adam, E. Rivlin and I. Shimshoni, Robust fragments-based tracking using the integral histogram, in CVPR, vol. 1, pp. 798-805, 2006.