Video Tracking Using Learned Hierarchical Features LI WANG, TING LIU, GANG WANG, KAP LUK CHAN, AND QINGXIONG YANG IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 4, APRIL 2015 1 Abstract Introduction Tracking System Overview Learning Features for Video Tracking Experiments Conclusions 2 Deep Learning a new area of Machine Learning research, moving it closer to one of its original goals: Artificial Intelligence. Multi-node, multi-layer Pixel -> line and boundary -> recognize picture It has achieved impressive performance on image classification, action recognition, and speech recognition, etc. 3 First layer 4 Second layer 5 Third layer 6 Fourth layer 7 Introduction N. Wang and D.-Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Proc. NIPS, 2013, pp. 809–817. drawback: they do not have an integrated objective function to bridge offline training and online tracking. To address the issue in DLT [4], we propose a domain adaptation based deep learning method to learn hierarchical features for model-free object tracking. 8 Introduction Offline learning Online Learning 9 Introduction We employ the limited memory BFGS (L-BFGS) algorithm [7] to solve the optimization problem in the adaptation module. J. Nocedal, “Updating quasi-Newton matrices with limited storage,” Math. Comput., vol. 35, no. 151, pp. 773–782, 1980. 10 Abstract Introduction Tracking System Overview Learning Features for Video Tracking Experiments Conclusions 11 Tracking System Overview The tracking system with the adaptive structural local sparse appearance model (ASLSA) [32] achieves very good performance. Hence, we integrate our feature learning method into this system. X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Proc. IEEE Conf. CVPR, Jun. 2012, pp. 1822–1829. 12 Tracking System Overview An observation set of target A corresponding feature representation set 13 Tracking System Overview An observation set of target A corresponding feature representation set 14 Tracking System Overview An observation set of target A corresponding feature representation set appearance model 15 Tracking System Overview An observation set of target A corresponding feature representation set motion model 16 17 18 19 20 Abstract Introduction Tracking System Overview Learning Features for Video Tracking Experiments Conclusions 21 Pre-Learning Generic Features 22 Pre-Learning Generic Features Given the offline training patch ๐ฅ๐ from the ๐๐กโ frame the corresponding learned feature is The feature transformation matrix W is learned by solving the following unconstrained minimization problem, 23 Pre-Learning Generic Features the corresponding learned feature The feature transformation matrix W is learned by solving the following unconstrained minimization problem, Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng, “ICA with reconstruction cost for efficient overcomplete feature learning,” in Proc. NIPS, 2011, pp. 1017–1025. 24 Domain adaptation module The first layer can extract features robust to local motion patterns e.g. translations. From the second layer, we could extract features robust to more complicated motion Transformations. 25 Domain adaptation module Given a target video sequence, we employ ASLSA [32] to track the target object in the first N frames and use the tracking results as the training data for the adaptation module. The adapted feature is denoted as is the object image patch in the ith frame of the training data W is the feature transformation matrix to be learned 26 Domain adaptation module We formulate the adaptation module by adding a regularization term as follows, denotes the pre-learned feature transformation matrix. 27 Optimization J. Nocedal, “Updating quasi-Newton matrices with limited storage,” Math. Comput., vol. 35, no. 151, pp. 773–782, 1980. X: object region ๐ โถ element in W 28 Abstract Introduction Tracking System Overview Learning Features for Video Tracking Experiments Conclusions 29 Experiments First, we evaluate our learned hierarchical features to demonstrate its robustness to complicated motion transformations. Second, we evaluate the temporal slowness constraint and the adaptation module in our feature learning Algorithm. Third, we evaluate our tracker’s capability of handling typical problems in visual tracking. 30 Experiments Then, we compare our tracker with 14 state-of-the-art trackers. Moreover, we present the comparison results between DLT [4] and our tracker. Finally, we present the generalizability of our feature learning algorithm on the other 2 tracking methods. 31 Experiments We use two measurements to quantitatively evaluate tracking performances. center location error overlap rate 32 The purple, green, cyan, blue and red bounding boxes refer to ASLSA [32]_RAW, ASLSA [32]_HOG, โ1_APG [51], CT_DIF [36] and our tracker respectively. 33 Experiments N. Li and J. J. DiCarlo, “Unsupervised natural experience rapidly alters invariant object representation in visual cortex,” Science, vol. 321, no. 5895, pp. 1502–1507, Sep. 2008. Temporal slowness constraint is beneficial for learning features robust to complicated motion transformations. 34 35 36 37 Experiments 38 Tracker’s ability of handling typical problems 39 Our tracker vs. DLT 40 Experiments To demonstrate the generalizability of our learned features, we integrate our feature learning algorithm into another baseline tracker which is called the incremental learning tracker (IVT) [12]. D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 125–141, May 2008. 41 42 Experiments In addition, we verify our learned feature’s generalizability by using โ1_APG tracker [51] and evaluating performances on the same 12 sequences as used for IVT. โ1_APG can hardly handle these challenging sequences with complicated motion transformations. In contrast, integrating our learned features into โ1_APG can succeed to track objects in 6 of 12 sequences. 43 Abstract Introduction Tracking System Overview Learning Features for Video Tracking Experiments Conclusions 44 Conclusions We learn the generic features from auxiliary video sequences by using a two-layer convolutional neural network. Moreover, we propose an adaptation module to adapt the pre-learned features according to specific target objects. As a result, the adapted features are robust to both complicated motion transformations and appearance changes of specific target objects. 45 Thanks for your listening! 46