Real-time Ship Tracking via Enhanced MIL Tracker Fei Teng, Qing Liu, Jianming Guo School of Autiomation, WuHan University of Technology, WuHan, 430070, China Abstract Ship tracking plays a key role in inland waterway CCTV (Closed-Circuit Television) video surveillance. While much development and success has been demonstrated, it is still a challenging task to keep long-term robust ship tracking due to factors such as pose variation, illumination change, occlusion, and motion blur. Currently, a technique called “tracking by detection” has been developed and studied with promising results. A typical tracking by detection algorithm called MIL (Multiple Instance Learning) has become one of the most popular methods in tracking domain. This technique is designed to alleviate the drift problem by using an MIL based appearance model to represent training data in the form of bags. In this paper, we enhance the MIL tracker in two aspects. Firstly, we propose a new bag model that integrates the importance of the samples naturally. Then, in order to solve the potential overfitting problem, we propose a dynamic function to estimate probability of instance instead of the original logistic function. Experimental results on challenging inland waterway CCTV video sequences demonstrate that our tracker achieves favorable performance. Keywords: CCTV, ship tracking, detection, MIL 1. Introduction The recent years have witnessed great development of CCTV system in the domain of inland waterway ship video surveillance [1-3]. It acts as an eye for observers to tell whether a ship is retrograding, overtaking or parking without regulation benefitted from the combined efforts of AIS (Automatic Identification System), GPS (Global Position System) and inland waterway electronic chart [4-6]. Numerous tracking algorithms that can be applied in the inland waterway ship tracking domain have been proposed in the literature over the past couple of decades and they can be categorized into two classes based on their different appearance representation schemes: generative models and discriminative ones. Generative algorithms typically learn an appearance model to represent the object and then search for image regions with minimal reconstruction errors as tracking results for current frame. To deal with appearance variation, Ross et al. [7] proposes a tracking method that incrementally learns a low-dimensional subspace representation, efficiently adapting online to changes in the appearance of the target. The model update, based on incremental algorithms for principal component analysis, includes two important features: a method for correctly updating the sample mean, and a forgetting factor to ensure less modeling power is expended fitting older observations. Both of these features contribute measurably to improving overall tracking performance. Recently, Mei at al. [8] proposes a โ1 -tracker based on sparse representation where the object is modeled by a sparse linear combination of target and trivial templates and treating partial occlusion as arbitrary but sparse noise. As it essentially solves a series of โ1 -minimization problem, the computational complexity is too high and therefore limits its performance. Although dozens of methods have been proposed to speed up the โ1 -tracker, they also fail to make much improvement in the trade off accuracy and time consumption. In one word, these generative models do not take into account background information, throwing away some very useful information that can help to discriminate object from background. On the other hand, discriminative approaches consider visual tracking as a binary classification problem, for it does not build an exact representation of the target but tries to find decision boundaries between the object and its surrounding background within a local region. Discriminative approaches utilize not only features extracted from the object but also features from the background. Collins et al. [9] have demonstrated that selecting discriminate features in an online manner can greatly improve the tracking performance. Recently, some successful tracking algorithms training their classifiers using boosting techniques have been proposed. Dietterich et al. [10] proposes a novel on-line Adaboost feature selection algorithm, depending on the background the algorithm selects the most discriminative features for tracking resulting in stable tracking results, which demonstrates to handle appearance change of the object like illumination changes or out of plane rotations naturally. However, as it utilizes the tracking result in current frame as the only one positive example to update the classifier, slightly inaccuracy of the tracking position in current frame can bring mislabeled training samples for the classifier. These errors may accumulate over time, leading to model drift or even tracking failures. Grabner et al. [11] presents a novel on-line semi-supervised boosting method, the main idea is to formulate the update process in a semi supervised fashion as combined decision of a given prior and an on-line classifier, this comes without any parameter tuning and significantly alleviates the drifting problem. The weakness of this method is that the classifier is trained by only labeling the examples at the first frame while leaving the samples at the coming frames unlabeled, which loses valuable motion information between frames. In order to solve the similar ambiguity in face detection, Babenko et al. [12] suggests employing multiple instance learning approach for object tracking problem and proposes MIL tracker. Their seminal work uses an MIL based appearance model to represent training data in the form of bags. It obeys the rule that a positive bag should contain at least one positive example and examples in negative bag are all negative. The positive examples are cropped around the object while the negative ones are far from the object location. The classifier is then trained in an online manner using the bag likelihood function. Because some flexibility is allowed in separating training examples, MIL tracker can solve the inherent ambiguity by itself to some degree, leading to more robust tracking. Zhang [13] points out that the positive instance near the object location should contribute larger to the bag probability, i.e., the weights for the instance near the object is larger than that far from the object location, thus integrating the sample importance into the learning procedure and proposing a weighed MIL(WMIL) tracker which demonstrates superior performance. We observe that the Noisy-OR model used in the MIL tracker does not take into account any information about the importance of the positive samples. Therefore, MIL tracker may select less effective positive samples. Oppositely, though WMIL integrates the sample importance into the learning procedure, the bag probability made up of weighed sum of instance probability seems to weaken the discriminative power for the final strong classifier and may bring new ambiguity, i.e., the normalized weights are not sufficient enough to keep the desired property that if one of the instance in a bag has a high probability, the bag probability will be high as well. Another important issue is that the standard logistic function which transfers output of strong classifiers to probability of an instance being positive has potentially overfitting problem which degrades the discriminative model severely. Motivated by above-mentioned discussions, in this paper, we enhance the MIL tracker in two aspects. Firstly, we propose a new bag model that integrates the importance of the samples naturally. Then, in order to solve the potential overfitting problem, we propose a dynamic function to estimate probability of instance instead of the logistic function. The remainder of this paper is organized as follows: Section 2 describes our enhanced WMIL tracker (EWMIL) in details. Section 3 conducts numerous experiments on challenging CCTV video sequences to illustrate the effectiveness of the proposed method. Finally, Section 4 concludes with a brief summary. 2. EWMIL tracker We introduce our EWMIL according to the standard three components of a tracking system: image representation, appearance model and motion model. They are sequentially arranged in 2.1, 2.2 and 2.3 respectively and 2.4 presents a brief summary of our proposed method. 2.1 Image representation We represent each sample ๐ฅ by a feature vector f(x) = (๐1 (x) … ๐๐พ (x))๐ , each feature ๐๐ (·) is a haar-like feature computed the sum of weighted pixels in several rectangles. Those features are easy to implement and efficient to compute using integral image [14]. 2.2 Appearance model In both MIL [12] and WMIL [13] tracker, a boosting classifier is trained to maximize the log likelihood of bags under the gradient boosting framework [15]: โ = ∑๐ logโก(๐(๐ฆ๐ |๐๐ )) (1) It deserves attention that the likelihood is defined over bags and not instance. In order to compute (1), we need to express ๐(๐ฆ๐ |๐๐ ), the probability of a bag being positive, in terms of its instance. In MIL tracker, the Noisy-OR (NOR) model is adopted for doing this: p(๐ฆ๐ |๐๐ ) = 1 − ∏๐ (1 − ๐(๐ฆ๐ |๐ฅ๐๐ )) (2) We can easily see that the NOR model does not discriminatively treat positive samples according to their importance to the bag probability, i.e., no matter how far or how near the sample is from the object location, they contribute equally to the bag probability which can confuse the classifier and makes the MIL tracker easily select less effective features. In order to take the discriminative importance of samples into consideration, in WMIL tracker, the bag probability is defined as follows: p(y = 1|๐ + ) = ∑๐−1 (3) ๐=0 ๐ค๐0 ๐(๐ฆ1 = 1|๐ฅ1๐ ) p(y = 0|๐ − ) = ∑๐+๐ฟ−1 ๐ค๐(๐ฆ0 = 0|๐ฅ0๐ ) = ๐ค ∑๐+๐ฟ−1 ๐(๐ฆ0 = 0|๐ฅ0๐ ) ๐=๐ ๐=๐ + (4) − where ๐ = {(๐ฅ10 , ๐ฅ11 , … ๐ฅ1(๐−1) )} is the positive bag which contains N positive samples and ๐ = {(๐ฅ0๐ , ๐ฅ0(๐+1) , … ๐ฅ0(๐+๐ฟ−1) )} represents the negative bag including L negative ones. ๐ฅ10 is assumed to be the tracked result at current frame without loss of generality and the instance label is assumed the same as the bag label. ๐ค๐0 is the weight function between the location of sample ๐ฅ1๐ and ๐ฅ10 with respect to the Euclidean distance: 1 ๐ค๐0 = ๐ −|๐(๐ฅ1๐ −๐(๐ฅ10 ))| ๐ (5) where l(·) is the location function and c is a normalization constant. In (4), w keeps constant, meaning that all negative instances contribute equally to the negative bag, i.e., assumptions are made that samples far away from the tracking location are easily distinguished without ambiguity. We observe that WMIL integrates the sample importance into the learning procedure, but the bag probability made up of normalized weighed sum of instance probability seems to bring new ambiguity which will still confuse the classifier, and thus the discriminative power for the final strong classifier weakens. In our EWMIL, we inherit the desired merit of NOR that if one of the instances in a bag has a high probability, the bag probability will be comparatively high as well. Simultaneously, the importance of samples contributing to the bag probability is taken into consideration. We combine these two aspects and propose a novel bag model: p(๐ฆ๐ |๐๐ ) = 1 − ∏๐+๐ฟ−1 (1 − ๐ค๐0 ๐(๐ฆ๐ |๐ฅ๐๐ )) ๐=0 (6) + where ๐ค๐0 is the same as (5). We uniformly express the probability for both the positive bag {๐ } and the negative bag {๐ − }, as it empirically makes sense that by applying weighs to the negative samples can help the classifier to better discriminate the cluttered background and therefore better distinguish the object from the background. In (6), in both MIL tracker and WMIL tracker, ๐(๐ฆ๐ |๐ฅ๐๐ ), probability of an example being positive, is computed using a standard logistic function: σ(x) = 1 (7) 1+๐๐ฅ๐−๐ฅ where the argument x represents the output of strong classifiers. We find that this function increases so quickly with regard to x, and enforces the probability estimates to diverge to 0 and 1(i.e., overfitting), see Fig.1 for illustration. Li [16] points out that the optimal weak classifier is hardly discriminated from the other after several rounds, and then the algorithm randomly selects weak classifier to make up the final classifier, which degrades the discriminative model. Fig.1: Logistic function Motivated by [16], we propose a novel dynamic function to estimate probability of instance instead of the original logistic function: σ(x) = 1 − ๐ฅ (8) 1+๐๐ฅ๐ ๐2 Where k represents the number of weak classifiers that construct final strong classifier. We find in our model the overfitting problem is alleviated and even avoided by regularizing ๐โกto a reasonable range where its value is strongly discriminative, resulting in a more stable and robust tracker. 2.3 Motion model Our motion model is the same as [12] and [13], meaning that the location of the tracker is equally likely to appear within a radius s of the tracker location at time ๐ก-1: ∗ || 1โกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโก๐๐โก||๐๐ก∗ − ๐๐ก−1 <๐ ∗ p(๐๐ก∗ |๐๐ก−1 )∝{ 0โกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโก๐๐กโ๐๐๐ค๐๐ ๐ (9) 2.4 A brief summary The main procedure of our tracking methods is summarized in Table 1, for some other not-mentioned information, we refer readers to [12] for more details. TABLE 1 EWMIL algorithm Input: Dataset {๐๐ , ๐ฆ๐ }1๐=0 , where ๐๐ = {๐ฅ๐1 , ๐ฅ๐2 … } is the ith bag densely cropped from current frame, i.e., ๐1 = {๐ฅ|||๐๐ก (x) − ๐๐ก (๐ฅ0 )|| < ๐ผ} and ๐0 = {๐ฅ| ξ < ||๐๐ก (x) − ๐๐ก (๐ฅ0 )|| < ๐ฝ} where ๐๐ก (๐ฅ0 ) is the object location and ๐ผ < ๐ < ๐ฝ differs the radius. 1. Update all M weak classifiers in the pool with data {๐ฅ๐๐ , ๐ฆ๐ } 2. Feature selection procedure: a. Compute the instance probability using (8) b. Compute the bag probability using (6) c. Choose the most discriminative feature that maximize โ in (1), and repeat 2, 3, 4 for K times 3. Compute the strong classifier 4. Apply the afore-trained classifier to samples cropped in the new frame and find the maximum confidence as the tracking result. Output: the tracking location in the new frame 3. Experiments We compare our EWMIL tracker with original MIL tracker and WMIL tracker on 8 challenging video sequences. We choose averaging center location error (ACLE) (in pixels) as our evaluation criteria. In order to make fair comparison, we set the parameters all the same as [12] and [13]. The quantitative results are shown in Table 2 and Fig.2 presents some screenshots of tracking results in real inland waterway CCTV videos. The experimental results demonstrate superior performance of our EWMIL algorithm. Table 2 Quantitative Evaluation for eight challenging CCTV videos (Red indicates the best and blue indicates the second) Sequence Main Challenge EWMIL WMIL MIL CCTV 1 Low video quality and 9 13 18 CCTV 5 motion blur 12.5 14 20 CCTV 2 Cluttered background and 38 29 45 CCTV 6 extremely scale change 31 26.5 37 CCTV 3 Partial 10 12 20 CCTV 7 occlusion 10 11 19 CCTV 4 Long time illumination 16 18 15 CCTV 8 change 13 13 24 and even full (a) CCTV 1 (b) CCTV 2 (c) CCTV 3 (d) CCTV 4 Fig.2: Screenshots of tracking results in several CCTV videos ( Our EWMIL WMIL MIL ) 4. Conclusion In this paper, we analyze and enhance the MIL tracker in two aspects. On one hand, we take into account the discriminative importance of training samples contributing to the bag probability, and then integrate the importance of the samples into our learning procedure naturally. On the other hand, in order to solve the potential overfitting problem, we propose a dynamic function to estimate probability of instance instead of the logistic function. Experimental results on challenging inland waterway CCTV video sequence demonstrate that our tracker achieves favorable performance. 5. Acknowledgments This work is supported by the National Science Foundation of China (NSFC 51279152). 6. References [1] F. Coudert, “Towards a new generation of CCTV networks: Erosion of data protection safeguards?,” Computer Law & Security Review, vol. 25, no. 2, pp. 145-154, 2009. [2] N. Dadashi, A. W. Stedmon, and T. P. Pridmore, “Semi-automated CCTV surveillance: The effects of system confidence, system accuracy and task complexity on operator vigilance, reliance and workload,” Applied Ergonomics, vol. 44, no. 5, pp. 730-738, 2013. [3] A. C. Davies, and S. A. Velastin, "A progress review of intelligent CCTV surveillance systems," Proceedings of the Third Workshop - 2005 IEEE Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2005. pp. 417-423. [4] N. Dadashi, A. Stedmon, and T. Pridmore, "Automatic components of integrated CCTV surveillance systems: Functionality, accuracy and confidence," 6th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2009. pp. 376-381. [5] S.-W. Park, K.-S. Lim, and J. W. Han, "Videos analytic retrieval system for CCTV surveillance," Lecture Notes in Electrical Engineering. pp. 239-247. [6] G. Thiel, "Automatic CCTV surveillance - towards the VIRTUAL GUARD," IEEE Annual International Carnahan Conference on Security Technology, Proceedings. pp. 42-48. [7] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 125-141, 2008. [8] M. Xue, and L. Haibin, "Robust visual tracking using l1 minimization." pp. 1436-1443. [9] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1631-1643, 2005. [10] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez, “Solving the multiple instance problem with axis-parallel rectangles,” Artificial Intelligence, vol. 89, no. 1-2, pp. 31-71, 1997. [11] H. Grabner, C. Leistner, and H. Bischof, "Semi-supervised on-line boosting for robust tracking," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp. 234-247. [12] B. Babenko, S. Belongie, and M.-H. Yang, "Visual tracking with online multiple instance learning," 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009. pp. 983-990. [13] K. Zhang, and H. Song, “Real-time visual tracking via online weighted multiple instance learning,” Pattern Recognition, vol. 46, no. 1, pp. 397-411, 2013. [14] P. Viola, and M. Jones, "Rapid object detection using a boosted cascade of simple features." pp. I-511-I-518 vol.1. [15] Y. Freund, and R. E. Schapire, “Decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997. [16] X. Li, and M. Tang, "Analysis of MILTrack," ACM International Conference Proceeding Series. pp. 176-179.