IETET2013_210Fei_55CSE

advertisement
Real-time Ship Tracking via Enhanced MIL Tracker
Fei Teng, Qing Liu, Jianming Guo
School of Autiomation, WuHan University of Technology, WuHan, 430070, China
Abstract
Ship tracking plays a key role in inland waterway CCTV (Closed-Circuit Television) video surveillance.
While much development and success has been demonstrated, it is still a challenging task to keep
long-term robust ship tracking due to factors such as pose variation, illumination change, occlusion,
and motion blur. Currently, a technique called “tracking by detection” has been developed and studied
with promising results. A typical tracking by detection algorithm called MIL (Multiple Instance
Learning) has become one of the most popular methods in tracking domain. This technique is designed
to alleviate the drift problem by using an MIL based appearance model to represent training data in the
form of bags. In this paper, we enhance the MIL tracker in two aspects. Firstly, we propose a new bag
model that integrates the importance of the samples naturally. Then, in order to solve the potential
overfitting problem, we propose a dynamic function to estimate probability of instance instead of the
original logistic function. Experimental results on challenging inland waterway CCTV video sequences
demonstrate that our tracker achieves favorable performance.
Keywords: CCTV, ship tracking, detection, MIL
1. Introduction
The recent years have witnessed great development of CCTV system in the domain of inland waterway
ship video surveillance [1-3]. It acts as an eye for observers to tell whether a ship is retrograding,
overtaking or parking without regulation benefitted from the combined efforts of AIS (Automatic
Identification System), GPS (Global Position System) and inland waterway electronic chart [4-6].
Numerous tracking algorithms that can be applied in the inland waterway ship tracking domain
have been proposed in the literature over the past couple of decades and they can be categorized into
two classes based on their different appearance representation schemes: generative models and
discriminative ones. Generative algorithms typically learn an appearance model to represent the object
and then search for image regions with minimal reconstruction errors as tracking results for current
frame. To deal with appearance variation, Ross et al. [7] proposes a tracking method that incrementally
learns a low-dimensional subspace representation, efficiently adapting online to changes in the
appearance of the target. The model update, based on incremental algorithms for principal component
analysis, includes two important features: a method for correctly updating the sample mean, and a
forgetting factor to ensure less modeling power is expended fitting older observations. Both of these
features contribute measurably to improving overall tracking performance. Recently, Mei at al. [8]
proposes a โ„“1 -tracker based on sparse representation where the object is modeled by a sparse linear
combination of target and trivial templates and treating partial occlusion as arbitrary but sparse noise.
As it essentially solves a series of โ„“1 -minimization problem, the computational complexity is too high
and therefore limits its performance. Although dozens of methods have been proposed to speed up the
โ„“1 -tracker, they also fail to make much improvement in the trade off accuracy and time consumption.
In one word, these generative models do not take into account background information, throwing away
some very useful information that can help to discriminate object from background.
On the other hand, discriminative approaches consider visual tracking as a binary classification
problem, for it does not build an exact representation of the target but tries to find decision boundaries
between the object and its surrounding background within a local region. Discriminative approaches
utilize not only features extracted from the object but also features from the background. Collins et al.
[9] have demonstrated that selecting discriminate features in an online manner can greatly improve the
tracking performance. Recently, some successful tracking algorithms training their classifiers using
boosting techniques have been proposed. Dietterich et al. [10] proposes a novel on-line Adaboost
feature selection algorithm, depending on the background the algorithm selects the most discriminative
features for tracking resulting in stable tracking results, which demonstrates to handle appearance
change of the object like illumination changes or out of plane rotations naturally. However, as it utilizes
the tracking result in current frame as the only one positive example to update the classifier, slightly
inaccuracy of the tracking position in current frame can bring mislabeled training samples for the
classifier. These errors may accumulate over time, leading to model drift or even tracking failures.
Grabner et al. [11] presents a novel on-line semi-supervised boosting method, the main idea is to
formulate the update process in a semi supervised fashion as combined decision of a given prior and an
on-line classifier, this comes without any parameter tuning and significantly alleviates the drifting
problem. The weakness of this method is that the classifier is trained by only labeling the examples at
the first frame while leaving the samples at the coming frames unlabeled, which loses valuable motion
information between frames.
In order to solve the similar ambiguity in face detection, Babenko et al. [12] suggests employing
multiple instance learning approach for object tracking problem and proposes MIL tracker. Their
seminal work uses an MIL based appearance model to represent training data in the form of bags. It
obeys the rule that a positive bag should contain at least one positive example and examples in negative
bag are all negative. The positive examples are cropped around the object while the negative ones are
far from the object location. The classifier is then trained in an online manner using the bag likelihood
function. Because some flexibility is allowed in separating training examples, MIL tracker can solve
the inherent ambiguity by itself to some degree, leading to more robust tracking. Zhang [13] points out
that the positive instance near the object location should contribute larger to the bag probability, i.e., the
weights for the instance near the object is larger than that far from the object location, thus integrating
the sample importance into the learning procedure and proposing a weighed MIL(WMIL) tracker
which demonstrates superior performance. We observe that the Noisy-OR model used in the MIL
tracker does not take into account any information about the importance of the positive samples.
Therefore, MIL tracker may select less effective positive samples. Oppositely, though WMIL integrates
the sample importance into the learning procedure, the bag probability made up of weighed sum of
instance probability seems to weaken the discriminative power for the final strong classifier and may
bring new ambiguity, i.e., the normalized weights are not sufficient enough to keep the desired property
that if one of the instance in a bag has a high probability, the bag probability will be high as well.
Another important issue is that the standard logistic function which transfers output of strong classifiers
to probability of an instance being positive has potentially overfitting problem which degrades the
discriminative model severely.
Motivated by above-mentioned discussions, in this paper, we enhance the MIL tracker in two
aspects. Firstly, we propose a new bag model that integrates the importance of the samples naturally.
Then, in order to solve the potential overfitting problem, we propose a dynamic function to estimate
probability of instance instead of the logistic function.
The remainder of this paper is organized as follows: Section 2 describes our enhanced WMIL
tracker (EWMIL) in details. Section 3 conducts numerous experiments on challenging CCTV video
sequences to illustrate the effectiveness of the proposed method. Finally, Section 4 concludes with a
brief summary.
2. EWMIL tracker
We introduce our EWMIL according to the standard three components of a tracking system: image
representation, appearance model and motion model. They are sequentially arranged in 2.1, 2.2 and 2.3
respectively and 2.4 presents a brief summary of our proposed method.
2.1 Image representation
We represent each sample ๐‘ฅ by a feature vector f(x) = (๐‘“1 (x) … ๐‘“๐พ (x))๐‘‡ , each feature ๐‘“๐‘– (·) is a
haar-like feature computed the sum of weighted pixels in several rectangles. Those features are easy to
implement and efficient to compute using integral image [14].
2.2 Appearance model
In both MIL [12] and WMIL [13] tracker, a boosting classifier is trained to maximize the log likelihood
of bags under the gradient boosting framework [15]:
โ„’ = ∑๐‘– logโก(๐‘(๐‘ฆ๐‘– |๐‘‹๐‘– ))
(1)
It deserves attention that the likelihood is defined over bags and not instance. In order to compute (1),
we need to express ๐‘(๐‘ฆ๐‘– |๐‘‹๐‘– ), the probability of a bag being positive, in terms of its instance. In MIL
tracker, the Noisy-OR (NOR) model is adopted for doing this:
p(๐‘ฆ๐‘– |๐‘‹๐‘– ) = 1 − ∏๐‘— (1 − ๐‘(๐‘ฆ๐‘– |๐‘ฅ๐‘–๐‘— ))
(2)
We can easily see that the NOR model does not discriminatively treat positive samples according
to their importance to the bag probability, i.e., no matter how far or how near the sample is from the
object location, they contribute equally to the bag probability which can confuse the classifier and
makes the MIL tracker easily select less effective features. In order to take the discriminative
importance of samples into consideration, in WMIL tracker, the bag probability is defined as follows:
p(y = 1|๐‘‹ + ) = ∑๐‘−1
(3)
๐‘—=0 ๐‘ค๐‘—0 ๐‘(๐‘ฆ1 = 1|๐‘ฅ1๐‘— )
p(y = 0|๐‘‹ − ) = ∑๐‘+๐ฟ−1
๐‘ค๐‘(๐‘ฆ0 = 0|๐‘ฅ0๐‘— ) = ๐‘ค ∑๐‘+๐ฟ−1
๐‘(๐‘ฆ0 = 0|๐‘ฅ0๐‘— )
๐‘—=๐‘
๐‘—=๐‘
+
(4)
−
where ๐‘‹ = {(๐‘ฅ10 , ๐‘ฅ11 , … ๐‘ฅ1(๐‘−1) )} is the positive bag which contains N positive samples and ๐‘‹ =
{(๐‘ฅ0๐‘ , ๐‘ฅ0(๐‘+1) , … ๐‘ฅ0(๐‘+๐ฟ−1) )} represents the negative bag including L negative ones. ๐‘ฅ10 is assumed
to be the tracked result at current frame without loss of generality and the instance label is assumed the
same as the bag label. ๐‘ค๐‘—0 is the weight function between the location of sample ๐‘ฅ1๐‘— and ๐‘ฅ10 with
respect to the Euclidean distance:
1
๐‘ค๐‘—0 = ๐‘’ −|๐‘™(๐‘ฅ1๐‘— −๐‘™(๐‘ฅ10 ))|
๐‘
(5)
where l(·) is the location function and c is a normalization constant. In (4), w keeps constant, meaning
that all negative instances contribute equally to the negative bag, i.e., assumptions are made that
samples far away from the tracking location are easily distinguished without ambiguity.
We observe that WMIL integrates the sample importance into the learning procedure, but the bag
probability made up of normalized weighed sum of instance probability seems to bring new ambiguity
which will still confuse the classifier, and thus the discriminative power for the final strong classifier
weakens.
In our EWMIL, we inherit the desired merit of NOR that if one of the instances in a bag has a
high probability, the bag probability will be comparatively high as well. Simultaneously, the
importance of samples contributing to the bag probability is taken into consideration. We combine
these two aspects and propose a novel bag model:
p(๐‘ฆ๐‘– |๐‘‹๐‘– ) = 1 − ∏๐‘+๐ฟ−1
(1 − ๐‘ค๐‘—0 ๐‘(๐‘ฆ๐‘– |๐‘ฅ๐‘–๐‘— ))
๐‘—=0
(6)
+
where ๐‘ค๐‘—0 is the same as (5). We uniformly express the probability for both the positive bag {๐‘‹ } and
the negative bag {๐‘‹ − }, as it empirically makes sense that by applying weighs to the negative samples
can help the classifier to better discriminate the cluttered background and therefore better distinguish
the object from the background.
In (6), in both MIL tracker and WMIL tracker, ๐‘(๐‘ฆ๐‘– |๐‘ฅ๐‘–๐‘— ), probability of an example being
positive, is computed using a standard logistic function:
σ(x) =
1
(7)
1+๐‘’๐‘ฅ๐‘−๐‘ฅ
where the argument x represents the output of strong classifiers. We find that this function increases so
quickly with regard to x, and enforces the probability estimates to diverge to 0 and 1(i.e., overfitting),
see Fig.1 for illustration. Li [16] points out that the optimal weak classifier is hardly discriminated from
the other after several rounds, and then the algorithm randomly selects weak classifier to make up the
final classifier, which degrades the discriminative model.
Fig.1: Logistic function
Motivated by [16], we propose a novel dynamic function to estimate probability of instance
instead of the original logistic function:
σ(x) =
1
−
๐‘ฅ
(8)
1+๐‘’๐‘ฅ๐‘ ๐‘˜2
Where k represents the number of weak classifiers that construct final strong classifier. We find in our
model the overfitting problem is alleviated and even avoided by regularizing ๐œŽโกto a reasonable range
where its value is strongly discriminative, resulting in a more stable and robust tracker.
2.3 Motion model
Our motion model is the same as [12] and [13], meaning that the location of the tracker is equally
likely to appear within a radius s of the tracker location at time ๐‘ก-1:
∗ ||
1โกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโก๐‘–๐‘“โก||๐‘™๐‘ก∗ − ๐‘™๐‘ก−1
<๐‘ 
∗
p(๐‘™๐‘ก∗ |๐‘™๐‘ก−1
)∝{
0โกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโกโก๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
(9)
2.4 A brief summary
The main procedure of our tracking methods is summarized in Table 1, for some other not-mentioned
information, we refer readers to [12] for more details.
TABLE 1 EWMIL algorithm
Input: Dataset
{๐‘‹๐‘– , ๐‘ฆ๐‘– }1๐‘–=0 ,
where ๐‘‹๐‘– = {๐‘ฅ๐‘–1 , ๐‘ฅ๐‘–2 … } is the ith bag densely cropped from current
frame, i.e., ๐‘‹1 = {๐‘ฅ|||๐‘™๐‘ก (x) − ๐‘™๐‘ก (๐‘ฅ0 )|| < ๐›ผ} and ๐‘‹0 = {๐‘ฅ| ξ < ||๐‘™๐‘ก (x) − ๐‘™๐‘ก (๐‘ฅ0 )|| < ๐›ฝ} where
๐‘™๐‘ก (๐‘ฅ0 ) is the object location and ๐›ผ < ๐œ‰ < ๐›ฝ differs the radius.
1.
Update all M weak classifiers in the pool with data {๐‘ฅ๐‘–๐‘— , ๐‘ฆ๐‘– }
2.
Feature selection procedure: a. Compute the instance probability using (8)
b. Compute the bag probability using (6)
c. Choose the most discriminative feature that maximize โ„’ in
(1), and repeat 2, 3, 4 for K times
3.
Compute the strong classifier
4.
Apply the afore-trained classifier to samples cropped in the new frame and find the
maximum confidence as the tracking result.
Output: the tracking location in the new frame
3. Experiments
We compare our EWMIL tracker with original MIL tracker and WMIL tracker on 8 challenging video
sequences. We choose averaging center location error (ACLE) (in pixels) as our evaluation criteria. In
order to make fair comparison, we set the parameters all the same as [12] and [13]. The quantitative
results are shown in Table 2 and Fig.2 presents some screenshots of tracking results in real inland
waterway CCTV videos. The experimental results demonstrate superior performance of our EWMIL
algorithm.
Table 2 Quantitative Evaluation for eight challenging CCTV videos
(Red indicates the best and blue indicates the second)
Sequence
Main Challenge
EWMIL
WMIL
MIL
CCTV 1
Low video quality and
9
13
18
CCTV 5
motion blur
12.5
14
20
CCTV 2
Cluttered background and
38
29
45
CCTV 6
extremely scale change
31
26.5
37
CCTV 3
Partial
10
12
20
CCTV 7
occlusion
10
11
19
CCTV 4
Long time illumination
16
18
15
CCTV 8
change
13
13
24
and
even
full
(a) CCTV 1
(b) CCTV 2
(c) CCTV 3
(d) CCTV 4
Fig.2: Screenshots of tracking results in several CCTV videos
(
Our EWMIL
WMIL
MIL )
4. Conclusion
In this paper, we analyze and enhance the MIL tracker in two aspects. On one hand, we take into
account the discriminative importance of training samples contributing to the bag probability, and then
integrate the importance of the samples into our learning procedure naturally. On the other hand, in
order to solve the potential overfitting problem, we propose a dynamic function to estimate probability
of instance instead of the logistic function. Experimental results on challenging inland waterway CCTV
video sequence demonstrate that our tracker achieves favorable performance.
5. Acknowledgments
This work is supported by the National Science Foundation of China (NSFC 51279152).
6. References
[1] F. Coudert, “Towards a new generation of CCTV networks: Erosion of data protection
safeguards?,” Computer Law & Security Review, vol. 25, no. 2, pp. 145-154, 2009.
[2] N. Dadashi, A. W. Stedmon, and T. P. Pridmore, “Semi-automated CCTV surveillance: The effects
of system confidence, system accuracy and task complexity on operator vigilance, reliance and
workload,” Applied Ergonomics, vol. 44, no. 5, pp. 730-738, 2013.
[3] A. C. Davies, and S. A. Velastin, "A progress review of intelligent CCTV surveillance systems,"
Proceedings of the Third Workshop - 2005 IEEE Intelligent Data Acquisition and Advanced Computing
Systems: Technology and Applications, IDAACS 2005. pp. 417-423.
[4] N. Dadashi, A. Stedmon, and T. Pridmore, "Automatic components of integrated CCTV
surveillance systems: Functionality, accuracy and confidence," 6th IEEE International Conference on
Advanced Video and Signal Based Surveillance, AVSS 2009. pp. 376-381.
[5] S.-W. Park, K.-S. Lim, and J. W. Han, "Videos analytic retrieval system for CCTV surveillance,"
Lecture Notes in Electrical Engineering. pp. 239-247.
[6] G. Thiel, "Automatic CCTV surveillance - towards the VIRTUAL GUARD," IEEE Annual International
Carnahan Conference on Security Technology, Proceedings. pp. 42-48.
[7] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,”
International Journal of Computer Vision, vol. 77, no. 1-3, pp. 125-141, 2008.
[8] M. Xue, and L. Haibin, "Robust visual tracking using l1 minimization." pp. 1436-1443.
[9] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1631-1643, 2005.
[10] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez, “Solving the multiple instance problem with
axis-parallel rectangles,” Artificial Intelligence, vol. 89, no. 1-2, pp. 31-71, 1997.
[11] H. Grabner, C. Leistner, and H. Bischof, "Semi-supervised on-line boosting for robust tracking,"
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics). pp. 234-247.
[12] B. Babenko, S. Belongie, and M.-H. Yang, "Visual tracking with online multiple instance learning,"
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR
Workshops 2009. pp. 983-990.
[13] K. Zhang, and H. Song, “Real-time visual tracking via online weighted multiple instance learning,”
Pattern Recognition, vol. 46, no. 1, pp. 397-411, 2013.
[14] P. Viola, and M. Jones, "Rapid object detection using a boosted cascade of simple features." pp.
I-511-I-518 vol.1.
[15] Y. Freund, and R. E. Schapire, “Decision-theoretic generalization of on-line learning and an
application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
[16] X. Li, and M. Tang, "Analysis of MILTrack," ACM International Conference Proceeding Series. pp.
176-179.
Download