sensors Article Robust Object Tracking Based on Motion Consistency Lijun He, Xiaoya Qiao, Shuai Wen and Fan Li * ID Department of Information and Communication Engineering, School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China; jzb2016125@mail.xjtu.edu.cn (L.H.); qxy0212@stu.xjtu.edu.cn (X.Q.); wen201388@stu.xjtu.edu.cn (S.W.) * Correspondence: lifan@mail.xjtu.edu.cn; Tel.: +86-29-8266-8772 Received: 28 December 2017; Accepted: 10 February 2018; Published: 13 February 2018 Abstract: Object tracking is an important research direction in computer vision and is widely used in video surveillance, security monitoring, video analysis and other fields. Conventional tracking algorithms perform poorly in specific scenes, such as a target with fast motion and occlusion. The candidate samples may lose the true target due to its fast motion. Moreover, the appearance of the target may change with movement. In this paper, we propose an object tracking algorithm based on motion consistency. In the state transition model, candidate samples are obtained by the target state, which is predicted according to the temporal correlation. In the appearance model, we define the position factor to represent the different importance of candidate samples in different positions using the double Gaussian probability model. The candidate sample with highest likelihood is selected as the tracking result by combining the holistic and local responses with the position factor. Moreover, an adaptive template updating scheme is proposed to adapt to the target’s appearance changes, especially those caused by fast motion. The experimental results on a 2013 benchmark dataset demonstrate that the proposed algorithm performs better in scenes with fast motion and partial or full occlusion compared to the state-of-the-art algorithms. Keywords: object tracking; motion consistency; state prediction; position factor; occlusion factor 1. Introduction Object tracking is an important application for video sensor signal and information processing, which is widely applied in video surveillance, security monitoring, video analysis, and other areas. Although numerous methods have been proposed, it is still a challenging problem to implement object tracking in particular scenes, such as sports scenes for player tracking and in security scenes for criminal tracking. These scenes are characterized by fast motion of the target, occlusion and illumination variation. Improving the accuracy and robustness of tracking in these particular scenes is an open problem. Object tracking is the determination of the target state in continuous video frames. The particle filter is widely used in object tracking, which uses the Monte Carlo method to simulate the probability distribution and is effective in estimating the non-Gaussian and nonlinear states [1]. In the particle filter, the state transition model and the appearance model are two important types of probabilistic models. The state transition model is used to predict the current target state based on the previous target states, which can be divided into Gaussian distribution model and constant velocity model. Gaussian distribution model assumes that the velocities between adjacent frames are uncorrelated. Thus, the candidate samples are predicted by simply adding a Gaussian disturbance to the previous target state [2–8]. This method has been widely used due to its simplicity and effectiveness. The method performs well when the target moves short distances in random directions. However, in the presence of fast motion, a large variance and more particles are needed to avoid the loss of the true target, which will result in higher computational burden. The constant velocity model assumes a strong correlation Sensors 2018, 18, 572; doi:10.3390/s18020572 www.mdpi.com/journal/sensors Sensors 2018, 18, 572 2 of 27 between the velocity of the current state and those of previous states, and candidate samples are predicted from the previous states and velocities with the addition of a stochastic disturbance [9–15]. This effect can be regarded as adding a disturbance to a new state that is far away from the previous state in terms of velocity. Thus, when the velocity in the current frame is notably smaller than that in the previous frame, an improper variance of the disturbance will also cause the loss of the target. The appearance model is used to represent the similarity between the real state of the target and the observation of the candidates. It can be divided into generative model [8,15–20], discriminative model [21–27] and collaboration model [2–4,6,28–30]. Tracking based on generative model focuses on foreground information and ignores background information. The candidates are represented sparsely by a foreground template set and the candidate with the minimum reconstruction error can be selected as the target. Tracking based on discriminative model considers the target tracking process as a classification process, and a classifier is trained and updated online to distinguish the foreground target from the background. Tracking based on collaboration model takes advantage of the generative and discriminative models to make joint decisions. Most existing target tracking methods have obvious disadvantages in complex scenes. The candidate samples selected in the state transition model without prediction may not include the real target, which might cause tracking drift or even tracking failure especially when the target has fast motion in the video. Moreover, if the target’s appearance changes drastically during movement, only considering the target’s appearance feature in appearance model may reduce the matching accuracy between the target and the candidate samples. In this paper, we propose an object-tracking algorithm based on motion consistency (MCT). MCT utilizes the temporal correlation between continuous video frames to describe the motion consistency of the target. In the state transition model, the candidate samples are selected based on the predicted target state. In the appearance model, the candidate samples are represented by the position factor, which is defined to describe the motion consistency of the target. The tracking result is determined by combining the position factor with the local and holistic responses. The main contributions of our proposed algorithm are summarized as follows: 1. Candidate sampling based on target state prediction A conventional transition model may result in loss of the target if the number of candidate samples is limited. This may lead to tracking drift or even tracking failure. In this paper, candidate sampling based on target state prediction is proposed. Considering the motion consistency, the target state in the current frame is predicted based on the motion characteristics of the target in previous frames. Then, the candidate samples are selected according to the predicted target state to overcome the tracking drift. 2. Joint decision by combining the position factor with local and holistic responses The appearance probability of the target may differ in different positions due to the motion consistency. Different from conventional algorithms, which neglect the temporal correlation of the video frames in the appearance model, the position factor is defined to characterize the importance of candidate samples in different positions according to the double Gaussian probability model. Meanwhile, the occlusion factor is proposed to rectify the similarity computed in the local representation to obtain the local response, and the candidates are represented sparsely by positive and negative templates in the holistic representation to obtain the holistic response. Then, the tracking decision is made by combining the position factor with the responses of the local and holistic representations. This approach makes full use of the temporal correlation of the target state and the spatial similarity of local and holistic characteristics. 3. Adaptive template updating based on target velocity prediction The fast motion of the target will inevitably bring about appearance changes of the target and background. To adapt to the changes, an adaptive template updating approach is proposed, which is based on the predicted velocity of the target. The positive template set is divided into dynamic Sensors 2018, 18, 572 3 of 27 and static template sets. When the target moves fast, the dynamic positive template set is updated to account for the appearance changes of the target. To keep the primitive characteristics of the target, static positive template set is retained. Moreover, the negative holistic template set is updated continuously to adapt to the background changes. The rest of the paper is organized as follows: Section 2 reviews the related work. In Section 3, an object tracking algorithm based on motion consistency is proposed. Section 4 gives the experimental results with quantitative and qualitative evaluations. This paper’s conclusions are presented in Section 5. 2. Related Work A large number of algorithms have been proposed for object tracking, and an extensive review is beyond the scope of this paper. Most existing algorithms mainly focus on both state transition model and appearance model based on particle filter. In state transition model, the change in the target state is estimated and candidate samples are predicted with various assumptions. Most of the state transition models based on particle filter can be divided into Gaussian distribution model and constant velocity model, according to the assumption on whether the velocity is temporally correlated with previous frames. Appearance models, which are used to represent the similarity between the real state of the target and the observation of the candidates, can be categorized into generative models, discriminative models and combinations of these two types of models. 2.1. State Transition Models The state transition model simulates the target state changes with time by particles prediction. Most of the algorithms use a Gaussian distribution model to predict candidate samples. The Gaussian distribution model regards previous target state as the current state, and then obtains candidate samples by adding a Gaussian noise dispersion to the state space. For example, in [2,3], the particles are predicted by a Gaussian function with fixed mean and variance, thus the coverage of particles is limited. It describes target motion best for short-distance motion with random directions, but it performs poorly when the target has a long-distance motion in a certain direction. Furthermore, in [5], three types of particles whose distributions have different variances are prepared to adapt to the different tasks. In visual tracking decomposition [7], the motion model is represented by a combination of multiple basic motion models, each of which covers a different type of motion using a Gaussian disturbance with a different variance. To a certain extent, the model can cover abrupt motion with a large variance. However, fixed models cannot adapt to complex practical situations. In addition, there are other algorithms which ignore the temporally correlation of the target in continuous frames. For example, multi-instance learning tracker (MIL) [22] utilizes a simplified particle filtering method which sets a searching radius around the target in the previous frame and uses dense sampling to obtain samples. Small searching radius cannot adapt to fast motion, on the other hand large searching radius will bring high computation burden. Since algorithms [20,31] based on local optimum search obtain the candidates without prediction either, thus a larger searching radius is also necessary when fast motion occurs. In subsequent studies, some of the algorithms use the constant velocity model to predict candidate samples, which assumes that the velocity of the target in the current frame is correlated to that in the previous frame. In [9,10], the motion of the target is modeled by adding the motion velocity with the frame rate (motion distance) in the previous frame to the previous state and then disturbing it with Gaussian noise. In [11,12], the average velocity of the target in previous frames is considered to obtain the translation parameters to model the object motion. This is suitable for the situation that the velocity in the current frame is strongly correlated with that in the previous frame. However, the actual motion of the target is complex. When the target in the current frame is not correlated with that in the previous frame, the improper variance of Gaussian disturbance and finite number of particles will also lead to tracking drift or even failure. Sensors 2018, 18, 572 4 of 27 2.2. Appearance Models Tracking based on generative models focuses more on foreground information. The candidates are represented by the appearance model and the candidate with the minimum reconstruction error can be selected as the target. In [8], multi-task tracking (MTT) represents each candidate by the same dictionary and considers each representation as a single task, and then formulates object tracking as a multi-task sparse learning problem by joint sparsity. In [19], a local sparse appearance model with histograms and mean shift is used to track the target. Tracking based on the L1 minimum (L1APG) in [12] models target appearance by target templates and trivial templates with sparse representation. However, it neglects the background information. Thus, in the presence of background cluster or occlusion, tracking based on generative model cannot distinguish the target from the background. Tracking based on discriminative models, in essence, considers the target tracking process as a classification task and distinguishes the foreground target from the background using the classifier with the maximum classifier response. In [24,25], an online boosting classifier is proposed to select the most discriminating features from the feature pool. In [26], the supervised and semi-supervised classifiers are trained on-line, in which the labeled first frame sample and unlabeled subsequent training samples are used. The discriminative model considers the background and foreground information completely to distinguish the target. However, a classifier with an improper updating scheme or limited training samples will lead to bad classification. Tracking methods that combine generative and discriminative models exploit the advantages of these two models, fusing them and using them together to make decisions. Tracking via the sparse collaborative appearance model (SCM) [2] combines a sparse discriminative classifier with a sparse generative model to track the object. With the sparse discriminative classifier, the candidate samples are represented sparsely by holistic templates and the confidence value is computed. In sparse generative model, the candidate samples are divided into patches and represented by a dictionary to compute the similarity between each candidate sample and the template. However, the positive templates remain; thus the discriminative classifier cannot adapt to drastic appearance variations. In object tracking via key patch sparse representation [4], the overlapped patch sampling strategy is adopted to capture the local structure. In the occlusion prediction scheme, the positive and negative bags are obtained by sampling patches within and outside the bounding box. Then the SVM (support vector machine) classifier is trained by the bags with labels, by which the patches are predicted to be occluded or not. Meanwhile, the contribution factor is computed via sparse representation. However, the SVM classifier is not timely updated, which may cause error accumulation in occlusion decisions. In object tracking via a cooperative appearance model [6], an object tracking algorithm based on cooperative appearance model is proposed. The discriminative model is established in local representation with positive and negative dictionaries. The reconstruction error of each candidate is computed in holistic representation, and the impact regions are proposed to combine the local and holistic responses. In this paper, we propose a tracking algorithm based on motion consistency, in which the candidate samples are predicted by a state transition model based on target state prediction. In addition, in the appearance model, the position factor is proposed to assist in target selection. As revealed in [2,3,5,7], the candidate samples are predicted simply by using Gaussian distribution model, which is effective for short-distance motion with random direction. Then, the algorithms in [9–12] take the velocity of the target into consideration, to a certain extent, which can adapt to motion that is correlated to the previous state, even long-distance motion. Inspired by the above work, in this paper, the predicted target state information is added to the conventional Gaussian distribution model. However, the proposed model is different from the constant velocity model in that only dynamic candidate samples are predicted with motion prediction; also, some basic candidate samples around the previous state are predicted with Gaussian disturbance. In this way, varied candidate samples can adapt to complex motions in addition to fast motion. Conversely, most tracking algorithms focus on the similarity between the candidate samples and templates, such as that in [2]. However, when the target appearance has a large variation, the true target cannot be selected by only considering the similarity. Thus, in the appearance model, Sensors 2018, 18, 572 5 of 27 Sensors 2018, 18, x FOR PEER REVIEW 5 of 27 considering the temporal correlation of the target state, the position factor is proposed to characterize the importance of candidate samples inofdifferent positions collaborates withand representation proposed to characterize the importance candidate samples and in different positions collaborates responses to select the target. to select the target. with representation responses 3. 3. Robust Robust Object Object Tracking Tracking Based Based on on Motion Motion Consistency Consistency Conventional which ignore ignore the the temporal correlation of of the Conventional algorithms algorithms which temporal correlation the target target states states in in state state transition model may lose the true target, especially when fast motion occurs, and if the appearance transition model may lose the true target, especially when fast motion occurs, and if the appearance of of the the target target changes changes drastically, drastically, algorithms algorithms that that only only focus focus on on the the target’s target’s appearance appearance feature feature in in the the appearance model will willstruggle struggletotoseparate separateforeground foregroundtargets targets from background. Meanwhile, appearance model from thethe background. Meanwhile, an an appropriate updating scheme is necessary to maintain the robustness of the algorithm. Thus we appropriate updating scheme is necessary to maintain the robustness of the algorithm. Thus we propose propose aa robust robust object object tracking tracking algorithm algorithm based based on on motion motionconsistency. consistency. The target is assumed The tracking tracking process process of of our our proposed proposed MCT MCT algorithm algorithm is is shown shown in in Figure Figure 1. 1. The The target is assumed to be known in the first frame. In the state transition model, the target state, including the to be known in the first frame. In the state transition model, the target state, including the motion motion direction direction and and distance distance in in the the current current frame, frame, is is predicted predicted according according to to the the target target states states in in previous previous frames. The candidate candidate samples samples are frames. The are predicted predicted by by the the state state transition transition model model according according to to the the predicted predicted target state. Next, all the candidate samples are put into the appearance model. In the target state. Next, all the candidate samples are put into the appearance model. In the appearance appearance model, model, we we propose propose the the position position factor factor according according to to the the double double Gaussian Gaussian probability probability model model to to assign assign weights to candidate samples located in different positions. Meanwhile, in the local representation, weights to candidate samples located in different positions. Meanwhile, in the local representation, the the similarity similarity between between the the candidate candidate and and the the target target is is computed computed with with the the sparse sparse representation representation by by dictionary dictionary to to obtain obtain the the local local response response of of each each candidate. candidate. In In the the holistic holistic representation, representation, the the holistic holistic response sample is computed with with the sparse representation by positive and negative response ofofeach eachcandidate candidate sample is computed the sparse representation by positive and templates. Finally, the candidate sample with the highest likelihood is decided as the tracking result negative templates. Finally, the candidate sample with the highest likelihood is decided as the by combining the holistic and local responses with the position factor. To adapt to the appearance tracking result by combining the holistic and local responses with the position factor. To adapt to the changes of the target,ofespecially caused by fast motion, the template should beshould updated appearance changes the target,those especially those caused by fast motion, the sets template sets be adaptively based on based the target velocity updated adaptively on the targetprediction. velocity prediction. Figure 1. 1. The prediction based state transition model. In Figure Thecandidates candidatesare arepredicted predictedby bythe thetarget targetstate state prediction based state transition model. appearance model, the position factor is proposed to characterize the importance of candidate In appearance model, the position factor is proposed to characterize the importance of candidate samples in in different different positions. positions. In In the the target target velocity velocity prediction prediction based based updating updating model, model, the the template template samples sets are updated adaptively to ensure the robustness and effectiveness of MCT algorithm. sets are updated adaptively to ensure the robustness and effectiveness of MCT algorithm. Sensors 2018, 18, 572 6 of 27 Sensors 2018, 18, x FOR PEER REVIEW 6 of 27 3.1.3.1. State Transition Model Based on Target State Prediction State Transition Model Based on Target State Prediction TheThe state transition samplesininthe thecurrent currentvideo video frame. state transitionmodel modelisisused usedto toselect select the the candidate candidate samples frame. Conventional methods based on particle filter usually use the Gaussian distribution model to obtain Conventional methods based on particle filter usually use the Gaussian distribution model to obtain thethe candidate samples around in the theprevious previousframe. frame.When When the candidate samples aroundthe theposition positionwhere wherethe thetarget target appears appears in the target hashas fastfast motion, the selection predictionmay mayresult resultininthe theloss loss target motion, the selectionofofcandidate candidatesamples samples without without prediction ofof thethe true target. Therefore, of motion motionconsistency, consistency,we wepropose proposea astate state true target. Therefore,considering consideringthe thecharacteristics characteristics of transition model based onon target samples. transition model based targetstate stateprediction predictionto topredict predict candidate candidate samples. 3.1.1. Target State Prediction 3.1.1. Target State Prediction Target state prediction is the estimation of the target position at frame t.oWe a setaofset target t . predict Target state prediction is the of the target position at frame We predict of n estimation 1 1 2 2 u u U U 1 1 2 2 u u U U ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ targetdirections motion directions and distances t according to theto (θ̂t , lˆt ),((tθ̂,tlt, ),lˆt(),t ·, l·t ·), , (,θ̂(t t, ,lˆltt )),, · ·, ·(,t (,θ̂ltt ),lˆt at) frame motion and distances at frame t according themotion motionconsistency, consistency,which whichis is shown Figure shown in in Figure 2. 2. Figure 2. The relative motion of target in the previous frame including motion direction and motion Figure 2. The relative motion of target in the previous frame including motion direction and motion distance is already known. Then the target motion direction in the current frame is predicted by distance is already known. Then the target motion direction in the current frame is predicted by adding adding a prediction range of γ to the previous motion direction, and the motion distance is predicted a prediction range of γ to the previous motion direction, and the motion distance is predicted according according to the previous motion distance with variation rate. to the previous motion distance with variation rate. is assumed that thetarget targethas hasaasteady steadymotion motion direction direction and It isIt assumed that the and velocity. velocity.We Wepredict predictthe thetarget target motion direction by using the motion direction of the target at frame t − 1. As shown in Figure 2, the 2, motion direction by using the motion direction of the target at frame t − 1. As shown in Figure from thethe target from frame t 1tto t 2 ist − t−1. Then, the target motion therelative relativemotion motiondirection direction from target from frame −frame 1 to frame 2 is θ t −1 . Then, the target direction is predicted as a set of values based on the value t−1. Taking t−1 as the center, we select U motion direction is predicted as a set of values based on the value θ t −1 . Taking θ t −1 as the center, angles uniformly in the interval [t−1 − γ, t−1 + γ], where γ is a constant, to determine the prediction we select U angles uniformly in the interval [θ t −1 − γ, θ t −1 + γ], where γ is a constant, to determine the range to reduce the estimation error. Then, the predicted values of the motion directions are: prediction range to reduce the estimation error. Then, the predicted values of the motion directions are: 2 ˆtu t 1 (u -1) u 1,2,...,U , (1) 2γU 1 u θ̂t = θt−1 − γ + (u − 1) u = 1, 2, . . . , U, (1) U−1 where U is the number of predicted directions. alsonumber predict of thepredicted motion distance based on the relative motion distance within the three where UWe is the directions. frames before frame. predicted distance each is: We also predict theThe motion distance basedinon thedirection relative motion distance within the three frames before frame. The predicted distance in eachˆdirection is: u lt wˆ t lt 1 , (2) ˆu = ŵt × lt−1 , t the target from frame t − 1 to frame t 2. w ˆ t describes(2) where lt 1 is the relative motion distance lof the variation rate of the motion distance in the previous frames and is defined as: where lt−1 is the relative motion distance of the target from frame t − 1 to frame t − 2. ŵt describes the l l 2 variation rate of the motion distance in the t 1 (1frames ) t and wˆ t previous , is defined as: (3) l l t 2 t 3 l l t −1 ŵt = α × than + (1 − α ) × t −2 , (3) where is a positive constant that is less 1. l t −2 l t −3 1 ˆ1 2 ˆ2 u ˆu U ˆU ˆ ˆ ˆ ˆ Thus, we obtain a set of target motion directions and distances (t , lt ),(t , lt ),,(t , lt ), , (t , lt ) . where α is a positive constant that is less than 1. Sensors 2018, 18, 572 7 of 27 Thus, we obtain a set of target motion directions and distances {(θ̂t1 , lˆt1 ), (θ̂t2 , lˆt2 ), · · · , (Sensors θ̂tu , lˆtu )2018, , · · · 18, , (θ̂xtUFOR , lˆtUPEER )}. REVIEW 7 of 27 3.1.2. 3.1.2. Three-Step Three-Step Candidate Candidate Sampling Sampling The The conventional conventional Gaussian Gaussian distribution distribution model model generates generates the the candidate candidate samples samples near near the the target target position in the previous frame. For random motion over a short distance, the true target position position in the previous frame. For random motion over a short distance, the true target position can can be However, when be successfully successfully covered covered by by the the samples. samples. However, when fast fast motion motion occurs, occurs, these these predicted predicted samples samples may tracking failure. ToTo avoid thethe loss of may lose lose the the true truetarget targetposition, position,which whichresults resultsinintracking trackingdrift driftoror tracking failure. avoid loss the truetrue target position, a three-step candidate sampling scheme is proposed according to target state of the target position, a three-step candidate sampling scheme is proposed according to target prediction, which is illustrated in Figure 3. state prediction, which is illustrated in Figure 3. Figure 3. First, the basic candidate samples (the blue dots) are obtained around the target location at Figure 3. First, the basic candidate samples (the blue dots) are obtained around the target location at frame t − 1. Second, the basic candidate samples with the largest relative distance in each predicted frame t − 1. Second, the basic candidate samples with the largest relative distance in each predicted direction will be selected as the benchmark candidate samples (the yellow dots). Third, adding more direction will be selected as the benchmark candidate samples (the yellow dots). Third, adding more dynamic candidate samples (the green dots) in the predicted direction begins from the benchmark dynamic candidate samples (the green dots) in the predicted direction begins from the benchmark within aa predicted predicted distance. distance. within 1. First First step: Basic candidate sampling 1. step: Basic candidate sampling The basic basic candidate candidate samples samples in in the the current current frame frame are are predicted predicted by by the the Gaussian Gaussian distribution distribution The i i i i model. The The current current state state XXt tis is predicted state previous frame disturbance model. predicted byby thethe state in in thethe previous frame X X tand thethe disturbance by 1 and t −1 X Gauss as follows: by Gaussian noise Gaussian noise XGauss as follows: Xti =i Xti−1i + XGauss , (4) X X X Gauss , (4) where Xti represents the ith candidate in frame t and the state of each candidate is represented is represented X ti represents t andXthe the ithmodel candidate in frame statei ofi each where i by i i i candidate i i i by an affine transformation of six parameters t = α1 , α2 , α3 , α4 , α5 , α6 , in which α1 , α2 i i i 1i , 2i ,3i , 4i ,rotation. 5i ,6i , in which an affine translation, transformation of sixscaling parameters represents α3i , αmodel and α5i ,Xαt 6i represents Each parameter 1i ,2in 4 represents i i i i i i i , iα , α i i the state vector α , α , α , α is assumed to obey a Gaussian distribution of different mean and 6 1 2 33 ,4 4 5 represents scaling and 5 ,6 represents rotation. Each parameter represents translation, variance according to the actual situation. Next, the relative translation between the N basic candidate i i i i assumed to obey aand Gaussian distribution different mean in the state 1i ,in2i ,the 3 , previous 4 , 5 , 6 is samples andvector the target frame is computed converted to a polarofcoordinate form 1 1 2 2 n n N N and variance according to the actual situation. Next, the relative translation between the N basic (θt , lt ), (θt , lt ), · · · , (θt , lt ), · · · , (θt , lt ) , which covers the direction and distance information of candidate samples and the target in the previous frame is computed and converted to a polar each candidate sample. 2 n n N N coordinate formBenchmark (t1, lt1),(t2 , lcandidate t ),,(t , lt ),,(t , lt ) , which covers the direction and distance information 2. Second step: sampling of each candidate sample. To make candidate samples cover as large a region as possible where the target possibly appears, 2. Second both step: efficiency Benchmark candidate sampling considering and effectiveness, we simplify the problem by selecting the benchmark with To themake largestcandidate relative distance in each predicted direction and thenwhere adding dynamic samples cover as large a region as possible themore target possiblycandidate appears, samples in the same direction a predicted distance, which begins from thethe benchmark. considering both efficiency andwithin effectiveness, we simplify the problem by selecting benchmark with t t 1 the largest relative distance in each predicted direction and then adding more dynamic candidate samples in the same direction within a predicted distance, which begins from the benchmark. 1 1 2 2 u u U U The benchmark candidate samples (ˆ , l ),(ˆ , l ),,(ˆ , l ),,(ˆ , l ) are selected from the t max t max t max t max basic candidate samples in accordance with the motion direction predicted by Equation (1) as follows: Sensors 2018, 18, 572 8 of 27 1 ), (θ̂ 2 , l 2 ), · · · , (θ̂ u , l u ), · · · , (θ̂U , lU ) are selected The benchmark candidate samples (θ̂t1 , lmax t max t max t max from the basic candidate samples in accordance with the motion direction predicted by Equation (1) as follows: j j u (θ̂tu , lmax ) = ( θt , lt ) j = argmax lti u = 1, 2, · · · , U, (5) i ∈{i | θti =θ̂tu ,1≤i ≤ N } In the θ̂tu direction, the basic candidate sample with the largest relative distance will be selected as the benchmark candidate, as shown in Figure 3. There is a rule that if it lacks a benchmark in the θ̂tu direction, the benchmark candidate sample will be defined as (θ̂tu , 0) with a relative distance of zero. 3. Third step: Dynamic candidate sampling Beginning from the benchmark, in each predicted direction, a total of M dynamic candidate samples is placed evenly at a distance of lˆtu /M apart, as shown in Figure 3. There are UM dynamic candidate samples: u (θ̂tu , lmax +m× lˆtu ) M u = 1, 2, · · · , U m = 1, 2, · · · , M, (6) Thus, we obtain the relative position information of all the candidate samples including N basic candidate samples and UM dynamic candidate samples {(θt1 , lt1 ), (θt2 , lt2 ), · · · , (θtN , ltN ), (θtN +1 , ltN +1 ), · · · , (θtN +UM , ltN +UM )} in the frame t. Moreover, these relative polar coordinates are converted into translation parameters, and the remaining four parameters are assumed to obeynthe Gaussian distribution. All the candidate samples predicted in o frame t are represented by Xt = Xt1 , Xt2 , · · · , XtN +UM . 3.2. Position Factor Based Cooperative Appearance Model In the presence of fast motion, the target’s appearance changes drastically. Therefore, the algorithms that only considering the changes of appearance feature in appearance model is difficult to distinguish the foreground target from a complex background. In view of this, we propose the position factor to reduce the effect of appearance changes caused by target movement. Due to the motion consistency, there is a different probability of the target appearing in each position. Generally, candidate sample in the location where the target may appear has a larger probability to be selected as the target. Thus, candidate samples of this kind should be assigned larger weights, while other candidate samples should be assigned smaller weights. Therefore, a quantitative parameter named the position factor is proposed to assign different weights to candidate samples in different positions to rectify the representation error. The likelihood of candidate sample is defined by combining the position factor with the responses of local and holistic representations. 3.2.1. Position Factor Based on Double Gaussian Probability Model The position factor of the ith candidate sample is computed by multiplying the direction score SiA and distance score Sli : Fi = SiA Sli i = 1, 2, · · · , N + MU, (7) To compute direction and distance scores, a double Gauss probability model is used according to the target states in previous frames, which includes the direction Gaussian probability model and distance Gaussian probability model. 1. Direction score The motion direction of the target at frame t is consistent with that in the previous frame. It is assumed that the probability of the target appearing in the θt−1 direction is the largest, and the probability of the target appearing in the opposite direction θt−1 is the smallest. Thus, a Gaussian probability model with θt−1 mean error is proposed to simulate the distribution of the target’s motion Sensors 2018, 18, 572 9 of 27 direction. Accordingly, the probability of the candidate sample lying in θt−1 being selected as the target is large and it should be assigned a large weight. In contrast, the candidate sample lying in the direction opposite θt−1 should be assigned a small weight. Then, the weight assignment model is transformed according to the trend of the direction Gaussian probability model, from which the direction score of the ith candidate sample is derived as follows: SiA = 2 T1 T1 −(θti −θt−1 ) e + (1 − ) 1 − e−π 1 − e−π i = 1, 2, · · · , N + MU, (8) where θti denotes the relative direction between the ith candidate and target at frame t − 1, θt−1 is the A ) motion direction of the target at frame t − 1, and T1 is a constant that depends inversely on (θt−1 , pmax A ). and (θt−1 − π, pmin 2. Distance score The motion velocities within two continuous frames are consistent. In the θt−1 direction, the velocity of the target at frame t is consistent with that of the previous frame. Thus, according to the distance predicted by Equation (2), it is assumed that the probability of the target appearing near the relative distance lˆtu is the largest. Conversely, in the direction opposite θt−1 , the motion velocity of the target is small, and the probability of the target at frame t appearing near the target at frame t − 1 is the largest. Accordingly, in each direction, there is a distance with the largest probability of target appearance, as well as directions where the candidate samples are located: li = ( i 2 T2 T2 e−(θt −θt−1 ) + (1 − ))lˆu − π 1−e 1 − e−π t i = 1, 2, · · · , N + MU, (9) where T2 is a constant that depends inversely on (θt−1 , lˆtu ) and (θt−1 − π, klˆtu ), and k is a positive constant that is less than 1. It has been assumed that in the θti direction, the probability of the target appearing at the relative distance l i is largest and smallest for the infinite relative distances. Thus, the probability of the candidate sample at θti , l i being selected as the target is large and the weight of this candidate is l defined as pmax , whereas the probability of the candidate sample at θti , ∞ being selected as the target l is small and the weight of this candidate is defined as pmin . According to this distribution feature, i in the θt direction, the motion distance can be assumed to obey a Gaussian probability model with l i mean error. Thus, the weight assignment model is transformed according to the trend of the distance Gaussian probability model, by which the distance score of the ith candidate sample is computed as follows: 2 T3 T3 −(lti −l i ) Sli = e + (1 − ) i = 1, 2, · · · , N + MU, (10) u ˆu −lˆi 1−e 1 − e − li where lti represents the relative distance of the ith candidate sample; l i is the corresponding distance with the largest probability in the θti direction, which is computed by Equation (9); and T3 is a constant l l that depends inversely on (l i , pmax ) and (∞, pmin ). 3.2.2. Local Representation with Occlusion Factor In the local representation, the candidate samples are divided into patches and are sparsely represented by the local features, from which the local information of candidate samples is obtained. The local information of candidates, obtained from the sparse representation, is used for adapting to occlusion and other situations that may cause local appearance changes. Thus, in this paper, based on the sparse generative model (SGM) [2], the occlusion information of each patch is defined according to the reconstruction error with the local sparse representation. Furthermore, to adapt to occlusions, the occlusion factor is defined to rectify the similarity computed by SGM. Thus, the local response L of each candidate is represented as follows: Sensors 2018, 18, 572 10 of 27 L = Lc β c , (11) where Lc represents the similarity and β c represents the occlusion factor. 1. Similarity function First, the image is normalized to 32 × 32 pixels, from which M patches are derived by overlapping sliding windows. With the sparse representation, the patches are represented by a dictionary generated from the first frame by K-Means [32] to compute the reconstruction error. The patch with large reconstruction error is regarded as an occluding part and the corresponding element of the occlusion indicator o is set to zero, as follows: ( 1 εm < ε0 m = 1, 2, · · · , M, (12) om = 0 otherwise where ε m denotes the reconstruction error of mth patch and ε 0 is a constant used to determine whether the patch is occluded or not. Moreover, a weighted histogram is generated by concatenating the sparse coefficient with the indicator o. The similarity Lc of each candidate is computed on the basis of weighted histograms between the candidate and the template. 2. Occlusion factor To further adapt to occlusion in the appearance model, we propose the occlusion factor to rectify the similarity function. The number of the patches whose reconstruction errors are larger than ε 0 is computed statistically by: M Nc = ∑ om , (13) m =1 where a larger value of Nc represents that there is more information on the background or the occlusion. The occlusion factor is defined as follows: β c = σe−(1−ωc ) , (14) where σ is a constant and ωc = Nc /M is the statistical average of Nc . According to Equation (14), there is a positive correlation between the occlusion factor and ωc . When occlusion occurs, the value of the occlusion factor increases with the increase of ωc . 3.2.3. Holistic Representation Based on Sparse Discriminative Classifier In the holistic representation, the foreground target is distinguished from the background by sparse discriminative classifier (SDC) [2]. Initially, np positive templates are drawn around the target location and nn negative templates are drawn further away from the target location. By the way, the label +1 is assigned to positive templates and label −1 is assigned to negative templates. In order to reduce the redundancy of the grayscale template sets, a feature selection scheme is proposed in SDC. First, the labels are represented sparsely by the positive and negative template sets and a sparse vector s is generated by the sparse coefficient vector. Then the sparse vector s is constructed into a diagonal matrix S’ and a projection matrix S is obtained by removing all-zero rows from S’. Finally, both the templates and candidates are reconstructed into the projected form by combining them with the projection matrix S. With the sparse representation, the projected candidates are represented by projected template sets, and the reconstruction error of the projected candidate is used to compute the confidence value Hc of each candidate. Sensors 2018, 18, 572 11 of 27 3.2.4. Joint Decision by Combining Position Factors with Local and Holistic Responses By combining the position factor with the local and holistic responses, the likelihood function of the ith candidate sample is computed as follows: pi = Fi Li Hci i = 1, 2, · · · , N + MU, (15) For each candidate, the position factor is large if it is located in the position where the target appears with the largest probability. Moreover, the local and holistic responses are large if the candidate is similar to the target. Thus, the candidate with the largest likelihood, calculated by Equation (15), will be possibly identified as the target. The tracking result Xt in the current frame is chosen from the candidate samples by: j Xt = Xt j = argi max pi , (16) i =1,2,··· ,N + MU 3.3. Adaptive Template Updating Scheme Based on Target Velocity Prediction It is necessary to update the templates and the dictionary to capture the appearance changes of the target to ensure the robustness and effectiveness of the appearance model. In SCM [2], the negative templates are updated in the SDC model every several frames from the image regions away from the current tracking result, but the positive templates remain. In the SGM model, instead of updating the dictionary, the template histogram is updated by taking the most recent tracking result into account. However, the fixed positive templates in SDC model cannot match the true target effectively when the target’s appearance changes drastically, which will inevitably occur in the presence of fast motion. To adapt to the changes of the target’s appearance, the positive templates also need to be updated to capture the appearance changes of the target. In view of this, an adaptive template updating scheme based on the prediction of the target velocity is proposed. The positive template set is divided into dynamic and static template sets, as follows: n p = { pd , ps } (17) where pd represents the dynamic template set and ps represents the static template sets. For every several frames, we determine whether the target has fast motion by comparing the relative motion distance of the target with a threshold of p pixels. The variable p denotes the fast motion measure of the tracking process. When the relative motion distance lt is larger than p, we deem that the target moves fast in the current frame, hence the target in the next frame will move with a large velocity according to motion consistency. Thus, the target may have appearance change at the same time. Next, the dynamic positive template set is updated to account for appearance changes of the target. To keep the initial information of the target and avoid incorrect representation in case of the recovery of target’s appearance, the static positive template set is retained, as follows: ( np = p∗d , ps { pd , ps } lt ≥ p otherwise (18) where p∗d represents the updated dynamic template set, which is drawn around the target location in the current frame. By combining the updated dynamic set and static template set, the target can be well matched with the positive template set. Besides, the negative templates are also updated in the SDC model every several frames in order to distinguish the target from the complex and changeable background in the tracking process. 4. Experiments To evaluate the performance of our proposed MCT algorithm, we test our tracker on the CVPR2013 (OTB2013) dataset, which has been widely used in evaluation [33]. OTB contains 50 video sequences, Sensors 2018, 18, 572 12 of 27 source codes for 29 trackers and their experimental results on the 50 video sequences. We compared our tracker with eight state-of-the-art tracking algorithms: the compressive tracking (CT) [23], distribution fields for tracking (DFT) [31], online robust image alignment (ORIA) [20], MTT [8], visual tracking via adaptive structural local sparse appearance model (ASLA) [18], L1APG [12], SCM [2], and MIL [22] trackers. The basic information of these trackers is listed in Table 1. As described in related work, most algorithms ignore the temporal correlation of the target states. MTT, ASLA and SCM obtain candidates by Gaussian distribution model, MIL and CT obtain candidates by dense sampling, DFT and ORIA obtain candidates by local optimum search. Only LIAPG considers target motion speed in the previous frame and obtain candidates by constant velocity model. Besides, the competition algorithms cover all the type of appearance model, but all of the appearance models are designed from the perspective of the target’s appearance feature. We evaluate the CT, DFT, ORIA, MTT, ASLA, L1APG, SCM, and MIL trackers by using the available source codes. We conduct experiments on all 50 video sequences and perform a concrete analysis on eight specific video sequences with different problems. All the problems we concerned are covered in the video sequences, thus the effectiveness of the algorithm can be illustrated. The basic information of the eight test video sequences and primary problems are shown in Table 2 [33]. Table 1. Basic information of the eight state-of-the-art tracking algorithms. Algorithm State Transition Model Appearance Model L1APG MTT ASLA SCM MIL CT DFT ORIA Constant velocity model Gaussian distribution model Gaussian distribution model Gaussian distribution model Dense sampling search Dense sampling search Local optimum search Local optimum search Holistic representation Holistic representation Local representation Holistic & Local representation Holistic representation Holistic representation Holistic representation Holistic representation Table 2. Basic information of the eight test videos. Video Image Size Target Size Main Confronted Problems Coke Boy Football Jogging-1 Girl Skating1 Basketball Sylvester 640 × 480 640 × 480 624 × 352 352 × 288 128 × 96 640 × 360 576 × 432 320 × 240 48 × 80 35 × 42 39 × 50 25 × 101 31 × 45 34 × 84 34 × 81 51 × 61 Fast motion Fast motion Occlusion Occlusion Scale variation and rotation Scale variation and rotation Deformation and illumination variation Illumination variation In the experiment, for each test video sequence, all information is unknown except for the initial target position of the first frame. The number of candidate samples is 600 including 250 basic candidate samples and 350 dynamic candidate samples. The number of motion directions we predicted is seven. In each direction, a total of 50 candidate samples are obtained. In the update, five of positive templates are updated every five frames from image regions away more than 20 pixels the current tracking result. 4.1. Quantitative Evaluation 4.1.1. Evaluation Metric We choose three popular evaluation metrics: center location error, tracking precision and success rate. Sensors 2018, 18, 572 13 of 27 The center location error (CLE) is defined as the Euclidean distance between the center position of the tracking result and the ground truth. Lower average center location error indicates better performance of the tracker. CLE is calculated by: CLE = 1 N i dis Cti , Cgt , ∑ N i =1 (19) where N is the total number of frames, dis(, ) represents distance between the center position Cti of the i in the ith frame. tracking result and the ground truth Cgt The tracking precision is defined as the ratio of the number of frames for which the error between the target center point calculated by the tracking algorithm and the groundtruth is smaller than the threshold th to the total number of frames, as follows: precision(th) = 1 N i F (dis(Cti , Cgt ) < th), N i∑ =1 (20) where F (·) is a Boolean function, the output of which is 1 when the expression inside brackets is true, otherwise the output is 0, and the threshold th ∈ [0, 100]. The tracking precision depends on the choice of th. The threshold th is set to 20 to compute the tracking precision in the following experiment. The success rate is defined as the ratio of the number of frames for which the overlap rate is larger than λ to the total number of frames, as follows: success(λ) = 1 N F (overlapi ≥ λ), N i∑ =1 (21) where the overlap rate is defined as overlap = τgt ∩ τt / τgt ∪ τt , the threshold λ ∈ [0, 1] is fixed, τgt represents the region inside the true tracking box generated by ground truth, τt represents the region inside the bounding box tracked by the tracker, |·| denotes the pixels in the region, and ∩ and ∪ denote the intersection and union of two regions, respectively. In the following experiment, instead of using the success rate value for a specific threshold λ, the area under the curve (AUC) is utilized to evaluate the algorithms. 4.1.2. Experiment on CVPR2013 Benchmark To evaluate the performance of our algorithm, we conduct an experiment to test the MCT tracker on the OTB dataset. OTB contains 50 videos and 29 algorithms, along with their experimental results on the 50 video sequences. We compared our tracker with eight trackers in terms of overall performances and we especially focus on the performance in the presence of fast motion (FM) and occlusion (OCC). 1. The comparison results for fast motion and occlusion Fast motion (FM) and occlusion (OCC) are the two scenarios that we examine. The results of precision and success rate for FM and OCC are shown as Figure 4. Our algorithm has the highest precision and success rate compared with others. In the presence of FM, the tracking precision of MCT is 0.433, which is 10% higher than that of the SCM algorithm. The success rate is 0.384, which is improved by 8.8% compared with the SCM algorithm. According to the motion consistency, the target state is estimated and merged into the state transition model, appearance model and updating scheme to promote the accuracy in the presence of fast motion. In the presence of OCC, the tracking precision of MCT is 0.655, which is 6% higher than that of the SCM algorithm. The success rate is 0.539, which is 5.2% higher than that of the SCM algorithm. The performance of MCT is still the best in the presence of OCC. This finding is attributable to the occlusion factor, which is proposed to rectify holistic and local responses to account for OCC. Sensors 2018, 18, 572 14 of 27 Sensors 2018, 18, x FOR PEER REVIEW 14 of 27 Sensors 2018, 18, x FOR PEER REVIEW 14 of 27 (a) (b) (a) (b) (c) (d) Figure 4. The precision(c) plot (a) and the success rate plot (b) of the tracking results on OTB for FM, (d) Figure 4. The precision plot (a) and the success rate plot (b) of the tracking results on OTB for FM, and the precision plot (c) and the success rate plot (d) of the tracking results on OTB for OCC. and the precision plot (c) and the success rate plot (d) of the tracking results on OTB Figure 4. The precision plot (a) and the success rate plot (b) of the tracking results on for OTBOCC. for FM, 2. and the precision plot (c) and success rate plot (d) of the tracking results on OTB for OCC. The comparison results for the overall performance 2. The results of precision and success rate for overall performance on 50 video sequences are The comparison results for overall performance 2. The comparison results for overall performance The results of precision and success ratealgorithms, for overallthe performance onthe 50 video areare shown shown as Figure 5. Compared with other precision and successsequences rate of MCT The results of precision and success rate for overall performance on 50 video sequences are as Figure 5. Compared with other algorithms, the precision and the success rate of MCT are highest. highest. The precision of MCT is 0.673, which is 6.5% higher than that of SCM. The success rate is shownwhich asof Figure 5.isCompared with thethat precision and the success success are The precision MCT 0.673, isother 6.5%algorithms, higher than of SCM. The rate is MCT 0.546, which 0.546, is improved bywhich 4.7% compared with the SCM algorithm. Compared withrate theof other eight highest. The precision of MCT is 0.673, which is 6.5% higher than that of SCM. The success rate is trackers,by MCT has the best performance. is improved 4.7% compared with the SCM algorithm. Compared with the other eight trackers, MCT 0.546, which is improved by 4.7% compared with the SCM algorithm. Compared with the other eight has the best performance. trackers, MCT has the best performance. (a) (b) Figure 5. The precision(a) plot (a) and the success rate plot (b) of the tracking results (b) on OTB for overall performances. Figure 5. The precision plot (a) and the success rate plot (b) of the tracking results on OTB for overall Figure 5. The precision plot (a) and the success rate plot (b) of the tracking results on OTB for performances. overall performances. Sensors 2018, 18, 572 15 of 27 4.1.3. Quantitative Evaluation on Eight Specific Video Sequences Sensors 2018, 18, x FOR PEER REVIEW 15 of 27 The following experiment is conducted on eight specific video sequences, namely, Basketball, 4.1.3. Quantitative on Eight Video Sequences Boy, Football, Jogging-1, Evaluation Girl, Skating1, Coke,Specific and Sylvester, and the performances of the algorithms are evaluated The on the metrics of center islocation error, tracking precision and success rate.Basketball, Boy, following experiment conducted on eight specific video sequences, namely, Football, Jogging-1, Girl, Skating1, Coke, and Sylvester, and the performances of the algorithms are 1. Center location error comparison evaluated on the metrics of center location error, tracking precision and success rate. The comparison of the center location errors is shown as Figure 6. For Basketball, the CLE of 1. Center location error comparison ORIA increases at approximately the 50th frame, and those of CT, MTT, L1APG and MIL increase The comparison of frame. the center is shown as Figure 6. For Basketball, the CLE at approximately the 200th Thelocation CLEs oferrors ASLA and SCM increase at approximately the of 500th ORIA increases at approximately the 50th frame, and those of CT, MTT, L1APG and MIL increase at frame. DFT performs similarly to MCT before the 650th frame, but after the 650th frame, DFT has approximately the 200th frame. The CLEs of ASLA and SCM increase at approximately the 500th a sharply increasing CLE. Almost all the methods fluctuate violently with a high CLE because the frame. DFT performs similarly to MCT before the 650th frame, but after the 650th frame, DFT has a basketball player movesCLE. in a Almost randomallmanner with deformation and illumination variation. sharply increasing the methods fluctuate violently with a high CLE becauseThe theCLE of MCT between the 80th frame and themanner 200th frame fluctuates,and butillumination it recovers soon afterThe with a low basketball player moves in a random with deformation variation. CLE CLE. For Boy,between the CLEthe of80th ORIA increases approximately the 100th frame because the motion of MCT frame and theat 200th frame fluctuates, but it recovers soon after with a lowblur CLE. ForCLEs Boy, the CLE ofASLA ORIA and increases approximately the 100th frame motion begins. The of DFT, SCMatincrease at approximately thebecause 260th the frame andblur cannot begins. The CLEs of DFT, ASLA and SCM increase at approximately the 260th frame and cannot resume tracking again because the boy moves fast with deformation, motion blur and scale variation. resume tracking boy movescompared fast with deformation, motion blur and MCT performs betteragain with because smaller the fluctuations with CT, MTT, L1APG andscale MIL.variation. For Football, MCT performs better with smaller fluctuations compared with CT, MTT, L1APG and MIL. For of almost all the algorithms have a dramatic fluctuation at approximately the 170th frame because Football, almost all the algorithms have a dramatic fluctuation at approximately the 170th frame deformation, but the CLE of MCT is the lowest. For Jogging-1, only MCT is stable with the lowest CLE. because of deformation, but the CLE of MCT is the lowest. For Jogging-1, only MCT is stable with the For Coke, partial and full occlusions appear from approximately the 30th frame, and the accuracies lowest CLE. For Coke, partial and full occlusions appear from approximately the 30th frame, and the of the accuracies algorithmsofdecrease, exceptdecrease, for MCT.except For Skating1, Girl Sylvester, a stable the algorithms for MCT. Forand Skating1, GirlMCT and maintains Sylvester, MCT performance with the lowest CLE. maintains a stable performance with the lowest CLE. Figure 6. Cont. Sensors 2018, 18, 572 16 of 27 Sensors 2018, 18, x FOR PEER REVIEW 16 of 27 Figure 6. Center location error comparison of the algorithms. Figure 6. Center location error comparison of the algorithms. Table 3 lists the average center location errors of the eight algorithms. Red indicates that the Table 3 lists center errors of the eight algorithms. Red We indicates algorithm hasthe theaverage lowest CLE withlocation the best performance followed by blue and green. can seethat that the algorithm has algorithm the lowest with the performance followed by blue and green. can see the MCT hasCLE the lowest CLEbest on seven of these tests and the lowest average CLE on We all tests. ToMCT sum algorithm up, by the has CLEthe comparison withonother algorithms, we can that our proposed that the lowest CLE seven of these tests anddetermine the lowest average CLE on all algorithm, outperforms the others under different challenges. tests. To sum up,MCT, by the CLE comparison with other algorithms, we can determine that our proposed algorithm, MCT, outperforms the others under different challenges. Table 3. Average center location error of the algorithms. Video CT 3.DFT ORIA SCM Table Average center MTT locationASLA error ofL1APG the algorithms. Basketball 89.11 18.03 152.09 106.80 82.63 137.53 52.90 Boy 9.03 106.31 132.90 12.77 106.07 7.03 Video CT DFT ORIA MTT ASLA L1APG 51.02 SCM Football 11.91 9.29 13.61 13.67 15.00 15.11 16.30 Basketball 89.11 92.49 18.0331.44152.09 52.90 Jogging-1 50.43 106.80 108.03 82.63 104.58 137.53 89.52 132.83 Boy Skating1 9.03 150.44 106.31 132.90 7.03 51.02 174.24 69.62 12.77 293.34 106.07 59.86 158.70 16.38 Football Girl11.91 18.859.29 23.9813.61 15.11 16.30 27.65 13.67 4.29 15.00 3.28 2.80 2.60 Jogging-1Coke92.49 40.49 31.4470.7050.43 89.52 132.83 49.96 108.03 29.98 104.58 60.17 50.45 56.81 Skating1 150.44 8.556 174.24 16.38 Sylvester 44.88369.62 5.683 293.34 7.554 59.86 15.227 158.70 26.244 7.968 Girl Average 18.85 52.610 23.9859.85927.65 3.28 2.80 2.60 62.743 4.29 72.054 55.852 60.923 42.101 Coke 40.49 70.70 49.96 29.98 60.17 green 50.45 56.81 Red is the best, blue is the second, is the third. Sylvester 8.556 44.883 5.683 7.554 15.227 26.244 7.968 Average 52.610 59.859 62.743 72.054 55.852 60.923 42.101 2. Tracking precision comparison MIL MCT 91.92 12.79 12.83 MIL 2.10MCT 12.09 8.86 91.92 5.3812.79 96.34 12.83 15.382.10 139.38 12.09 1.428.86 13.67 96.34 19.445.38 46.72 139.38 8.030 15.38 15.504 13.67 9.1751.42 53.557 46.72 15.504 53.557 19.44 8.030 9.175 Red is the best, blue is the second, green is the third. Figure 7 shows the precision results of each algorithm for various values of the threshold th within [0, 100]. MCT is the first algorithm to reach 100% precision, except on Coke and Sylvester. 2. Tracking precision comparison However, for Coke, the precision of MCT is the highest in the range of [0, 80]. For Boy, Jogging-1, Girl, Figure 7 and shows the precision of each algorithm for various values the threshold th within Football Basketball, MCT is results stable and still the first algorithm to reach 100%of precision. In the range [0, 100]. MCT is the first algorithm to reach 100% precision, except on Coke However, of approximately [0, 15], the precision of MCT is lower than those of DFT and and SCMSylvester. on Basketball and lower those ofof DFT, CT,isMTT, ORIA, and on Football, while other for Coke, thethan precision MCT the highest in L1APG the range of [0, 80]. For Boy,algorithms Jogging-1,suffer Girl,from Football drifting and struggle to reach particular,tofor Football, when the threshold is larger and Basketball, MCT is stable and100% stillprecision. the first In algorithm reach 100% precision. In the range of than 18, MCT first reaches 100% precision at a high rate and later increases steadily. For Skating1, inand approximately [0, 15], the precision of MCT is lower than those of DFT and SCM on Basketball the range of approximately [0, 20], MCT has similar performance to ASLA and SCM, but the precision lower than those of DFT, CT, MTT, ORIA, and L1APG on Football, while other algorithms suffer from of MCT is higher than those of the other algorithms. However, when the threshold is larger than 20, drifting and struggle to reach 100% precision. In particular, for Football, when the threshold is larger the precision of MCT is the highest. Table 4 lists the average precisions of the eight comparison than 18, MCT first reaches 100% precision at a high rate and later increases steadily. For Skating1, algorithms and our algorithm MCT for a threshold of 20. The average precision of MCT is higher than in thethose range [0, 20],onMCT has performance to ASLA and SCM, but the of of theapproximately other eight algorithms most of thesimilar test sequences. Thus, MCT outperforms the other precision of MCT is higher than those of the other algorithms. However, when the threshold is larger comparison algorithms. than 20, the precision of MCT is the highest. Table 4 lists the average precisions of the eight comparison algorithms and our algorithm MCT for a threshold of 20. The average precision of MCT is higher than those of the other eight algorithms on most of the test sequences. Thus, MCT outperforms the other comparison algorithms. Sensors 2018, 18, 572 17 of 27 Sensors 2018, 18, x FOR PEER REVIEW 17 of 27 Figure 7. Precision comparison of the algorithms. Figure 7. Precision comparison of the algorithms. Table 4. Average precision of the algorithms. Table 4. Average precision of the algorithms. Video VideoBasketball CT Boy Basketball 0.453 Football Boy 0.906 Jogging-1 Football 0.877 Skating1 Jogging-1 0.244 Skating1 Girl 0.231 Girl Coke 0.808 Coke Sylvester 0.594 SylvesterAverage 0.910 Average 0.628 CT DFT ORIA 0.453 0.087 DFT0.847ORIA 0.906 0.477 0.285 0.847 0.087 0.877 0.903 0.860 0.477 0.285 0.244 0.516 0.9030.6940.860 0.231 0.189 0.529 0.694 0.516 0.808 0.722 0.1890.7570.529 0.594 0.503 0.7570.3130.722 0.910 0.939 0.3130.6100.503 0.628 0.555 0.6100.5990.939 MTT ASLA 0.336 MTT 0.582 ASLA 0.868 0.440 L1APG 0.329 L1APG 0.926 0.329 0.845 0.926 0.266 0.845 0.217 0.266 0.968 0.217 0.510 0.968 0.735 0.510 0.560 0.735 SCM 0.640 SCM 0.512 0.640 0.834 0.512 0.256 0.834 0.833 0.256 0.969 0.833 0.528 0.969 0.916 0.528 0.686 0.916 0.336 0.582 0.860 0.846 0.868 0.440 0.249 0.253 0.860 0.846 0.160 0.717 0.249 0.253 0.953 0.963 0.160 0.717 0.703 0.963 0.404 0.953 0.920 0.404 0.844 0.703 0.631 0.844 0.631 0.920 Red is the best, blue is the second, is the third. 0.599 0.555 0.631 0.631 green 0.560 0.686 Red is the best, blue is the second, green is the third. MIL MCT 0.398 MIL 0.869MCT 0.868 0.974 0.398 0.869 0.875 0.907 0.868 0.974 0.253 0.8750.9420.907 0.271 0.2530.8430.942 0.960 0.2710.9810.843 0.532 0.9600.8190.981 0.844 0.5320.9150.819 0.625 0.8440.9060.915 0.625 0.906 Sensors 2018, 18, 572 18 of 27 Sensors 2018, 18, x FOR PEER REVIEW 18 of 27 3. Success rate comparison 3. Success rate comparison Figure 8 shows the success rates of each algorithm at different values of the threshold λ Figure 8 shows success of each algorithm at different values to of the threshold within within [0, 1]. The areathe under therates curve (AUC) of Figure 8 is utilized evaluate the algorithms. [0, 1].Jogging-1, The area Skating1, under theCoke curve (AUC) Figure utilized to evaluate the algorithms. For Boy, and Girl,of the AUC8ofisMCT is the highest compared withFor the Boy, other Jogging-1, Skating1, Coke and Girl, the AUC of MCT is the highest compared with the other algorithms. algorithms. Especially for Boy, Jogging-1 and Girl, the success rate of MCT decreases stably at a low rate. Especially for Boy, Jogging-1 and Girl, the success rate of MCT decreases stably at a low rate. For For Basketball, the performance of MCT is similar to that of the DFT algorithm, but the performance of Basketball, the performance of MCT is similar to that of the DFT algorithm, but the performance of MCT is better than that of DFT at thresholds lower than approximately 0.3. For Sylvester, the success MCT is better than that of DFT at thresholds lower than approximately 0.3. For Sylvester, the success rate of MCT is the highest at thresholds in [0, 0.3] and is close to 1, and the performance of MCT is rate of MCT is the highest at thresholds in [0, 0.3] and is close to 1, and the performance of MCT is similar to those of CT, SCM and ORIA. similar to those of CT, SCM and ORIA. Table 5 lists the the average success rate (AUC) algorithmsand andMCT. MCT.The TheMCT MCT Table 5 lists average success rate (AUC)ofofeight eightcomparison comparison algorithms algorithm achieves the highest average success rate on Boy, Jogging-1, Skating1, Coke and Girl. According algorithm achieves the highest average success rate on Boy, Jogging-1, Skating1, Coke and Girl. According to the results, MCT outperforms algorithmson onthe themetric metric success rate. to comparison the comparison results, MCT outperformsthe the other other algorithms of of success rate. Figure Successrate ratecomparison comparisonof ofthe the algorithms. algorithms. Figure 8. 8. Success Sensors 2018, 18, 572 19 of 27 Sensors 2018, 18, x FOR PEER REVIEW 19 of 27 Table 5. Table 5. Average Average success success rate rate of of the the algorithms. algorithms. Video CT CT DFT DFT Video Basketball 0.278 0.604 Basketball 0.278 0.604 Boy 0.587 0.418 Boy 0.587 0.418 Football0.607 0.607 0.651 0.651 Football Jogging-10.212 0.212 0.218 0.218 Jogging-1 Skating10.123 0.123 0.172 0.172 Skating1 0.318 0.293 0.293 Girl Girl 0.318 CokeCoke 0.241 0.241 0.142 0.142 Sylvester Sylvester0.658 0.658 0.391 0.391 Average 0.378 Average 0.378 0.361 0.361 ORIA MTT MTT ORIA 0.072 0.223 0.072 0.223 0.189 0.502 0.189 0.502 0.513 0.575 0.575 0.513 0.221 0.211 0.211 0.221 0.242 0.144 0.144 0.242 0.263 0.657 0.657 0.263 0.215 0.215 0.449 0.449 0.642 0.642 0.641 0.641 0.295 0.295 0.425 0.425 ASLA ASLA 0.398 0.398 0.389 0.389 0.533 0.533 0.213 0.213 0.501 0.501 0.701 0.701 0.192 0.192 0.590 0.590 0.440 0.440 L1APG L1APG 0.256 0.256 0.725 0.725 0.553 0.553 0.208 0.208 0.143 0.719 0.203 0.413 0.413 0.403 0.403 SCM SCM 0.475 0.475 0.397 0.397 0.489 0.489 0.212 0.212 0.471 0.673 0.342 0.677 0.677 0.467 0.467 MIL MIL 0.247 0.247 0.491 0.491 0.585 0.585 0.214 0.214 0.161 0.161 0.403 0.403 0.224 0.224 0.528 0.528 0.357 0.357 MCT MCT 0.526 0.526 0.818 0.818 0.580 0.580 0.717 0.717 0.487 0.487 0.880 0.880 0.539 0.539 0.659 0.659 0.650 0.650 Red thebest, best,blue blue is is the is the third. Red is is the the second, second,green green is the third. 4.2. Qualitative Evaluation 4.2. Qualitative Evaluation The the algorithms algorithms on on eight eight test test videos videos are are shown shown in in Figures Figures 9–16. 9–16. The qualitative qualitative analysis analysis results results of of the All the problems we concerned are covered in these video sequences, fast motion and scale variation, All the problems we concerned are covered in these video sequences, fast motion and scale variation, partial We analyze partial or or full full occlusion, occlusion, the the deformation deformation and and illumination illumination variation. variation. We analyze the the performance performance of of algorithms on eight specific video sequences with different problems one by one. algorithms on eight specific video sequences with different problems one by one. 1. motion 1. Fast Fast motion As in Figure Figure 9, 9, the thecontent contentof ofthe thevideo videosequence sequenceBoy Boyisisa adancing dancing boy.In In process As shown shown in boy. thethe process of of dancing, the target shakes randomly with fast motion as shown from 88th to 386th. The boy’s dancing, the target shakes randomly with fast motion as shown from 88th to 386th. The boy’s appearance The tracking tracking box box of of ORIA ORIA cannot cannot keep keep up appearance blurs blurs during during the the fast fast motion. motion. The up with with the the fast fast motion In the motion of of the the target target and and the the tracking tracking drift drift occurs occurs in in 108th. 108th. In the 160th 160th frame, frame, the the target target moves moves backwards, 293rd frame, backwards, the the tracking tracking box box of of ORIA ORIA loses loses the the target target completely completely in in 160th. 160th. In In the the 293rd frame, the the boy boy is squatting down, the tracking boxes of SCM and ALSA cannot keep up with the variation is squatting down, the tracking boxes of SCM and ALSA cannot keep up with the variation of of the the target, which leads to tracking failure after 293rd. Besides, the tracking boxes of MTT, L1APG and MIL target, which leads to tracking failure after 293rd. Besides, the tracking boxes of MTT, L1APG and have tracking drift indrift 293rd. tracking continues, CT, MTT,CT, L1APG, MILSCM, and other MIL have tracking in As 293rd. As tracking continues, MTT,SCM, L1APG, MILalgorithms and other also cannot also keepcannot up with theup target Only MCT Only is ableMCT to properly the target. algorithms keep withmotion. the target motion. is able totrack properly track the target. Figure 9. Tracking results on Boy. Figure 9. Tracking results on Boy. For Coke, a can held by a man moves fast within a small area in the complex background with the For Coke, a can held by a man moves fast within a small area in the complex background with dense bushes. Figure 10 shows the tracking results on Coke. In the 42nd frame, because of occlusion the dense bushes. Figure 10 shows the tracking results on Coke. In the 42nd frame, because of by plants, the algorithms DFT, L1APG MIL and ASLA have tracking drifts to different extents, while occlusion by plants, the algorithms DFT, L1APG MIL and ASLA have tracking drifts to different MCT and MTT perform well. Then the target keeps moving fast from the 42nd to the 67th frame. extents, while MCT and MTT perform well. Then the target keeps moving fast from the 42nd to the 67th frame. The motion directions vary rapidly, which results in tracking drifts with the L1APG, MTT Sensors 2018, 18, 572 20 of 27 Sensors 2018, 18, x FOR PEER REVIEW 20 ofand 27 CT The motion directions vary rapidly, which results in tracking drifts with the L1APG, MTT algorithms in 67th. The tracking box of DFT, ORIA and ASLA lose the target completely in 67th, while and CT algorithms in 67th. The tracking box of DFT, ORIA and ASLA lose the target completely in the MCT and MIL algorithms can accurately track the target in the presence of fast motion. After 194th 67th, while the MCT and MIL algorithms can accurately track the target in the presence of fast motion. frame, all algorithms except MCT fail to track the target. Throughout the entire process of fast motion, After 194th frame, all algorithms except MCT fail to track the target. Throughout the entire process the MCT algorithm can stably track the target. of fast motion, the MCT algorithm can stably track the target. Figure10. 10.Tracking Tracking results Figure resultson onCoke. Coke. algorithm, MCT, performsnotably notablywell well on on videos fast motion. ThisThis is because Our Our algorithm, MCT, performs videosthat thatinvolve involve fast motion. is because the MCT algorithm takes into account motion consistency so that targeted candidate samples can be the MCT algorithm takes into account motion consistency so that targeted candidate samples can be calculated after predicting the target’s possible motion state during fast motion, which can possibly calculated after predicting the target’s possible motion state during fast motion, which can possibly cover the region where the target may appear. In addition, the position factor is proposed to decrease coverthe theinfluence region where the target may appear. In addition, the position factor is proposed to decrease of secondary samples, and an adaptive strategy is utilized to address the change in the the influence of secondary samples, an adaptive strategy utilized to address thestable change target’s appearance brought aboutand by fast motion. All of theseisfactors guarantee highly andin the target’s appearance accurate tracking.brought about by fast motion. All of these factors guarantee highly stable and accurate tracking. 2. Partial and full occlusion 2. Partial Two and full occlusion ladies are jogging in the video Jogging-1. As shown in Figure 11, partial and full occlusions are caused pole during jogging from the to 80th frame. In 69th frame, theocclusions DFT Two ladiesby arethe jogging in thethe video Jogging-1. As69th shown in Figure 11,the partial and full algorithm fails to track the target first when partial occlusion occurs by the pole and the DFT are caused by the pole during the jogging from the 69th to 80th frame. In the 69th frame, the DFT algorithm turns to track the jogger with white shirt. In the 80th frame, after the jogger we tracked algorithm fails to track the target first when partial occlusion occurs by the pole and the DFT algorithm appears without partial or full occlusion, all the algorithms lose the target completely except MCT. turnsIn tothe track the jogger with white shirt. In the 80th frame, after the jogger we tracked appears without following frames, while the other algorithms fail to track the target, MCT can stably track the partial or full occlusion, all the algorithms lose theintarget completely except MCT. following target until the occlusion completely disappears the 129th frame. The tracking boxInofthe DFT has frames, whiledrift thein other algorithms track the target, MCT can stably trackhave the target until the tracking the 221st frame. Infail theto297th frame, all the algorithms except MCT no overlap occlusion completely disappears in the 129th frame. The tracking box of DFT has tracking drift in the with the true target. The tracking results on Football arealgorithms shown in Figure 12. MCT For Football, occlusion occurs football 221st frame. In the 297th frame, all the except have no overlap withwhen the true target. players compete in the game. The players are crowded in the presence of the ball robbing. And thewhen The tracking results on Football are shown in Figure 12. For Football, occlusion occurs partial or full occlusion the players situationare of ball robbing. In the 118th frame, are And football players compete inwill the occur game.inThe crowded in the presence of thethe ballplayers robbing. crowded in the same place and the target we tracked is in the middle of the crowds, then the partial the partial or full occlusion will occur in the situation of ball robbing. In the 118th frame, the players occlusion occurs at the head of the target, which causes confusion of different football players. As the are crowded in the same place and the target we tracked is in the middle of the crowds, then the partial competition goes, the algorithms have a tracking drift in the frame 194th. In 292nd, only MCT and DFT occlusion occurs at the head of the target, which causes confusion of different football players. As the can accurately track the target under the full occlusion. The tracking box of L1APG, ASLA, MIL and competition goes, the algorithms have a tracking drift inplayer the frame 194th. In 292nd, onlyWhen MCT and other algorithms lose the target and turn to track another close to the target we tracked. DFT can accurately track the target under the full occlusion. The tracking box of L1APG, ASLA, MIL the occlusion disappears in the 309th frame, only MCT can still track the target. and otherInalgorithms lose thefactor targetis and turn in to the track another player close the target we tracked. MCT, the occlusion proposed local representation. Whento occlusion occurs, the information of theintarget increases, andonly the occlusion factor change correspondingly. Whenbackground the occlusion disappears the 309th frame, MCT can stillwill track the target. Partial andthe fullocclusion occlusionsfactor can be is rectified basedinonthe the local occlusion factor in appearance In MCT, proposed representation. When model. occlusion occurs, the background information of the target increases, and the occlusion factor will change correspondingly. Partial and full occlusions can be rectified based on the occlusion factor in appearance model. Sensors 2018, 18, 572 21 of 27 Sensors 2018, 18, x FOR PEER REVIEW Sensors 2018, 18, x FOR PEER REVIEW 21 of 27 21 of 27 Figure11. 11.Tracking Tracking results results on Figure onJogging-1. Jogging-1. Figure 11. Tracking results on Jogging-1. Figure 12. Tracking results on Football. Figure onFootball. Football. Figure12. 12.Tracking Tracking results results on 3. 3. Deformation and illumination variation Deformation and illumination variation 3. Deformation and illumination variation For Basketball, the target No.9 basketball player’s appearance varies in the game because of For Basketball, the target illumination No.9 basketball player’s appearance varies in the game because of inconstant motion. Moreover, changes because reporters take photos, especially when For Basketball, theMoreover, target No.9 basketball player’s appearance varies in the game because of inconstant motion. illumination changes because reporters take photos, especially when players are shooting. The tracking results of the nine algorithms are shown in Figure 13. As we can players are shooting. The tracking results of the nine algorithms are shown in Figure 13. As we can inconstant motion. Moreover, illumination changes because reporters take photos, especially when see, when the player starts to run, almost all the tracking boxes have tracking drift in the frame 140th, see,are when the starts to run, almost the tracking boxes occurs have drift in the frame 140th, players shooting. The tracking results ofallthe nine algorithms are tracking shown in Figure 13. AsL1APG, we can see, and ORIA loseplayer the target completely. When the deformation in 247th, ORIA, MTT, and ORIA lose the target completely. When the deformation occurs in 247th, ORIA, MTT, L1APG, whenCT theand player starts to run, almost all the tracking boxes have tracking drift in the frame 140th, MIL fail to track the target. Only DFT and MCT can track the target with a slight tracking and CT and MIL fail to track the DFT andwhile MCToccurs can track theDFT, target with a slight tracking the 344th frame, the target. player turns around, MCT, SCM, and ASLA succeed and ORIAdrift. lose In the target completely. WhenOnly the deformation in 247th, ORIA, MTT, L1APG, CT and drift. In the 344th frame, the player turns around, while MCT, SCM, DFT, and ASLA succeed other lose theOnly true target. In the 510th the player to tracking spectator,drift. alland theIn the MIL fail toalgorithms track the target. DFT and MCT canframe, trackwhen the target withmoves a slight other algorithms the true target. In the 510th frame, DFT whenhas the player moves spectator, all the to lose track except MCTwhile and DFT. However, drifts. to When illumination 344thalgorithms frame, thefail player turns around, MCT, SCM, DFT, and tracking ASLA succeed and other algorithms algorithms fail to track except MCT and DFT. However, DFT has tracking drifts. When illumination changes in the 650th frame, only MCTwhen and DFT trackmoves the target. lose the true target. In the 510th frame, thecan player to spectator, all the algorithms fail to changes in the 650th frame, only and DFT can track the target. For Sylvester, a man shakes a MCT toy under the light. Within the whole process, as shown in Figure 14, track except MCT and DFT. However, DFT has tracking drifts. When changes in the For Sylvester, a man shakesofa the toy toy under the light. the rotates wholeillumination process, as shown Figure 14,650th illumination and deformation change, andWithin the plane inside and outsideinoccur. The frame, only MCT and DFT can track the target. illumination and deformation thetoy toy change, and the plane and outside The tracking boxes have drift whenofthe begins to move under therotates light ininside the frame 264th. Inoccur. the 467th tracking boxes have drift when the toy begins to move under the light in the frame 264th. In the 467th For Sylvester, a man shakes a toy under the light. Within the whole process, as shown in Figure frame, the toy is lower its head, in such a situation, the algorithm DFT almost lose the target. Then in 14, frame, theand toy is lower in such a situation, the DFT almost lose the target. Then inoccur. illumination deformation the toy change, andalgorithm the plane rotates inside and outside the 613rd frame, when its thehead, toyofhas rotation, the algorithm L1APG fails to track the target and other the 613rd frame, when the toy has rotation, the algorithm L1APG fails to track the target and other The tracking boxes driftdrift. whenInthe begins to the move the light in the the light, frameand 264th. algorithms havehave tracking thetoy 956th frame, toyunder is rotating under moreIn the algorithms drift. In the frame, the However, toy is algorithm rotating light, and the more such as lower DFT and L1APG lose target. MCT under performs well lose under the 467thalgorithms frame, thehave toy tracking is its head, in956th suchthe a situation, the DFTthe almost target. algorithms such as DFT and L1APG lose the target. However, MCT performs well under the deformation especially shows robustness within Thencircumstances in the 613rd with frame, when theand toyillumination has rotation, the algorithm L1APG fails to illumination track the target circumstances with deformation and467th illumination especially shows robustness within illumination decreasing between the 264th to the and other algorithms have tracking drift. Inframe. the 956th frame, the toy is rotating under the light, and decreasing between the 264th to the 467th frame. more algorithms such as DFT and L1APG lose the target. However, MCT performs well under the circumstances with deformation and illumination especially shows robustness within illumination decreasing between the 264th to the 467th frame. Sensors 2018, 18, 572 Sensors 2018, 18, x FOR PEER REVIEW 22 of 27 22 of 27 Figure 13. Tracking results on Basketball. Figure 13. Tracking results on Basketball. Figure 14. Tracking results on Sylvester. Figure 14. Tracking results on Sylvester. First, First, MCT MCT is is based based on on particle particle filtering filtering framework, framework, so so that that to to some some extent extent diversity diversity of of candidate candidate samples can resolve elastic deformation. Secondly, holistic representation and local representation samples can resolve elastic deformation. Secondly, holistic representation and local representation are are utilized utilized in in the the appearance appearance model, model, which which shows shows great great robustness robustness by by enabling enabling holistic holistic and and local local representation to capture the changes of target’s deformation and illumination variation. Finally, when representation to capture the changes of target’s deformation and illumination variation. Finally, models are being appropriately updating positivepositive and negative templates can also adapt to when models areupdated, being updated, appropriately updating and negative templates can also the changes appearance and environment. adapt to the of changes of appearance and environment. 4. variation and rotation 4. Scale Scale variation and rotation For For Skating1, Skating1, aa female female athlete athlete is is skating skating on on ice, ice, and and the the athlete athlete continuously continuously slides slides so so the the relative relative distance to camera changes continuously, thus the athlete has a scale variation in the scene. Besides, distance to camera changes continuously, thus the athlete has a scale variation in the scene. Besides, the is dancing dancing in in the theprocess, process,thus thusthe theathlete athletealso alsohas hasa arotation rotationininthe thescene. scene. can seen the athlete athlete is It It can bebe seen in in Figure from the 24thframe frametotothe the176th 176thframe, frame,the thesize sizeof ofthe the target target becomes becomes smaller, smaller, and Figure 15,15, from the 24th and the the algorithms And CT, CT, DFT, algorithms SCM, SCM, ASLA ASLA and and MCT MCT can can track track the the target. target. And DFT, ORIA, ORIA, L1APG L1APG and and MIL MIL lose lose the the true target in the frame 176th. From the 176th to the 371st frame, the target moves to the front true target in the frame 176th. From the 176th to the 371st frame, the target moves to the front of of the the stage and becomes larger with rotation on the stage, and almost all the algorithms fail to track except stage and becomes larger with rotation on the stage, and almost all the algorithms fail to track except MCT MCT and and SCM. SCM. But But SCM SCM has has tracking tracking drift. drift. In In the the 389th 389th frame, frame, SCM SCM loses loses the the athlete, athlete, and and only only MCT MCT can can track track the the target. target. Sensors 2018, 18, 572 23 of 27 Sensors Sensors 2018, 2018, 18, 18, xx FOR FOR PEER PEER REVIEW REVIEW 23 23 of of 27 27 Figure 15. Tracking results on Skating1. Figure Figure 15. 15. Tracking Tracking results results on on Skating1. Skating1. For For Girl, Girl, aa woman woman sitting sitting on on the the chair chair varies varies her her position position and and facing facing direction direction by by rotating rotating the the chair. It can be seen in Figure 16, the size of the target changes from the 11st to the 392nd chair. chair. It can be seen in Figure 16, the size of the target changes from the 11st to the 392nd frame frame with with rotation rotation on chair. Because of rotationon on the the chair. chair. Because Becauseof of the the chair’s chair’s rotation, rotation, the the size size of of the the target target becomes becomes smaller smaller in in the the 101st 101st frame compared with that in the 11st frame. After that, the girl turns around and becomes larger frame compared with that in the 11st frame. After that, the girl turns around and becomes larger in in 134th. 134th. In In this this process, process, all all the the algorithms algorithms have have tracking tracking drift, drift, and and ORIA ORIA loses loses the the target target completely completely from from 101st 101st to to 487th. 487th. When When the the target target turns turns around around from from 101st 101st to to 134th, 134th, the the tracking tracking boxes boxes of of ORI, ORI, CT CT and the target. Although the appearance of the target changes promiscuously from 227th MIL lose the target. Although the appearance of the target changes promiscuously from to and MIL lose the target. Although the appearance of the target changes promiscuously from 227th 227th to to 487th, MCT performs well under this circumstance and shows robustness and accuracy compared 487th, 487th, MCT MCT performs performs well well under under this this circumstance circumstance and and shows shows robustness robustness and and accuracy compared with algorithms. with other other algorithms. algorithms. Based Based on on particle particle filter, filter, six six affine affine parameters parameters including including parameters parameters indicate indicate scaling scaling variation variation and and rotation are used in MCT to represent the candidate samples. Thus, the diversity of scaling rotation are used in MCT MCT to to represent represent the the candidate candidate samples. samples. Thus, the diversity of scaling can can be be ensured in prediction of candidate samples, which could adapt to the scene of scale variation. prediction of candidate samples, which could adapt to the scene of scale variation. ensured in prediction of candidate samples, which could adapt to the scene of scale variation. Figure 16. Tracking results on Girl. Figure Figure 16. 16. Tracking Tracking results results on on Girl. Girl. 4.3. 4.3. Discussion Discussion In In this this part, part, we we further further illustrate illustrate the the selection selection of of related related means means in in MCT MCT and and comparison comparison of of the the tracking results and computational complexity between MCT and SCM. tracking results and computational computational complexity complexity between between MCT MCT and and SCM. SCM. Sensors 2018, 18, 572 Sensors 2018, 18, x FOR PEER REVIEW 24 of 27 24 of 27 4.3.1.Discussion Discussionabout aboutthe theNumber Numberof ofParticles Particles 4.3.1. Mostofof tracking methods the particle filter framework Gaussian Most thethe tracking methods withinwithin the particle filter framework exploit the exploit Gaussianthe distribution distribution model to predict candidate OurMCT, algorithm, MCT, predicts targeted candidate model to predict candidate samples. Oursamples. algorithm, predicts targeted candidate samples by samples by the direction target motion directioninand distance; in addition, positiontofactor estimating theestimating target motion and distance; addition, the position factorthe is defined assignis definedto to candidate assign weights to candidate samples in different positions. To prove of the weights samples in different positions. To prove the superiority ofthe thesuperiority MCT algorithm MCT algorithm in terms of we resource show number that the of smaller number of samples in terms of resource saving, show saving, that thewe smaller samples predicted by ourpredicted method by our method canperformance achieve better performance with fewer particles with other can achieve better with fewer particles compared withcompared other algorithms. Toalgorithms. do so, we To do so, we conducton anMCT experiment MCT and withparticles. 200, 400,The and 600 particles. The test conduct an experiment and SCMonwith 200, 400,SCM and 600 test sequence consists of sequence consists of 17 videos from the CVPR2013 benchmark datasets with fast properties. 17 videos from the CVPR2013 benchmark datasets with fast motion properties. The motion results are shown The results in Figure 17.are shown in Figure 17. Figure17. 17.Comparison Comparisonbetween betweenMCT MCTand andMCT MCTwithin withindifferent differentnumber numberofofparticles. particles. Figure The success rate and tracking precision of MCT with 600 and 400 particles are higher than those The success rate and tracking precision of MCT with 600 and 400 particles are higher than those of SCM with 600 particles; the success rate and tracking precision of MCT with 200 particles are higher of SCM with 600 particles; the success rate and tracking precision of MCT with 200 particles are than those of SCM with 400 and 200 particles. MCT can achieve better performance with fewer higher than those of SCM with 400 and 200 particles. MCT can achieve better performance with fewer particles compared with SCM. In this paper, we predict the target state and incorporate the target particles compared with SCM. In this paper, we predict the target state and incorporate the target state information into the algorithm. More candidate samples are obtained in the position where the state information into the algorithm. More candidate samples are obtained in the position where the target may appear with a larger probability, and weights are assigned to all of the candidates target may appear with a larger probability, and weights are assigned to all of the candidates according according to the position factor. Thus, the impact of invalid samples on the computation decreases to the position factor. Thus, the impact of invalid samples on the computation decreases and better and better performance can be achieved with fewer particles. performance can be achieved with fewer particles. 4.3.2.Discussion Discussionof ofPosition PositionFactor Factor 4.3.2. Accordingto tothe themotion motionconsistency, consistency,the thestates statesof ofthe thetarget targetin insuccessive successiveframes framesare arecorrelated. correlated. According Therefore, the probability of the target appearing in each position is different, and the importances of Therefore, the probability of the target appearing in each position is different, and the importances candidate samples located in different positions alsoalso differ. Thus,Thus, a double Gaussian probability model of candidate samples located in different positions differ. a double Gaussian probability is proposed due to the motion consistency in successive frames to simulate the target state, based on model is proposed due to the motion consistency in successive frames to simulate the target state, whichon thewhich position is proposed to assign weights candidate samples tosamples represent differences based the factor position factor is proposed to assigntoweights to candidate tothe represent the between candidate samples with different locations. To demonstrate the significant impact of the differences between candidate samples with different locations. To demonstrate the significant impact position factor, an experiment is conducted with 600 particles on MCT and MCT without the position of the position factor, an experiment is conducted with 600 particles on MCT and MCT without the factor (MCTW), and the and test the sequence consists consists of 17 videos thefrom CVPR2013 benchmark datasets position factor (MCTW), test sequence of 17from videos the CVPR2013 benchmark with fast motion properties. The results are shown in Figure 18. According to the experimental results, datasets with fast motion properties. The results are shown in Figure 18. According to the experimental the precision of MCT is 6.8% higher than that of MCTW, and the success rate is 6.7% higher than that results, the precision of MCT is 6.8% higher than that of MCTW, and the success rate is 6.7% higher of MCTW. theThus, position improves the tracking success rate. than that of Thus, MCTW. the factor position factor improves theprecision tracking and precision and success rate. Sensors 2018, 18, 572 Sensors 2018, 18, x FOR PEER REVIEW 25 of 27 25 of 27 (a) (b) Figure18. 18.The Theprecision precisionplot plot(a) (a)and andthe thesuccess successrate rateplot plot(b) (b)on onMCT MCTand andMCTW MCTWfor forFM. FM. Figure 4.3.3.Discussion Discussionof ofComputational ComputationalComplexity Complexity 4.3.3. The computational computational complexity complexity isis an animportant importantevaluation evaluationfactor. factor. We We also also evaluate evaluate the the The computational complexity complexity of algorithm. SinceSince both MCT are basedare on computational ofthe theMCT MCT algorithm. both and MCTSCM andalgorithms SCM algorithms the particle filter, we will provide the complexity analysis between these two algorithms. The based on the particle filter, we will provide the complexity analysis between these two algorithms. computational complexity will be evaluated by theby tracking speed. The higher tracking the The computational complexity will be evaluated the tracking speed. Thethe higher thespeed, tracking lower the the lower computational complexitycomplexity of the algorithm. The experiments of computational complexity speed, the computational of the algorithm. The experiments of computational are all implemented on the computer in our lab, the computer with 3.20 GHz CPU, 4.0 GB RAM and complexity are all implemented on the computer in our lab, the computer with 3.20 GHz CPU, 4.0 GB Windows 10 operating system. And the results on eight specific videos are shown in Table 6. In RAM and Windows 10 operating system. And the results on eight specific videos are shown in Table 6. addition, the tracking trackingprecision precision In addition,the thetracking trackingspeed speedofofMCT MCTisisclose closeto to that that of of SCM. SCM. Simultaneously, Simultaneously, the of MCT can achieve 10% increment on fast motion compared with SCM as shown in the results of of MCT can achieve 10% increment on fast motion compared with SCM as shown in the results of quantitative evaluation. This is because the target state prediction, the position factor and the target quantitative evaluation. This is because the target state prediction, the position factor and the target velocity prediction were developed in state transition model, appearance model and updating velocity prediction were developed in state transition model, appearance model and updating scheme scheme respectively compared with SCM. respectively compared with SCM. In addition, the experimental results show that the computational complexity of the tracker is related to the target size. Table It is because in appearance model, target is divided into patches, which 6. Tracking speed comparison of the the algorithms. will be represented sparely by the dictionary. In fact, the tracking speed is also influenced by other factors, such as the different update frequencies in videos with different scenes. Accordingly, the Tracking Coke Sylvester Skating1 Basketball Jogging-1 Football Boy Girl Average Speed (FPS) computational complexity of the tracker will decrease if the size of target becomes smaller, SCM the0.3138 0.3390 0.2780 0.3801 0.3759 0.4292 0.3512 0.3171 0.3480 meanwhile fluctuation will still exist. MCT 0.3089 0.2951 0.2557 0.3146 0.3345 0.3417 0.3498 0.3683 0.3211 (The target size decreases from Coke Girl.) Table 6. Tracking speed comparison oftothe algorithms. InTracking addition, the experimental that the Jogging-1 computational complexity of Average is Coke Sylvester results Skating1show Basketball Football Boy Girlthe tracker Speed (FPS) related to the target size. It is because in appearance model, the target is divided into patches, which will SCM 0.3138 0.3390 0.2780 0.3801 0.3759 0.4292 0.3512 0.3171 0.3480 be represented sparely by the dictionary. In fact,0.3146 the tracking speed is0.3417 also influenced by other0.3211 factors, MCT 0.3089 0.2951 0.2557 0.3345 0.3498 0.3683 such as the different update frequencies in videos with different scenes. Accordingly, the computational (The target size decreases from Coke to Girl.) complexity of the tracker will decrease if the size of target becomes smaller, meanwhile the fluctuation will still exist. 5. Conclusions In this paper, we propose an object tracking algorithm based on motion consistency. The target 5. Conclusions state is predicted based on the motion direction and distance and is later merged into the state In this paper, we propose an object tracking algorithm based on motion consistency. The target transition model to predict additional candidate samples. The appearance model first exploits the state is predicted based on the motion direction and distance and is later merged into the state transition position factor to assign weights to candidate samples to characterize the different importance of model to predict additional candidate samples. The appearance model first exploits the position factor candidate samples in different positions. At the same time, the occlusion factor is utilized to rectify to assign weights to candidate samples to characterize the different importance of candidate samples in the similarity in the local representation and obtain the local response, and the candidates are different positions. At the same time, the occlusion factor is utilized to rectify the similarity in the local represented sparsely by positive and negative templates in the holistic representation to obtain the representation and obtain the local response, and the candidates are represented sparsely by positive holistic response. The tracking result is determined by the position factor with the local and holistic responses of each candidate. The adaptive template updating scheme based on target velocity prediction is proposed to adapt to appearance changes when the object moves quickly by updating Sensors 2018, 18, 572 26 of 27 and negative templates in the holistic representation to obtain the holistic response. The tracking result is determined by the position factor with the local and holistic responses of each candidate. The adaptive template updating scheme based on target velocity prediction is proposed to adapt to appearance changes when the object moves quickly by updating the dynamic positive templates and negative templates continuously. According to qualitative and quantitative experimental results, the proposed MCT has good performance against the state-of-the-art tracking algorithms, especially in dealing with fast motion and occlusion, on several challenging video sequences. Supplementary Materials: The following are available online at www.mdpi.com/1424-8220/18/2/572/s1, Video S1: Figure 9-Boy, Video S2: Figure 10-Coke, Video S3: Figure 11-Jogging-1, Video S4: Figure 12-Football, Video S5: Figure 13-Basketball, Video S6: Figure 14-Sylester, Video S7: Figure 15-Skating1, Video S8: Figure 16-Girl. Acknowledgments: The authors would like to thank the National Science Foundation of China: 61671365, the National Science Foundation of China: 61701389, the Joint Foundation of Ministry of Education of China: 6141A020223, and the Natural Science Basic Research Plan in Shaanxi Province of China: 2017JM6018. Author Contributions: In this work, Lijun He and Fan Li conceived the main idea, designed the main algorithms and wrote the manuscript. Xiaoya Qiao analyzed the data, and performed the simulation experiments. Shuai Wen analyzed the data, and provided important suggestions. Conflicts of Interest: The authors declare no conflict of interest. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Chorin, A.J.; Tu, X. A tutorial on particle filters for online nonlinear/nongaussian Bayesia tracking. Esaim Math. Model. Numer. Anal. 2012, 46, 535–543. [CrossRef] Zhong, W.; Lu, H.; Yang, M. Robust object tracking via sparse collaborative appearance model. IEEE Trans. Image Process. 2014, 23, 2356–2368. [CrossRef] [PubMed] Liu, H.; Li, S.; Fang, L. Robust object tracking based on principal component analysis and local sparse representation. IEEE Trans. Instrum. Meas. 2015, 64, 2863–2875. [CrossRef] He, Z.; Yi, S.; Cheung, Y.M.; You, X.; Yang, Y. Robust object tracking via key patch sparse representation. IEEE Trans. Cybern. 2016, 47, 354–364. [CrossRef] [PubMed] Hotta, K. Adaptive weighting of local classifiers by particle filters for robust tracking. Pattern Recognit. 2009, 42, 619–628. [CrossRef] Li, F.; Liu, S. Object tracking via a cooperative appearance model. Knowl.-Based Syst. 2017, 129, 61–78. [CrossRef] Kwon, J.; Lee, K.M. Visual tracking decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010. Ahuja, N. Robust visual tracking via multi-task sparse learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. Breitenstein, M.D.; Reichlin, F.; Leibe, B.; Koller-Merier, E. Robust tracking-by-detection using a detector confidence particle filter. In Proceedings of the IEEE Conference on Computer Vision (ICCV), Kyoto, Japan, 29 September–2 October 2009. Shan, C.; Tan, T.; Wei, Y. Real-time hand tracking using a mean shift embedded particle filter. Pattern Recognit. 2007, 40, 1958–1970. [CrossRef] Mei, X.; Ling, H. Robust Visual Tracking and Vehicle Classification via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2259–2272. [PubMed] Bao, C.; Wu, Y.; Ling, H.; Ji, H. Real time robust L1 tracker using accelerated proximal gradient approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. Cai, Y.; Freitas, D.N.; Little, J.J. Robust Visual Tracking for Multiple Targets. In Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006. Wang, J.; Yagi, Y. Adaptive mean-shift tracking with auxiliary particles. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2009, 39, 1578–1589. [CrossRef] [PubMed] Bellotto, N.; Hu, H. Multisensor-based human detection and tracking for mobile service robots. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2009, 39, 167–181. [CrossRef] [PubMed] Sensors 2018, 18, 572 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 27 of 27 Zhang, T.; Liu, S.; Ahuja, N.; Yang, M.H. Robust Visual Tracking via Consistent Low-Rank Sparse Learning. Int. J. Comput. Vis. 2015, 111, 171–190. [CrossRef] Wang, Z.; Wang, J.; Zhang, S.; Gong, Y. Visual tracking based on online sparse feature learning. Image Vis. Comput. 2015, 38, 24–32. [CrossRef] Jia, X.; Lu, H.; Yang, M.H. Visual tracking via adaptive structural local sparse appearance model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. Liu, B.; Huang, J.; Yang, L.; Kulikowsk, C. Robust tracking using local sparse appearance model and K-selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011. Wu, Y.; Shen, B.; Ling, H. Online robust image alignment via iterative convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. Hare, S.; Saffari, A.; Vineet, V.; Cheng, M.M.; Hicks, S.L.; Torr, P.H.S. Struck: Structured output tracking with kernels. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2096–2109. [CrossRef] [PubMed] Bahenko, B.; Yang, M.H.; Belongie, S. Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1619–1632. [CrossRef] [PubMed] Zhang, K.; Zhang, L.; Yang, M.H. Real-Time Compressive Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Firenze, Italy, 7–13 October 2012. Grabner, H.; Grabner, M.; Bischof, H. Real-Time Tracking via On-line Boosting. In Proceedings of the British Machine Vision Conference, Bristol, UK, 9–13 September 2013. Grabner, H.; Bischof, H. On-line Boosting and Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006. Sato, K.; Sekiguchi, S.; Fukumori, T.; Kawagishi, N. Beyond semi-supervised tracking: Tracking should be as simple as detection, but not simpler than recognition. In Proceedings of the IEEE Conference on Computer Vision (ICCV), Kyoto, Japan, 29 September–2 October 2009. Li, F.; Zhang, S.; Qiao, X. Scene-Aware Adaptive Updating for Visual Tracking via Correlation Filters. Sensors 2017, 17, 2626. [CrossRef] [PubMed] Li, J.; Fan, X. Outdoor augmented reality tracking using 3D city models and game engine. In Proceedings of the International Congress on Image and Signal Processing, Dalian, China, 14–16 October 2014. Dudek, D. Collaborative detection of traffic anomalies using first order Markov chains. In Proceedings of the International Conference on Networked Sensing Systems, Antwerp, Belgium, 11–14 June 2012. Chen, S.; Li, S.; Ji, R.; Yan, Y. Discriminative local collaborative representation for online object tracking. Knowl.-Based Syst. 2016, 100, 13–24. [CrossRef] Sevilla-Lara, L.; Learned-Miller, E. Distribution fields for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 881–892. [CrossRef] Wu, Y.; Lim, J.; Yang, M.H. Online Object Tracking: A Benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).