Uploaded by Márcia Baptista

Robust Object Tracking based on Motion Consistency

advertisement
sensors
Article
Robust Object Tracking Based on Motion Consistency
Lijun He, Xiaoya Qiao, Shuai Wen and Fan Li *
ID
Department of Information and Communication Engineering, School of Electronic and Information Engineering,
Xi’an Jiaotong University, Xi’an 710049, China; jzb2016125@mail.xjtu.edu.cn (L.H.);
qxy0212@stu.xjtu.edu.cn (X.Q.); wen201388@stu.xjtu.edu.cn (S.W.)
* Correspondence: lifan@mail.xjtu.edu.cn; Tel.: +86-29-8266-8772
Received: 28 December 2017; Accepted: 10 February 2018; Published: 13 February 2018
Abstract: Object tracking is an important research direction in computer vision and is widely used in
video surveillance, security monitoring, video analysis and other fields. Conventional tracking
algorithms perform poorly in specific scenes, such as a target with fast motion and occlusion.
The candidate samples may lose the true target due to its fast motion. Moreover, the appearance
of the target may change with movement. In this paper, we propose an object tracking algorithm
based on motion consistency. In the state transition model, candidate samples are obtained by the
target state, which is predicted according to the temporal correlation. In the appearance model,
we define the position factor to represent the different importance of candidate samples in different
positions using the double Gaussian probability model. The candidate sample with highest likelihood
is selected as the tracking result by combining the holistic and local responses with the position factor.
Moreover, an adaptive template updating scheme is proposed to adapt to the target’s appearance
changes, especially those caused by fast motion. The experimental results on a 2013 benchmark
dataset demonstrate that the proposed algorithm performs better in scenes with fast motion and
partial or full occlusion compared to the state-of-the-art algorithms.
Keywords: object tracking; motion consistency; state prediction; position factor; occlusion factor
1. Introduction
Object tracking is an important application for video sensor signal and information processing,
which is widely applied in video surveillance, security monitoring, video analysis, and other areas.
Although numerous methods have been proposed, it is still a challenging problem to implement
object tracking in particular scenes, such as sports scenes for player tracking and in security scenes
for criminal tracking. These scenes are characterized by fast motion of the target, occlusion and
illumination variation. Improving the accuracy and robustness of tracking in these particular scenes is
an open problem.
Object tracking is the determination of the target state in continuous video frames. The particle
filter is widely used in object tracking, which uses the Monte Carlo method to simulate the probability
distribution and is effective in estimating the non-Gaussian and nonlinear states [1]. In the particle
filter, the state transition model and the appearance model are two important types of probabilistic
models. The state transition model is used to predict the current target state based on the previous
target states, which can be divided into Gaussian distribution model and constant velocity model.
Gaussian distribution model assumes that the velocities between adjacent frames are uncorrelated.
Thus, the candidate samples are predicted by simply adding a Gaussian disturbance to the previous
target state [2–8]. This method has been widely used due to its simplicity and effectiveness. The method
performs well when the target moves short distances in random directions. However, in the presence
of fast motion, a large variance and more particles are needed to avoid the loss of the true target, which
will result in higher computational burden. The constant velocity model assumes a strong correlation
Sensors 2018, 18, 572; doi:10.3390/s18020572
www.mdpi.com/journal/sensors
Sensors 2018, 18, 572
2 of 27
between the velocity of the current state and those of previous states, and candidate samples are
predicted from the previous states and velocities with the addition of a stochastic disturbance [9–15].
This effect can be regarded as adding a disturbance to a new state that is far away from the previous
state in terms of velocity. Thus, when the velocity in the current frame is notably smaller than that
in the previous frame, an improper variance of the disturbance will also cause the loss of the target.
The appearance model is used to represent the similarity between the real state of the target and
the observation of the candidates. It can be divided into generative model [8,15–20], discriminative
model [21–27] and collaboration model [2–4,6,28–30]. Tracking based on generative model focuses
on foreground information and ignores background information. The candidates are represented
sparsely by a foreground template set and the candidate with the minimum reconstruction error can
be selected as the target. Tracking based on discriminative model considers the target tracking process
as a classification process, and a classifier is trained and updated online to distinguish the foreground
target from the background. Tracking based on collaboration model takes advantage of the generative
and discriminative models to make joint decisions. Most existing target tracking methods have obvious
disadvantages in complex scenes. The candidate samples selected in the state transition model without
prediction may not include the real target, which might cause tracking drift or even tracking failure
especially when the target has fast motion in the video. Moreover, if the target’s appearance changes
drastically during movement, only considering the target’s appearance feature in appearance model
may reduce the matching accuracy between the target and the candidate samples.
In this paper, we propose an object-tracking algorithm based on motion consistency (MCT). MCT
utilizes the temporal correlation between continuous video frames to describe the motion consistency
of the target. In the state transition model, the candidate samples are selected based on the predicted
target state. In the appearance model, the candidate samples are represented by the position factor,
which is defined to describe the motion consistency of the target. The tracking result is determined by
combining the position factor with the local and holistic responses.
The main contributions of our proposed algorithm are summarized as follows:
1. Candidate sampling based on target state prediction
A conventional transition model may result in loss of the target if the number of candidate
samples is limited. This may lead to tracking drift or even tracking failure. In this paper, candidate
sampling based on target state prediction is proposed. Considering the motion consistency, the target
state in the current frame is predicted based on the motion characteristics of the target in previous
frames. Then, the candidate samples are selected according to the predicted target state to overcome
the tracking drift.
2. Joint decision by combining the position factor with local and holistic responses
The appearance probability of the target may differ in different positions due to the motion
consistency. Different from conventional algorithms, which neglect the temporal correlation of
the video frames in the appearance model, the position factor is defined to characterize the
importance of candidate samples in different positions according to the double Gaussian probability
model. Meanwhile, the occlusion factor is proposed to rectify the similarity computed in the local
representation to obtain the local response, and the candidates are represented sparsely by positive
and negative templates in the holistic representation to obtain the holistic response. Then, the tracking
decision is made by combining the position factor with the responses of the local and holistic
representations. This approach makes full use of the temporal correlation of the target state and
the spatial similarity of local and holistic characteristics.
3. Adaptive template updating based on target velocity prediction
The fast motion of the target will inevitably bring about appearance changes of the target and
background. To adapt to the changes, an adaptive template updating approach is proposed, which
is based on the predicted velocity of the target. The positive template set is divided into dynamic
Sensors 2018, 18, 572
3 of 27
and static template sets. When the target moves fast, the dynamic positive template set is updated
to account for the appearance changes of the target. To keep the primitive characteristics of the
target, static positive template set is retained. Moreover, the negative holistic template set is updated
continuously to adapt to the background changes.
The rest of the paper is organized as follows: Section 2 reviews the related work. In Section 3,
an object tracking algorithm based on motion consistency is proposed. Section 4 gives the experimental
results with quantitative and qualitative evaluations. This paper’s conclusions are presented
in Section 5.
2. Related Work
A large number of algorithms have been proposed for object tracking, and an extensive review is
beyond the scope of this paper. Most existing algorithms mainly focus on both state transition model
and appearance model based on particle filter. In state transition model, the change in the target state is
estimated and candidate samples are predicted with various assumptions. Most of the state transition
models based on particle filter can be divided into Gaussian distribution model and constant velocity
model, according to the assumption on whether the velocity is temporally correlated with previous
frames. Appearance models, which are used to represent the similarity between the real state of the
target and the observation of the candidates, can be categorized into generative models, discriminative
models and combinations of these two types of models.
2.1. State Transition Models
The state transition model simulates the target state changes with time by particles prediction.
Most of the algorithms use a Gaussian distribution model to predict candidate samples. The Gaussian
distribution model regards previous target state as the current state, and then obtains candidate
samples by adding a Gaussian noise dispersion to the state space. For example, in [2,3], the particles
are predicted by a Gaussian function with fixed mean and variance, thus the coverage of particles
is limited. It describes target motion best for short-distance motion with random directions, but it
performs poorly when the target has a long-distance motion in a certain direction. Furthermore, in [5],
three types of particles whose distributions have different variances are prepared to adapt to the
different tasks. In visual tracking decomposition [7], the motion model is represented by a combination
of multiple basic motion models, each of which covers a different type of motion using a Gaussian
disturbance with a different variance. To a certain extent, the model can cover abrupt motion with
a large variance. However, fixed models cannot adapt to complex practical situations. In addition,
there are other algorithms which ignore the temporally correlation of the target in continuous frames.
For example, multi-instance learning tracker (MIL) [22] utilizes a simplified particle filtering method
which sets a searching radius around the target in the previous frame and uses dense sampling to
obtain samples. Small searching radius cannot adapt to fast motion, on the other hand large searching
radius will bring high computation burden. Since algorithms [20,31] based on local optimum search
obtain the candidates without prediction either, thus a larger searching radius is also necessary when
fast motion occurs.
In subsequent studies, some of the algorithms use the constant velocity model to predict candidate
samples, which assumes that the velocity of the target in the current frame is correlated to that in the
previous frame. In [9,10], the motion of the target is modeled by adding the motion velocity with the
frame rate (motion distance) in the previous frame to the previous state and then disturbing it with
Gaussian noise. In [11,12], the average velocity of the target in previous frames is considered to obtain
the translation parameters to model the object motion. This is suitable for the situation that the velocity
in the current frame is strongly correlated with that in the previous frame. However, the actual motion
of the target is complex. When the target in the current frame is not correlated with that in the previous
frame, the improper variance of Gaussian disturbance and finite number of particles will also lead to
tracking drift or even failure.
Sensors 2018, 18, 572
4 of 27
2.2. Appearance Models
Tracking based on generative models focuses more on foreground information. The candidates
are represented by the appearance model and the candidate with the minimum reconstruction error
can be selected as the target. In [8], multi-task tracking (MTT) represents each candidate by the same
dictionary and considers each representation as a single task, and then formulates object tracking as a
multi-task sparse learning problem by joint sparsity. In [19], a local sparse appearance model with
histograms and mean shift is used to track the target. Tracking based on the L1 minimum (L1APG)
in [12] models target appearance by target templates and trivial templates with sparse representation.
However, it neglects the background information. Thus, in the presence of background cluster or
occlusion, tracking based on generative model cannot distinguish the target from the background.
Tracking based on discriminative models, in essence, considers the target tracking process as a
classification task and distinguishes the foreground target from the background using the classifier with
the maximum classifier response. In [24,25], an online boosting classifier is proposed to select the most
discriminating features from the feature pool. In [26], the supervised and semi-supervised classifiers
are trained on-line, in which the labeled first frame sample and unlabeled subsequent training samples
are used. The discriminative model considers the background and foreground information completely
to distinguish the target. However, a classifier with an improper updating scheme or limited training
samples will lead to bad classification.
Tracking methods that combine generative and discriminative models exploit the advantages of
these two models, fusing them and using them together to make decisions. Tracking via the sparse
collaborative appearance model (SCM) [2] combines a sparse discriminative classifier with a sparse
generative model to track the object. With the sparse discriminative classifier, the candidate samples are
represented sparsely by holistic templates and the confidence value is computed. In sparse generative
model, the candidate samples are divided into patches and represented by a dictionary to compute the
similarity between each candidate sample and the template. However, the positive templates remain;
thus the discriminative classifier cannot adapt to drastic appearance variations. In object tracking via
key patch sparse representation [4], the overlapped patch sampling strategy is adopted to capture
the local structure. In the occlusion prediction scheme, the positive and negative bags are obtained
by sampling patches within and outside the bounding box. Then the SVM (support vector machine)
classifier is trained by the bags with labels, by which the patches are predicted to be occluded or not.
Meanwhile, the contribution factor is computed via sparse representation. However, the SVM classifier
is not timely updated, which may cause error accumulation in occlusion decisions. In object tracking
via a cooperative appearance model [6], an object tracking algorithm based on cooperative appearance
model is proposed. The discriminative model is established in local representation with positive and
negative dictionaries. The reconstruction error of each candidate is computed in holistic representation,
and the impact regions are proposed to combine the local and holistic responses.
In this paper, we propose a tracking algorithm based on motion consistency, in which the candidate
samples are predicted by a state transition model based on target state prediction. In addition, in the
appearance model, the position factor is proposed to assist in target selection. As revealed in [2,3,5,7],
the candidate samples are predicted simply by using Gaussian distribution model, which is effective
for short-distance motion with random direction. Then, the algorithms in [9–12] take the velocity of the
target into consideration, to a certain extent, which can adapt to motion that is correlated to the previous
state, even long-distance motion. Inspired by the above work, in this paper, the predicted target state
information is added to the conventional Gaussian distribution model. However, the proposed model
is different from the constant velocity model in that only dynamic candidate samples are predicted with
motion prediction; also, some basic candidate samples around the previous state are predicted with
Gaussian disturbance. In this way, varied candidate samples can adapt to complex motions in addition
to fast motion. Conversely, most tracking algorithms focus on the similarity between the candidate
samples and templates, such as that in [2]. However, when the target appearance has a large variation,
the true target cannot be selected by only considering the similarity. Thus, in the appearance model,
Sensors 2018, 18, 572
5 of 27
Sensors 2018, 18, x FOR PEER REVIEW
5 of 27
considering the temporal correlation of the target state, the position factor is proposed to characterize
the
importance
of candidate
samples inofdifferent
positions
collaborates
withand
representation
proposed
to characterize
the importance
candidate
samples and
in different
positions
collaborates
responses
to select the
target. to select the target.
with representation
responses
3.
3. Robust
Robust Object
Object Tracking
Tracking Based
Based on
on Motion
Motion Consistency
Consistency
Conventional
which ignore
ignore the
the temporal
correlation of
of the
Conventional algorithms
algorithms which
temporal correlation
the target
target states
states in
in state
state
transition
model
may
lose
the
true
target,
especially
when
fast
motion
occurs,
and
if
the
appearance
transition model may lose the true target, especially when fast motion occurs, and if the appearance
of
of the
the target
target changes
changes drastically,
drastically, algorithms
algorithms that
that only
only focus
focus on
on the
the target’s
target’s appearance
appearance feature
feature in
in the
the
appearance
model will
willstruggle
struggletotoseparate
separateforeground
foregroundtargets
targets
from
background.
Meanwhile,
appearance model
from
thethe
background.
Meanwhile,
an
an
appropriate
updating
scheme
is
necessary
to
maintain
the
robustness
of
the
algorithm.
Thus
we
appropriate updating scheme is necessary to maintain the robustness of the algorithm. Thus we
propose
propose aa robust
robust object
object tracking
tracking algorithm
algorithm based
based on
on motion
motionconsistency.
consistency.
The
target is
assumed
The tracking
tracking process
process of
of our
our proposed
proposed MCT
MCT algorithm
algorithm is
is shown
shown in
in Figure
Figure 1.
1. The
The target
is assumed
to
be
known
in
the
first
frame.
In
the
state
transition
model,
the
target
state,
including
the
to be known in the first frame. In the state transition model, the target state, including the motion
motion
direction
direction and
and distance
distance in
in the
the current
current frame,
frame, is
is predicted
predicted according
according to
to the
the target
target states
states in
in previous
previous
frames.
The candidate
candidate samples
samples are
frames. The
are predicted
predicted by
by the
the state
state transition
transition model
model according
according to
to the
the predicted
predicted
target
state.
Next,
all
the
candidate
samples
are
put
into
the
appearance
model.
In
the
target state. Next, all the candidate samples are put into the appearance model. In the appearance
appearance
model,
model, we
we propose
propose the
the position
position factor
factor according
according to
to the
the double
double Gaussian
Gaussian probability
probability model
model to
to assign
assign
weights
to
candidate
samples
located
in
different
positions.
Meanwhile,
in
the
local
representation,
weights to candidate samples located in different positions. Meanwhile, in the local representation,
the
the similarity
similarity between
between the
the candidate
candidate and
and the
the target
target is
is computed
computed with
with the
the sparse
sparse representation
representation by
by
dictionary
dictionary to
to obtain
obtain the
the local
local response
response of
of each
each candidate.
candidate. In
In the
the holistic
holistic representation,
representation, the
the holistic
holistic
response
sample
is computed
with with
the sparse
representation
by positive
and negative
response ofofeach
eachcandidate
candidate
sample
is computed
the sparse
representation
by positive
and
templates.
Finally,
the
candidate
sample
with
the
highest
likelihood
is
decided
as
the
tracking
result
negative templates. Finally, the candidate sample with the highest likelihood is decided as
the
by
combining
the
holistic
and
local
responses
with
the
position
factor.
To
adapt
to
the
appearance
tracking result by combining the holistic and local responses with the position factor. To adapt to the
changes
of the
target,ofespecially
caused
by fast
motion,
the
template
should
beshould
updated
appearance
changes
the target,those
especially
those
caused
by fast
motion,
the sets
template
sets
be
adaptively
based on based
the target
velocity
updated adaptively
on the
targetprediction.
velocity prediction.
Figure 1.
1. The
prediction
based
state
transition
model.
In
Figure
Thecandidates
candidatesare
arepredicted
predictedby
bythe
thetarget
targetstate
state
prediction
based
state
transition
model.
appearance
model,
the
position
factor
is
proposed
to
characterize
the
importance
of
candidate
In appearance model, the position factor is proposed to characterize the importance of candidate
samples in
in different
different positions.
positions. In
In the
the target
target velocity
velocity prediction
prediction based
based updating
updating model,
model, the
the template
template
samples
sets
are
updated
adaptively
to
ensure
the
robustness
and
effectiveness
of
MCT
algorithm.
sets are updated adaptively to ensure the robustness and effectiveness of MCT algorithm.
Sensors 2018, 18, 572
6 of 27
Sensors 2018, 18, x FOR PEER REVIEW
6 of 27
3.1.3.1.
State
Transition Model Based on Target State Prediction
State Transition Model Based on Target State Prediction
TheThe
state
transition
samplesininthe
thecurrent
currentvideo
video
frame.
state
transitionmodel
modelisisused
usedto
toselect
select the
the candidate
candidate samples
frame.
Conventional
methods
based
on
particle
filter
usually
use
the
Gaussian
distribution
model
to
obtain
Conventional methods based on particle filter usually use the Gaussian distribution model to obtain
thethe
candidate
samples
around
in the
theprevious
previousframe.
frame.When
When
the
candidate
samples
aroundthe
theposition
positionwhere
wherethe
thetarget
target appears
appears in
the
target
hashas
fastfast
motion,
the
selection
predictionmay
mayresult
resultininthe
theloss
loss
target
motion,
the
selectionofofcandidate
candidatesamples
samples without
without prediction
ofof
thethe
true
target.
Therefore,
of motion
motionconsistency,
consistency,we
wepropose
proposea astate
state
true
target.
Therefore,considering
consideringthe
thecharacteristics
characteristics of
transition
model
based
onon
target
samples.
transition
model
based
targetstate
stateprediction
predictionto
topredict
predict candidate
candidate samples.
3.1.1.
Target
State
Prediction
3.1.1.
Target
State
Prediction
Target
state
prediction
is the
estimation
of the
target
position
at frame
t.oWe
a setaofset
target
t . predict
Target
state
prediction
is the
of the
target
position
at frame
We predict
of
n estimation
1
1
2
2
u
u
U
U
1
1
2
2
u
u
U
U
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
targetdirections
motion directions
and distances
t according
to theto
(θ̂t , lˆt ),((tθ̂,tlt, ),lˆt(),t ·, l·t ·),
, (,θ̂(t t, ,lˆltt )),,
· ·, ·(,t (,θ̂ltt ),lˆt at) frame
motion
and distances
at frame
t according
themotion
motionconsistency,
consistency,which
whichis is
shown
Figure
shown
in in
Figure
2. 2.
Figure 2. The relative motion of target in the previous frame including motion direction and motion
Figure
2. The relative motion of target in the previous frame including motion direction and motion
distance is already known. Then the target motion direction in the current frame is predicted by
distance is already known. Then the target motion direction in the current frame is predicted by adding
adding a prediction range of γ to the previous motion direction, and the motion distance is predicted
a prediction range of γ to the previous motion direction, and the motion distance is predicted according
according to the previous motion distance with variation rate.
to the previous motion distance with variation rate.
is assumed
that
thetarget
targethas
hasaasteady
steadymotion
motion direction
direction and
It isIt assumed
that
the
and velocity.
velocity.We
Wepredict
predictthe
thetarget
target
motion
direction
by
using
the
motion
direction
of
the
target
at
frame
t
−
1.
As
shown
in
Figure
2, the 2,
motion direction by using the motion direction of the target at frame t − 1. As shown in Figure
from
thethe
target
from
frame
t  1tto
t  2 ist −
t−1. Then, the target motion
therelative
relativemotion
motiondirection
direction
from
target
from
frame
−frame
1 to frame
2 is θ t −1 . Then, the target
direction is predicted as a set of values based on the value t−1. Taking t−1 as the center, we select U
motion direction is predicted as a set of values based on the value θ t −1 . Taking θ t −1 as the center,
angles uniformly in the interval [t−1 − γ, t−1 + γ], where γ is a constant, to determine the prediction
we select U angles uniformly in the interval [θ t −1 − γ, θ t −1 + γ], where γ is a constant, to determine the
range to reduce the estimation error. Then, the predicted values of the motion directions are:
prediction range to reduce the estimation error. Then, the predicted values of the motion directions are:
2
ˆtu  t 1   
(u -1) u  1,2,...,U ,
(1)
2γU  1
u
θ̂t = θt−1 − γ +
(u − 1) u = 1, 2, . . . , U,
(1)
U−1
where U is the number of predicted directions.
alsonumber
predict of
thepredicted
motion distance
based on the relative motion distance within the three
where UWe
is the
directions.
frames
before
frame.
predicted
distance
each
is:
We also
predict
theThe
motion
distance
basedinon
thedirection
relative motion
distance within the three frames
before frame. The predicted distance in eachˆdirection
is:
u
lt  wˆ t  lt 1 ,
(2)
ˆu = ŵt × lt−1 ,
t the target from frame t − 1 to frame t  2. w
ˆ t describes(2)
where lt 1 is the relative motion distance lof
the variation rate of the motion distance in the previous frames and is defined as:
where lt−1 is the relative motion distance of the target from frame t − 1 to frame t − 2. ŵt describes the
l
l 2
variation rate of the motion distance in the
  t 1  (1frames
  )  t and
wˆ t previous
, is defined as:
(3)
l
l
t 2
t 3
l
l t −1
ŵt =
α × than
+ (1 − α ) × t −2 ,
(3)
where  is a positive constant that
is less
1.
l t −2
l t −3
1 ˆ1
2 ˆ2
u ˆu
U ˆU
ˆ
ˆ
ˆ
ˆ
Thus, we obtain a set of target motion directions and distances (t , lt ),(t , lt ),,(t , lt ), , (t , lt ) .
where α is a positive constant that is less than 1.
Sensors 2018, 18, 572
7 of 27
Thus, we obtain a set of target motion directions and distances {(θ̂t1 , lˆt1 ), (θ̂t2 , lˆt2 ), · · · ,
(Sensors
θ̂tu , lˆtu )2018,
, · · · 18,
, (θ̂xtUFOR
, lˆtUPEER
)}. REVIEW
7 of 27
3.1.2.
3.1.2. Three-Step
Three-Step Candidate
Candidate Sampling
Sampling
The
The conventional
conventional Gaussian
Gaussian distribution
distribution model
model generates
generates the
the candidate
candidate samples
samples near
near the
the target
target
position
in
the
previous
frame.
For
random
motion
over
a
short
distance,
the
true
target
position
position in the previous frame. For random motion over a short distance, the true target position can
can
be
However, when
be successfully
successfully covered
covered by
by the
the samples.
samples. However,
when fast
fast motion
motion occurs,
occurs, these
these predicted
predicted samples
samples
may
tracking
failure.
ToTo
avoid
thethe
loss
of
may lose
lose the
the true
truetarget
targetposition,
position,which
whichresults
resultsinintracking
trackingdrift
driftoror
tracking
failure.
avoid
loss
the
truetrue
target
position,
a three-step
candidate
sampling
scheme
is proposed
according
to target
state
of the
target
position,
a three-step
candidate
sampling
scheme
is proposed
according
to target
prediction,
which
is
illustrated
in
Figure
3.
state prediction, which is illustrated in Figure 3.
Figure 3. First, the basic candidate samples (the blue dots) are obtained around the target location at
Figure 3. First, the basic candidate samples (the blue dots) are obtained around the target location at
frame t − 1. Second, the basic candidate samples with the largest relative distance in each predicted
frame t − 1. Second, the basic candidate samples with the largest relative distance in each predicted
direction will be selected as the benchmark candidate samples (the yellow dots). Third, adding more
direction will be selected as the benchmark candidate samples (the yellow dots). Third, adding more
dynamic candidate samples (the green dots) in the predicted direction begins from the benchmark
dynamic candidate samples (the green dots) in the predicted direction begins from the benchmark
within aa predicted
predicted distance.
distance.
within
1. First
First
step:
Basic
candidate
sampling
1.
step:
Basic
candidate
sampling
The basic
basic candidate
candidate samples
samples in
in the
the current
current frame
frame are
are predicted
predicted by
by the
the Gaussian
Gaussian distribution
distribution
The
i
i
i
i
model. The
The current
current state
state XXt tis is
predicted
state
previous
frame
disturbance
model.
predicted
byby
thethe
state
in in
thethe
previous
frame
X X tand
thethe
disturbance
by
1 and
t −1
X Gauss
as follows:
by Gaussian
noise
Gaussian
noise
XGauss
as follows:
Xti =i Xti−1i + XGauss ,
(4)
X  X  X Gauss ,
(4)
where Xti represents the ith candidate in frame t and the state of each candidate is represented
is represented
X ti represents
t andXthe
the ithmodel
candidate
in frame
statei ofi each
where
i by
i
i
i candidate
i
i
i
by
an affine
transformation
of six
parameters
t = α1 , α2 , α3 , α4 , α5 , α6 , in which α1 , α2
i
i
i
1i , 2i ,3i , 4i ,rotation.
5i ,6i  , in
which
an affine translation,
transformation
of sixscaling
parameters
represents
α3i , αmodel
and α5i ,Xαt 6i  represents
Each
parameter
1i ,2in

4 represents
i
i
i
i
i
i
i , iα , α
i
i
the
state
vector
α
,
α
,
α
,
α
is
assumed
to
obey
a
Gaussian
distribution
of
different
mean
and
6
1 2 33 ,4 4  5 represents
scaling and 5 ,6  represents rotation. Each parameter
represents translation,
variance according to the actual situation. Next, the relative translation between the N basic candidate
i
i
i
i
assumed
to obey aand
Gaussian
distribution
different mean
in the state
1i ,in2i ,the
3 , previous
4 ,  5 ,  6  is
samples
andvector
the target
frame
is computed
converted
to a polarofcoordinate
form
1 1
2
2
n
n
N
N
and
variance
according
to
the
actual
situation.
Next,
the
relative
translation
between
the
N
basic
(θt , lt ), (θt , lt ), · · · , (θt , lt ), · · · , (θt , lt ) , which covers the direction and distance information
of
candidate
samples
and the target in the previous frame is computed and converted to a polar
each
candidate
sample.
2
n n
N N
coordinate
formBenchmark
(t1, lt1),(t2 , lcandidate
t ),,(t , lt ),,(t , lt ) , which covers the direction and distance information
2.
Second step:
sampling
of each candidate sample.
To make candidate samples cover as large a region as possible where the target possibly appears,
2. Second both
step: efficiency
Benchmark
candidate
sampling
considering
and
effectiveness,
we simplify the problem by selecting the benchmark
with To
themake
largestcandidate
relative distance
in
each
predicted
direction
and thenwhere
adding
dynamic
samples cover as large a region
as possible
themore
target
possiblycandidate
appears,
samples
in the
same
direction
a predicted
distance,
which
begins
from thethe
benchmark.
considering
both
efficiency
andwithin
effectiveness,
we simplify
the
problem
by selecting
benchmark with
t
t 1
the largest relative distance in each predicted direction and then adding more dynamic candidate
samples in the same direction within a predicted distance, which begins from the benchmark.
1 1
2 2
u u
U U
The benchmark candidate samples (ˆ , l ),(ˆ , l ),,(ˆ , l ),,(ˆ , l ) are selected from the

t
max
t
max
t
max
t
max

basic candidate samples in accordance with the motion direction predicted by Equation (1) as follows:
Sensors 2018, 18, 572
8 of 27
1 ), (θ̂ 2 , l 2 ), · · · , (θ̂ u , l u ), · · · , (θ̂U , lU ) are selected
The benchmark candidate samples (θ̂t1 , lmax
t max
t max
t max
from the basic candidate samples in accordance with the motion direction predicted by Equation (1)
as follows:
j j
u
(θ̂tu , lmax
) = ( θt , lt ) j =
argmax
lti u = 1, 2, · · · , U,
(5)
i ∈{i | θti =θ̂tu ,1≤i ≤ N }
In the θ̂tu direction, the basic candidate sample with the largest relative distance will be selected as
the benchmark candidate, as shown in Figure 3. There is a rule that if it lacks a benchmark in the θ̂tu
direction, the benchmark candidate sample will be defined as (θ̂tu , 0) with a relative distance of zero.
3. Third step: Dynamic candidate sampling
Beginning from the benchmark, in each predicted direction, a total of M dynamic candidate
samples is placed evenly at a distance of lˆtu /M apart, as shown in Figure 3. There are UM dynamic
candidate samples:
u
(θ̂tu , lmax
+m×
lˆtu
)
M
u = 1, 2, · · · , U
m = 1, 2, · · · , M,
(6)
Thus,
we obtain the relative position information of all the candidate
samples including N basic candidate samples and UM dynamic candidate samples
{(θt1 , lt1 ), (θt2 , lt2 ), · · · , (θtN , ltN ), (θtN +1 , ltN +1 ), · · · , (θtN +UM , ltN +UM )} in the frame t.
Moreover,
these relative polar coordinates are converted into translation parameters, and the remaining four
parameters are assumed to obeynthe Gaussian distribution.
All the candidate samples predicted in
o
frame t are represented by Xt =
Xt1 , Xt2 , · · · , XtN +UM .
3.2. Position Factor Based Cooperative Appearance Model
In the presence of fast motion, the target’s appearance changes drastically. Therefore, the algorithms
that only considering the changes of appearance feature in appearance model is difficult to distinguish
the foreground target from a complex background. In view of this, we propose the position factor to
reduce the effect of appearance changes caused by target movement. Due to the motion consistency,
there is a different probability of the target appearing in each position. Generally, candidate sample in
the location where the target may appear has a larger probability to be selected as the target. Thus,
candidate samples of this kind should be assigned larger weights, while other candidate samples
should be assigned smaller weights. Therefore, a quantitative parameter named the position factor
is proposed to assign different weights to candidate samples in different positions to rectify the
representation error. The likelihood of candidate sample is defined by combining the position factor
with the responses of local and holistic representations.
3.2.1. Position Factor Based on Double Gaussian Probability Model
The position factor of the ith candidate sample is computed by multiplying the direction score SiA
and distance score Sli :
Fi = SiA Sli i = 1, 2, · · · , N + MU,
(7)
To compute direction and distance scores, a double Gauss probability model is used according
to the target states in previous frames, which includes the direction Gaussian probability model and
distance Gaussian probability model.
1. Direction score
The motion direction of the target at frame t is consistent with that in the previous frame. It is
assumed that the probability of the target appearing in the θt−1 direction is the largest, and the
probability of the target appearing in the opposite direction θt−1 is the smallest. Thus, a Gaussian
probability model with θt−1 mean error is proposed to simulate the distribution of the target’s motion
Sensors 2018, 18, 572
9 of 27
direction. Accordingly, the probability of the candidate sample lying in θt−1 being selected as the
target is large and it should be assigned a large weight. In contrast, the candidate sample lying in
the direction opposite θt−1 should be assigned a small weight. Then, the weight assignment model
is transformed according to the trend of the direction Gaussian probability model, from which the
direction score of the ith candidate sample is derived as follows:
SiA =
2
T1
T1
−(θti −θt−1 )
e
+ (1 −
)
1 − e−π
1 − e−π
i = 1, 2, · · · , N + MU,
(8)
where θti denotes the relative direction between the ith candidate and target at frame t − 1, θt−1 is the
A )
motion direction of the target at frame t − 1, and T1 is a constant that depends inversely on (θt−1 , pmax
A ).
and (θt−1 − π, pmin
2. Distance score
The motion velocities within two continuous frames are consistent. In the θt−1 direction, the
velocity of the target at frame t is consistent with that of the previous frame. Thus, according to the
distance predicted by Equation (2), it is assumed that the probability of the target appearing near the
relative distance lˆtu is the largest. Conversely, in the direction opposite θt−1 , the motion velocity of
the target is small, and the probability of the target at frame t appearing near the target at frame t − 1
is the largest. Accordingly, in each direction, there is a distance with the largest probability of target
appearance, as well as directions where the candidate samples are located:
li = (
i
2
T2
T2
e−(θt −θt−1 ) + (1 −
))lˆu
−
π
1−e
1 − e−π t
i = 1, 2, · · · , N + MU,
(9)
where T2 is a constant that depends inversely on (θt−1 , lˆtu ) and (θt−1 − π, klˆtu ), and k is a positive
constant that is less than 1.
It has been assumed that in the θti direction, the probability of the target appearing at the relative
distance l i is largest and smallest for the infinite relative distances. Thus, the probability of the
candidate sample at θti , l i being selected as the target is large and the weight of this candidate is
l
defined as pmax
, whereas the probability of the candidate sample at θti , ∞ being selected as the target
l
is small and the weight of this candidate is defined as pmin
. According to this distribution feature,
i
in the θt direction, the motion distance can be assumed to obey a Gaussian probability model with l i
mean error. Thus, the weight assignment model is transformed according to the trend of the distance
Gaussian probability model, by which the distance score of the ith candidate sample is computed as
follows:
2
T3
T3
−(lti −l i )
Sli =
e
+ (1 −
) i = 1, 2, · · · , N + MU,
(10)
u
ˆu
−lˆi
1−e
1 − e − li
where lti represents the relative distance of the ith candidate sample; l i is the corresponding distance
with the largest probability in the θti direction, which is computed by Equation (9); and T3 is a constant
l
l
that depends inversely on (l i , pmax
) and (∞, pmin
).
3.2.2. Local Representation with Occlusion Factor
In the local representation, the candidate samples are divided into patches and are sparsely
represented by the local features, from which the local information of candidate samples is obtained.
The local information of candidates, obtained from the sparse representation, is used for adapting to
occlusion and other situations that may cause local appearance changes. Thus, in this paper, based on
the sparse generative model (SGM) [2], the occlusion information of each patch is defined according
to the reconstruction error with the local sparse representation. Furthermore, to adapt to occlusions,
the occlusion factor is defined to rectify the similarity computed by SGM. Thus, the local response L of
each candidate is represented as follows:
Sensors 2018, 18, 572
10 of 27
L = Lc β c ,
(11)
where Lc represents the similarity and β c represents the occlusion factor.
1. Similarity function
First, the image is normalized to 32 × 32 pixels, from which M patches are derived by overlapping
sliding windows. With the sparse representation, the patches are represented by a dictionary generated
from the first frame by K-Means [32] to compute the reconstruction error. The patch with large
reconstruction error is regarded as an occluding part and the corresponding element of the occlusion
indicator o is set to zero, as follows:
(
1 εm < ε0
m = 1, 2, · · · , M,
(12)
om =
0 otherwise
where ε m denotes the reconstruction error of mth patch and ε 0 is a constant used to determine whether
the patch is occluded or not.
Moreover, a weighted histogram is generated by concatenating the sparse coefficient with the
indicator o. The similarity Lc of each candidate is computed on the basis of weighted histograms
between the candidate and the template.
2. Occlusion factor
To further adapt to occlusion in the appearance model, we propose the occlusion factor to rectify
the similarity function. The number of the patches whose reconstruction errors are larger than ε 0 is
computed statistically by:
M
Nc =
∑
om ,
(13)
m =1
where a larger value of Nc represents that there is more information on the background or the occlusion.
The occlusion factor is defined as follows:
β c = σe−(1−ωc ) ,
(14)
where σ is a constant and ωc = Nc /M is the statistical average of Nc . According to Equation (14),
there is a positive correlation between the occlusion factor and ωc . When occlusion occurs, the value
of the occlusion factor increases with the increase of ωc .
3.2.3. Holistic Representation Based on Sparse Discriminative Classifier
In the holistic representation, the foreground target is distinguished from the background by
sparse discriminative classifier (SDC) [2]. Initially, np positive templates are drawn around the target
location and nn negative templates are drawn further away from the target location. By the way,
the label +1 is assigned to positive templates and label −1 is assigned to negative templates. In order
to reduce the redundancy of the grayscale template sets, a feature selection scheme is proposed in
SDC. First, the labels are represented sparsely by the positive and negative template sets and a sparse
vector s is generated by the sparse coefficient vector. Then the sparse vector s is constructed into a
diagonal matrix S’ and a projection matrix S is obtained by removing all-zero rows from S’. Finally,
both the templates and candidates are reconstructed into the projected form by combining them with
the projection matrix S. With the sparse representation, the projected candidates are represented by
projected template sets, and the reconstruction error of the projected candidate is used to compute the
confidence value Hc of each candidate.
Sensors 2018, 18, 572
11 of 27
3.2.4. Joint Decision by Combining Position Factors with Local and Holistic Responses
By combining the position factor with the local and holistic responses, the likelihood function of
the ith candidate sample is computed as follows:
pi = Fi Li Hci
i = 1, 2, · · · , N + MU,
(15)
For each candidate, the position factor is large if it is located in the position where the target
appears with the largest probability. Moreover, the local and holistic responses are large if the candidate
is similar to the target. Thus, the candidate with the largest likelihood, calculated by Equation (15),
will be possibly identified as the target. The tracking result Xt in the current frame is chosen from the
candidate samples by:
j
Xt = Xt j = argi max pi ,
(16)
i =1,2,··· ,N + MU
3.3. Adaptive Template Updating Scheme Based on Target Velocity Prediction
It is necessary to update the templates and the dictionary to capture the appearance changes of
the target to ensure the robustness and effectiveness of the appearance model. In SCM [2], the negative
templates are updated in the SDC model every several frames from the image regions away from the
current tracking result, but the positive templates remain. In the SGM model, instead of updating the
dictionary, the template histogram is updated by taking the most recent tracking result into account.
However, the fixed positive templates in SDC model cannot match the true target effectively when
the target’s appearance changes drastically, which will inevitably occur in the presence of fast motion.
To adapt to the changes of the target’s appearance, the positive templates also need to be updated to
capture the appearance changes of the target. In view of this, an adaptive template updating scheme
based on the prediction of the target velocity is proposed.
The positive template set is divided into dynamic and static template sets, as follows:
n p = { pd , ps }
(17)
where pd represents the dynamic template set and ps represents the static template sets. For every
several frames, we determine whether the target has fast motion by comparing the relative motion
distance of the target with a threshold of p pixels. The variable p denotes the fast motion measure of
the tracking process. When the relative motion distance lt is larger than p, we deem that the target
moves fast in the current frame, hence the target in the next frame will move with a large velocity
according to motion consistency. Thus, the target may have appearance change at the same time. Next,
the dynamic positive template set is updated to account for appearance changes of the target. To keep
the initial information of the target and avoid incorrect representation in case of the recovery of target’s
appearance, the static positive template set is retained, as follows:
( np =
p∗d , ps
{ pd , ps }
lt ≥ p
otherwise
(18)
where p∗d represents the updated dynamic template set, which is drawn around the target location in
the current frame. By combining the updated dynamic set and static template set, the target can be
well matched with the positive template set. Besides, the negative templates are also updated in the
SDC model every several frames in order to distinguish the target from the complex and changeable
background in the tracking process.
4. Experiments
To evaluate the performance of our proposed MCT algorithm, we test our tracker on the CVPR2013
(OTB2013) dataset, which has been widely used in evaluation [33]. OTB contains 50 video sequences,
Sensors 2018, 18, 572
12 of 27
source codes for 29 trackers and their experimental results on the 50 video sequences. We compared our
tracker with eight state-of-the-art tracking algorithms: the compressive tracking (CT) [23], distribution
fields for tracking (DFT) [31], online robust image alignment (ORIA) [20], MTT [8], visual tracking via
adaptive structural local sparse appearance model (ASLA) [18], L1APG [12], SCM [2], and MIL [22]
trackers. The basic information of these trackers is listed in Table 1. As described in related work,
most algorithms ignore the temporal correlation of the target states. MTT, ASLA and SCM obtain
candidates by Gaussian distribution model, MIL and CT obtain candidates by dense sampling, DFT
and ORIA obtain candidates by local optimum search. Only LIAPG considers target motion speed
in the previous frame and obtain candidates by constant velocity model. Besides, the competition
algorithms cover all the type of appearance model, but all of the appearance models are designed
from the perspective of the target’s appearance feature. We evaluate the CT, DFT, ORIA, MTT, ASLA,
L1APG, SCM, and MIL trackers by using the available source codes. We conduct experiments on all
50 video sequences and perform a concrete analysis on eight specific video sequences with different
problems. All the problems we concerned are covered in the video sequences, thus the effectiveness of
the algorithm can be illustrated. The basic information of the eight test video sequences and primary
problems are shown in Table 2 [33].
Table 1. Basic information of the eight state-of-the-art tracking algorithms.
Algorithm
State Transition Model
Appearance Model
L1APG
MTT
ASLA
SCM
MIL
CT
DFT
ORIA
Constant velocity model
Gaussian distribution model
Gaussian distribution model
Gaussian distribution model
Dense sampling search
Dense sampling search
Local optimum search
Local optimum search
Holistic representation
Holistic representation
Local representation
Holistic & Local representation
Holistic representation
Holistic representation
Holistic representation
Holistic representation
Table 2. Basic information of the eight test videos.
Video
Image Size
Target Size
Main Confronted Problems
Coke
Boy
Football
Jogging-1
Girl
Skating1
Basketball
Sylvester
640 × 480
640 × 480
624 × 352
352 × 288
128 × 96
640 × 360
576 × 432
320 × 240
48 × 80
35 × 42
39 × 50
25 × 101
31 × 45
34 × 84
34 × 81
51 × 61
Fast motion
Fast motion
Occlusion
Occlusion
Scale variation and rotation
Scale variation and rotation
Deformation and illumination variation
Illumination variation
In the experiment, for each test video sequence, all information is unknown except for the initial
target position of the first frame. The number of candidate samples is 600 including 250 basic candidate
samples and 350 dynamic candidate samples. The number of motion directions we predicted is seven.
In each direction, a total of 50 candidate samples are obtained. In the update, five of positive templates
are updated every five frames from image regions away more than 20 pixels the current tracking result.
4.1. Quantitative Evaluation
4.1.1. Evaluation Metric
We choose three popular evaluation metrics: center location error, tracking precision and
success rate.
Sensors 2018, 18, 572
13 of 27
The center location error (CLE) is defined as the Euclidean distance between the center position
of the tracking result and the ground truth. Lower average center location error indicates better
performance of the tracker. CLE is calculated by:
CLE =
1 N
i
dis Cti , Cgt
,
∑
N i =1
(19)
where N is the total number of frames, dis(, ) represents distance between the center position Cti of the
i in the ith frame.
tracking result and the ground truth Cgt
The tracking precision is defined as the ratio of the number of frames for which the error between
the target center point calculated by the tracking algorithm and the groundtruth is smaller than the
threshold th to the total number of frames, as follows:
precision(th) =
1 N
i
F (dis(Cti , Cgt
) < th),
N i∑
=1
(20)
where F (·) is a Boolean function, the output of which is 1 when the expression inside brackets is true,
otherwise the output is 0, and the threshold th ∈ [0, 100]. The tracking precision depends on the choice
of th. The threshold th is set to 20 to compute the tracking precision in the following experiment.
The success rate is defined as the ratio of the number of frames for which the overlap rate is larger
than λ to the total number of frames, as follows:
success(λ) =
1 N
F (overlapi ≥ λ),
N i∑
=1
(21)
where the overlap rate is defined as overlap = τgt ∩ τt / τgt ∪ τt , the threshold λ ∈ [0, 1] is fixed,
τgt represents the region inside the true tracking box generated by ground truth, τt represents the
region inside the bounding box tracked by the tracker, |·| denotes the pixels in the region, and ∩ and ∪
denote the intersection and union of two regions, respectively. In the following experiment, instead of
using the success rate value for a specific threshold λ, the area under the curve (AUC) is utilized to
evaluate the algorithms.
4.1.2. Experiment on CVPR2013 Benchmark
To evaluate the performance of our algorithm, we conduct an experiment to test the MCT tracker
on the OTB dataset. OTB contains 50 videos and 29 algorithms, along with their experimental results on
the 50 video sequences. We compared our tracker with eight trackers in terms of overall performances
and we especially focus on the performance in the presence of fast motion (FM) and occlusion (OCC).
1. The comparison results for fast motion and occlusion
Fast motion (FM) and occlusion (OCC) are the two scenarios that we examine. The results of
precision and success rate for FM and OCC are shown as Figure 4. Our algorithm has the highest
precision and success rate compared with others. In the presence of FM, the tracking precision of
MCT is 0.433, which is 10% higher than that of the SCM algorithm. The success rate is 0.384, which is
improved by 8.8% compared with the SCM algorithm. According to the motion consistency, the target
state is estimated and merged into the state transition model, appearance model and updating scheme
to promote the accuracy in the presence of fast motion. In the presence of OCC, the tracking precision
of MCT is 0.655, which is 6% higher than that of the SCM algorithm. The success rate is 0.539, which is
5.2% higher than that of the SCM algorithm. The performance of MCT is still the best in the presence
of OCC. This finding is attributable to the occlusion factor, which is proposed to rectify holistic and
local responses to account for OCC.
Sensors 2018, 18, 572
14 of 27
Sensors 2018, 18, x FOR PEER REVIEW
14 of 27
Sensors 2018, 18, x FOR PEER REVIEW
14 of 27
(a)
(b)
(a)
(b)
(c)
(d)
Figure 4. The precision(c)
plot (a) and the success rate plot (b) of the tracking
results on OTB for FM,
(d)
Figure 4. The precision plot
(a) and the success rate plot (b) of the tracking
results on OTB for FM,
and the precision plot (c) and the success rate plot (d) of the tracking results on OTB for OCC.
and the
precision
plot
(c)
and
the
success
rate
plot
(d)
of
the
tracking
results
on
OTB
Figure 4. The precision plot (a) and the success rate plot (b) of the tracking results
on for
OTBOCC.
for FM,
2.
and the
precision plot
(c) and
success
rate plot (d) of the tracking results on OTB for OCC.
The
comparison
results
for the
overall
performance
2.
The
results of precision
and
success
rate for overall performance on 50 video sequences are
The comparison
results for
overall
performance
2. The comparison results for overall performance
The
results
of precision
and success
ratealgorithms,
for overallthe
performance
onthe
50 video
areare
shown
shown
as Figure
5. Compared
with other
precision and
successsequences
rate of MCT
The
results
of
precision
and
success
rate
for
overall
performance
on
50
video
sequences
are
as Figure
5.
Compared
with
other
algorithms,
the
precision
and
the
success
rate
of
MCT
are
highest.
highest. The precision of MCT is 0.673, which is 6.5% higher than that of SCM. The success rate is
shownwhich
asof
Figure
5.isCompared
with
thethat
precision
and
the success
success
are
The precision
MCT
0.673,
isother
6.5%algorithms,
higher
than
of SCM.
The
rate
is MCT
0.546,
which
0.546,
is improved
bywhich
4.7%
compared
with the
SCM
algorithm.
Compared
withrate
theof
other
eight
highest.
The
precision
of
MCT
is
0.673,
which
is
6.5%
higher
than
that
of
SCM.
The
success
rate
is
trackers,by
MCT
has
the best performance.
is improved
4.7%
compared
with the SCM algorithm. Compared with the other eight trackers, MCT
0.546, which is improved by 4.7% compared with the SCM algorithm. Compared with the other eight
has the best performance.
trackers, MCT has the best performance.
(a)
(b)
Figure 5. The precision(a)
plot (a) and the success rate plot (b) of the tracking results
(b) on OTB for overall
performances.
Figure 5. The precision plot (a) and the success rate plot (b) of the tracking results on OTB for overall
Figure 5. The precision plot (a) and the success rate plot (b) of the tracking results on OTB for
performances.
overall performances.
Sensors 2018, 18, 572
15 of 27
4.1.3. Quantitative Evaluation on Eight Specific Video Sequences
Sensors 2018, 18, x FOR PEER REVIEW
15 of 27
The following experiment is conducted on eight specific video sequences, namely, Basketball,
4.1.3. Quantitative
on Eight
Video Sequences
Boy, Football,
Jogging-1, Evaluation
Girl, Skating1,
Coke,Specific
and Sylvester,
and the performances of the algorithms are
evaluated The
on the
metrics
of center islocation
error,
tracking
precision
and success
rate.Basketball, Boy,
following
experiment
conducted
on eight
specific
video sequences,
namely,
Football, Jogging-1, Girl, Skating1, Coke, and Sylvester, and the performances of the algorithms are
1. Center
location error comparison
evaluated on the metrics of center location error, tracking precision and success rate.
The comparison of the center location errors is shown as Figure 6. For Basketball, the CLE of
1. Center location error comparison
ORIA increases at approximately the 50th frame, and those of CT, MTT, L1APG and MIL increase
The comparison
of frame.
the center
is shown
as Figure
6. For
Basketball, the CLE
at approximately
the 200th
Thelocation
CLEs oferrors
ASLA
and SCM
increase
at approximately
the of
500th
ORIA increases at approximately the 50th frame, and those of CT, MTT, L1APG and MIL increase at
frame. DFT performs similarly to MCT before the 650th frame, but after the 650th frame, DFT has
approximately the 200th frame. The CLEs of ASLA and SCM increase at approximately the 500th
a sharply increasing CLE. Almost all the methods fluctuate violently with a high CLE because the
frame. DFT performs similarly to MCT before the 650th frame, but after the 650th frame, DFT has a
basketball
player
movesCLE.
in a Almost
randomallmanner
with deformation
and illumination
variation.
sharply
increasing
the methods
fluctuate violently
with a high CLE
becauseThe
theCLE
of MCT
between
the 80th
frame
and themanner
200th frame
fluctuates,and
butillumination
it recovers soon
afterThe
with
a low
basketball
player
moves
in a random
with deformation
variation.
CLE
CLE. For
Boy,between
the CLEthe
of80th
ORIA
increases
approximately
the 100th
frame because
the
motion
of MCT
frame
and theat
200th
frame fluctuates,
but it recovers
soon after
with
a lowblur
CLE.
ForCLEs
Boy, the
CLE ofASLA
ORIA and
increases
approximately
the 100th frame
motion
begins.
The
of DFT,
SCMatincrease
at approximately
thebecause
260th the
frame
andblur
cannot
begins.
The
CLEs
of
DFT,
ASLA
and
SCM
increase
at
approximately
the
260th
frame
and
cannot
resume tracking again because the boy moves fast with deformation, motion blur and scale variation.
resume tracking
boy movescompared
fast with deformation,
motion
blur and
MCT performs
betteragain
with because
smaller the
fluctuations
with CT, MTT,
L1APG
andscale
MIL.variation.
For Football,
MCT
performs
better
with
smaller
fluctuations
compared
with
CT,
MTT,
L1APG
and
MIL.
For of
almost all the algorithms have a dramatic fluctuation at approximately the 170th frame
because
Football, almost all the algorithms have a dramatic fluctuation at approximately the 170th frame
deformation, but the CLE of MCT is the lowest. For Jogging-1, only MCT is stable with the lowest CLE.
because of deformation, but the CLE of MCT is the lowest. For Jogging-1, only MCT is stable with the
For Coke, partial and full occlusions appear from approximately the 30th frame, and the accuracies
lowest CLE. For Coke, partial and full occlusions appear from approximately the 30th frame, and the
of the accuracies
algorithmsofdecrease,
exceptdecrease,
for MCT.except
For Skating1,
Girl
Sylvester,
a stable
the algorithms
for MCT.
Forand
Skating1,
GirlMCT
and maintains
Sylvester, MCT
performance
with
the
lowest
CLE.
maintains a stable performance with the lowest CLE.
Figure 6. Cont.
Sensors 2018, 18, 572
16 of 27
Sensors 2018, 18, x FOR PEER REVIEW
16 of 27
Figure 6. Center location error comparison of the algorithms.
Figure
6. Center location error comparison of the algorithms.
Table 3 lists the average center location errors of the eight algorithms. Red indicates that the
Table
3 lists
center
errors of the
eight algorithms.
Red We
indicates
algorithm
hasthe
theaverage
lowest CLE
withlocation
the best performance
followed
by blue and green.
can seethat
that the
algorithm
has algorithm
the lowest
with the
performance
followed
by blue
and green.
can see
the MCT
hasCLE
the lowest
CLEbest
on seven
of these tests
and the lowest
average
CLE on We
all tests.
ToMCT
sum algorithm
up, by the has
CLEthe
comparison
withonother
algorithms,
we can
that
our proposed
that the
lowest CLE
seven
of these tests
anddetermine
the lowest
average
CLE on all
algorithm,
outperforms
the others
under
different
challenges.
tests. To
sum up,MCT,
by the
CLE comparison
with
other
algorithms,
we can determine that our proposed
algorithm, MCT, outperforms the others under different challenges.
Table 3. Average center location error of the algorithms.
Video
CT 3.DFT
ORIA
SCM
Table
Average
center MTT
locationASLA
error ofL1APG
the algorithms.
Basketball 89.11
18.03 152.09 106.80 82.63
137.53
52.90
Boy
9.03
106.31
132.90
12.77
106.07
7.03
Video
CT
DFT
ORIA
MTT
ASLA L1APG 51.02
SCM
Football
11.91
9.29
13.61
13.67
15.00
15.11
16.30
Basketball
89.11 92.49
18.0331.44152.09
52.90
Jogging-1
50.43 106.80
108.03 82.63
104.58 137.53
89.52
132.83
Boy Skating1
9.03 150.44
106.31
132.90
7.03
51.02
174.24
69.62 12.77
293.34 106.07
59.86
158.70
16.38
Football Girl11.91 18.859.29 23.9813.61
15.11
16.30
27.65 13.67
4.29 15.00
3.28
2.80
2.60
Jogging-1Coke92.49 40.49
31.4470.7050.43
89.52
132.83
49.96 108.03
29.98 104.58
60.17
50.45
56.81
Skating1
150.44 8.556
174.24
16.38
Sylvester
44.88369.62
5.683 293.34
7.554 59.86
15.227 158.70
26.244
7.968
Girl Average
18.85 52.610
23.9859.85927.65
3.28
2.80
2.60
62.743 4.29
72.054 55.852
60.923
42.101
Coke
40.49
70.70
49.96
29.98
60.17 green
50.45
56.81
Red is the
best, blue
is the second,
is the third.
Sylvester
8.556
44.883
5.683
7.554
15.227 26.244
7.968
Average
52.610
59.859
62.743
72.054
55.852
60.923
42.101
2. Tracking precision comparison
MIL
MCT
91.92
12.79
12.83
MIL 2.10MCT
12.09
8.86
91.92 5.3812.79
96.34
12.83 15.382.10
139.38
12.09 1.428.86
13.67
96.34 19.445.38
46.72
139.38 8.030
15.38
15.504
13.67 9.1751.42
53.557
46.72
15.504
53.557
19.44
8.030
9.175
Red is the best, blue is the second, green is the third.
Figure 7 shows the precision results of each algorithm for various values of the threshold th
within
[0, 100]. MCT
is the first algorithm to reach 100% precision, except on Coke and Sylvester.
2. Tracking precision
comparison
However, for Coke, the precision of MCT is the highest in the range of [0, 80]. For Boy, Jogging-1, Girl,
Figure
7 and
shows
the precision
of each
algorithm
for various
values
the threshold
th within
Football
Basketball,
MCT is results
stable and
still the
first algorithm
to reach
100%of
precision.
In the range
[0, 100].
MCT is the first
algorithm
to reach
100%
precision,
except
on Coke
However,
of approximately
[0, 15],
the precision
of MCT
is lower
than those
of DFT
and and
SCMSylvester.
on Basketball
and
lower
those ofof
DFT,
CT,isMTT,
ORIA, and
on Football,
while
other
for Coke,
thethan
precision
MCT
the highest
in L1APG
the range
of [0, 80].
For
Boy,algorithms
Jogging-1,suffer
Girl,from
Football
drifting and
struggle
to reach
particular,tofor
Football,
when
the threshold
is larger
and Basketball,
MCT
is stable
and100%
stillprecision.
the first In
algorithm
reach
100%
precision.
In the
range of
than
18,
MCT
first
reaches
100%
precision
at
a
high
rate
and
later
increases
steadily.
For
Skating1,
inand
approximately [0, 15], the precision of MCT is lower than those of DFT and SCM on Basketball
the
range
of
approximately
[0,
20],
MCT
has
similar
performance
to
ASLA
and
SCM,
but
the
precision
lower than those of DFT, CT, MTT, ORIA, and L1APG on Football, while other algorithms suffer from
of MCT is higher than those of the other algorithms. However, when the threshold is larger than 20,
drifting and struggle to reach 100% precision. In particular, for Football, when the threshold is larger
the precision of MCT is the highest. Table 4 lists the average precisions of the eight comparison
than 18, MCT first reaches 100% precision at a high rate and later increases steadily. For Skating1,
algorithms and our algorithm MCT for a threshold of 20. The average precision of MCT is higher than
in thethose
range
[0, 20],onMCT
has
performance
to ASLA
and SCM,
but the
of of
theapproximately
other eight algorithms
most of
thesimilar
test sequences.
Thus, MCT
outperforms
the other
precision
of
MCT
is
higher
than
those
of
the
other
algorithms.
However,
when
the
threshold
is
larger
comparison algorithms.
than 20, the precision of MCT is the highest. Table 4 lists the average precisions of the eight comparison
algorithms and our algorithm MCT for a threshold of 20. The average precision of MCT is higher than
those of the other eight algorithms on most of the test sequences. Thus, MCT outperforms the other
comparison algorithms.
Sensors 2018, 18, 572
17 of 27
Sensors 2018, 18, x FOR PEER REVIEW
17 of 27
Figure 7. Precision comparison of the algorithms.
Figure 7. Precision comparison of the algorithms.
Table 4. Average precision of the algorithms.
Table 4. Average precision of the algorithms.
Video
VideoBasketball
CT
Boy
Basketball 0.453
Football
Boy
0.906
Jogging-1
Football
0.877
Skating1
Jogging-1
0.244
Skating1 Girl
0.231
Girl Coke
0.808
Coke Sylvester
0.594
SylvesterAverage
0.910
Average
0.628
CT
DFT ORIA
0.453
0.087
DFT0.847ORIA
0.906 0.477 0.285
0.847
0.087
0.877 0.903 0.860
0.477
0.285
0.244
0.516
0.9030.6940.860
0.231
0.189
0.529
0.694
0.516
0.808
0.722
0.1890.7570.529
0.594
0.503
0.7570.3130.722
0.910
0.939
0.3130.6100.503
0.628
0.555
0.6100.5990.939
MTT
ASLA
0.336
MTT
0.582
ASLA
0.868
0.440
L1APG
0.329
L1APG
0.926
0.329
0.845
0.926
0.266
0.845
0.217
0.266
0.968
0.217
0.510
0.968
0.735
0.510
0.560
0.735
SCM
0.640
SCM
0.512
0.640
0.834
0.512
0.256
0.834
0.833
0.256
0.969
0.833
0.528
0.969
0.916
0.528
0.686
0.916
0.336
0.582
0.860
0.846
0.868
0.440
0.249
0.253
0.860
0.846
0.160
0.717
0.249
0.253
0.953
0.963
0.160
0.717
0.703 0.963
0.404
0.953
0.920 0.404
0.844
0.703
0.631 0.844
0.631
0.920
Red is the
best, blue
is the second,
is the third.
0.599
0.555
0.631
0.631 green
0.560
0.686
Red is the best, blue is the second, green is the third.
MIL MCT
0.398
MIL 0.869MCT
0.868 0.974
0.398
0.869
0.875 0.907
0.868
0.974
0.253
0.8750.9420.907
0.271
0.2530.8430.942
0.960
0.2710.9810.843
0.532
0.9600.8190.981
0.844
0.5320.9150.819
0.625
0.8440.9060.915
0.625
0.906
Sensors 2018, 18, 572
18 of 27
Sensors 2018, 18, x FOR PEER REVIEW
18 of 27
3. Success rate comparison
3.
Success rate comparison
Figure 8 shows the success rates of each algorithm at different values of the threshold λ
Figure
8 shows
success
of each
algorithm
at different
values to
of the
threshold
 within
within [0,
1]. The
areathe
under
therates
curve
(AUC)
of Figure
8 is utilized
evaluate
the algorithms.
[0, 1].Jogging-1,
The area Skating1,
under theCoke
curve
(AUC)
Figure
utilized
to evaluate
the algorithms.
For Boy,
and
Girl,of
the
AUC8ofisMCT
is the
highest compared
withFor
the Boy,
other
Jogging-1, Skating1, Coke and Girl, the AUC of MCT is the highest compared with the other algorithms.
algorithms. Especially for Boy, Jogging-1 and Girl, the success rate of MCT decreases stably at a low rate.
Especially for Boy, Jogging-1 and Girl, the success rate of MCT decreases stably at a low rate. For
For Basketball, the performance of MCT is similar to that of the DFT algorithm, but the performance of
Basketball, the performance of MCT is similar to that of the DFT algorithm, but the performance of
MCT is better than that of DFT at thresholds lower than approximately 0.3. For Sylvester, the success
MCT is better than that of DFT at thresholds lower than approximately 0.3. For Sylvester, the success
rate of MCT is the highest at thresholds in [0, 0.3] and is close to 1, and the performance of MCT is
rate of MCT is the highest at thresholds in [0, 0.3] and is close to 1, and the performance of MCT is
similar
to those
of CT,
SCM
and
ORIA.
similar
to those
of CT,
SCM
and
ORIA.
Table
5 lists
the the
average
success
rate
(AUC)
algorithmsand
andMCT.
MCT.The
TheMCT
MCT
Table
5 lists
average
success
rate
(AUC)ofofeight
eightcomparison
comparison algorithms
algorithm
achieves
the
highest
average
success
rate
on
Boy,
Jogging-1,
Skating1,
Coke
and
Girl.
According
algorithm achieves the highest average success rate on Boy, Jogging-1, Skating1, Coke and Girl. According
to the
results,
MCT
outperforms
algorithmson
onthe
themetric
metric
success
rate.
to comparison
the comparison
results,
MCT
outperformsthe
the other
other algorithms
of of
success
rate.
Figure
Successrate
ratecomparison
comparisonof
ofthe
the algorithms.
algorithms.
Figure
8. 8.
Success
Sensors 2018, 18, 572
19 of 27
Sensors 2018, 18, x FOR PEER REVIEW
19 of 27
Table 5.
Table
5. Average
Average success
success rate
rate of
of the
the algorithms.
algorithms.
Video CT CT DFT
DFT
Video
Basketball 0.278 0.604
Basketball 0.278
0.604
Boy
0.587 0.418
Boy
0.587
0.418
Football0.607
0.607 0.651
0.651
Football
Jogging-10.212
0.212 0.218
0.218
Jogging-1
Skating10.123
0.123 0.172
0.172
Skating1
0.318 0.293
0.293
Girl Girl 0.318
CokeCoke 0.241
0.241 0.142
0.142
Sylvester
Sylvester0.658
0.658 0.391
0.391
Average
0.378
Average 0.378 0.361
0.361
ORIA MTT
MTT
ORIA
0.072 0.223
0.072
0.223
0.189 0.502
0.189
0.502
0.513 0.575
0.575
0.513
0.221 0.211
0.211
0.221
0.242 0.144
0.144
0.242
0.263 0.657
0.657
0.263
0.215
0.215 0.449
0.449
0.642
0.642 0.641
0.641
0.295
0.295 0.425
0.425
ASLA
ASLA
0.398
0.398
0.389
0.389
0.533
0.533
0.213
0.213
0.501
0.501
0.701
0.701
0.192
0.192
0.590
0.590
0.440
0.440
L1APG
L1APG
0.256
0.256
0.725
0.725
0.553
0.553
0.208
0.208
0.143
0.719
0.203
0.413
0.413
0.403
0.403
SCM
SCM
0.475
0.475
0.397
0.397
0.489
0.489
0.212
0.212
0.471
0.673
0.342
0.677
0.677
0.467
0.467
MIL
MIL
0.247
0.247
0.491
0.491
0.585
0.585
0.214
0.214
0.161
0.161
0.403
0.403
0.224
0.224
0.528
0.528
0.357
0.357
MCT
MCT
0.526
0.526
0.818
0.818
0.580
0.580
0.717
0.717
0.487
0.487
0.880
0.880
0.539
0.539
0.659
0.659
0.650
0.650
Red
thebest,
best,blue
blue is
is the
is the
third.
Red
is is
the
the second,
second,green
green
is the
third.
4.2.
Qualitative Evaluation
4.2. Qualitative
Evaluation
The
the algorithms
algorithms on
on eight
eight test
test videos
videos are
are shown
shown in
in Figures
Figures 9–16.
9–16.
The qualitative
qualitative analysis
analysis results
results of
of the
All
the
problems
we
concerned
are
covered
in
these
video
sequences,
fast
motion
and
scale
variation,
All the problems we concerned are covered in these video sequences, fast motion and scale variation,
partial
We analyze
partial or
or full
full occlusion,
occlusion, the
the deformation
deformation and
and illumination
illumination variation.
variation. We
analyze the
the performance
performance of
of
algorithms
on
eight
specific
video
sequences
with
different
problems
one
by
one.
algorithms on eight specific video sequences with different problems one by one.
1.
motion
1. Fast
Fast
motion
As
in Figure
Figure 9,
9, the
thecontent
contentof
ofthe
thevideo
videosequence
sequenceBoy
Boyisisa adancing
dancing
boy.In In
process
As shown
shown in
boy.
thethe
process
of
of
dancing,
the
target
shakes
randomly
with
fast
motion
as
shown
from
88th
to
386th.
The
boy’s
dancing, the target shakes randomly with fast motion as shown from 88th to 386th. The boy’s
appearance
The tracking
tracking box
box of
of ORIA
ORIA cannot
cannot keep
keep up
appearance blurs
blurs during
during the
the fast
fast motion.
motion. The
up with
with the
the fast
fast
motion
In the
motion of
of the
the target
target and
and the
the tracking
tracking drift
drift occurs
occurs in
in 108th.
108th. In
the 160th
160th frame,
frame, the
the target
target moves
moves
backwards,
293rd frame,
backwards, the
the tracking
tracking box
box of
of ORIA
ORIA loses
loses the
the target
target completely
completely in
in 160th.
160th. In
In the
the 293rd
frame, the
the boy
boy
is
squatting
down,
the
tracking
boxes
of
SCM
and
ALSA
cannot
keep
up
with
the
variation
is squatting down, the tracking boxes of SCM and ALSA cannot keep up with the variation of
of the
the
target,
which
leads
to
tracking
failure
after
293rd.
Besides,
the
tracking
boxes
of
MTT,
L1APG
and
MIL
target, which leads to tracking failure after 293rd. Besides, the tracking boxes of MTT, L1APG and
have
tracking
drift indrift
293rd.
tracking
continues,
CT, MTT,CT,
L1APG,
MILSCM,
and other
MIL have
tracking
in As
293rd.
As tracking
continues,
MTT,SCM,
L1APG,
MILalgorithms
and other
also
cannot also
keepcannot
up with
theup
target
Only
MCT Only
is ableMCT
to properly
the target.
algorithms
keep
withmotion.
the target
motion.
is able totrack
properly
track the target.
Figure 9. Tracking results on Boy.
Figure 9. Tracking results on Boy.
For Coke, a can held by a man moves fast within a small area in the complex background with the
For Coke, a can held by a man moves fast within a small area in the complex background with
dense bushes. Figure 10 shows the tracking results on Coke. In the 42nd frame, because of occlusion
the dense bushes. Figure 10 shows the tracking results on Coke. In the 42nd frame, because of
by plants, the algorithms DFT, L1APG MIL and ASLA have tracking drifts to different extents, while
occlusion by plants, the algorithms DFT, L1APG MIL and ASLA have tracking drifts to different
MCT and MTT perform well. Then the target keeps moving fast from the 42nd to the 67th frame.
extents, while MCT and MTT perform well. Then the target keeps moving fast from the 42nd to the
67th frame. The motion directions vary rapidly, which results in tracking drifts with the L1APG, MTT
Sensors 2018, 18, 572
20 of 27
Sensors 2018,
18, x FOR PEER
REVIEW
20 ofand
27 CT
The motion
directions
vary
rapidly, which results in tracking drifts with the L1APG, MTT
algorithms in 67th. The tracking box of DFT, ORIA and ASLA lose the target completely in 67th, while
and CT algorithms in 67th. The tracking box of DFT, ORIA and ASLA lose the target completely in
the MCT
and MIL algorithms can accurately track the target in the presence of fast motion. After 194th
67th, while the MCT and MIL algorithms can accurately track the target in the presence of fast motion.
frame, all algorithms except MCT fail to track the target. Throughout the entire process of fast motion,
After 194th frame, all algorithms except MCT fail to track the target. Throughout the entire process
the MCT
algorithm can stably track the target.
of fast motion, the MCT algorithm can stably track the target.
Figure10.
10.Tracking
Tracking results
Figure
resultson
onCoke.
Coke.
algorithm,
MCT,
performsnotably
notablywell
well on
on videos
fast
motion.
ThisThis
is because
Our Our
algorithm,
MCT,
performs
videosthat
thatinvolve
involve
fast
motion.
is because
the MCT algorithm takes into account motion consistency so that targeted candidate samples can be
the MCT algorithm takes into account motion consistency so that targeted candidate samples can be
calculated after predicting the target’s possible motion state during fast motion, which can possibly
calculated after predicting the target’s possible motion state during fast motion, which can possibly
cover the region where the target may appear. In addition, the position factor is proposed to decrease
coverthe
theinfluence
region where
the target may appear. In addition, the position factor is proposed to decrease
of secondary samples, and an adaptive strategy is utilized to address the change in the
the influence
of
secondary
samples,
an adaptive
strategy
utilized
to address
thestable
change
target’s appearance brought
aboutand
by fast
motion. All
of theseisfactors
guarantee
highly
andin the
target’s
appearance
accurate
tracking.brought about by fast motion. All of these factors guarantee highly stable and
accurate tracking.
2.
Partial and full occlusion
2. Partial Two
and full
occlusion
ladies
are jogging in the video Jogging-1. As shown in Figure 11, partial and full occlusions
are caused
pole during
jogging
from the
to 80th
frame. In
69th frame,
theocclusions
DFT
Two
ladiesby
arethe
jogging
in thethe
video
Jogging-1.
As69th
shown
in Figure
11,the
partial
and full
algorithm fails to track the target first when partial occlusion occurs by the pole and the DFT
are caused by the pole during the jogging from the 69th to 80th frame. In the 69th frame, the DFT
algorithm turns to track the jogger with white shirt. In the 80th frame, after the jogger we tracked
algorithm fails to track the target first when partial occlusion occurs by the pole and the DFT algorithm
appears without partial or full occlusion, all the algorithms lose the target completely except MCT.
turnsIn
tothe
track
the jogger with white shirt. In the 80th frame, after the jogger we tracked appears without
following frames, while the other algorithms fail to track the target, MCT can stably track the
partial
or
full
occlusion,
all the
algorithms
lose theintarget
completely
except
MCT.
following
target until
the occlusion
completely
disappears
the 129th
frame. The
tracking
boxInofthe
DFT
has
frames,
whiledrift
thein
other
algorithms
track
the target,
MCT can stably
trackhave
the target
until the
tracking
the 221st
frame. Infail
theto297th
frame,
all the algorithms
except MCT
no overlap
occlusion
completely
disappears in the 129th frame. The tracking box of DFT has tracking drift in the
with the
true target.
The
tracking
results
on Football
arealgorithms
shown in Figure
12. MCT
For Football,
occlusion
occurs
football
221st frame. In the 297th frame,
all the
except
have no
overlap
withwhen
the true
target.
players
compete
in
the
game.
The
players
are
crowded
in
the
presence
of
the
ball
robbing.
And
thewhen
The tracking results on Football are shown in Figure 12. For Football, occlusion occurs
partial
or full
occlusion
the players
situationare
of ball
robbing.
In the
118th frame,
are And
football
players
compete
inwill
the occur
game.inThe
crowded
in the
presence
of thethe
ballplayers
robbing.
crowded in the same place and the target we tracked is in the middle of the crowds, then the partial
the partial or full occlusion will occur in the situation of ball robbing. In the 118th frame, the players
occlusion occurs at the head of the target, which causes confusion of different football players. As the
are crowded in the same place and the target we tracked is in the middle of the crowds, then the partial
competition goes, the algorithms have a tracking drift in the frame 194th. In 292nd, only MCT and DFT
occlusion
occurs at the head of the target, which causes confusion of different football players. As the
can accurately track the target under the full occlusion. The tracking box of L1APG, ASLA, MIL and
competition
goes, the
algorithms
have
a tracking
drift inplayer
the frame
194th.
In 292nd,
onlyWhen
MCT and
other algorithms
lose
the target and
turn
to track another
close to
the target
we tracked.
DFT can
accurately
track
the
target
under
the
full
occlusion.
The
tracking
box
of
L1APG,
ASLA,
MIL
the occlusion disappears in the 309th frame, only MCT can still track the target.
and otherInalgorithms
lose thefactor
targetis and
turn in
to the
track
another
player close
the target
we tracked.
MCT, the occlusion
proposed
local
representation.
Whento
occlusion
occurs,
the
information
of theintarget
increases,
andonly
the occlusion
factor
change
correspondingly.
Whenbackground
the occlusion
disappears
the 309th
frame,
MCT can
stillwill
track
the target.
Partial
andthe
fullocclusion
occlusionsfactor
can be is
rectified
basedinonthe
the local
occlusion
factor in appearance
In
MCT,
proposed
representation.
When model.
occlusion occurs,
the background information of the target increases, and the occlusion factor will change correspondingly.
Partial and full occlusions can be rectified based on the occlusion factor in appearance model.
Sensors 2018, 18, 572
21 of 27
Sensors 2018, 18, x FOR PEER REVIEW
Sensors 2018, 18, x FOR PEER REVIEW
21 of 27
21 of 27
Figure11.
11.Tracking
Tracking results
results on
Figure
onJogging-1.
Jogging-1.
Figure 11. Tracking results on Jogging-1.
Figure 12. Tracking results on Football.
Figure
onFootball.
Football.
Figure12.
12.Tracking
Tracking results
results on
3.
3.
Deformation and illumination variation
Deformation and illumination variation
3. Deformation
and illumination
variation
For Basketball,
the target No.9
basketball player’s appearance varies in the game because of
For Basketball,
the target illumination
No.9 basketball
player’s
appearance
varies
in the game
because
of
inconstant
motion. Moreover,
changes
because
reporters
take photos,
especially
when
For
Basketball,
theMoreover,
target No.9
basketball
player’s
appearance
varies
in the
game because
of
inconstant
motion.
illumination
changes
because
reporters
take
photos,
especially
when
players are shooting. The tracking results of the nine algorithms are shown in Figure 13. As we can
players
are
shooting.
The
tracking
results
of
the
nine
algorithms
are
shown
in
Figure
13.
As
we
can
inconstant
motion.
Moreover,
illumination
changes
because
reporters
take
photos,
especially
when
see, when the player starts to run, almost all the tracking boxes have tracking drift in the frame 140th,
see,are
when
the
starts
to run,
almost
the
tracking
boxes occurs
have
drift
in the
frame
140th,
players
shooting.
The
tracking
results
ofallthe
nine
algorithms
are tracking
shown
in
Figure
13.
AsL1APG,
we
can see,
and
ORIA
loseplayer
the
target
completely.
When
the
deformation
in 247th,
ORIA,
MTT,
and
ORIA
lose
the
target
completely.
When
the
deformation
occurs
in
247th,
ORIA,
MTT,
L1APG,
whenCT
theand
player
starts
to
run,
almost
all
the
tracking
boxes
have
tracking
drift
in
the
frame
140th,
MIL fail to track the target. Only DFT and MCT can track the target with a slight tracking and
CT
and
MIL
fail to
track the
DFT
andwhile
MCToccurs
can track
theDFT,
target
with
a slight
tracking
the
344th
frame,
the target.
player
turns
around,
MCT,
SCM,
and
ASLA
succeed
and
ORIAdrift.
lose In
the
target
completely.
WhenOnly
the deformation
in
247th,
ORIA,
MTT,
L1APG,
CT and
drift.
In
the
344th
frame,
the
player
turns
around,
while
MCT,
SCM,
DFT,
and
ASLA
succeed
other
lose theOnly
true target.
In the
510th
the player
to tracking
spectator,drift.
alland
theIn the
MIL fail
toalgorithms
track the target.
DFT and
MCT
canframe,
trackwhen
the target
withmoves
a slight
other algorithms
the
true target.
In the
510th
frame, DFT
whenhas
the player moves
spectator,
all the
to lose
track
except
MCTwhile
and
DFT.
However,
drifts. to
When
illumination
344thalgorithms
frame, thefail
player
turns
around,
MCT,
SCM, DFT,
and tracking
ASLA succeed
and other
algorithms
algorithms
fail
to
track
except
MCT
and
DFT.
However,
DFT
has
tracking
drifts.
When
illumination
changes
in the 650th
frame,
only
MCTwhen
and DFT
trackmoves
the target.
lose the
true target.
In the
510th
frame,
thecan
player
to spectator, all the algorithms fail to
changes
in the 650th
frame,
only
and DFT
can
track
the
target.
For Sylvester,
a man
shakes
a MCT
toy under
the light.
Within
the
whole process, as shown in Figure 14,
track except
MCT
and
DFT.
However,
DFT
has
tracking
drifts.
When
changes
in the
For Sylvester,
a man shakesofa the
toy toy
under
the light.
the rotates
wholeillumination
process,
as shown
Figure
14,650th
illumination
and deformation
change,
andWithin
the plane
inside and
outsideinoccur.
The
frame,
only
MCT
and
DFT
can
track
the
target.
illumination
and
deformation
thetoy
toy
change,
and the
plane
and outside
The
tracking
boxes
have
drift whenofthe
begins
to move
under
therotates
light ininside
the frame
264th. Inoccur.
the 467th
tracking
boxes
have
drift
when
the
toy
begins
to
move
under
the
light
in
the
frame
264th.
In
the
467th
For
Sylvester,
a
man
shakes
a
toy
under
the
light.
Within
the
whole
process,
as
shown
in
Figure
frame, the toy is lower its head, in such a situation, the algorithm DFT almost lose the target. Then in 14,
frame,
theand
toy is
lower
in
such
a situation,
the
DFT
almost
lose
the
target.
Then
inoccur.
illumination
deformation
the
toy
change,
andalgorithm
the
plane
rotates
inside
and
outside
the
613rd
frame,
when its
thehead,
toyofhas
rotation,
the algorithm
L1APG
fails
to track
the
target
and
other
the
613rd
frame,
when
the
toy
has
rotation,
the
algorithm
L1APG
fails
to
track
the
target
and
other
The tracking
boxes
driftdrift.
whenInthe
begins
to the
move
the light
in the
the light,
frameand
264th.
algorithms
havehave
tracking
thetoy
956th
frame,
toyunder
is rotating
under
moreIn the
algorithms
drift.
In the
frame,
the However,
toy
is algorithm
rotating
light,
and the
more
such
as lower
DFT and
L1APG
lose
target.
MCT under
performs
well lose
under
the
467thalgorithms
frame, thehave
toy tracking
is
its head,
in956th
suchthe
a situation,
the
DFTthe
almost
target.
algorithms
such
as
DFT
and
L1APG
lose
the
target.
However,
MCT
performs
well
under
the
deformation
especially
shows robustness
within
Thencircumstances
in the 613rd with
frame,
when theand
toyillumination
has rotation,
the algorithm
L1APG fails
to illumination
track the target
circumstances
with deformation
and467th
illumination especially shows robustness within illumination
decreasing
between
the 264th
to the
and other
algorithms
have
tracking
drift. Inframe.
the 956th frame, the toy is rotating under the light, and
decreasing between the 264th to the 467th frame.
more algorithms such as DFT and L1APG lose the target. However, MCT performs well under the
circumstances with deformation and illumination especially shows robustness within illumination
decreasing between the 264th to the 467th frame.
Sensors 2018, 18, 572
Sensors 2018, 18, x FOR PEER REVIEW
22 of 27
22 of 27
Figure 13. Tracking results on Basketball.
Figure 13. Tracking results on Basketball.
Figure 14. Tracking results on Sylvester.
Figure 14. Tracking results on Sylvester.
First,
First, MCT
MCT is
is based
based on
on particle
particle filtering
filtering framework,
framework, so
so that
that to
to some
some extent
extent diversity
diversity of
of candidate
candidate
samples
can
resolve
elastic
deformation.
Secondly,
holistic
representation
and
local
representation
samples can resolve elastic deformation. Secondly, holistic representation and local representation
are
are utilized
utilized in
in the
the appearance
appearance model,
model, which
which shows
shows great
great robustness
robustness by
by enabling
enabling holistic
holistic and
and local
local
representation
to
capture
the
changes
of
target’s
deformation
and
illumination
variation.
Finally,
when
representation to capture the changes of target’s deformation and illumination variation. Finally,
models
are being
appropriately
updating
positivepositive
and negative
templates
can also adapt
to
when models
areupdated,
being updated,
appropriately
updating
and negative
templates
can also
the
changes
appearance
and environment.
adapt
to the of
changes
of appearance
and environment.
4.
variation
and
rotation
4. Scale
Scale
variation
and
rotation
For
For Skating1,
Skating1, aa female
female athlete
athlete is
is skating
skating on
on ice,
ice, and
and the
the athlete
athlete continuously
continuously slides
slides so
so the
the relative
relative
distance
to
camera
changes
continuously,
thus
the
athlete
has
a
scale
variation
in
the
scene.
Besides,
distance to camera changes continuously, thus the athlete has a scale variation in the scene. Besides,
the
is dancing
dancing in
in the
theprocess,
process,thus
thusthe
theathlete
athletealso
alsohas
hasa arotation
rotationininthe
thescene.
scene.
can
seen
the athlete
athlete is
It It
can
bebe
seen
in
in
Figure
from
the
24thframe
frametotothe
the176th
176thframe,
frame,the
thesize
sizeof
ofthe
the target
target becomes
becomes smaller,
smaller, and
Figure
15,15,
from
the
24th
and the
the
algorithms
And CT,
CT, DFT,
algorithms SCM,
SCM, ASLA
ASLA and
and MCT
MCT can
can track
track the
the target.
target. And
DFT, ORIA,
ORIA, L1APG
L1APG and
and MIL
MIL lose
lose the
the
true
target
in
the
frame
176th.
From
the
176th
to
the
371st
frame,
the
target
moves
to
the
front
true target in the frame 176th. From the 176th to the 371st frame, the target moves to the front of
of the
the
stage
and
becomes
larger
with
rotation
on
the
stage,
and
almost
all
the
algorithms
fail
to
track
except
stage and becomes larger with rotation on the stage, and almost all the algorithms fail to track except
MCT
MCT and
and SCM.
SCM. But
But SCM
SCM has
has tracking
tracking drift.
drift. In
In the
the 389th
389th frame,
frame, SCM
SCM loses
loses the
the athlete,
athlete, and
and only
only MCT
MCT
can
can track
track the
the target.
target.
Sensors 2018, 18, 572
23 of 27
Sensors
Sensors 2018,
2018, 18,
18, xx FOR
FOR PEER
PEER REVIEW
REVIEW
23
23 of
of 27
27
Figure 15. Tracking results on Skating1.
Figure
Figure 15.
15. Tracking
Tracking results
results on
on Skating1.
Skating1.
For
For Girl,
Girl, aa woman
woman sitting
sitting on
on the
the chair
chair varies
varies her
her position
position and
and facing
facing direction
direction by
by rotating
rotating the
the
chair.
It
can
be
seen
in
Figure
16,
the
size
of
the
target
changes
from
the
11st
to
the
392nd
chair.
chair. It can be seen in Figure 16, the size of the target changes from the 11st to the 392nd frame
frame with
with
rotation
rotation
on
chair.
Because
of
rotationon
on the
the chair.
chair. Because
Becauseof
of the
the chair’s
chair’s rotation,
rotation, the
the size
size of
of the
the target
target becomes
becomes smaller
smaller in
in the
the 101st
101st
frame
compared
with
that
in
the
11st
frame.
After
that,
the
girl
turns
around
and
becomes
larger
frame compared with that in the 11st frame. After that, the girl turns around and becomes larger in
in
134th.
134th. In
In this
this process,
process, all
all the
the algorithms
algorithms have
have tracking
tracking drift,
drift, and
and ORIA
ORIA loses
loses the
the target
target completely
completely
from
from 101st
101st to
to 487th.
487th. When
When the
the target
target turns
turns around
around from
from 101st
101st to
to 134th,
134th, the
the tracking
tracking boxes
boxes of
of ORI,
ORI, CT
CT
and
the
target.
Although
the
appearance
of
the
target
changes
promiscuously
from
227th
MIL
lose
the
target.
Although
the
appearance
of
the
target
changes
promiscuously
from
to
and MIL lose the target. Although the appearance of the target changes promiscuously from 227th
227th
to
to
487th,
MCT
performs
well
under
this
circumstance
and
shows
robustness
and
accuracy
compared
487th,
487th, MCT
MCT performs
performs well
well under
under this
this circumstance
circumstance and
and shows
shows robustness
robustness and
and accuracy compared
with
algorithms.
with other
other algorithms.
algorithms.
Based
Based on
on particle
particle filter,
filter, six
six affine
affine parameters
parameters including
including parameters
parameters indicate
indicate scaling
scaling variation
variation and
and
rotation
are
used
in
MCT
to
represent
the
candidate
samples.
Thus,
the
diversity
of
scaling
rotation are used in MCT
MCT to
to represent
represent the
the candidate
candidate samples.
samples. Thus, the diversity of scaling can
can be
be
ensured
in
prediction
of
candidate
samples,
which
could
adapt
to
the
scene
of
scale
variation.
prediction
of
candidate
samples,
which
could
adapt
to
the
scene
of
scale
variation.
ensured in prediction of candidate samples, which could adapt to the scene of scale variation.
Figure 16.
Tracking results
on Girl.
Figure
Figure 16.
16. Tracking
Tracking results
results on
on Girl.
Girl.
4.3.
4.3. Discussion
Discussion
In
In this
this part,
part, we
we further
further illustrate
illustrate the
the selection
selection of
of related
related means
means in
in MCT
MCT and
and comparison
comparison of
of the
the
tracking
results
and
computational
complexity
between
MCT
and
SCM.
tracking results and computational
computational complexity
complexity between
between MCT
MCT and
and SCM.
SCM.
Sensors 2018, 18, 572
Sensors 2018, 18, x FOR PEER REVIEW
24 of 27
24 of 27
4.3.1.Discussion
Discussionabout
aboutthe
theNumber
Numberof
ofParticles
Particles
4.3.1.
Mostofof
tracking
methods
the particle
filter framework
Gaussian
Most
thethe
tracking
methods
withinwithin
the particle
filter framework
exploit the exploit
Gaussianthe
distribution
distribution
model
to predict
candidate
OurMCT,
algorithm,
MCT,
predicts
targeted
candidate
model
to predict
candidate
samples.
Oursamples.
algorithm,
predicts
targeted
candidate
samples
by
samples by
the direction
target motion
directioninand
distance;
in addition,
positiontofactor
estimating
theestimating
target motion
and distance;
addition,
the position
factorthe
is defined
assignis
definedto
to candidate
assign weights
to candidate
samples
in different
positions.
To prove
of the
weights
samples
in different
positions.
To prove
the superiority
ofthe
thesuperiority
MCT algorithm
MCT
algorithm
in terms
of we
resource
show number
that the of
smaller
number
of samples
in
terms
of resource
saving,
show saving,
that thewe
smaller
samples
predicted
by ourpredicted
method
by our
method
canperformance
achieve better
performance
with fewer
particles
with other
can
achieve
better
with
fewer particles
compared
withcompared
other algorithms.
Toalgorithms.
do so, we
To do so,
we conducton
anMCT
experiment
MCT
and
withparticles.
200, 400,The
and
600
particles.
The test
conduct
an experiment
and SCMonwith
200,
400,SCM
and 600
test
sequence
consists
of
sequence
consists
of 17 videos
from the CVPR2013
benchmark
datasets
with fast
properties.
17
videos from
the CVPR2013
benchmark
datasets with
fast motion
properties.
The motion
results are
shown
The
results
in
Figure
17.are shown in Figure 17.
Figure17.
17.Comparison
Comparisonbetween
betweenMCT
MCTand
andMCT
MCTwithin
withindifferent
differentnumber
numberofofparticles.
particles.
Figure
The success rate and tracking precision of MCT with 600 and 400 particles are higher than those
The success rate and tracking precision of MCT with 600 and 400 particles are higher than those
of SCM with 600 particles; the success rate and tracking precision of MCT with 200 particles are higher
of SCM with 600 particles; the success rate and tracking precision of MCT with 200 particles are
than those of SCM with 400 and 200 particles. MCT can achieve better performance with fewer
higher than those of SCM with 400 and 200 particles. MCT can achieve better performance with fewer
particles compared with SCM. In this paper, we predict the target state and incorporate the target
particles compared with SCM. In this paper, we predict the target state and incorporate the target
state information into the algorithm. More candidate samples are obtained in the position where the
state information into the algorithm. More candidate samples are obtained in the position where the
target may appear with a larger probability, and weights are assigned to all of the candidates
target may appear with a larger probability, and weights are assigned to all of the candidates according
according to the position factor. Thus, the impact of invalid samples on the computation decreases
to the position factor. Thus, the impact of invalid samples on the computation decreases and better
and better performance can be achieved with fewer particles.
performance can be achieved with fewer particles.
4.3.2.Discussion
Discussionof
ofPosition
PositionFactor
Factor
4.3.2.
Accordingto
tothe
themotion
motionconsistency,
consistency,the
thestates
statesof
ofthe
thetarget
targetin
insuccessive
successiveframes
framesare
arecorrelated.
correlated.
According
Therefore,
the
probability
of
the
target
appearing
in
each
position
is
different,
and
the
importances
of
Therefore, the probability of the target appearing in each position is different, and the importances
candidate
samples
located
in different
positions
alsoalso
differ.
Thus,Thus,
a double
Gaussian
probability
model
of
candidate
samples
located
in different
positions
differ.
a double
Gaussian
probability
is
proposed
due
to
the
motion
consistency
in
successive
frames
to
simulate
the
target
state,
based
on
model is proposed due to the motion consistency in successive frames to simulate the target state,
whichon
thewhich
position
is proposed
to assign weights
candidate
samples tosamples
represent
differences
based
the factor
position
factor is proposed
to assigntoweights
to candidate
tothe
represent
the
between
candidate
samples
with
different
locations.
To
demonstrate
the
significant
impact
of
the
differences between candidate samples with different locations. To demonstrate the significant impact
position
factor,
an
experiment
is
conducted
with
600
particles
on
MCT
and
MCT
without
the
position
of the position factor, an experiment is conducted with 600 particles on MCT and MCT without the
factor (MCTW),
and the and
test the
sequence
consists consists
of 17 videos
thefrom
CVPR2013
benchmark
datasets
position
factor (MCTW),
test sequence
of 17from
videos
the CVPR2013
benchmark
with
fast
motion
properties.
The
results
are
shown
in
Figure
18.
According
to
the
experimental
results,
datasets with fast motion properties. The results are shown in Figure 18. According to the experimental
the
precision
of
MCT
is
6.8%
higher
than
that
of
MCTW,
and
the
success
rate
is
6.7%
higher
than
that
results, the precision of MCT is 6.8% higher than that of MCTW, and the success rate is 6.7% higher
of MCTW.
theThus,
position
improves
the tracking
success
rate.
than
that of Thus,
MCTW.
the factor
position
factor improves
theprecision
tracking and
precision
and
success rate.
Sensors 2018, 18, 572
Sensors 2018, 18, x FOR PEER REVIEW
25 of 27
25 of 27
(a)
(b)
Figure18.
18.The
Theprecision
precisionplot
plot(a)
(a)and
andthe
thesuccess
successrate
rateplot
plot(b)
(b)on
onMCT
MCTand
andMCTW
MCTWfor
forFM.
FM.
Figure
4.3.3.Discussion
Discussionof
ofComputational
ComputationalComplexity
Complexity
4.3.3.
The computational
computational complexity
complexity isis an
animportant
importantevaluation
evaluationfactor.
factor. We
We also
also evaluate
evaluate the
the
The
computational complexity
complexity of
algorithm.
SinceSince
both MCT
are basedare
on
computational
ofthe
theMCT
MCT
algorithm.
both and
MCTSCM
andalgorithms
SCM algorithms
the
particle
filter,
we
will
provide
the
complexity
analysis
between
these
two
algorithms.
The
based on the particle filter, we will provide the complexity analysis between these two algorithms.
computational
complexity
will be
evaluated
by theby
tracking
speed. The
higher
tracking
the
The
computational
complexity
will
be evaluated
the tracking
speed.
Thethe
higher
thespeed,
tracking
lower the
the lower
computational
complexitycomplexity
of the algorithm.
The experiments
of computational
complexity
speed,
the computational
of the algorithm.
The experiments
of computational
are
all
implemented
on
the
computer
in
our
lab,
the
computer
with
3.20
GHz
CPU,
4.0
GB
RAM
and
complexity are all implemented on the computer in our lab, the computer with 3.20 GHz CPU,
4.0 GB
Windows
10
operating
system.
And
the
results
on
eight
specific
videos
are
shown
in
Table
6.
In
RAM and Windows 10 operating system. And the results on eight specific videos are shown in Table 6.
addition,
the tracking
trackingprecision
precision
In
addition,the
thetracking
trackingspeed
speedofofMCT
MCTisisclose
closeto
to that
that of
of SCM.
SCM. Simultaneously,
Simultaneously, the
of
MCT
can
achieve
10%
increment
on
fast
motion
compared
with
SCM
as
shown
in
the
results
of
of MCT can achieve 10% increment on fast motion compared with SCM as shown in the results of
quantitative evaluation. This is because the target state prediction, the position factor and the target
quantitative evaluation. This is because the target state prediction, the position factor and the target
velocity prediction were developed in state transition model, appearance model and updating
velocity prediction were developed in state transition model, appearance model and updating scheme
scheme respectively compared with SCM.
respectively compared with SCM.
In addition, the experimental results show that the computational complexity of the tracker is
related to the target size. Table
It is because
in appearance
model,
target is divided into patches, which
6. Tracking
speed comparison
of the
the algorithms.
will be represented sparely by the dictionary. In fact, the tracking speed is also influenced by other
factors,
such as the different update frequencies in videos with different scenes. Accordingly, the
Tracking
Coke
Sylvester Skating1 Basketball Jogging-1 Football Boy
Girl
Average
Speed (FPS)
computational
complexity of the tracker will decrease if the size of target becomes smaller,
SCM the0.3138
0.3390
0.2780
0.3801
0.3759
0.4292
0.3512 0.3171 0.3480
meanwhile
fluctuation
will still
exist.
MCT
0.3089
0.2951
0.2557
0.3146
0.3345
0.3417
0.3498
0.3683
0.3211
(The
target size
decreases
from Coke
Girl.)
Table 6.
Tracking
speed
comparison
oftothe
algorithms.
InTracking
addition, the
experimental
that the Jogging-1
computational
complexity
of
Average is
Coke
Sylvester results
Skating1show
Basketball
Football
Boy
Girlthe tracker
Speed (FPS)
related
to the target size. It is because in appearance model, the target is divided into patches, which will
SCM
0.3138
0.3390
0.2780
0.3801
0.3759
0.4292
0.3512 0.3171
0.3480
be represented
sparely
by the
dictionary.
In fact,0.3146
the tracking
speed is0.3417
also influenced
by other0.3211
factors,
MCT
0.3089
0.2951
0.2557
0.3345
0.3498 0.3683
such as the different update frequencies
in
videos
with
different
scenes.
Accordingly,
the
computational
(The target size decreases from Coke to Girl.)
complexity of the tracker will decrease if the size of target becomes smaller, meanwhile the fluctuation
will
still exist.
5. Conclusions
In this paper, we propose an object tracking algorithm based on motion consistency. The target
5. Conclusions
state is predicted based on the motion direction and distance and is later merged into the state
In this paper, we propose an object tracking algorithm based on motion consistency. The target
transition model to predict additional candidate samples. The appearance model first exploits the
state is predicted based on the motion direction and distance and is later merged into the state transition
position factor to assign weights to candidate samples to characterize the different importance of
model to predict additional candidate samples. The appearance model first exploits the position factor
candidate samples in different positions. At the same time, the occlusion factor is utilized to rectify
to assign weights to candidate samples to characterize the different importance of candidate samples in
the similarity in the local representation and obtain the local response, and the candidates are
different positions. At the same time, the occlusion factor is utilized to rectify the similarity in the local
represented sparsely by positive and negative templates in the holistic representation to obtain the
representation and obtain the local response, and the candidates are represented sparsely by positive
holistic response. The tracking result is determined by the position factor with the local and holistic
responses of each candidate. The adaptive template updating scheme based on target velocity
prediction is proposed to adapt to appearance changes when the object moves quickly by updating
Sensors 2018, 18, 572
26 of 27
and negative templates in the holistic representation to obtain the holistic response. The tracking
result is determined by the position factor with the local and holistic responses of each candidate.
The adaptive template updating scheme based on target velocity prediction is proposed to adapt to
appearance changes when the object moves quickly by updating the dynamic positive templates and
negative templates continuously. According to qualitative and quantitative experimental results, the
proposed MCT has good performance against the state-of-the-art tracking algorithms, especially in
dealing with fast motion and occlusion, on several challenging video sequences.
Supplementary Materials: The following are available online at www.mdpi.com/1424-8220/18/2/572/s1, Video
S1: Figure 9-Boy, Video S2: Figure 10-Coke, Video S3: Figure 11-Jogging-1, Video S4: Figure 12-Football, Video S5:
Figure 13-Basketball, Video S6: Figure 14-Sylester, Video S7: Figure 15-Skating1, Video S8: Figure 16-Girl.
Acknowledgments: The authors would like to thank the National Science Foundation of China: 61671365,
the National Science Foundation of China: 61701389, the Joint Foundation of Ministry of Education of China:
6141A020223, and the Natural Science Basic Research Plan in Shaanxi Province of China: 2017JM6018.
Author Contributions: In this work, Lijun He and Fan Li conceived the main idea, designed the main algorithms
and wrote the manuscript. Xiaoya Qiao analyzed the data, and performed the simulation experiments. Shuai Wen
analyzed the data, and provided important suggestions.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Chorin, A.J.; Tu, X. A tutorial on particle filters for online nonlinear/nongaussian Bayesia tracking.
Esaim Math. Model. Numer. Anal. 2012, 46, 535–543. [CrossRef]
Zhong, W.; Lu, H.; Yang, M. Robust object tracking via sparse collaborative appearance model. IEEE Trans.
Image Process. 2014, 23, 2356–2368. [CrossRef] [PubMed]
Liu, H.; Li, S.; Fang, L. Robust object tracking based on principal component analysis and local sparse
representation. IEEE Trans. Instrum. Meas. 2015, 64, 2863–2875. [CrossRef]
He, Z.; Yi, S.; Cheung, Y.M.; You, X.; Yang, Y. Robust object tracking via key patch sparse representation.
IEEE Trans. Cybern. 2016, 47, 354–364. [CrossRef] [PubMed]
Hotta, K. Adaptive weighting of local classifiers by particle filters for robust tracking. Pattern Recognit. 2009,
42, 619–628. [CrossRef]
Li, F.; Liu, S. Object tracking via a cooperative appearance model. Knowl.-Based Syst. 2017, 129, 61–78.
[CrossRef]
Kwon, J.; Lee, K.M. Visual tracking decomposition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010.
Ahuja, N. Robust visual tracking via multi-task sparse learning. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012.
Breitenstein, M.D.; Reichlin, F.; Leibe, B.; Koller-Merier, E. Robust tracking-by-detection using a detector
confidence particle filter. In Proceedings of the IEEE Conference on Computer Vision (ICCV), Kyoto, Japan,
29 September–2 October 2009.
Shan, C.; Tan, T.; Wei, Y. Real-time hand tracking using a mean shift embedded particle filter. Pattern Recognit.
2007, 40, 1958–1970. [CrossRef]
Mei, X.; Ling, H. Robust Visual Tracking and Vehicle Classification via Sparse Representation. IEEE Trans.
Pattern Anal. Mach. Intell. 2011, 33, 2259–2272. [PubMed]
Bao, C.; Wu, Y.; Ling, H.; Ji, H. Real time robust L1 tracker using accelerated proximal gradient approach.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI,
USA, 16–21 June 2012.
Cai, Y.; Freitas, D.N.; Little, J.J. Robust Visual Tracking for Multiple Targets. In Proceedings of the European
Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006.
Wang, J.; Yagi, Y. Adaptive mean-shift tracking with auxiliary particles. IEEE Trans. Syst. Man Cybern. Part
B (Cybern.) 2009, 39, 1578–1589. [CrossRef] [PubMed]
Bellotto, N.; Hu, H. Multisensor-based human detection and tracking for mobile service robots. IEEE Trans.
Syst. Man Cybern. Part B (Cybern.) 2009, 39, 167–181. [CrossRef] [PubMed]
Sensors 2018, 18, 572
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
27 of 27
Zhang, T.; Liu, S.; Ahuja, N.; Yang, M.H. Robust Visual Tracking via Consistent Low-Rank Sparse Learning.
Int. J. Comput. Vis. 2015, 111, 171–190. [CrossRef]
Wang, Z.; Wang, J.; Zhang, S.; Gong, Y. Visual tracking based on online sparse feature learning.
Image Vis. Comput. 2015, 38, 24–32. [CrossRef]
Jia, X.; Lu, H.; Yang, M.H. Visual tracking via adaptive structural local sparse appearance model.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI,
USA, 16–21 June 2012.
Liu, B.; Huang, J.; Yang, L.; Kulikowsk, C. Robust tracking using local sparse appearance model and
K-selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Colorado Springs, CO, USA, 20–25 June 2011.
Wu, Y.; Shen, B.; Ling, H. Online robust image alignment via iterative convex optimization. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA,
16–21 June 2012.
Hare, S.; Saffari, A.; Vineet, V.; Cheng, M.M.; Hicks, S.L.; Torr, P.H.S. Struck: Structured output tracking with
kernels. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2096–2109. [CrossRef] [PubMed]
Bahenko, B.; Yang, M.H.; Belongie, S. Robust object tracking with online multiple instance learning.
IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1619–1632. [CrossRef] [PubMed]
Zhang, K.; Zhang, L.; Yang, M.H. Real-Time Compressive Tracking. In Proceedings of the European
Conference on Computer Vision (ECCV), Firenze, Italy, 7–13 October 2012.
Grabner, H.; Grabner, M.; Bischof, H. Real-Time Tracking via On-line Boosting. In Proceedings of the British
Machine Vision Conference, Bristol, UK, 9–13 September 2013.
Grabner, H.; Bischof, H. On-line Boosting and Vision. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006.
Sato, K.; Sekiguchi, S.; Fukumori, T.; Kawagishi, N. Beyond semi-supervised tracking: Tracking should be as
simple as detection, but not simpler than recognition. In Proceedings of the IEEE Conference on Computer
Vision (ICCV), Kyoto, Japan, 29 September–2 October 2009.
Li, F.; Zhang, S.; Qiao, X. Scene-Aware Adaptive Updating for Visual Tracking via Correlation Filters. Sensors
2017, 17, 2626. [CrossRef] [PubMed]
Li, J.; Fan, X. Outdoor augmented reality tracking using 3D city models and game engine. In Proceedings of
the International Congress on Image and Signal Processing, Dalian, China, 14–16 October 2014.
Dudek, D. Collaborative detection of traffic anomalies using first order Markov chains. In Proceedings of the
International Conference on Networked Sensing Systems, Antwerp, Belgium, 11–14 June 2012.
Chen, S.; Li, S.; Ji, R.; Yan, Y. Discriminative local collaborative representation for online object tracking.
Knowl.-Based Syst. 2016, 100, 13–24. [CrossRef]
Sevilla-Lara, L.; Learned-Miller, E. Distribution fields for tracking. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012.
Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D. An Efficient k-Means Clustering Algorithm:
Analysis and Implementation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 881–892. [CrossRef]
Wu, Y.; Lim, J.; Yang, M.H. Online Object Tracking: A Benchmark. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013.
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Download