Uploaded by Thanh Nguyen Truong

A Reliable Feature-based Framework for Vehicle Tracking in ADAS

advertisement
2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
A Reliable Feature-Based Framework for Vehicle
Tracking in Advanced Driver Assistance Systems
Ngoc-Quan Ha-Phan1, Thanh-Nguyen Truong2, Vu-Hoang Tran3, Ching-Chun Huang4
123
Ho Chi Minh City University of Technology and Education, Vietnam
1
E-mail: 19145008@student.hcmute.edu.vn
2
E-mail: 19145158@student.hcmute.edu.vn
3
E-mail: hoangtv@hcmute.edu.vn
4
National Yang Ming Chiao Tung University, Taiwan
E-mail: chingchun@nycu.edu.tw
Abstract — Vehicle tracking has always been a vital aspect of
modern transportation systems. This phenomenon has gained
even more interest with the introduction of Advanced Driver
Assistance Systems (ADAS) and Autonomous Vehicles. Most
state-of-the-art (SOTA) vehicle trackers, and their enhanced
versions, commonly rely on mathematical motion models (e.g.,
Kalman Filter) as the core information. However, these models
may produce unreliable outputs, especially when objects exhibit
complex motion patterns. Hence, we propose a reliable featurebased tracking framework that fully exploits distinct vehicle
appearance and conduct a comparative analysis with classic
motion-based trackers. Additionally, we revisit previously
proposed track handling strategies to incorporate a specially
designed track management system for feature-based tracking.
The proposed method achieves the highest score on all selected
multi-object-tracking (MOT) evaluation metrics compared to the
current SOTA methods on the KITTI dataset. Notably, our
approach experienced significantly low False Positive (FP) errors,
ensuring its performance in minimizing unreliable information.
I.
INTRODUCTION
Multi-object tracking (MOT) has gained significant
popularity in the field of computer vision, finding applications
in various domains such as robotic perception, complex
medical imaging, and surveillance management. In the context
of Autonomous Vehicles and ADAS, object tracking plays a
critical role. By continuously monitoring the positions and
trajectories of vehicles, pedestrians, and other objects, the
tracking framework provides essential information for
autonomous systems to make decisions and take appropriate
actions. It is a vital component that contributes to the perceptual
aspects of autonomous vehicle systems, including navigation,
path planning, and even the advanced Vehicle-to-Infrastructure
(V2I) tasks related to traffic management.
Vehicle Tracking has been dominated by the tracking-bydetection (TBD) paradigm, which leverages the innovative
performance of modern object detectors to enhance tracking
accuracy. Specifically, TBD can be seen as a data association
approach that aims to match the detections across frames in a
sequence. Currently, two popular TBD methods are being
extensively researched and developed: motion-based tracking
and feature-based tracking.
Motion-based tracking has recently gained favor in TBD
field. This approach heavily relies on mathematical motion
models, specifically the Kalman Filter, to calculate motion cues
such as velocity, direction, and acceleration. These cues are
then used to predict the object's location within a frame.
Despite its inability to accurately estimate object location, this
method is popularly employed due to its computational
efficiency and robust performance in the presence of noise,
such as changes in lighting conditions or object appearance
variations.
One example of a motion-based tracking method is SORT
[1], which utilizes Kalman Filter to predict object's state and
adopts the Intersection over Union (IoU) and Hungarian
algorithm to associate tracks across a video sequence.
Although SORT performs well in real-time scenarios, it faces
challenges in efficiently tracking occluded objects, leading to a
relatively high number of identity switches. Addressing the
occlusion issue, DeepSORT [2] builds upon SORT and takes
object appearance into account. It integrates a pre-trained
Convolution Neural Network (CNN) into the association
process and introduces a matching cascade mechanism that
computes appearance similarities before employing IoU and
Hungarian matching. DeepSORT demonstrates relatively low
identity switches, showcasing its strong performance in
Fig. 1 System overview of proposed tracking framework
occlusion scenarios. Drawing upon the foundations laid by
SORT and DeepSORT, several proposed approaches have been
developed to overcome the remaining limitations, such as
improving detection quality [3], handling non-linear motion [4],
or addressing missing detection problems [5].
However, the heavy reliance on mathematical models
introduces a level of unreliability to the output of trackers in
ADAS. The assumptions made by these models often fail to
accurately capture all the complex and diverse motions
exhibited by on-road vehicles in real-world scenarios. For
instance, situations of abnormal vehicle behaviors such as
sudden lane change, hard braking, rapid acceleration as well as
variation in camera motion contribute to the varied range of
different vehicle motions that usually happen. Additionally,
cameras in ADAS systems are not stationary but follow the
vehicle, which creates additional challenges for motion-based
tracking systems. Moreover, the appearance of on-road
vehicles remains relatively consistent in terms of size, shape,
color, and texture, and these characteristics do not undergo
significant changes during tracking. Therefore, feature-based
tracking can fully exploit these consistent appearance details
and leverage them for accurate tracking.
In light of this, we propose a fully feature-based multi-object
tracking framework that shifts the focus from mathematical
models to leveraging the object appearance as the primary
information for tracking. Our approach builds upon a
previously proposed tracker [6] that analyzes rich vehicle
features such as shapes, colors, or textures. By utilizing these
insights, our framework can effectively handle complex realworld scenarios, including on-road objects' varied motion
patterns or occlusions. To enhance the track management
strategy, we also revisit SORT and incorporate an adaptive
thresholding approach specifically designed for feature-based
tracking. This approach aims to reduce the occurrence of
common errors associated with feature-based vehicle tracking,
such as out-of-frame handling, and feature loss. In summary,
our contributions can be summarized as follows:
1. A novel Feature-based tracking framework: our
proposed tracking framework fully exploits object
appearance as tracking information, leading to reliable
and accurate information.
2. A Track Management System: we revisit SORT [1]
and design a track management system specialized for
feature-based tracking.
3. Adaptive Thresholding Approach: we proposed an
adaptive thresholding approach to handle the problems
of feature-based trackers such as out-of-frame or feature
loss scenarios.
II.
PROPOSED METHOD
In this section, we present a novel vehicle tracking
framework, as shown in Fig. 1, which incorporates an Object
Detector, a Feature-based Tracker, a Data Association
mechanism, and a Track Management strategy. Notably, we
introduce adaptive thresholding approaches specifically
designed for feature-based tracking methods. The following
subsections provide a comprehensive description of these
components, offering detailed insights into their methodologies.
A. Object Detector
Firstly, the accurate identification and precise localization of
tracking targets in image sequences are crucial for achieving
high-quality tracking performance. This object detection task
(OD) plays a pivotal role in determining the overall accuracy
and effectiveness of the tracking results. Furthermore, in
ADAS applications, it is essential to meet strict processing time
constraints. To meet these requirements, we evaluated several
single-stage OD models from the YOLO family, known for
their real-time capabilities, simplicity, and efficiency [7].
Among these models, YOLOv7 stood out due to its outstanding
accuracy on the Microsoft COCO dataset [8], as shown in
TABLE 1. Consequently, we have chosen YOLOv7 as our
object detector to ensure accurate and efficient object detection,
thereby guaranteeing reliable tracking in ADAS systems. By
leveraging the capabilities of YOLOv7, we aim to enhance the
overall performance of our tracking framework and provide
dependable object detection in real-time scenarios.
TABLE 1
COMPARISON OF BASELINE OBJECT DETECTORS [9]
Model
YOLOv4 [10]
YOLOR-u5 (r6.1) [11]
YOLOv4-CSP [12]
YOLOR-CSP [11]
YOLOv7 [9]
#Param.↓
64.4M
46.5M
52.9M
52.9M
36.9M
FLOPs↓
142.8G
109.1G
120.4G
120.4G
104.7G
Size
640
640
640
640
640
𝑨𝑷!"# ↑
49.7%
50.2%
50.3%
50.8%
51.2%
𝑨𝑷!"#
$% ↑
68.2%
68.7%
68.6%
69.5%
69.7%
𝑨𝑷!"#
&$ ↑
54.3%
54.6%
54.9%
55.3%
55.5%
B.
Feature-based Tracker
The primary objective of the tracker is to estimate the
position of the object based on information from the previous
frame. In this paper, we utilize the idea of DiMP-50 Tracker
[6] to build our feature-based tracker. The tracker’s workflow
consists of two primary steps: Initialize and Track. In the
following parts of this subsection, we delve into the
methodology of these components, providing a detailed
overview of their functioning. Moreover, an additional aspect
of the tracker, which involves bounding box refinement using
IoUNet prediction, will be described.
appearance. It will be trained 10 epochs for the first frame, and
then every 20 frames it will be trained 20 epochs on the updated
training dataset STrain. The object identifier is trained using the
regularized loss is described in (1).
1
𝐿=
- ‖𝑟(𝑓* (𝑥), 𝑐)‖+ + 𝜆‖𝜃‖+ ,
(1)
|𝑆)"#$% |
(-,/)∈2!"#$%
where, 𝜃 denotes the weight of the object identifier f and 𝜆 is
the regularization factor. The function 𝑟 captures the
difference, or errors, between the predicted target confidence
score 𝑓* (𝑥) and the ground-truth bounding box centroid 𝑐 at
each spatial location. While r can be straightforwardly defined
as the difference between the two inputs, to address the issue
of data imbalance and enhance the model's emphasis on
positive data, we opt to utilize a combination of least-squares
regression and hinge loss for r, as proposed in [6].
Fig. 3 Track component of tracker.
Fig. 2 Initialize component of tracker.
Initialize: The initialize step is responsible for establishing
object identifiers when they first appear, as shown in Fig. 1,
based on their characteristics. Fig. 2 illustrates the initialize
process, the input to the component is an image containing
information about a newly detected bounding box oi, obtained
from the object detector. Some data augmentation methods
such as: translation, rotation, blur and dropout are then
performed to create a set of augmented image samples that
accounts for potential variations in object appearance. The
Initialize process continues with feature extractor, here using
ResNet-50, to create a set of deep feature maps 𝑥! from the
augmented image samples. The set is combined with associated
bounding box centroid 𝑐" to create a set of training samples,
denoted as 𝑆!"#$% = {(𝑥& , 𝑐& }%&'( , the training samples are then
used to train the online object identifier fi, which consists of a
single convolutional layer responsible for detecting the
possibility of an input window containing an object.
The object identifier fi is designed as a compact network that
can be trained online for the purpose of being able to
continuously and rapidly update changes in the object's
Track: After establishing initial information, the tracker
proceeds to the next task of estimating object states in the
subsequent frames, which is dealt with by the Track component.
Fig. 3 describes the operational mechanism of the track
component on the subsequent frames in a video sequence for
an object oi. To save processing time, the system will rely on
the bounding box of the object in the previous frame 𝐵$!3( to
determine the search area for the current frame t, and the
feature xt is only extracted in this search area. This area delimits
the most probable position that the object exists in the current
frame, enable the tracker concentrate only on the potential
location of the object in the frame. Unlike in [6], where the
search area is determined by multiplying 𝐵$!3( by a fixed
scaling factor Sc, we will change Sc depending on the relative
position of the object in previous frame as calculated in (2).
𝐴𝑆/ =
4$%(5& ,5' )
6(#&
𝑆/ ,
(2)
where, ASc denotes adaptive search area scale, Sc represents the
pre-defined default scaling factor, and dx and dy indicate the
closest distances from the center of the bounding box in the
previous frame to the edges of the input image, respectively.
Dmax signifies the maximum distance from the bounding box
center to the image's edge. If the bounding box center in the
previous frame is closer to the vertical edge, Dmax is calculated
as half the image's width; otherwise, it is calculated as half the
image's height. The adaptive search area scale dynamically
adjusts the search area when the object approaches the image's
edge, causing a partial disappearance and altering the object's
appearance. By restricting the search area, the system focuses
on the most relevant and reliable object features, enabling the
tracker to adapt better to appearance changes and reduce false
outputs.
Using the extracted features xt, the online object identifier fi
is employed to generate a confidence score map across the
entire search area. This map reflects the likelihood of the
object's presence at different locations. The location with the
highest score is chosen as the centroid of the object's bounding
box in the current frame, representing the most probable
location of the object. By combining this centroid with the
width and height of the bounding box from the previous frame,
a coarse estimation of the object oi in current frame is obtained.
Although the tracking module performs well in determining
the location of objects within the frame, its accuracy is limited
as it relies on a coarse estimation derived solely from the size
of the previous bounding box. To address this limitation, the
IoUNet prediction approach, introduced in [13], is employed to
refine and regress a more precise target box. As shown in Fig.
3, after obtaining a coarse estimation of the object's location in
terms of the bounding box centroid together with the bounding
box width and height from the previous frame, ten bounding
77! , are generated by adding uniform
box proposals, denoted as 𝐵
noise, encompassing a range of potential object positions
surrounding the coarse estimation. Using the extracted features
xt-1 and bounding box 𝐵$!3( from previous frame, as well as
current extracted features xt, IoUNet predict IoU score for each
77! . The final estimation, 𝐵$! , is obtained by
proposal in 𝐵
averaging of three proposals with the highest IoU score.
Furthermore, in order to maintain the object identifier's
effectiveness, new training samples will be incorporated into
the Strain dataset if they are predicted with a satisfactory level of
confidence. To prevent an excessive increase in memory size,
the oldest samples in Strain will be removed to make room for
the new ones. This process ensures that the object identifier
remains up to date and capable of accurately identifying objects.
C. Data Association Method
The core requirement of any tracking-by-detection method
is the ability to assign detections to existing targets. In detail,
produced output of the tracker, which includes bounding boxes
of targets, is associated with measured bounding boxes given
by an object detector. We tackle the assignment problem by
using a similar approach to SORT [1], which employs
Intersection over Union (IoU) and the Hungarian matching
algorithm as shown in Fig. 1.
Given N detected boxes (𝐷) from the object detector and M
generated track boxes (𝑇) from the tracker, the IoU distance
between each detected box and all generated boxes resulting in
cost matrix, 𝑚𝐼𝑜𝑈 (𝐷, 𝑇), as described in (3):
𝐼𝑜𝑈(𝐷(, 𝑇( ) ⋯ 𝐼𝑜𝑈(𝐷(, 𝑇8 )
⋮
⋱
⋮
𝑚𝐼𝑜𝑈(𝐷, 𝑇) = >
B, (3)
𝐼𝑜𝑈(𝐷9 , 𝑇( ) ⋯ 𝐼𝑜𝑈(𝐷9, 𝑇8 )
where, 𝐼𝑜𝑈 corresponds to the IoU score between chosen
detected box 𝐷 and generated box 𝑇 is calculated as in (4).
𝐼𝑜𝑈(𝐷, 𝑇) =
𝐷∩𝑇
.
𝐷 ∪𝑇
(4)
Given the cost matrix 𝑚𝐼𝑜𝑈, the Hungarian Algorithm [14]
witnessed the final stage of the assignment process resulting in
three possible outputs, namely matches (𝑇4 ), unmatched tracks
(𝑇: ), and unmatched detections (𝐷: ) as shown in Fig. 1. Each
𝑇4 represents the pairings of 𝐷 and 𝑇 that are matched with
the highest IoU score. 𝑇: indicates tracks that have an overlap
less than a predefined 𝐼𝑜𝑈4$% threshold and, similarly, 𝐷:
refers to unmatched detections, which correspond to detections
of new objects.
D. Track Management
Dealing with the set of 𝑇4 , 𝑇: , and 𝐷: , our framework
employs a modified version of track handling approach as in
[1], [2], as shown in Fig. 1. Additionally, we also propose an
adaptive thresholding method designed for the proposed
system, which can also be used for other feature-based trackers.
Track Initialization and Rejection: New tracks are
initialized when there exists a set of unmatched detection 𝐷: ,
each element in 𝐷: will be initialized as new track 𝑇 with
the associated parameter (𝑐, 𝛾, ℎ, 𝑎, 𝑏, 𝑆/ ) that contains
bounding box center position c, aspect ratio 𝛾 , height ℎ
followed by the track management parameters such as track age
𝑎, track hit 𝑏, and predefined search area scale 𝑆/ .
After initialization, each 𝑇 undergoes data association for
detected boxes in the subsequent frame. The counter 𝑏 is
incremented following a successful association. In contrast, for
every unsuccessful association of tracks, 𝑎 and 𝑆/ are
incremented to accommodate potential object re-entry and are
reset to the initial value if any track subsequently rematches
with a detected box. Any track with the related 𝑎 that exceeds
a maximum age threshold AMax is considered to have
permanently left the frame and is therefore rejected. Unlike
SORT and DeepSORT, which rely on a fixed age threshold, we
propose to use an adaptive thresholding approach specifically
designed for feature-based trackers, aiming to minimize
associated errors. The AMax is defined similarly to ASc as
described in (2). By employing the Adaptive Maximum Age
mechanism, when a vehicle's trajectory starts to extend beyond
the frame boundaries, AMax gradually diminishes, facilitating
the quick deletion of these tracks. This adaptive approach
ensures that the tracker effectively handles objects that move
out of the frame, enhancing overall tracking performance.
Track Confirm Mechanism: In tracking applications,
scenarios when objects only appear in a few initial frames,
should also be considered. Therefore, as described in [2], tracks
at creation are considered to be tentative during their first 𝐻4$%
frames and is confirmed after consecutive successful
associations. During 𝐻4$% , tentative tracks will be deleted
right after an unsuccessful association. By requiring
consecutive matches, this method helps minimize related errors
caused by false or short-lived detections.
III.
EXPERIMENTAL RESULTS
In this section, we presented an in-depth analysis of tracking
evaluation and performance comparison to assess the
performance of the proposed tracking framework compared to
existing methods on the KITTI Object Tracking Benchmark
[15]. For evaluation, the proposed method is tested on a chosen
set of sequences specialized for vehicle tracking. For
comparison, we executed the proposed tracking framework as
well as reproduced results from DeepSORT [2] and SORT [1]
using the same detection input, the results are evaluated on
different selected metrics. Furthermore, performance analysis
and ablation studies are accomplished for deeper insight.
A. Dataset
Our tracking framework is evaluated on KITTI Object
Tracking Benchmark [15], a popular dataset utilized for many
tasks in Autonomous Vehicle applications. The dataset
incorporates data from various sensors including Color and
Greyscale Cameras, a Laser Scanner, and GPS/IMU navigation
systems. KITTI Image Dataset comprises a total of 21 Training
and 29 Testing sequences, all captured at a common frame rate
of 10 FPS using vehicle-mounted color cameras positioned at
a height of 1.65 meters above the ground.
While the KITTI Dataset can be used for tracking
applications on either vehicles or pedestrians, our framework
specifically focuses on vehicle tracking. Consequently, a total
of nine sequences, primarily composed of on-road vehicles, are
selected for evaluation. These chosen sequences cover a
diverse range of on-road scenarios spanning from highways to
urban areas for a realistic evaluation of the framework’s
performance.
B. Evaluation Metrics
We use the metrics MOTA, HOTA, and IDF1 [16]–[18] to
evaluate our tracking performance, with each metric providing
different criteria for tracking performance assessment.
MOTA [16] measures the tracking performance at the
detection level by calculating the matching between each
ground-truth detection (𝑔𝑡𝐷𝑒𝑡) and predicted detection
(𝑝𝑟𝐷𝑒𝑡). From that approach, the number of matched pair of
𝑔𝑡𝐷𝑒𝑡 and 𝑝𝑟𝐷𝑒𝑡 is regarded as True Positive (𝑇𝑃). The
number of times a 𝑝𝑟𝐷𝑒𝑡 is unmatched results in the number
False Positive (𝐹𝑃) while the number missed 𝑔𝑡𝐷𝑒𝑡
corresponds to False Negative (𝐹𝑁). Moreover, association
performance in MOTA is measured with the concept of Identity
Switch (𝐼𝐷𝑆𝑊) , an 𝐼𝐷𝑆𝑊 is counted when objects
unexpectedly swap their identity during tracking. The final
score of MOTA is calculated as in (5).
𝑀𝑂𝑇𝐴 = 1 −
|𝐹𝑃| + |𝐹𝑁| + |𝐼𝐷𝑆𝑊|
.
|𝑔𝑡𝐷𝑒𝑡|
(5)
On the other hand, IDF1 [18] acquires tracking errors at the
association level by measuring bijective matching between sets
of ground-truth trajectories (𝑔𝑡𝑇𝑟𝑎𝑗) and prediction
trajectories (𝑝𝑟𝑇𝑟𝑎𝑗). Matching between each 𝑔𝑡𝑇𝑟𝑎𝑗 and
𝑝𝑟𝑇𝑟𝑎𝑗 results in ID-True Positive (𝐼𝐷𝑇𝑃). The counts of
ID-False Negative (𝐼𝐷𝐹𝑁) and ID-False Positive (𝐼𝐷𝐹𝑃)
represent the counts of remaining unmatched 𝑔𝑡𝐷𝑒𝑡 and
𝑝𝑟𝐷𝑒𝑡, respectively. Final IDF1 score is calculated as in (6).
𝐼𝐷𝐹1 =
|𝐼𝐷𝑇𝑃|
.
|𝐼𝐷𝑇𝑃| + 0.5|𝐼𝐷𝐹𝑁| + 0.5|𝐼𝐷𝐹𝑃|
(6)
HOTA [17] is a recently proposed metric that, unlike MOTA
and IDF1, considers errors at both mentioned levels. This
metric balances the effects of detection and association errors
on tracking performance. At the detection level, HOTA
acquires the instances of 𝐹𝑁, 𝐹𝑃, and 𝑇𝑃 in a similar way to
MOTA to calculate Detection Accuracy (𝐷𝑒𝑡𝐴). Additionally,
by using the method of matching the set of ground-truth
Identity (𝑔𝑡𝐼𝐷) with prediction Identity (𝑝𝑟𝐼𝐷) over a set of
𝑇𝑃𝑠 , HOTA proposed an association-level evaluation that
acquires True Positive Association (𝑇𝑃𝐴) , False Negative
Association (𝐹𝑁𝐴), and False Positive Association (𝐹𝑃𝐴),
these three concepts are used to measure Association Accuracy
(𝐴𝑠𝑠𝐴). As a result, the HOTA score over different localization
thresholds (0.05 to 0.95 in 0.05 intervals) is calculated as in (7).
(
𝐻𝑂𝑇𝐴 = Z 𝐻𝑂𝑇𝐴; 𝑑𝛼 ≈
<
1
- 𝐻𝑂𝑇𝐴; ,
19
;'<.<>
#?'<.<>
(7)
To conclude, those mentioned enhancements prove the
accuracy and reliability of our approach, the results on HOTA
metrics (TABLE 4), additionally, contribute to this fact as the
corresponding performance in both detection (DetA) and
association (AssA) is leading in comparison with either SORT
or DeepSORT.
where, 𝐻𝑂𝑇𝐴; , calculated in (8), is the HOTA score over an
𝛼, which is the geometric mean of 𝐷𝑒𝑡𝐴 and 𝐴𝑠𝑠𝐴 over 𝛼.
𝐻𝑂𝑇𝐴; = c𝐷𝑒𝑡𝐴; · 𝐴𝑠𝑠𝐴; ,
(8)
where, 𝐷𝑒𝑡𝐴; and 𝐴𝑠𝑠𝐴; are 𝐷𝑒𝑡𝐴 and 𝐴𝑠𝑠𝐴
calculated with localization threshold 𝛼 as in (9) and (10).
𝐷𝑒𝑡𝐴; =
𝐴𝑠𝑠𝐴; =
|𝑇𝑃|
.
|𝑇𝑃| + |𝐹𝑁| + |𝐹𝑃|
(9)
1
|𝑇𝑃𝐴|
.
|𝑇𝑃|
|𝑇𝑃𝐴| + |𝐹𝑁𝐴| + |𝐹𝑃𝐴|
/ ∈ )A
(10)
Note that the concepts of TP, FN, FP, TPA, FNA and FPA
in (9) and (10) are calculated for a particular value of 𝛼 .
However, for clarity, the subscript is omitted.
C. Results
For comparison, we also reproduced two popular motionbased tracking methods that employ similar association
strategies as our proposed method, namely DeepSORT [2] and
SORT [1].
TABLE 2 and TABLE 3 illustrate the results of our
evaluation on MOTA and IDF1 metrics, respectively. It is
evident that our method notably improved tracking
performance in the overall MOTA score. In detail, our method
ranked first in FN and FP, surpassing the second-best method
(SORT) with a slight reduction in FN and a complete outperformance in FP (from 775 to 465). The proposed method
also secured the top position on other associated parameters,
which are Mostly Track Ratio (MT), Mostly Lost Ratio (ML),
and Tracking Precision (MOTP). Similarly, at the association
level, the proposed method is dominant in IDF1 due to the
relatively low IDFP value in comparison two other methods.
TABLE 2
EVALUATION RESULTS OF THREE METHODS ON MOTA METRIC.
SORT [1]
DeepSORT [2]
Proposed
MOTA↑
66.9
63.3
70.1
MOTP↑
75.5
74.7
80.5
MT↑
45.2
44.9
48
ML↓
11.5
12.5
10
FN↓
2928
2970
2868
FP↓
775
1169
465
IDSW↓
153
137
152
TABLE 3
EVALUATION RESULTS OF THREE METHODS ON IDF1 METRIC.
SORT [1]
DeepSORT [2]
Proposed
IDF1↑
70.52
74.23
76.68
IDTP↑
7449
7971
8004
IDFN↓
4190
3668
3635
TABLE 4
EVALUATION RESULTS OF THREE METHODS ON HOTA METRIC.
HOTA↑ DetA↑ AssA↑
SORT [1]
54.65
53.85
56.62
DeepSORT[2]
56.46
52.21
62.62
Proposed
60.77
58.42
63.84
IDFP↓
2037
1867
1232
Frag↓
202
178
211
D.
Ablation Studies
TABLE 5 summarizes the comparison of the proposed
method on different improvement proposals, we compare our
baseline method with two other ones with strategy
improvements, namely Proposedv1 and Proposedv2.
Proposedv1 updated the Baseline method with Adaptive
Maximum Age Threshold (AMax), resulting in a reduction in
IDSW (-185) and Frag (Fragmentation) (-23). This approach
also demonstrates the best performance in overall HOTA and
MOTA scores. Here, the tendency indicates that the proposed
method benefits considerably from AMax mechanism.
On the other hand, Proposedv2 integrated Adaptive Search
Scale (ASc) and Track Confirm Mechanism (Cfm). ASc is
implemented to minimize errors that may arise when objects
tend to move out of frame, potentially resulting in feature loss
and causing confusion for the tracker. The track confirm
mechanism, on the other hand, is used to minimize errors for
scenarios when tracks only appear for a few initial frames. By
applying these two strategies, Proposedv2 significantly
reduced IDSW and Frag errors, and achieved the highest
association-level score (IDF1). Still, the improvement in
IDF1(+0.04) compared to Proposedv1 is not considered
significant, and there is a slight decrease in the overall HOTA
and MOTA scores. This trend can be attributed to the fact that
Proposedv2 focuses only on, as much as possible, reducing
errors caused by objective uncertainties.
TABLE 5
ABLATION STUDY ON HOTA, MOTA, IDF1 FOR VARIOUS STRATEGIES: ADAPTIVE
MAXIMUM AGE (AMax), ADAPTIVE SEARCH SCALE (ASc), TRACK CONFIRM (Cfm).
IDSW↓
Frag↓
AMax
ASc
HOTA↑
IDF1↑
MOTA↑
Baseline
58.21
71.68
69.71
396
294
Proposedv1
61.77
76.64
71.24
211
271
🗸
Proposedv2
60.77
76.68
70.06
152
211
🗸
🗸
🗸
E.
Limitations
Although the proposed method has successfully witnessed
improvements in accuracy and reliability, it still exhibits a
certain limitation. The concern is that performance in IDSW
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Fig. 4 Types of false output generated by Proposed method
and Frag is still poor compared to previous methods (especially
DeepSORT). We consider this to be a distinctive issue in
feature-based tracking, particularly evident in crowded
scenarios, where uncertainties in object searching are relatively
high. For example, Fig. 4 illustrates a situation in which
vehicles are in close proximity and have identical appearances.
The proposed method struggles to differentiate between the
objects, leading to uncertain and incorrect outputs. In contrast,
DeepSORT effectively handles the situation by leveraging the
accurate predictions generated by the Kalman Filter. These
findings highlight that future developments should focus on
enhancing the proposed method’s capability to predict object
states in crowded environment.
IV. CONCLUSIONS
In this paper, we have introduced a novel feature-based
tracking framework that leverages appearance information to
enhance tracking performance. We have also presented a new
track management system and an adaptive thresholding
approach. These techniques have been specifically designed to
address the requirements of vehicle tracking in ADAS
applications. Through extensive evaluation on the wellestablished KITTI benchmark, our proposed method has
demonstrated its effectiveness by surpassing existing
approaches in terms of tracking accuracy and robustness. A
series of ablation studies were conducted to further emphasize
the advantages of the integrated track management system and
adaptive thresholding approach. Overall, our findings
underscore the importance of considering appearance
information and implementing advanced track management
strategies in feature-based tracking. This paper contributes to
the existing body of knowledge in the field of vehicle tracking
and provides valuable insights for future developments in
ADAS applications.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Online
and Realtime Tracking,” arXiv [cs.CV], 2016.
N. Wojke, A. Bewley, and D. Paulus, “Simple online and Realtime
Tracking with a deep association metric,” arXiv [cs.CV], 2017.
Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu
and X. Wang , “ByteTrack: Multi-object tracking by associating every
detection box,” arXiv [cs.CV], 2021.
J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, “Observationcentric SORT: Rethinking SORT for robust multi-object tracking,”
arXiv [cs.CV], 2022.
Y. Du, Z. Zhao, Y. Song, Y. Zhao, F. Su, T. Gong and H. Meng,
“StrongSORT: Make DeepSORT Great Again,” arXiv [cs.CV], 2022.
G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Learning
discriminative model prediction for tracking,” arXiv [cs.CV], 2019.
S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B.
Lee, “A survey of modern deep learning based object detection
models,” arXiv [cs.CV], 2021.
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P.
Dollár and C. L. Zitnick, “Microsoft COCO: Common objects in
context,” arXiv [cs.CV], 2014.
C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7:
Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors,” arXiv [cs.CV], 2022.
A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
speed and accuracy of object detection,” arXiv [cs.CV], 2020.
C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “You Only Learn One
Representation: Unified Network for Multiple Tasks,” CoRR, vol.
abs/2105.04206, 2021.
C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-YOLOv4:
Scaling cross stage partial network,” arXiv [cs.CV], 16-Nov-2020.
M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM:
Accurate tracking by overlap maximization,” arXiv [cs.CV], 2018.
H. W. Kuhn, “The Hungarian Method for the Assignment Problem,”
Naval Research Logistics Quarterly, vol. 2, no. 1--2, pp. 83--97, 1955.
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The KITTI dataset,” Int. J. Rob. Res., vol. 32, no. 11, pp. 1231–1237,
2013.
K. Bernardin and R. Stiefelhagen, “Evaluating multiple object
tracking performance: The CLEAR MOT metrics,” EURASIP J.
Image Video Process., vol. 2008, pp. 2008.
J. Luiten, A. Os̆ep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé and
B. Leibe, “HOTA: A higher order metric for evaluating multi-object
tracking,” Int. J. Comput. Vis., vol. 129, no. 2, pp. 548–578, 2021.
E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi,
“Performance measures and a data set for multi-target, multi-camera
tracking,” arXiv [cs.CV], 2016.
Download