2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) A Reliable Feature-Based Framework for Vehicle Tracking in Advanced Driver Assistance Systems Ngoc-Quan Ha-Phan1, Thanh-Nguyen Truong2, Vu-Hoang Tran3, Ching-Chun Huang4 123 Ho Chi Minh City University of Technology and Education, Vietnam 1 E-mail: 19145008@student.hcmute.edu.vn 2 E-mail: 19145158@student.hcmute.edu.vn 3 E-mail: hoangtv@hcmute.edu.vn 4 National Yang Ming Chiao Tung University, Taiwan E-mail: chingchun@nycu.edu.tw Abstract — Vehicle tracking has always been a vital aspect of modern transportation systems. This phenomenon has gained even more interest with the introduction of Advanced Driver Assistance Systems (ADAS) and Autonomous Vehicles. Most state-of-the-art (SOTA) vehicle trackers, and their enhanced versions, commonly rely on mathematical motion models (e.g., Kalman Filter) as the core information. However, these models may produce unreliable outputs, especially when objects exhibit complex motion patterns. Hence, we propose a reliable featurebased tracking framework that fully exploits distinct vehicle appearance and conduct a comparative analysis with classic motion-based trackers. Additionally, we revisit previously proposed track handling strategies to incorporate a specially designed track management system for feature-based tracking. The proposed method achieves the highest score on all selected multi-object-tracking (MOT) evaluation metrics compared to the current SOTA methods on the KITTI dataset. Notably, our approach experienced significantly low False Positive (FP) errors, ensuring its performance in minimizing unreliable information. I. INTRODUCTION Multi-object tracking (MOT) has gained significant popularity in the field of computer vision, finding applications in various domains such as robotic perception, complex medical imaging, and surveillance management. In the context of Autonomous Vehicles and ADAS, object tracking plays a critical role. By continuously monitoring the positions and trajectories of vehicles, pedestrians, and other objects, the tracking framework provides essential information for autonomous systems to make decisions and take appropriate actions. It is a vital component that contributes to the perceptual aspects of autonomous vehicle systems, including navigation, path planning, and even the advanced Vehicle-to-Infrastructure (V2I) tasks related to traffic management. Vehicle Tracking has been dominated by the tracking-bydetection (TBD) paradigm, which leverages the innovative performance of modern object detectors to enhance tracking accuracy. Specifically, TBD can be seen as a data association approach that aims to match the detections across frames in a sequence. Currently, two popular TBD methods are being extensively researched and developed: motion-based tracking and feature-based tracking. Motion-based tracking has recently gained favor in TBD field. This approach heavily relies on mathematical motion models, specifically the Kalman Filter, to calculate motion cues such as velocity, direction, and acceleration. These cues are then used to predict the object's location within a frame. Despite its inability to accurately estimate object location, this method is popularly employed due to its computational efficiency and robust performance in the presence of noise, such as changes in lighting conditions or object appearance variations. One example of a motion-based tracking method is SORT [1], which utilizes Kalman Filter to predict object's state and adopts the Intersection over Union (IoU) and Hungarian algorithm to associate tracks across a video sequence. Although SORT performs well in real-time scenarios, it faces challenges in efficiently tracking occluded objects, leading to a relatively high number of identity switches. Addressing the occlusion issue, DeepSORT [2] builds upon SORT and takes object appearance into account. It integrates a pre-trained Convolution Neural Network (CNN) into the association process and introduces a matching cascade mechanism that computes appearance similarities before employing IoU and Hungarian matching. DeepSORT demonstrates relatively low identity switches, showcasing its strong performance in Fig. 1 System overview of proposed tracking framework occlusion scenarios. Drawing upon the foundations laid by SORT and DeepSORT, several proposed approaches have been developed to overcome the remaining limitations, such as improving detection quality [3], handling non-linear motion [4], or addressing missing detection problems [5]. However, the heavy reliance on mathematical models introduces a level of unreliability to the output of trackers in ADAS. The assumptions made by these models often fail to accurately capture all the complex and diverse motions exhibited by on-road vehicles in real-world scenarios. For instance, situations of abnormal vehicle behaviors such as sudden lane change, hard braking, rapid acceleration as well as variation in camera motion contribute to the varied range of different vehicle motions that usually happen. Additionally, cameras in ADAS systems are not stationary but follow the vehicle, which creates additional challenges for motion-based tracking systems. Moreover, the appearance of on-road vehicles remains relatively consistent in terms of size, shape, color, and texture, and these characteristics do not undergo significant changes during tracking. Therefore, feature-based tracking can fully exploit these consistent appearance details and leverage them for accurate tracking. In light of this, we propose a fully feature-based multi-object tracking framework that shifts the focus from mathematical models to leveraging the object appearance as the primary information for tracking. Our approach builds upon a previously proposed tracker [6] that analyzes rich vehicle features such as shapes, colors, or textures. By utilizing these insights, our framework can effectively handle complex realworld scenarios, including on-road objects' varied motion patterns or occlusions. To enhance the track management strategy, we also revisit SORT and incorporate an adaptive thresholding approach specifically designed for feature-based tracking. This approach aims to reduce the occurrence of common errors associated with feature-based vehicle tracking, such as out-of-frame handling, and feature loss. In summary, our contributions can be summarized as follows: 1. A novel Feature-based tracking framework: our proposed tracking framework fully exploits object appearance as tracking information, leading to reliable and accurate information. 2. A Track Management System: we revisit SORT [1] and design a track management system specialized for feature-based tracking. 3. Adaptive Thresholding Approach: we proposed an adaptive thresholding approach to handle the problems of feature-based trackers such as out-of-frame or feature loss scenarios. II. PROPOSED METHOD In this section, we present a novel vehicle tracking framework, as shown in Fig. 1, which incorporates an Object Detector, a Feature-based Tracker, a Data Association mechanism, and a Track Management strategy. Notably, we introduce adaptive thresholding approaches specifically designed for feature-based tracking methods. The following subsections provide a comprehensive description of these components, offering detailed insights into their methodologies. A. Object Detector Firstly, the accurate identification and precise localization of tracking targets in image sequences are crucial for achieving high-quality tracking performance. This object detection task (OD) plays a pivotal role in determining the overall accuracy and effectiveness of the tracking results. Furthermore, in ADAS applications, it is essential to meet strict processing time constraints. To meet these requirements, we evaluated several single-stage OD models from the YOLO family, known for their real-time capabilities, simplicity, and efficiency [7]. Among these models, YOLOv7 stood out due to its outstanding accuracy on the Microsoft COCO dataset [8], as shown in TABLE 1. Consequently, we have chosen YOLOv7 as our object detector to ensure accurate and efficient object detection, thereby guaranteeing reliable tracking in ADAS systems. By leveraging the capabilities of YOLOv7, we aim to enhance the overall performance of our tracking framework and provide dependable object detection in real-time scenarios. TABLE 1 COMPARISON OF BASELINE OBJECT DETECTORS [9] Model YOLOv4 [10] YOLOR-u5 (r6.1) [11] YOLOv4-CSP [12] YOLOR-CSP [11] YOLOv7 [9] #Param.↓ 64.4M 46.5M 52.9M 52.9M 36.9M FLOPs↓ 142.8G 109.1G 120.4G 120.4G 104.7G Size 640 640 640 640 640 𝑨𝑷!"# ↑ 49.7% 50.2% 50.3% 50.8% 51.2% 𝑨𝑷!"# $% ↑ 68.2% 68.7% 68.6% 69.5% 69.7% 𝑨𝑷!"# &$ ↑ 54.3% 54.6% 54.9% 55.3% 55.5% B. Feature-based Tracker The primary objective of the tracker is to estimate the position of the object based on information from the previous frame. In this paper, we utilize the idea of DiMP-50 Tracker [6] to build our feature-based tracker. The tracker’s workflow consists of two primary steps: Initialize and Track. In the following parts of this subsection, we delve into the methodology of these components, providing a detailed overview of their functioning. Moreover, an additional aspect of the tracker, which involves bounding box refinement using IoUNet prediction, will be described. appearance. It will be trained 10 epochs for the first frame, and then every 20 frames it will be trained 20 epochs on the updated training dataset STrain. The object identifier is trained using the regularized loss is described in (1). 1 𝐿= - ‖𝑟(𝑓* (𝑥), 𝑐)‖+ + 𝜆‖𝜃‖+ , (1) |𝑆)"#$% | (-,/)∈2!"#$% where, 𝜃 denotes the weight of the object identifier f and 𝜆 is the regularization factor. The function 𝑟 captures the difference, or errors, between the predicted target confidence score 𝑓* (𝑥) and the ground-truth bounding box centroid 𝑐 at each spatial location. While r can be straightforwardly defined as the difference between the two inputs, to address the issue of data imbalance and enhance the model's emphasis on positive data, we opt to utilize a combination of least-squares regression and hinge loss for r, as proposed in [6]. Fig. 3 Track component of tracker. Fig. 2 Initialize component of tracker. Initialize: The initialize step is responsible for establishing object identifiers when they first appear, as shown in Fig. 1, based on their characteristics. Fig. 2 illustrates the initialize process, the input to the component is an image containing information about a newly detected bounding box oi, obtained from the object detector. Some data augmentation methods such as: translation, rotation, blur and dropout are then performed to create a set of augmented image samples that accounts for potential variations in object appearance. The Initialize process continues with feature extractor, here using ResNet-50, to create a set of deep feature maps 𝑥! from the augmented image samples. The set is combined with associated bounding box centroid 𝑐" to create a set of training samples, denoted as 𝑆!"#$% = {(𝑥& , 𝑐& }%&'( , the training samples are then used to train the online object identifier fi, which consists of a single convolutional layer responsible for detecting the possibility of an input window containing an object. The object identifier fi is designed as a compact network that can be trained online for the purpose of being able to continuously and rapidly update changes in the object's Track: After establishing initial information, the tracker proceeds to the next task of estimating object states in the subsequent frames, which is dealt with by the Track component. Fig. 3 describes the operational mechanism of the track component on the subsequent frames in a video sequence for an object oi. To save processing time, the system will rely on the bounding box of the object in the previous frame 𝐵$!3( to determine the search area for the current frame t, and the feature xt is only extracted in this search area. This area delimits the most probable position that the object exists in the current frame, enable the tracker concentrate only on the potential location of the object in the frame. Unlike in [6], where the search area is determined by multiplying 𝐵$!3( by a fixed scaling factor Sc, we will change Sc depending on the relative position of the object in previous frame as calculated in (2). 𝐴𝑆/ = 4$%(5& ,5' ) 6(#& 𝑆/ , (2) where, ASc denotes adaptive search area scale, Sc represents the pre-defined default scaling factor, and dx and dy indicate the closest distances from the center of the bounding box in the previous frame to the edges of the input image, respectively. Dmax signifies the maximum distance from the bounding box center to the image's edge. If the bounding box center in the previous frame is closer to the vertical edge, Dmax is calculated as half the image's width; otherwise, it is calculated as half the image's height. The adaptive search area scale dynamically adjusts the search area when the object approaches the image's edge, causing a partial disappearance and altering the object's appearance. By restricting the search area, the system focuses on the most relevant and reliable object features, enabling the tracker to adapt better to appearance changes and reduce false outputs. Using the extracted features xt, the online object identifier fi is employed to generate a confidence score map across the entire search area. This map reflects the likelihood of the object's presence at different locations. The location with the highest score is chosen as the centroid of the object's bounding box in the current frame, representing the most probable location of the object. By combining this centroid with the width and height of the bounding box from the previous frame, a coarse estimation of the object oi in current frame is obtained. Although the tracking module performs well in determining the location of objects within the frame, its accuracy is limited as it relies on a coarse estimation derived solely from the size of the previous bounding box. To address this limitation, the IoUNet prediction approach, introduced in [13], is employed to refine and regress a more precise target box. As shown in Fig. 3, after obtaining a coarse estimation of the object's location in terms of the bounding box centroid together with the bounding box width and height from the previous frame, ten bounding 77! , are generated by adding uniform box proposals, denoted as 𝐵 noise, encompassing a range of potential object positions surrounding the coarse estimation. Using the extracted features xt-1 and bounding box 𝐵$!3( from previous frame, as well as current extracted features xt, IoUNet predict IoU score for each 77! . The final estimation, 𝐵$! , is obtained by proposal in 𝐵 averaging of three proposals with the highest IoU score. Furthermore, in order to maintain the object identifier's effectiveness, new training samples will be incorporated into the Strain dataset if they are predicted with a satisfactory level of confidence. To prevent an excessive increase in memory size, the oldest samples in Strain will be removed to make room for the new ones. This process ensures that the object identifier remains up to date and capable of accurately identifying objects. C. Data Association Method The core requirement of any tracking-by-detection method is the ability to assign detections to existing targets. In detail, produced output of the tracker, which includes bounding boxes of targets, is associated with measured bounding boxes given by an object detector. We tackle the assignment problem by using a similar approach to SORT [1], which employs Intersection over Union (IoU) and the Hungarian matching algorithm as shown in Fig. 1. Given N detected boxes (𝐷) from the object detector and M generated track boxes (𝑇) from the tracker, the IoU distance between each detected box and all generated boxes resulting in cost matrix, 𝑚𝐼𝑜𝑈 (𝐷, 𝑇), as described in (3): 𝐼𝑜𝑈(𝐷(, 𝑇( ) ⋯ 𝐼𝑜𝑈(𝐷(, 𝑇8 ) ⋮ ⋱ ⋮ 𝑚𝐼𝑜𝑈(𝐷, 𝑇) = > B, (3) 𝐼𝑜𝑈(𝐷9 , 𝑇( ) ⋯ 𝐼𝑜𝑈(𝐷9, 𝑇8 ) where, 𝐼𝑜𝑈 corresponds to the IoU score between chosen detected box 𝐷 and generated box 𝑇 is calculated as in (4). 𝐼𝑜𝑈(𝐷, 𝑇) = 𝐷∩𝑇 . 𝐷 ∪𝑇 (4) Given the cost matrix 𝑚𝐼𝑜𝑈, the Hungarian Algorithm [14] witnessed the final stage of the assignment process resulting in three possible outputs, namely matches (𝑇4 ), unmatched tracks (𝑇: ), and unmatched detections (𝐷: ) as shown in Fig. 1. Each 𝑇4 represents the pairings of 𝐷 and 𝑇 that are matched with the highest IoU score. 𝑇: indicates tracks that have an overlap less than a predefined 𝐼𝑜𝑈4$% threshold and, similarly, 𝐷: refers to unmatched detections, which correspond to detections of new objects. D. Track Management Dealing with the set of 𝑇4 , 𝑇: , and 𝐷: , our framework employs a modified version of track handling approach as in [1], [2], as shown in Fig. 1. Additionally, we also propose an adaptive thresholding method designed for the proposed system, which can also be used for other feature-based trackers. Track Initialization and Rejection: New tracks are initialized when there exists a set of unmatched detection 𝐷: , each element in 𝐷: will be initialized as new track 𝑇 with the associated parameter (𝑐, 𝛾, ℎ, 𝑎, 𝑏, 𝑆/ ) that contains bounding box center position c, aspect ratio 𝛾 , height ℎ followed by the track management parameters such as track age 𝑎, track hit 𝑏, and predefined search area scale 𝑆/ . After initialization, each 𝑇 undergoes data association for detected boxes in the subsequent frame. The counter 𝑏 is incremented following a successful association. In contrast, for every unsuccessful association of tracks, 𝑎 and 𝑆/ are incremented to accommodate potential object re-entry and are reset to the initial value if any track subsequently rematches with a detected box. Any track with the related 𝑎 that exceeds a maximum age threshold AMax is considered to have permanently left the frame and is therefore rejected. Unlike SORT and DeepSORT, which rely on a fixed age threshold, we propose to use an adaptive thresholding approach specifically designed for feature-based trackers, aiming to minimize associated errors. The AMax is defined similarly to ASc as described in (2). By employing the Adaptive Maximum Age mechanism, when a vehicle's trajectory starts to extend beyond the frame boundaries, AMax gradually diminishes, facilitating the quick deletion of these tracks. This adaptive approach ensures that the tracker effectively handles objects that move out of the frame, enhancing overall tracking performance. Track Confirm Mechanism: In tracking applications, scenarios when objects only appear in a few initial frames, should also be considered. Therefore, as described in [2], tracks at creation are considered to be tentative during their first 𝐻4$% frames and is confirmed after consecutive successful associations. During 𝐻4$% , tentative tracks will be deleted right after an unsuccessful association. By requiring consecutive matches, this method helps minimize related errors caused by false or short-lived detections. III. EXPERIMENTAL RESULTS In this section, we presented an in-depth analysis of tracking evaluation and performance comparison to assess the performance of the proposed tracking framework compared to existing methods on the KITTI Object Tracking Benchmark [15]. For evaluation, the proposed method is tested on a chosen set of sequences specialized for vehicle tracking. For comparison, we executed the proposed tracking framework as well as reproduced results from DeepSORT [2] and SORT [1] using the same detection input, the results are evaluated on different selected metrics. Furthermore, performance analysis and ablation studies are accomplished for deeper insight. A. Dataset Our tracking framework is evaluated on KITTI Object Tracking Benchmark [15], a popular dataset utilized for many tasks in Autonomous Vehicle applications. The dataset incorporates data from various sensors including Color and Greyscale Cameras, a Laser Scanner, and GPS/IMU navigation systems. KITTI Image Dataset comprises a total of 21 Training and 29 Testing sequences, all captured at a common frame rate of 10 FPS using vehicle-mounted color cameras positioned at a height of 1.65 meters above the ground. While the KITTI Dataset can be used for tracking applications on either vehicles or pedestrians, our framework specifically focuses on vehicle tracking. Consequently, a total of nine sequences, primarily composed of on-road vehicles, are selected for evaluation. These chosen sequences cover a diverse range of on-road scenarios spanning from highways to urban areas for a realistic evaluation of the framework’s performance. B. Evaluation Metrics We use the metrics MOTA, HOTA, and IDF1 [16]–[18] to evaluate our tracking performance, with each metric providing different criteria for tracking performance assessment. MOTA [16] measures the tracking performance at the detection level by calculating the matching between each ground-truth detection (𝑔𝑡𝐷𝑒𝑡) and predicted detection (𝑝𝑟𝐷𝑒𝑡). From that approach, the number of matched pair of 𝑔𝑡𝐷𝑒𝑡 and 𝑝𝑟𝐷𝑒𝑡 is regarded as True Positive (𝑇𝑃). The number of times a 𝑝𝑟𝐷𝑒𝑡 is unmatched results in the number False Positive (𝐹𝑃) while the number missed 𝑔𝑡𝐷𝑒𝑡 corresponds to False Negative (𝐹𝑁). Moreover, association performance in MOTA is measured with the concept of Identity Switch (𝐼𝐷𝑆𝑊) , an 𝐼𝐷𝑆𝑊 is counted when objects unexpectedly swap their identity during tracking. The final score of MOTA is calculated as in (5). 𝑀𝑂𝑇𝐴 = 1 − |𝐹𝑃| + |𝐹𝑁| + |𝐼𝐷𝑆𝑊| . |𝑔𝑡𝐷𝑒𝑡| (5) On the other hand, IDF1 [18] acquires tracking errors at the association level by measuring bijective matching between sets of ground-truth trajectories (𝑔𝑡𝑇𝑟𝑎𝑗) and prediction trajectories (𝑝𝑟𝑇𝑟𝑎𝑗). Matching between each 𝑔𝑡𝑇𝑟𝑎𝑗 and 𝑝𝑟𝑇𝑟𝑎𝑗 results in ID-True Positive (𝐼𝐷𝑇𝑃). The counts of ID-False Negative (𝐼𝐷𝐹𝑁) and ID-False Positive (𝐼𝐷𝐹𝑃) represent the counts of remaining unmatched 𝑔𝑡𝐷𝑒𝑡 and 𝑝𝑟𝐷𝑒𝑡, respectively. Final IDF1 score is calculated as in (6). 𝐼𝐷𝐹1 = |𝐼𝐷𝑇𝑃| . |𝐼𝐷𝑇𝑃| + 0.5|𝐼𝐷𝐹𝑁| + 0.5|𝐼𝐷𝐹𝑃| (6) HOTA [17] is a recently proposed metric that, unlike MOTA and IDF1, considers errors at both mentioned levels. This metric balances the effects of detection and association errors on tracking performance. At the detection level, HOTA acquires the instances of 𝐹𝑁, 𝐹𝑃, and 𝑇𝑃 in a similar way to MOTA to calculate Detection Accuracy (𝐷𝑒𝑡𝐴). Additionally, by using the method of matching the set of ground-truth Identity (𝑔𝑡𝐼𝐷) with prediction Identity (𝑝𝑟𝐼𝐷) over a set of 𝑇𝑃𝑠 , HOTA proposed an association-level evaluation that acquires True Positive Association (𝑇𝑃𝐴) , False Negative Association (𝐹𝑁𝐴), and False Positive Association (𝐹𝑃𝐴), these three concepts are used to measure Association Accuracy (𝐴𝑠𝑠𝐴). As a result, the HOTA score over different localization thresholds (0.05 to 0.95 in 0.05 intervals) is calculated as in (7). ( 𝐻𝑂𝑇𝐴 = Z 𝐻𝑂𝑇𝐴; 𝑑𝛼 ≈ < 1 - 𝐻𝑂𝑇𝐴; , 19 ;'<.<> #?'<.<> (7) To conclude, those mentioned enhancements prove the accuracy and reliability of our approach, the results on HOTA metrics (TABLE 4), additionally, contribute to this fact as the corresponding performance in both detection (DetA) and association (AssA) is leading in comparison with either SORT or DeepSORT. where, 𝐻𝑂𝑇𝐴; , calculated in (8), is the HOTA score over an 𝛼, which is the geometric mean of 𝐷𝑒𝑡𝐴 and 𝐴𝑠𝑠𝐴 over 𝛼. 𝐻𝑂𝑇𝐴; = c𝐷𝑒𝑡𝐴; · 𝐴𝑠𝑠𝐴; , (8) where, 𝐷𝑒𝑡𝐴; and 𝐴𝑠𝑠𝐴; are 𝐷𝑒𝑡𝐴 and 𝐴𝑠𝑠𝐴 calculated with localization threshold 𝛼 as in (9) and (10). 𝐷𝑒𝑡𝐴; = 𝐴𝑠𝑠𝐴; = |𝑇𝑃| . |𝑇𝑃| + |𝐹𝑁| + |𝐹𝑃| (9) 1 |𝑇𝑃𝐴| . |𝑇𝑃| |𝑇𝑃𝐴| + |𝐹𝑁𝐴| + |𝐹𝑃𝐴| / ∈ )A (10) Note that the concepts of TP, FN, FP, TPA, FNA and FPA in (9) and (10) are calculated for a particular value of 𝛼 . However, for clarity, the subscript is omitted. C. Results For comparison, we also reproduced two popular motionbased tracking methods that employ similar association strategies as our proposed method, namely DeepSORT [2] and SORT [1]. TABLE 2 and TABLE 3 illustrate the results of our evaluation on MOTA and IDF1 metrics, respectively. It is evident that our method notably improved tracking performance in the overall MOTA score. In detail, our method ranked first in FN and FP, surpassing the second-best method (SORT) with a slight reduction in FN and a complete outperformance in FP (from 775 to 465). The proposed method also secured the top position on other associated parameters, which are Mostly Track Ratio (MT), Mostly Lost Ratio (ML), and Tracking Precision (MOTP). Similarly, at the association level, the proposed method is dominant in IDF1 due to the relatively low IDFP value in comparison two other methods. TABLE 2 EVALUATION RESULTS OF THREE METHODS ON MOTA METRIC. SORT [1] DeepSORT [2] Proposed MOTA↑ 66.9 63.3 70.1 MOTP↑ 75.5 74.7 80.5 MT↑ 45.2 44.9 48 ML↓ 11.5 12.5 10 FN↓ 2928 2970 2868 FP↓ 775 1169 465 IDSW↓ 153 137 152 TABLE 3 EVALUATION RESULTS OF THREE METHODS ON IDF1 METRIC. SORT [1] DeepSORT [2] Proposed IDF1↑ 70.52 74.23 76.68 IDTP↑ 7449 7971 8004 IDFN↓ 4190 3668 3635 TABLE 4 EVALUATION RESULTS OF THREE METHODS ON HOTA METRIC. HOTA↑ DetA↑ AssA↑ SORT [1] 54.65 53.85 56.62 DeepSORT[2] 56.46 52.21 62.62 Proposed 60.77 58.42 63.84 IDFP↓ 2037 1867 1232 Frag↓ 202 178 211 D. Ablation Studies TABLE 5 summarizes the comparison of the proposed method on different improvement proposals, we compare our baseline method with two other ones with strategy improvements, namely Proposedv1 and Proposedv2. Proposedv1 updated the Baseline method with Adaptive Maximum Age Threshold (AMax), resulting in a reduction in IDSW (-185) and Frag (Fragmentation) (-23). This approach also demonstrates the best performance in overall HOTA and MOTA scores. Here, the tendency indicates that the proposed method benefits considerably from AMax mechanism. On the other hand, Proposedv2 integrated Adaptive Search Scale (ASc) and Track Confirm Mechanism (Cfm). ASc is implemented to minimize errors that may arise when objects tend to move out of frame, potentially resulting in feature loss and causing confusion for the tracker. The track confirm mechanism, on the other hand, is used to minimize errors for scenarios when tracks only appear for a few initial frames. By applying these two strategies, Proposedv2 significantly reduced IDSW and Frag errors, and achieved the highest association-level score (IDF1). Still, the improvement in IDF1(+0.04) compared to Proposedv1 is not considered significant, and there is a slight decrease in the overall HOTA and MOTA scores. This trend can be attributed to the fact that Proposedv2 focuses only on, as much as possible, reducing errors caused by objective uncertainties. TABLE 5 ABLATION STUDY ON HOTA, MOTA, IDF1 FOR VARIOUS STRATEGIES: ADAPTIVE MAXIMUM AGE (AMax), ADAPTIVE SEARCH SCALE (ASc), TRACK CONFIRM (Cfm). IDSW↓ Frag↓ AMax ASc HOTA↑ IDF1↑ MOTA↑ Baseline 58.21 71.68 69.71 396 294 Proposedv1 61.77 76.64 71.24 211 271 🗸 Proposedv2 60.77 76.68 70.06 152 211 🗸 🗸 🗸 E. Limitations Although the proposed method has successfully witnessed improvements in accuracy and reliability, it still exhibits a certain limitation. The concern is that performance in IDSW REFERENCES [1] [2] [3] [4] [5] [6] [7] Fig. 4 Types of false output generated by Proposed method and Frag is still poor compared to previous methods (especially DeepSORT). We consider this to be a distinctive issue in feature-based tracking, particularly evident in crowded scenarios, where uncertainties in object searching are relatively high. For example, Fig. 4 illustrates a situation in which vehicles are in close proximity and have identical appearances. The proposed method struggles to differentiate between the objects, leading to uncertain and incorrect outputs. In contrast, DeepSORT effectively handles the situation by leveraging the accurate predictions generated by the Kalman Filter. These findings highlight that future developments should focus on enhancing the proposed method’s capability to predict object states in crowded environment. IV. CONCLUSIONS In this paper, we have introduced a novel feature-based tracking framework that leverages appearance information to enhance tracking performance. We have also presented a new track management system and an adaptive thresholding approach. These techniques have been specifically designed to address the requirements of vehicle tracking in ADAS applications. Through extensive evaluation on the wellestablished KITTI benchmark, our proposed method has demonstrated its effectiveness by surpassing existing approaches in terms of tracking accuracy and robustness. A series of ablation studies were conducted to further emphasize the advantages of the integrated track management system and adaptive thresholding approach. Overall, our findings underscore the importance of considering appearance information and implementing advanced track management strategies in feature-based tracking. This paper contributes to the existing body of knowledge in the field of vehicle tracking and provides valuable insights for future developments in ADAS applications. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Online and Realtime Tracking,” arXiv [cs.CV], 2016. N. Wojke, A. Bewley, and D. Paulus, “Simple online and Realtime Tracking with a deep association metric,” arXiv [cs.CV], 2017. Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu and X. Wang , “ByteTrack: Multi-object tracking by associating every detection box,” arXiv [cs.CV], 2021. J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, “Observationcentric SORT: Rethinking SORT for robust multi-object tracking,” arXiv [cs.CV], 2022. Y. Du, Z. Zhao, Y. Song, Y. Zhao, F. Su, T. Gong and H. Meng, “StrongSORT: Make DeepSORT Great Again,” arXiv [cs.CV], 2022. G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” arXiv [cs.CV], 2019. S. S. A. Zaidi, M. S. Ansari, A. Aslam, N. Kanwal, M. Asghar, and B. Lee, “A survey of modern deep learning based object detection models,” arXiv [cs.CV], 2021. T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick, “Microsoft COCO: Common objects in context,” arXiv [cs.CV], 2014. C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” arXiv [cs.CV], 2022. A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv [cs.CV], 2020. C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “You Only Learn One Representation: Unified Network for Multiple Tasks,” CoRR, vol. abs/2105.04206, 2021. C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-YOLOv4: Scaling cross stage partial network,” arXiv [cs.CV], 16-Nov-2020. M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate tracking by overlap maximization,” arXiv [cs.CV], 2018. H. W. Kuhn, “The Hungarian Method for the Assignment Problem,” Naval Research Logistics Quarterly, vol. 2, no. 1--2, pp. 83--97, 1955. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Rob. Res., vol. 32, no. 11, pp. 1231–1237, 2013. K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,” EURASIP J. Image Video Process., vol. 2008, pp. 2008. J. Luiten, A. Os̆ep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé and B. Leibe, “HOTA: A higher order metric for evaluating multi-object tracking,” Int. J. Comput. Vis., vol. 129, no. 2, pp. 548–578, 2021. E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” arXiv [cs.CV], 2016.