FACULTY OF ENGINEERING TECHNOLOGY CAMPUS GROUP T LEUVEN Visual perception for an autonomous race car Implementation of a camera-based perception system Kerem OKYAY Rio EVRARD Supervisor(s): Dr. Ir. Koen Eneman Master Thesis submitted to obtain the degree of Master of Science in Electronics and ICT Engineering Technology Academic Year 2022 - 2023 Faculty of Engineering Technology Campus GROUP T Leuven — Master Thesis submitted to obtain the degree of Master of Science in Electronics and ICT Engineering Technology Academic Year 2022 - 2023 Visual perception for an autonomous race car Implementation of a camera-based perception system Evrard Rio, Okyay Kerem Master of Science in Electronics and ICT Engineering Technology, Faculty of Engineering Technology, Campus GROUP T Leuven, Andreas Vesaliusstraat 13, 3000 Leuven, Belgium Supervisor(s): Dr. Ir. Koen Eneman Faculty of Engineering Technology, Campus GROUP T Leuven, Andreas Vesaliusstraat 13, 3000 Leuven, Belgium, <koen.eneman@kuleuven.be> ABSTRACT This paper presents a pipelined approach for a camera detection system to detect cones in frames and recover their 3D position. It employs an object detection model, a keypoint regression model, and the Perspective-n-Point algorithm. The paper discusses requirements, related work, and system foundations, followed by design, implementation, and evaluation of the pipeline. It outlines datasets, libraries, and frameworks used, and describes the visual perception pipeline and its training schemes. Performance evaluation includes module and overall pipeline performance against targets. The paper explores improvements for each module, proposes alternative pipeline structures, and evaluates the pipeline against requirements. It contributes to an open design of a low-latency vision stack for highperformance autonomous racing and addresses bottlenecks in computer vision algorithms. The system achieves sub-80ms latency and <0.5m errors at 10m. This system and its associated source code are available for future teams of Formula Electric Belgium to improve upon and innovate. ©Copyright KU Leuven — This master’s thesis is an examination document that has not been corrected for any errors. Without written permission of the supervisor(s) and the author(s) it is forbidden to reproduce or adapt in any form or by any means any part of this publication. Requests for obtaining the right to reproduce or utilise parts of this publication should be addressed to KU Leuven, Campus GROUP T Leuven, Andreas Vesaliusstraat 13, 3000 Leuven, +32 16 30 10 30 or via e-mail fet.groupt@kuleuven.be A written permission of the supervisor(s) is also required to use the methods, products, schematics and programs described in this work for industrial or commercial use, and for submitting this publication in scientific contests. 1 INTRODUCTION 1.1 Nature and scope of the problem Autonomous driving is one of the most complex and challenging problems being tackled today, requiring the cooperation of multiple fields such as computer vision, robotics, and machine learning. The development of autonomous vehicles depends on the efficient collaboration of diverse modules, including controls, perception, mapping, and actuators. Achieving Society of Automation Engineers (SAE) Level 4 autonomy (49), which requires no driver attention even in emergency situations and challenging weather conditions, promises to significantly improve safety and efficiency of autonomous vehicles. Figure 1.1: Super Nova, driverless car from FEB 2022 marks (cones). Additionally, even after the map is defined, perception is still required to accurately localize the car’s position within the map and potentially improve it further. • Accurate color information is also crucial to reduce the ambiguity of possible trajectories and provides a means of narrowing the search space of paths. Allowing for quicker, more efficient, and effective heuristics to define the driving line. • In case LiDAR perception system fails or provides worse outputs due to weather or lighting conditions for e.g., a camera-based perception system can complement or even assume total control over the perception system. However, to achieve full autonomy, it is crucial to be able to operate a vehicle close to its limits, including slippery surfaces and avoidance maneuvers. Autonomous car racing is an ideal platform for developing and validating new technologies under challenging conditions. Self-driving race cars offer a unique opportunity to test software required in autonomous transport, such as redundant perception, failure detection, and control in extreme conditions. Testing such systems on closed tracks mitigates the risks of accidents and human injury. Formula Student Germany (23), the largest student engineering competition worldwide, introduced a new category for autonomous race cars in 2017. This category encourages student teams of any size to design and build fully functional autonomous race cars in around nine months. The goal is to foster the development of autonomous vehicles and accelerate progress in this field. In the case of an autonomous race car, the LiDAR’s advantage is the accurate localization ability in 3D space and being less resource-hungry than the neural networks used for image processing tasks. However, camera sensors are feature rich and can extract useful information like the color of a cone which is advantageous for path planning. As the right side of the track is marked with yellow cones and the left side of the track is marked with blue cones, knowing the cones’ color reduces the path’s ambiguity. Camera perception is also more stable in windy or rainy conditions (15) which may occur at the competition. In 2021, Formula Electric Belgium (FEB) (22) started developing their own autonomous race car shown in Figure 1.1. To provide an accurate tracking of the car position and track landmarks, FEB fused two sensors: a LiDAR and an INS/GNSS device (14). While this fusion provided accurate tracking in normal conditions, the possibility of sensor failure leading to an immediate loss of tracking could be catastrophic in a racing scenario. Therefore, there is a need to improve the current system’s efficiency and reliability. In light of this, extensive research and discussions with the past and present Formula Electric Driverless team have led to the conclusion that a complementary vision perception system is required. This thesis aims to improve the overall performance of an operational algorithm by implementing a camera-based perception stack that works in parallel to the LiDAR-based stack in providing data about the environment, namely the location and types of cones in the surrounding of the car. This approach has proven successful with other teams such as AMZ (9) and MIT (36) but is still novel, and the thesis aims to develop a perception stack that compares and combines the best practices of landmark detection to enhance autonomous driving capabilities of the FEB team. There are several reasons for including a vision perception module in an autonomous driving system as outlined in (17): 1.2 • Visual perception is essential for accurate mapping and localization, which is key to driving at higher speeds without incurring penalties by hitting land- Overview of driverless car software pipeline At the highest level of abstraction, the software for a driverless car is composed of three main algorithms that work 3 1.3 sequentially: Track Landmarks Detection; Car Localization and Landmarks Mapping; and Car Control. Analysis of related systems and designs To build upon the existing body of knowledge, this work is based on research by previous Formula Student teams around the world. In particular, the MIT/Delft team (50) and AMZ team (31) which have both made significant contributions in the area of autonomous driving perception systems in the context of the Formula Student Competitions. Namely, by fusing LiDAR and camera data to extract environmental information and as part of this strategy achieving top results in competitions. In the first step, Track Landmarks Detection, data from sensors such as cameras or laser rangefinders is used to make an estimation of the position of landmarks in the environment. Data from different kinds of sensors are fused using sensor fusion techniques such as Kalman filter, particle filter, or Extended Kalman Filter (EKF) (24) (51). In the case of an autonomous racecar application, EKF is often given preference because of its ability to deal with noisy measurements from multiple sensors (54). Prior works from formula student teams related to computer vision tasks have explored solutions to monocular/stereo depth estimation and critical system challenges that arise in real world systems. This paper builds on this prior research by compiling the most effective findings from individual studies and presenting a camera pipeline system that delivers the best perception system to Formula Electric Belgium. In the second step about Car Localization and Landmarks Mapping, SLAM (Simultaneous Localization and Mapping) (21) is the most popular approach for an autonomous racecar application. SLAM is a technique used by robots and autonomous systems to use sensor data to create a map of their environment and to simultaneously determine their own location within that map. There are many different approaches to SLAM, and the specific algorithm used depends on the type and configuration of the sensors, the characteristics of the environment, and the requirements of the system. In the case of Formula Electric Driverless, the GTSAM library (16) was used to implement SLAM due to its robustness and efficiency. 1.4 Outline The proposed work introduces a pipelined approach for a camera detection system that can detect cones in images and recover their 3D position using an object detection model, a keypoint regression model, and the Perspectiven-Point algorithm. The paper is structured in a sequential manner, starting with defining the requirements and discussing related work and foundations of the system, then the design concepts and procedure, followed by implementation and evaluation of the pipeline’s performance. In the last step, Car Control, the aim is to simultaneously keep the car within the track limits and maximize the speed. The Car Control algorithm typically uses advanced control techniques, such as model predictive control, reinforcement learning, or neural networks, to optimize the car’s movements and maximize its speed and performance while ensuring safety and stability. The algorithm continuously updates the car’s control outputs based on the sensor data and the car’s dynamics to achieve the best lap times and race results. The paper outlines the datasets used for training the models, the crucial libraries and frameworks employed in computer vision, and the implementation details of the pipelined approach. The visual perception pipeline is defined, explaining the implementation of the models and their training schemes. The individual modules of the pipeline are evaluated, including the performance of the models used and the pipeline’s overall performance. The performances are evaluated against the requirements or targets set. As this work is around sensor fusion and hence SLAM optimization, it focuses on Track Landmarks Detection by implementing a camera-based perception stack whose output data about the location and types of cones in the surrounding of the car can be fused to LiDAR data and forwarded to the SLAM algorithm. For this, research was done into computer vision and architectures that would be suited for this application i.e., real time inference and high accuracy. Of critical importance in autonomous driving is the latency of the hardware and software stack since the visual perception system in an autonomous vehicle often dominates the latency of the entire autonomy stack (50). Furthermore, the paper analyzes each module of the pipeline for improvement potential, and new pipeline structures are considered to overcome any possible shortcomings encountered. The discussion section provides a general evaluation of the pipeline with respect to the defined requirements, and the future work section provides ideas for future teams to improve upon the proposed pipeline. The conclusion restates the paper’s findings and contributions to the Formula Electric Team, with a final word of acknowledgment for all members involved in the creation of this paper. Most of the models and theory research was based on work by different teams within the formula student competition community. These models and designs are discussed in the following section. 4 Figure 2.1: Standard track layout, (31) Figure 2.2: Cases that define minimum look-ahead distance (case 1) and FOV (case 2) requirements, (50) The paper provides two contributions: Firstly, an open design of a low latency stack for autonomous racing, and secondly, a comprehensive description of the solutions to common bottlenecks deploying state-of-the-art computer vision algorithms. The proposed method is not only limited to the context of the Formula Driverless Competition but can also be adapted to various visual perception systems used for autonomous platforms. 2 was derived from SLAM mappers characteristics to be around 0.5m as per (50). 3. Horizontal Field-of-View (FOV): This refers to the arc of visibility, and it puts a requirement on the system’s coverage area. Horizontal FOV is lower bounded by hairpin U-turns, where the competition rules dictate the minimum radius (4.5 m outside radius) (8). The system must perceive landmarks on the inside of a U-turn such as Figure 2.2 in order to plan an optimal trajectory. Case 2 sets a minimum FOV of 101°as put forth by (50) which was calculated under the same regulations. 4. Look-ahead Distance: refers to the maximum straight-line distance over which accuracy is maintained. Considering this vision system complements an existing LiDAR system, FEB arbitrarily defined this at 10m. REQUIREMENTS ANALYSIS The proposed perception system aims to precisely identify and locate the racetrack’s environmental landmarks, namely traffic cones, in compliance with the regulations outlined by the Formula Student Germany competition (23). As illustrated in Figure 2.1, the racetrack’s left and right boundaries are demarcated by blue and yellow cones, respectively, with orange cones marking the start and finish points. The proposed vision stack aims to detect cones accurately by providing their color (blue, yellow, or orange) and position in 3D space (as coordinates), while avoiding false positives and ensuring computational efficiency. The pipeline designed should be modular, easy to debug, and allow for interchangeable sub-modules. By adhering to these requirements, we can ensure that the perception system operates efficiently, accurately, and with adequate coverage area to provide reliable information for decision-making in real-time. By adhering to these requirements, the perception system can contribute to the overall performance of the autonomous vehicle. To ensure the visual perception system does not become a bottleneck in the overall vehicle performance, four system requirements have been defined. These requirements have been adapted from the MIT/Delft team (50) and can be generalized to aid in the design of other visual perception systems. The four system requirements are: 3 3.1 RELATED WORK AND CAMERA FOUNDATIONS Datasets There are two open-source datasets that are used to train the required models. The first one is the MIT/Delft ”Rektnet Dataset” (20), and the second one is the Formula Student Objects in Context (FSOCO) dataset (7), a community project developed by the Formula Student community to enable all teams to work with as much data as possible. The comparison of some key features of the datasets is shown in table 3.1. 1. Latency: refers to the total time it takes for a landmark to be localized from the moment it captured by the imaging sensor (camera). This puts an efficiency requirement on the system. For safety reasons as discussed in (50) the latency should not exceed 200ms. 2. Mapping Accuracy: outlines the maximum acceptable error for landmark localization. It puts an accuracy requirement on the system. Maximum error The FSOCO dataset is a large and diverse collection of images containing cones resulting from the contributions 5 Dataset Annotated Images Classes Keypoints defined Teams contributed MIT/Delft 8000 1 Yes 2 FSOCO 11572 5 No 18 Table 3.1: Datasets as of February 2023 of numerous formula student teams, each utilizing different sensor setups and lighting conditions. This variety of sources makes the dataset an excellent choice for finetuning an object detection model such as YOLOv7 to detect cones, as it provides a wide range of examples for the model to learn from. The dataset contains annotations for five distinct classes, including blue cones (right side of the track), yellow cones (left side of the track), small and large orange cones (start/finish of the track), and other cones (that do not comply with the rules). Figure 3.1: YOLOv7 performance on MS COCO dataset compared to other models, (52) problems. To evaluate the accuracy of these models, the COCO dataset (32) is a popular benchmark. For application in an autonomous race car, the real-time performance of such models should obviously also be considered. The open source MIT/Delft dataset was used for training the custom keypoint regression model, RektNet. However, it only contains data from these two teams and their specific setups, therefore, it is not as robust as the FSOCO dataset. 3.2.1 Another dataset of relevance is the COCO dataset (Common Objects in Context) (32) which is a large-scale image recognition, segmentation, and captioning dataset containing more than 330,000 images with more than 2.5 million object instances labeled with 80 different object categories, such as person, car, bicycle, etc. Each image in the dataset is accompanied by multiple annotations, including object bounding boxes. This dataset is often used to train large object detection models such as YOLOv7 which are later finetuned to a specific use case, such as detecting cones for example. 3.2 Evaluation Metrics for Object Detection To understand the evaluation of the object detection model used, it is important to understand the most common metrics involved, namely Precision, Recall, and mean Average Precision (mAP). Precision looks at what proportion of identifications are actually correct, whereas Recall calculates what proportion of actual objects was identified correctly. These are defined as: Precision = Cone Object Detection Recall = Object detection is used in computer vision to identify and track objects of interest. In this case the object of interest are cones that describe the track. Cone object detection aims to locate the bounding boxes of cone objects and accurately classify them into different categories, such as blue, yellow, large orange, small orange, and unknown cone. True Positives True Positives + False Positives True Positives True Positives + False Negatives Precision and Recall are related in that increasing one metric often leads to a decrease in the other. A model with high precision will have a low false positive rate, but may miss many true positives, leading to a low recall. A model with high recall will identify most of the true positives, but may also have a high false positive rate, leading to a low precision. Recent years have seen significant advancements in object detection problems, thanks to the development of deep learning architectures based on convolutional neural networks (CNNs). Several architectures have been proposed, such as Faster R-CNN (47), SSD (Single Shot Detector) (33), and YOLO (You Only Look Once) (52), which have shown promising results in solving object detection mAP (29) is another common metric used in object detection models. mAP is a measure of the average precision (AP) relating to the quality of predictions, across all object categories. It is calculated as the area under the Precision-Recall curve. Generally, mAP values are coupled with a threshold. For example, mAP@.5 means that 6 Figure 3.2: Architecture of YOLOv4, (44) and classes. The output of the Neck Network is a set of feature maps with different spatial resolutions, which are used to capture objects of different sizes and shapes. a prediction is considered correct if the intersection-overunion (IoU) between the predicted bounding box and the ground truth bounding box is greater than or equal to 0.5. IoU (48) is a measure of how much two bounding boxes overlap, and a higher value indicates a better overlap. 3.2.2 The Head, also known as Dense Prediction, is the third and final component of the YOLO framework. Its main function is to generate predictions of the objects’ locations and classes by processing the unified feature representation obtained from the Neck Network and applying non-maximum suppression to remove duplicate detections, making the detections more accurate. The Head consists of a set of convolutional layers that predict the coordinates of the bounding boxes and the probabilities of the objects’ classes. YOLO One of the most widely adopted object detection systems is YOLO whose primary advantage lies in its ability to achieve high detection speed while maintaining accuracy. Since its first version in 2015 (46), YOLO has undergone numerous iterations to improve its performance. The latest version, YOLOv7, has demonstrated superior performance compared to other existing object detectors in terms of both speed and accuracy (52) as seen in Figure 3.1, making it an ideal candidate for solving the cone object detection problem in an autonomous race car. To improve the accuracy of object detection, YOLO uses anchor boxes. Anchor boxes are pre-defined bounding boxes of different sizes and aspect ratios that are used to predict the final bounding boxes. These boxes are defined to capture the scale and aspect ratio of specific object classes and are typically chosen based on object sizes in the training datasets. During detection, the predefined anchor boxes are tiled across the image. The YOLO network predicts the probability and other attributes, such as background, intersection over union (IoU) and offsets for every tiled anchor box. When the network predicts the final bounding boxes for each object, it uses the anchor boxes as a reference point. The network predicts the offset, width, and height for each anchor box and then applies those values to the anchor box to obtain the final bounding box coordinates. YOLO object detection models belong to the category of single-stage object detectors. Unlike two-stage detectors, which first generate region proposals and then refine them, YOLO predicts bounding boxes in a single inference step. The YOLO framework consists of three subcomponents: Backbone, Neck, and Head, shown in Figure 3.2. The Backbone network is responsible for extracting highlevel features from the input image frames. Typically, it consists of a series of convolutional layers that learn to detect edges, textures, and other low-level features. The output of the Backbone Network is a feature map, which is a representation of the input image at a lower resolution. This lower resolution representation is used to speed up the detection process and reduce the computational burden. 3.3 Keypoint Detection A keypoint detector makes use of features it can detect in an image which can be of three kinds: flat regions, edges, and corners. By far, the most interesting features are the corners that have gradients (changes in image brightness) in two directions, thus enabling accurate localization of a single point, namely the corner. Commonly used feature extraction models include the Harris corners detector (25), The Neck further processes the output by using a series of convolutional layers with different kernel sizes and strides to capture features at different scales. By doing so, it obtains a unified feature representation that will be used by the Head to generate predictions of the objects’ locations 7 Figure 3.3: Left: RektNet architecture; Right: Geometric interrelations of keypoints; from (50) dicted keypoint, and t pi j is the corresponding jth coordinate of the ith true keypoint. robust SIFT (34) and SURF [(11) feature extractors and descriptors. The problem with using pre-existing feature extraction techniques like these is that they are designed to detect any type of feature that meets the criteria for being considered a feature point. This lack of specificity means that techniques like Harris corners cannot differentiate between feature points that are located on a cone versus those located on the road for example. 3.3.1 The introduced geometric loss term penalizes deviations from the expected geometric relationships between keypoints. Figure 3.3 on the right demonstrates the colinearity of the keypoints. The unit vectors between keypoints must thus have a unity dot product. The geometric loss is hence given by: RektNet Lgeo = γhorz (2 −V12 ·V34 −V34 ·V56 ) In 2020, in the context of the competition, the MIT/Delft team developed a keypoint detector based on residual neural networks (26), called RektNet, developed in the PyTorch framework and available at their public Github repository (19). To improve accuracy, RektNet contains two important network modifications. + γvert (4 −V01·V13 −V13 ·V35 −V02 ·V24 −V24 ·V46 ) from (50) where, γhorz and γvert are hyperparameters that determine the relative weight of the cone geometry term and Vi j represent the vectors as seen in Figure 3.3 on the right. First, the fully connected output layer was replaced with a convolutional layer. Convolutional layers outperform fully connected layers at predicting features which are spatially interrelated (geometrically dependent) as is the case for the keypoints of the cone, while also improving training convergence by having fewer parameters. The expected value over the heatmap is then used as the keypoint location. The architecture of RektNet is designed to detect seven characteristic keypoints within an 80 x 80 x 3 sub-image patch (bounding box). To maintain the input dimensions throughout the network, the structure is composed of ResNet blocks (27). The input is a tensor of size [height, width, number of keypoints, batch size], with the height and width fixed at 80 pixels, the number of keypoints set to 7, and the batch size set to 8, which refers to the number of data samples that were processed simultaneously during the forward pass to generate the tensor. At the output, the network generates one heatmap per keypoint, where each coordinate on the heatmap corresponds to the coordinate of the keypoint in the image patch. The second modification is an additional term in the loss function to leverage the geometric interrelations of keypoints. The total loss Ltotal is given as: Ltotal = Lloc + Lgeo from (50) where Lloc is the location loss and Lgeo is the geometric loss. RektNet starts with a convolution layer followed by batch normalization, which helps in normalizing the output and makes it easier to train. The batch normalization output then goes through a rectified linear unit (ReLU) such as A.1 as the non-linear activation. Following this, the next four blocks are basic residual blocks, with each block having increasing channels C ∈ 16, 32, 64, 128. Figure 3.3 on the left depicts these blocks. The location loss is the mean squared error (MSE) loss between the predicted and target keypoint locations and is given by: Lloc = 1 N K ∑ ∑ (pi j − t pi j )2 N i=1 j=1 where N is the number of training examples, K is the number of keypoints, pi j is the jth coordinate of the ith pre8 Figure 3.4: Camera Pinhole model, (28) 3.4 Perspective-n-Point The Perspective-n-Point (PnP) is an algorithm in the field of computer vision and robotics that seeks to determine the position and orientation of a calibrated camera in a 3D environment using a set of 2D image points and their corresponding 3D world points. Over the years, the PnP problem has been extensively studied, leading to the development of numerous algorithms aimed at solving it. Before explaining calculation behind PnP, it is important to understand the camera parameters involved, namely the intrinsic matrix, K , and the extrinsic matrix, [R t] as well as the pinhole model of a camera. 3.4.1 Figure 3.5: 3D object model of cone, (18) world coordinate system to the camera coordinate system while translation vector t represents the position of the camera’s origin. The camera’s coordinate system has its origin at the optical center, and its x- and y-axes define the image plane. Intrinsic Matrix The internal camera parameters, describe the intrinsic characteristics of a camera and are used to model its geometry and optics. Represented by K , the intrinsic matrix contains five intrinsic parameters of the pinhole model of the camera (Figure 3.4). The matrix K is defined as: fx K = 0 0 3.4.3 The pinhole camera model (28) is a mathematical representation of the way cameras capture images. In this model shown in Figure 3.4, a 3D point in the world is projected onto the image plane by drawing a straight line from the point through the pinhole to the image plane. The intersection of the line and the image plane determines the 2D location of the point’s image in the camera . Since the image plane is flat, the resulting image is a 2D representation of the 3D world. s cx f y cy 0 1 from (28) where [cx , cy ] is the optical center (principal point), in pixels; s is the skew coefficient which is non-zero if the images axes are not perpendicular; and ( fx , fy ) is the focal length in pixels, given by: fx = F/px and fy = F/py where F is the focal length in world units and (px , py ) are the size of the pixel in world units. Parameters cx , cy , fx , fy can be better understood by looking at Figure 3.4. 3.4.4 3.4.2 Pinhole Model Extrinsic Matrix PnP calculation To arrive at a unique solution to the pose estimation problem, at least three pairs of corresponding points are necessary. The camera pose, which consists of six degrees-of-freedom, namely three-dimensional rota- The extrinsic parameters are captured in the extrinsic matrix, [R t]. Rotation matrix R transforms points from the 9 tion (roll, pitch, and yaw) and translation (x, y, z), must satisfy six constraints. Each pair of corresponding points provides two constraints, one for the rotation and one for the translation. Hence, three pairs of points are sufficient to solve the six degrees of freedom of the camera pose. Once the 3D points are defined as in Figure 3.5 with respect to the world frame. The pose computation problem (35) consists in solving for the rotation R and translation t values that minimize the reprojection error from 3D-2D point correspondences given a set of defined object points with coordinates pw with respect to the world frame and their corresponding image projections on the frame p f , given the camera’s intrinsic matrix K . This translates to the equation bellow from (28) which is solved for [R t]: Figure 3.6: Field of View, (10) p f = K · [R t] · pw Alternatively, the equation can be expressed in its extended form as follows: u fx v = 0 1 0 s fy 0 Xw cx R11 R12 R13 t1 Yw cy R21 R22 R23 t2 Zw 1 R31 R32 R33 t3 1 This way, one is able to calculate the rotation vector R and translation vector t from the camera to the cone, giving the position of the cone in the environment relative to the car. Since the geometry of the cone consists of a revolution of a triangle around a vertical axis, the rotation matrix R is not useful. The only variable of interest in this application is the translation vector t . 3.5 Field of View The field of view (FOV) refers to the angular extent of the scene that is captured by the camera as shown in Figure 3.6, typically measured in degrees. The FOV represents the solid angle subtended by the camera lens at the object being photographed or filmed. The FOV is determined by various factors, including the focal length of the camera lens, the size of the sensor, and the distance between the camera and the object. Figure 3.7: Calibration interface by camera calibration node achieved with a longer focal length and a smaller sensor. Additionally, the distance between the camera and the object being photographed or filmed affects the FOV. Objects that are farther away appear smaller and therefore cover less of the total FOV. 3.6 Calibration Camera calibration is the process of determining the intrinsic and extrinsic parameters of a camera. where h and w are respectively the height and width of the image the camera captures, and fx and fy are intrinsic parameters of the camera as discussed in 3.4.1. A common technique for calibrating a camera is using a checkerboard pattern as in Figure 3.7 because this provides an easily detectable target in the image. The process of camera calibration involves taking several pictures of the checkerboard from different orientations and positions, while moving the checkerboard around in the camera’s field of view. The resulting images are then processed to find the corners of the checkerboard squares. The positions of the corners in the image are then used to compute the intrinsic and extrinsic parameters of the camera. A wider FOV is generally achieved with a shorter focal length and a larger sensor, while a narrower FOV is There is ready to use software options in ROS to calibrate cameras (40) which is used in this thesis. OpenCV’s calibration function (42) calculates FOVx,y as: FOVy = 2 arctan h 2 fy FOVx = 2 arctan w 2 fx 10 Figure 3.8: Left: Stereo camera used; Right: Test setup on RC, camera visible in top right (28) 4 4.1 HARDWARE Driverless PC All software is executed on the Driverless PC mounted behind the cockpit visible in Figure 1.1. This computer is equipped with an AMD Ryzen 5 3600 6-core processor, 16Gb RAM, 240.1Gb storage capacity, NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] GPU, and runs on Ubuntu 20.04 LTS 64-bit operating system. These specifications guarantee the computer is suitable to meet the demands of the software application and enable realtime performance of the vision pipeline system. 4.2 Camera Monocular cameras and stereo cameras are both widely used in autonomous driving, each with different strengths and limitations. Monocular cameras are simple and inexpensive, relying on other sensors and/or algorithms to determine the distance and position of objects in the scene. However, they can be less accurate and reliable than stereo cameras, particularly in complex or cluttered environments. Stereo cameras use multiple lenses to capture images from different perspectives (typically two), creating a 3D model of the scene that allows for more accurate detection and tracking of objects. Despite their advantages, stereo cameras present their own set of challenges in the context of this research, such as synchronization, stereo calibration, and extra processing load on the hardware. The estimation of 3D poses from a single image captured by a monocular camera presents an ill-conditioned problem due to the inherent limitations in information resulting from the projection of 2D space into 3D space. This problem can be overcome by using prior information about landmarks in the scene which in the case of this project are traffic cones with known shape and size which can thus be modeled as in Figure 3.5. In our pipeline, we worked with a stereo camera provided by Formula Electric Belgium shown in Figure 3.8 on the left but using only one camera output, effectively working with a monocular setup. By doing so, we successfully circumvented any potential stereo synchronization issues. The output of the camera was sent to the Driverless PC via a high-speed USB-3 connection. 4.2.1 Camera Characteristics The camera utilized is the VEN-134-90U3C-D from Daheng Imaging (13). It offers a resolution of 1280x1024, a pixel size of 4.8 micrometers, a global shutter color CMOS sensor, and an adjustable frame rate capped at 90FPS. The camera also has a 12.7mm optical sensor size and provides a 60° horizontal and 72° vertical field of view. As stated in requirements Section 2, the minimum FOV required is 101° which means that in extreme situations such as hairpin U-turns, the camera should be complemented with LiDAR. Alternatively, another camera with a wider field of view but at least the same resolution should be used to address these scenarios and achieve the same performance or better. 4.2.2 Camera Positioning In the test setup, both cameras are mounted on a steel rod for stability and easy adjustment as seen in Figure 3.8 on the right. On the actual race car as in Figure 1.1, the camera would be positioned above the driver’s seat similarly to how is being done by other teams with similar setups (50) (31). This position offers the advantage that the visual overlap between cones is reduced to a minimum meaning that even the cones placed behind another (in same line of sight) can be perceived sufficiently well. 11 5 SOFTWARE 5.2.1 The software design approach adopted for this project was geared towards ensuring real-time reliability and the ability to interpret data and sub-module outputs. To achieve this, the system was broken down into distinct parts, with each sub-module undergoing debugging and testing. This approach indirectly leads to the optimization of the entire pipeline. The project was segmented into two primary tasks, each implemented as a separate software package. The first task is image acquisition, for which the package contains code for the camera driver. The second task is image processing, which involves retrieving cone positions from a frame, and the package contains code related to processing the images through neural networks and 3D coordinate estimation logic. 5.1 Image Acquisition Package This package contains the driver code for the camera VEN-134-90U3C-D from Daheng Imaging (13). Its primary function is to serve as a bridge between the camera hardware and the software system, thereby providing an interface that standardizes and documents how to access data from the imaging sensor. One of the advantages of this is that it allows users to decouple the camera from processing. This feature allows tuning of the camera parameters to meet individual needs or to use a different camera altogether. The package delivers an adjustable frame rate (FPS) captured by the image processing package of Section 5.2. An important factor to consider when developing this package was the resolution of the images since there is a fundamental trade-off between using high- and low-resolution images. While high-resolution images are useful for accurate keypoint detection, which improves the performance of the PnP algorithm, they also add extra latency and overhead when transporting and storing them. 5.2 Image Processing Package This package encompasses all the image processing steps necessary to produce a position estimate relative to the camera. It takes in the frames captured from the camera, which are sent from the image acquisition package of Section 5.1, as input, and outputs translation vector from the camera to the landmarks. The image processing package is composed of three distinct parts: Cone detection, Keypoint detection, and 3D pose estimation, each explained in the following sections. Cone Detection As noted in the related work Section 3.2, the YOLOv7 object detection algorithm is currently the most accurate and efficient model for object detection. Therefore, it was selected as the algorithm of choice. The input images are 640x640 BGR color images (OpenCV works with Blue-Green-Red color format), and the output generated are bounding boxes that indicate the location of detected cones within the frame. From the FSOCO dataset described in Section 3.1, 8000 random images were used for training and 1000 random images were used for test data. YOLOv7 and YOLOv7-tiny were trained with default parameters. The training script runs a genetic algorithm to find the best anchors for the images, then saves them to the model and starts training. The models are finetuned rather than trained from scratch to be more efficient. This means only the last layers of the model were fine-tuned with the FSOCO dataset, resulting in 100 total epochs being run on the 8000 training images. 5.2.2 Keypoint detection Including prior information about the geometry of the landmark can significantly improve the accuracy of the landmark localization process. As discussed in the related work Section 3.3.1, the MIT/Delft team has addressed this by developing RektNet, a powerful tool for detecting keypoints on the bounding boxes of the landmarks. The input is a normalized 80x80 image of a cone. This image was cropped according to the bounding box from the frame captured by the camera. After going through the RektNet, the output is a vector of keypoint coordinates for that cone image. The MIT/Delft team provided the training scheme documentation of RektNet in their paper (50) and provide a hands on tutorial through their GitHub repository ((19)). RektNet is trained as in the tutorial using around 3.2k annotated images from (20). The dataset was split 85/15 between training and validation. Before training, some training parameters are defined as in Table 5.1. Note that L2 Softargmax is good at penalizing large errors. Training keeps track of the best epoch and will stop before running all epochs if the loss does not decrease further. 12 Number of keypoints Learning rate Rate of decay Number of epochs Use geometric loss Loss function Batch size Maximum Tolerance 7 0.1 0.999 1024 Yes L2 Softargmax 8 8 6.1.2 LibTorch (1) is a C++ library that provides a wrapper to load, manipulate and run PyTorch models in C++ code. It is built on top of the PyTorch C++ API and allows developers to use PyTorch’s capabilities in C++ projects. Table 5.1: RektNet Training parameters 5.2.3 Pose estimation As outlined in Section 3.4, the PnP algorithm needs at least three points to obtain a unique solution. In the context of this thesis there are seven points that depict the cone and there is prior knowledge of the cone’s geometry (Figure 3.3). This allowed to create a 3D model of the cone as in Figure 3.5 and thus enables the PnP algorithm to retrieve the cone’s 3D position in relation to the camera as explained in Section 3.4. As our focus is solely on the position of the cone, given by the translation vector t , we may disregard the rotation vector R. The inputs are the keypoints detected for a cone and the corresponding 3D model of the cone which differs depending if the cone is large or small. The outputs are as discussed in Section 3.4 the rotation vector R and the translation vector t . PyTorch (4) is a popular open-source machine learning library based on Torch, a scientific computing framework. PyTorch provides a Python API for building and training machine learning models. However, there are some use cases where Python might not be the best fit. For example, in embedded systems or real-time applications, it might be necessary to have a low-level programming language such as C++. This is where LibTorch comes in handy. In this project, LibTorch is used for several purposes, namely: • Loading the YOLO and RektNet models to torchscript format using the load method from the Jit library (PyTorch). • Preprocessing the input models from OpenCV matrices to torch input tensors and applying some preprocessing transformations on this tensor. For this project LibTorch was installed specifically for Linux and computing platform CUDA 11.7 which is optimized for GPU use. 6.1.3 6 LibTorch ROS IMPLEMENTATION 6.1 6.1.1 Frameworks OpenCV OpenCV (Open-Source Computer Vision) is a popular open-source computer vision and machine learning software library (3). The library contains more than 2,500 machine learning algorithms, focused around computer vision. In this project Version 4.x was used to: ROS (Robot Operating System) (5) is a set of software libraries and tools that provide a flexible, extensible and well documented framework for building robot applications. The system is composed of a set of nodes, each with their own function, that communicate with each other through topics and thus inform each other of the state of the system as a whole. It provides a number of useful features and tools, from which the most relevant to our project include: • Initializing ROS nodes and NodeHandles which en• Define the images/frames the camera sensor provides and store them as OpenCV matrices. • Manipulate images before, e.g. forwarding through the YOLO or RektNet network. This includes preprocessing steps such as resizing, changing color model or normalizing. • Perform non maximum suppression i.e., filtering out overlapping boxes, using the NMSBoxes function from the built in OpenCV dnn library (2). • Return the rotational and the translational vectors with the built in solvePnP method (41). 13 able communication between nodes. • A publish-subscribe communication model, which allows different parts of a robot system to send and receive messages to coordinate their actions. In the case of this project, the image publisher node decodes the information from the camera and sends the frames as messages to the topic named /le f t/image raw as the ”publisher”. Then, the inference node as the ”subscriber” to that topic receives the message and thus the frame with which it may infer the locations and types of cones present. Besides these contributions to our project, ROS possesses a large library of pre-built software components (called ”packages”) that can be used to complement the system. These packages include libraries to calibrate the camera, send messages between nodes or testing the system. ROS is also equipped with a suite of tools for building, debugging, and deploying robot applications, including RVIZ, a 3D visualization tool to display and interact with various sensor data, robot models, and other 3D information in real-time, which will be useful when testing the car as a whole with the vision stack integrated. 6.2 Camera Pipeline Although it might be justified to attempt an end-to-end image to pose estimation or even image to throttle and steering command relation, this work employs a part-based approach. This allows for easy debugging and extensive testing, indirectly helping to optimize each sub-module until a certain level of performance is reached by the pipeline as a whole. Besides this, a part-based approach allows for easy replacement of sub-modules. This structuring segregation is visible in, e.g. the inference package where the sub-modules (YOLO, RektNet and PnP) are separated into classes and their use can be easily controlled. 6.3 6.3.2 Cone Struct From the moment a cone is detected by YOLO, it is treated as an entity defined by a struct which is accessed in other sections of the code by reference (instead of by value) for efficiency. Cone 3D modeling As explained in Section 3.4, Perspective-n-Point works by inferring a rotation vector R and translation vector t to a landmark by using prior information of the landmark. In this project, this prior information is a 3D model of the cone defined by seven 3D points with respect to the base of the cone, the world frame, as shown in 3.5. These points are arbitrarily located where keypoints are best detected by the RektNet meaning at every 1/3 of the height of a cone. The 3DHandler class possesses two distinct 3D object models, one for a small cones and one for a large cone, so that the PnP estimation (which looks at the shape of the landmark) can be adapted and give accurate coordinate results for both types of cones. These objects are represented by vectors of 3D points, whose coordinates are determined dynamically by the height and radius of the cone. 6.3.3 Implementation Notes 6.3.1 The validity of the cone is checked at every stage of the pipeline because in rare occasions a fallen cone is detected and/or the position estimate is unreasonable. These invalid cones are filtered out as a final step by the PnP handler if the coordinates returned are beyond certain thresholds arbitrarily defined. These thresholds are ±5m for x (side to side), ±1.5m for y (up/down) and less than 2m or more than 20m for z (distance). Pseudocode For reference and perhaps provide a better understanding of the image acquisition and inference package, the pseudocode for the main file of both packages is displayed below with explanation for each. Pseudocode for image acquisition Cone Struct member data: 1 2 • Bounding Box: An OpenCV Rect object representing 3 the bounding box. 4 • Confidence: The confidence associated with the bounding box. 5 6 7 • Class ID: Type of cone (Yellow, Blue, Small Orange, Large Orange or Unknown). 8 • Keypoints Vector: A vector of keypoints, each repre- 9 Procedure : Image Acquisition init ROS node init Camera device init image_transport ROS package to publish images set loop rate to acquire image from camera while ros :: ok () acquire image from camera as OpenCV Mat object transform OpenCV Mat object to image sensor message of ROS publish image to respective topic sented by a pair of integers for coordinates. • Translation Vector: A vector representing the transla- The image acquisition package initializes: itself as a ROS tion from the base of the cone to the camera. node; the camera device; and the transport method to • Validity: A Boolean variable indicating the validity of the communicate to topic. Next, it continuously acquires imcone. ages from the camera, transforms them to OpenCV Mat 14 objects and sends them to the topic which will be used by the image inference package. Pseudocode for image processing 1 2 3 4 5 6 7 8 Procedure : Image Processing transform received message to OpenCV Mat object forward the image through the YOLOv7 network for cone in 8 largest cones : crop the image with the cone bounding box forward the cone image through RektNet apply PnP to the obtained keypoints filter out invalid cones compared on a test set of 1000 images from the FSOCO dataset. Results are displayed in Table 7.1. YOLOv7-tiny shows an overall Precision of 79.9%, Recall of 68.4% and mAP@.95 of 0.451 compared to YOLOv7 with an overall Precision of 83.1%, Recall of 73.9% and mAP@.95 of 0.502. Note these averages are lowered significantly by the class of ”Unknown” cones, but in a similar way for both models. The Precision/Recall curve and the confusion matrix for YOLOv7-tiny and YOLOv7 are shown in the Appendix B. The image inference package, after initializing itself as a ROS node and subscribing to the topic containing the images, starts continuously processing on the images. This implies forwarding the image through the YOLO object detection network to get the bounding boxes of the cones. Then, the 8 largest boxes (closest cones in theory) are forwarded through the RektNet keypoint regression model to locate the keypoints and forward these to the 3DHandler class which will execute the PnP calculation and output the desired translation vector t . 7.2 Keypoint detection and PnP and 3D pose estimation performance Key point detection and pose estimation are evaluated for accuracy along the longitudinal (z) and latitudinal (x) coordinates. The remaining y axis corresponds to height, which remains constant on a flat surface, as is the case in the test and competition setup. Test results indicate highly accurate and consistent measurements in the y axis, with negligible variations. Consequently, the y axis is deemed irrelevant for the evaluation. 7.2.1 7 The system is composed of three main parts: object detection, keypoint detection, and 3D pose estimation. During the testing phase, object detection was assessed individually, while keypoint detection and 3D pose estimation were evaluated together. Furthermore, the performance of each part of the system was analyzed in terms of execution time. To accurately test the system in a realistic setting, data was collected in the form of rosbags (6) which are a format used in ROS to store and log data as a collection of messages exchanged between various ROS nodes. These messages can include sensor data (e.g., camera images, laser scans), state information, control commands, and any other data exchanged within the ROS ecosystem. For this work, only camera images were stored. 7.1 Longitudinal error (z) EVALUATION In this context, longitudinal error refers to the error in the system’s estimation of the z coordinate in the translation vector, which indicates how far a cone is ”in front” of the camera. A cone with 0cm longitudinal error means that the system has correctly estimated the distance to the cone. Results for the longitudinal error relative to the ground truth are shown for all cones in Figure 7.3 and by means of a box plot in Figure 7.4, with different colors for each class and where the dotted red line represents the acceptable error range (requirement 2). Box plot of Figure 7.4 revealed that predictions generally tended to be further away from the ground truth values. The errors were strongly grouped by class type. Large orange cones were generally classified further with the greatest error with almost 40cm offset. 7.2.2 YOLO cone detection performance Two YOLOv7 models were finetuned on the same FSOCO dataset 3.1 and compared, namely YOLOv7 and YOLOv7tiny. YOLOv7-tiny is a smaller model with fewer parameters and therefore generally detects faster and is less resource-intensive at the cost of accuracy when compared to YOLOv7. The performance of the two trained YOLOv7 models is Latitudinal error (x) Similarly, latitudinal error refers to the error in the system’s estimation of the x coordinate in the translation vector, which indicates how much a cone is ”to the side.” A cone with 0 cm latitudinal error means that the system has correctly estimated the cone’s x pose. Results for the latitudinal error relative to the ground truth are shown by means of a box plot in Figure 7.5, with again 15 Model YOLOv7-tiny YOLOv7 Class Precision Recall mAP@.95 Precision Recall mAP@.95 Blue Yellow Small orange Large orange Unknown 0.895 0.88 0.867 0.813 0.541 0.755 0.74 0.744 0.772 0.412 0.515 0.5 0.506 0.564 0.172 0.912 0.91 0.901 0.877 0.543 0.8 0.794 0.8 0.819 0.479 0.566 0.556 0.556 0.627 0.205 All 0.799 0.684 0.451 0.831 0.739 0.502 Table 7.1: Test set results for YOLOv7-tiny and YOLOv7 different colors for each class and where the dotted red line represents the acceptable error range (requirement 2). The evaluation revealed that predictions were somewhat skewed depending on cone type. 7.2.3 Test setup for longitudinal and latitudinal error (a) View from camera at start position For evaluation the longitudinal error z and the latitudinal error x, physical cones are placed every 50cm along the longitudinal and latitudinal axis on a straight line as shown in Figure 7.1. Effectively positioned on a diagonal line. The camera is then moved along different latitudinal positions (moving sideways as shown in Sub figure 7.1b) with respect to the starting cone to capture the furthest cones without any overlapping between cones which can cause estimation errors. This test was performed for all cone classes. The data is then compiled in tables per cone class using Microsoft Excel (version 2304) and only data from non-overlapping cones are considered in the evaluation. 7.3 (b) View from camera along latitudinal axis to the left Figure 7.1: Testing setup Pipeline latency To assess the latency (requirement 1), different parts of the system were timed using the chrono C++ library (12), which includes a high-resolution clock. The results of these measurements are displayed in Figure 7.2. It shows the vision system can process each frame within 76ms, corresponding to 13 frames per second. Analyzing the time distribution of each submodule, we find that YOLOv7 cone detection requires 7ms, YOLOv7-tiny being 2-3ms faster. Followed by RektNet keypoint detection at 60ms for 8 cones, and ultimately, PnP calculation at 1ms also for 8 cones. 16 Figure 7.2: System latency by parts Figure 7.3: Absolute distance error by cone type Figure 7.4: Average longitudinal z error by cone class Figure 7.5: Average latitudinal x error by cone class 17 8 8.1 DISCUSSION YOLO performance The experimental results indicate YOLOv7 outperforms the YOLOv7-tiny on the test set in terms of accuracy despite both being trained using the same hyper-parameters. In the case of this application, small differences in mAP@.95 are significant as high precision is required. As explained in Evaluation Metrics Section 3.2.1, high precision of mAP@.95 means the cone will be accurately included inside the bounding box, which is very advantageous for identifying the keypoints of the cone as the next step. This suggests that the increased number of layers and parameters of YOLOv7 provides an advantage in terms of detection accuracy. Furthermore, the number of cones detected was different for both models, especially for cones further away. This is related to Precision and Recall as defined in Evaluation Metrics Section 3.2.1. After anchor boxes are defined and final boxes are obtained, false positives are rarely observed, however, false negatives are still present, i.e. the model did not detect cones that were in the frame, this is because anchor boxes can only ”filter out” measurements to increase the overall accuracy. YOLOv7 detected cones that were further away much better than YOLOv7-tiny who failed to detect them entirely. For an application such as a racing car this random error was considered unacceptable. This detection difference is illustrated in Figure 8.1 where the same frames where processed using both object detection models. Lastly, since the processing time of one frame by YOLOv7 was only around 2-3ms slower than by YOLOv7-tiny and the overall system was still well within the latency budget, YOLOv7 was the chosen model. 8.2 Rektnet and PnP performance As mentioned in Section 5.2.2, the RektNet model was initially trained on the MIT/Delft dataset. Unfortunately, despite best efforts, results achieved by the MIT/Delft team could not be replicated. While z distance errors remained below 50cm up to a range of 10m as can be seen in Figures 7.4 and 7.3, the standard deviation of 19cm was not as low as the 5cm reported by the MIT/Delft team (50). One noteworthy observation we made during our testing was that the MIT/Delft team did not provide any information regarding x distance estimation, which is vital to get a complete pose estimation analysis. Our results shown in Figure 7.5 that the x distance estimates provided by RektNet were also accurate, with errors typically below 50cm. (a) Object detection from YOLOv7-tiny (b) Object detection from YOLOv7 Figure 8.1: Object detection comparison: YOLOv7-tiny detects less cones than YOLOv7, especially at further away distances The longitudinal and latitudinal errors were observed to be correlated with the type of cone, which could be attributed to RektNet’s difficulty in accurately identifying keypoints based on the cone’s color. Furthermore, it was noted that errors along both the longitudinal and latitudinal axes increased in a random manner when the camera shifted its focus away from the cone. This behavior could be attributed to inadequate camera calibration. Despite our best efforts, consistent results were not achieved for different angles. Testing also revealed that the performance of RektNet was not as robust as anticipated. Specifically, RektNet struggled to estimate the keypoints of overlapping cones in YOLOv7 predictions. This issue stems from the inherent architecture of RektNet (3.3.1), which computes keypoints based on heatmap values that can be easily influenced by any background cones present in a given prediction bounding box. Overall, both latitudinal and longitudinal error threshold boundaries were not breached, hence the mapping accuracy and look ahead distance requirements were satisfied. 18 8.3 Pipeline speed performance For safety reasons, as discussed in requirement 1, the latency should not exceed 200ms. Since that is the case, the latency requirement is satisfied. In order to provide comprehensive environmental information during racing conditions, it is highly desirable for the vision stack to process a maximum number of frames. The current inference time of 76ms (Figure 7.2), equivalent to approximately 13 frames per second (FPS), is a good achievement in this regard. However, it falls short of the optimal real-time processing performance typically achieved by MIT/Delft team (50) at around 30ms, equivalent to 30 FPS. The RektNet keypoint detection appears to be a notable bottleneck in the processing pipeline, and further optimization opportunities are discussed in Section 9 as part of future work. Nonetheless, it effectively complements the LiDAR system, which operates at 10 FPS (43). 9 FUTURE WORK The purpose of this camera-based system as discussed in Section 1.1 is to complement the existing LiDAR system in providing the Simultaneous Localization and Mapping algorithm (SLAM) information about the environment, namely the locations and classes of the cones. As mentioned in Section 1.2, leveraging the strengths of both camera-based visual SLAM and LiDAR can be done using an extended Kalman filter (EKF) (24) (51). An EKF is a mathematical algorithm used to estimate the state of a system in the presence of noise and uncertainty. In this case, the EKF can be used to fuse the data from the camera and LiDAR to obtain a more accurate estimate of the car’s location and a more accurate map of the environment. The integration of camera-based visual SLAM and LiDAR through EKF can be achieved by first performing the LiDAR-based SLAM pipeline to obtain a rough estimate of the car’s pose and map. This estimate can then be refined by using the camera data multiplied by some weight factor to improve the accuracy of the car’s pose and the map. Formula Electric Driverless has discussed plans to run its entire system on a smaller computing machine such as a Jetson Xavier (37), and there are several solutions that can increase the runtime speed of the developed vision stack to adapt to this change. One effective technique is int8 quantization (30), which reduces the precision of each weight or activation value to 8 bits, significantly reducing the memory requirements and computational cost of a neural network. Additionally, parameter pruning (39) can be used to remove insignificant weights from a model, reducing its size and improving its efficiency. Another important factor is the backbone of the RektNet. Previous work by (44) has demonstrated that CSPNet blocks (53) can reduce computing effort by up to 50% while maintaining similar or even better performance than standard ResNet blocks. Although integrating such blocks in the backbone was attempted during the thesis, it was not successful. An important improvement this thesis did not manage to fully implement is passing the cones (cropped from the frame by their bounding boxes) as a batch to the RektNet. This improvement is discussed at length in Section 3.4.2 of (44). Since the keypoint regression was trained to accept batches of several images per inference, this could be leveraged by passing several images in a single inference through the network. In contrast, the current solution iterates though each bounding box or cone sequentially and thus forwards several times per frame, which is not ideal. However, in order to see this through, the inference needs to make use of TensorRT. A final improvement could also be using TensorRT (38), a software development kit developed by NVIDIA that optimizes and accelerates the inference of deep learning models. It is designed to provide high performance for deep learning applications by using certain NVIDIA GPUs such as the one from the Driverless PC to achieve faster processing times. This thesis started implementing this solution and installed the required libraries on the Driverless PC but was not able to fully implement it. Note that TensorRT requires first the conversion from PyTorch model to ONNX format, and thus the code needs to adapt to working the models in this format instead. However, if this implementation is achieved, the processing could be even faster and would be optimized for the hardware used currently as well as the Jetson Xavier. 10 CONCLUSION The proposed work has implemented a pipelined approach for a camera detection system that can detect cones in images and recover their 3D position using a 2D object detector, a keypoint regression model, and the Perspective n-Point algorithm. We have presented challenges that require us to optimize known solutions to develop designs for Formula Student Driverless. The paper has provided a comprehensive description of the solutions to common bottlenecks deploying state-of-the-art computer vision algorithms and has presented an open design of a tested low-latency vision stack for high-performance autonomous racing. 19 The end result of this work is a perception system that is able to achieve a sub 80ms latency from view to depth estimation, with errors smaller than 0.5m at a distance of 10m. Satisfying all requirements except for the FOV which can be overcome by using another high resolution camera as discussed in Subsection 4.2.1. This system and its associated source code are available for future teams of Formula Electric Belgium to improve upon and innovate further at http://github.com/BayKeremm/thesis-code. [12] C++ (Last Updated: 1 October 2022). Chrono library date and time utilities. https://en.cppreference. com/w/cpp/chrono. ACKNOWLEDGEMENTS [13] Daheng Imaging (2023). Daheng Imaging Industrial Vision Cameras. https://www.get-cameras.com/ dahengimaging. We would like to thank Formula Electric Belgium for this amazing opportunity to work on a life-sized, real-time and very demanding yet extremely interesting project. Not only did it help us learn new skills and improve our technical abilities but also helped us understand the dynamics of such an organization. [11] Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF: Speeded Up Robust Features. In Leonardis, A., Bischof, H., and Pinz, A., editors, Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg. Springer Berlin Heidelberg. [14] De Ryck, R. (2016). State Estimation for Autonomous Vehicles. PhD thesis, KU Leuven. [15] De Silva, D., Roche, J., and Kondoz, A. (2017). Fusion of LiDAR and Camera Sensor Data for Environment Sensing in Driverless Vehicles. We therefore thank all the members, advisors, and generous sponsors of Formula Electric Belgium for making this project possible. [16] Dellaert, F. and Contributors, G. (2022). borglab/gtsam. https://github.com/borglab/gtsam). We would also like to thank our colleague Antoine Bauer for providing the hardware to train YOLOv7. [17] Dhall, A. (2018). Real-time 3D Pose Estimation with a Monocular Camera Using Deep Learning and Object Priors On an Autonomous Racecar. PhD thesis, ETH Zurich. BIBLIOGRAPHY [1] Installing PyTorch C++ API. https://pytorch.org/ cppdocs/installing.html. [2] OpenCV: DNN [18] Dhall, A., Dai, D., and Van Gool, L. (2019). Real-time 3D Traffic Cone Detection for Autonomous Driving. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 494–501. module. https://docs. [19] Driverless, M. and Core, C. (2021). MIT-Driverlessopencv.org/4.x/d6/d0f/group__dnn.html# CV-TrainingInfra. https://github.com/cv-core/ ga9d118d70a1659af729d01b10233213ee. MIT-Driverless-CV-TrainingInfra. [3] OpenCV Documentation. https://docs.opencv. [20] Driverless MIT/Delft (2021). RektNet org/4.x/index.html. Dataset. https://storage.cloud.google.com/ mit-driverless-open-source . [4] PyTorch. https://pytorch.org/. [5] Robot Operating System. https://www.ros.org/. [6] rosbag - ROS Wiki. http://wiki.ros.org/rosbag. [7] (2021). FSOCO Dataset. fsoco-dataset.com/. [21] Durrant-Whyte, H. and Bailey, T. (2006). Simultaneous localization and mapping: part I. IEEE Robotics & Automation Magazine, 13(2):99–110. https://www. [22] FEB (2023). Formula Electric Belgium. https:// formulaelectric.be/. [8] (2022). Formula Student Rules 2022. https://www. formulastudent.de/fileadmin/user_upload/ all/2022/rules/FS-Rules_2022_v1.0.pdf. [23] FSG (2023). Formula Student Germany. https:// www.formulastudent.de/fsg/. https://www. [24] Fujii, K. and Group, T. A.-S.-J. (2004). Extended Kalman Filter. https://www-jlc.kek.jp/2004sep/ subg/offl/kaltest/doc/ReferenceManual.pdf. [10] asmag.com (2013). 7 points to heed when setting FoV. https://www.asmag.com/showpost/14308. aspx. [25] Harris, C. G. and Stephens, M. J. (1988). A Combined Corner and Edge Detector. In Alvey Vision Conference. [9] AMZ (2023). AMZ Racing. amzracing.ch/en/node/1255. 20 [26] He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385. [27] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. [28] HediVision (2019). Pinhole Camera Model. https: //hedivision.github.io/Pinhole.html. [29] Hui, erage J. (2020). Precision) for MAP (Mean AvObject Detection. [38] NVIDIA Corporation. TensorRT. developer.nvidia.com/tensorrt. https:// [39] O’Keeffe, S. and Villing, R. (2018). Evaluating pruned object detection networks for real-time robot vision. In 2018 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pages 91–96. [40] Open Source Robotics Foundation (Last Edited: 19 October 2020). ROS Camera Calibration. http:// wiki.ros.org/camera_calibration. https://jonathan-hui.medium.com/ map-mean-average-precision-for-object-detection-45c121a31173. [41] OpenCV (2021). [30] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A. G., Adam, H., and Kalenichenko, D. (2017). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. CoRR, abs/1712.05877. [31] Kabzan, J., de la Iglesia Valls, M., Reijgwart, V., Hendrikx, H. F. C., Ehmke, C., Prajapat, M., Bühler, A., Gosala, N. B., Gupta, M., Sivanesan, R., Dhall, A., Chisari, E., Karnchanachari, N., Brits, S., Dangel, M., Sa, I., Dubé, R., Gawel, A., Pfeiffer, M., Liniger, A., Lygeros, J., and Siegwart, R. (2019). AMZ Driverless: The Full Autonomous Racing System. CoRR, abs/1905.05150. [32] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T., editors, Computer Vision – ECCV 2014, pages 740–755, Cham. Springer International Publishing. [33] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. (2016). SSD: Single Shot MultiBox Detector. In Leibe, B., Matas, J., Sebe, N., and Welling, M., editors, Computer Vision – ECCV 2016, pages 21–37, Cham. Springer International Publishing. [34] Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110. [35] Marchand, E., Uchiyama, H., and Spindler, F. (2016). Pose Estimation for Augmented Reality: A Hands-On Survey. IEEE Transactions on Visualization and Computer Graphics, 22(12):2633 – 2651. [36] MIT (2023). MIT Driverless. http://driverless. mit.edu/. [37] NVIDIA Xavier. Corporation. NVIDIA Jetson solvePnP. https://docs. opencv.org/4.x/d5/d1f/calib3d_solvePnP. html. [42] OpenCV (2022). Camera Calibration and 3D Reconstruction â Calibration Matrix Values. https: //docs.opencv.org/2.4/modules/calib3d/doc/ camera_calibration_and_3d_reconstruction. html#calibrationmatrixvalues. [43] Ouster (n.d.). Ouster OS1 Datasheet. https: //data.ouster.io/downloads/datasheets/ datasheet-rev7-v3p0-os1.pdf. [44] Perauer, C. (2021). Development and Deployment of a Perception Stack for the Formula Student Driverless Competition. [PyTorch] PyTorch. JIT Library. http://pytorch.org/ docs/stable/generated/torch.jit.load.html. [46] Redmon, J. and Farhadi, A. (2017). YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525. [47] Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137– 1149. [48] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019). Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 658–666, Los Alamitos, CA, USA. IEEE Computer Society. AGX https://www.nvidia.com/en-us/ [49] SAE (2021). SAE Levels of Driving Automation Refined for Clarity and International Audience. https: autonomous-machines/embedded-systems/ //www.sae.org/blog/sae-j3016-update. jetson-agx-xavier/. 21 [50] Strobel, K., Zhu, S., Chang, R., and Koppula, S. (2020). Accurate, Low-Latency Visual Perception for Autonomous Racing: Challenges, Mechanisms, and Practical Solutions. CoRR, abs/2007.13971. [51] Thrun, S., Burgard, W., and Fox, D. (2005). Probabilistic Robotics. The MIT Press, Cambridge, MA, USA. [52] Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2022). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. [53] Wang, C.-Y., Liao, H.-Y. M., Wu, Y.-H., Chen, P.Y., Hsieh, J.-W., and Yeh, I.-H. (2020). CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. [54] Xue, Z. and Schwartz, H. (2013). A comparison of several nonlinear filters for mobile robot pose estimation. 2013 IEEE International Conference on Mechatronics and Automation, IEEE ICMA 2013, pages 1087– 1094. 22 APPENDICES A ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 B YOLO Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2 23 APPENDIX A : RELU Figure A.1: ReLU: activation function shape A-1 APPENDIX B : YOLO PERFORMANCE METRICS Figure B.1: YOLOv7-tiny: PR curve and Confusion matrix for difference cone types Figure B.2: YOLOv7: PR curve and Confusion matrix for difference cone types B-2