Uploaded by rio.evrard

Thesis

advertisement
FACULTY OF ENGINEERING TECHNOLOGY
CAMPUS GROUP T LEUVEN
Visual perception for an
autonomous race car
Implementation of a camera-based perception system
Kerem OKYAY
Rio EVRARD
Supervisor(s):
Dr. Ir. Koen Eneman
Master Thesis submitted to obtain the degree
of Master of Science in Electronics and ICT
Engineering Technology
Academic Year 2022 - 2023
Faculty of Engineering Technology Campus GROUP T Leuven — Master Thesis submitted to obtain
the degree of Master of Science in Electronics and ICT Engineering Technology
Academic Year 2022 - 2023
Visual perception for an autonomous race car
Implementation of a camera-based perception system
Evrard Rio, Okyay Kerem
Master of Science in Electronics and ICT Engineering Technology, Faculty of Engineering Technology, Campus GROUP T
Leuven, Andreas Vesaliusstraat 13, 3000 Leuven, Belgium
Supervisor(s): Dr. Ir. Koen Eneman
Faculty of Engineering Technology, Campus GROUP T Leuven, Andreas Vesaliusstraat 13, 3000 Leuven, Belgium,
<koen.eneman@kuleuven.be>
ABSTRACT
This paper presents a pipelined approach for a camera detection system to detect cones in frames
and recover their 3D position. It employs an object detection model, a keypoint regression model,
and the Perspective-n-Point algorithm. The paper discusses requirements, related work, and system
foundations, followed by design, implementation, and evaluation of the pipeline. It outlines datasets,
libraries, and frameworks used, and describes the visual perception pipeline and its training schemes.
Performance evaluation includes module and overall pipeline performance against targets. The paper
explores improvements for each module, proposes alternative pipeline structures, and evaluates the
pipeline against requirements. It contributes to an open design of a low-latency vision stack for highperformance autonomous racing and addresses bottlenecks in computer vision algorithms. The system
achieves sub-80ms latency and <0.5m errors at 10m. This system and its associated source code are
available for future teams of Formula Electric Belgium to improve upon and innovate.
©Copyright KU Leuven — This master’s thesis is an examination document that has not been corrected for any errors. Without written permission of the supervisor(s) and the author(s)
it is forbidden to reproduce or adapt in any form or by any means any part of this publication. Requests for obtaining the right to reproduce or utilise parts of this publication should
be addressed to KU Leuven, Campus GROUP T Leuven, Andreas Vesaliusstraat 13, 3000 Leuven, +32 16 30 10 30 or via e-mail fet.groupt@kuleuven.be A written permission of the
supervisor(s) is also required to use the methods, products, schematics and programs described in this work for industrial or commercial use, and for submitting this publication in
scientific contests.
1
INTRODUCTION
1.1
Nature and scope of the problem
Autonomous driving is one of the most complex and challenging problems being tackled today, requiring the cooperation of multiple fields such as computer vision, robotics,
and machine learning. The development of autonomous
vehicles depends on the efficient collaboration of diverse
modules, including controls, perception, mapping, and actuators. Achieving Society of Automation Engineers (SAE)
Level 4 autonomy (49), which requires no driver attention even in emergency situations and challenging weather
conditions, promises to significantly improve safety and efficiency of autonomous vehicles.
Figure 1.1: Super Nova, driverless car from FEB 2022
marks (cones). Additionally, even after the map is
defined, perception is still required to accurately localize the car’s position within the map and potentially
improve it further.
• Accurate color information is also crucial to reduce
the ambiguity of possible trajectories and provides
a means of narrowing the search space of paths.
Allowing for quicker, more efficient, and effective
heuristics to define the driving line.
• In case LiDAR perception system fails or provides
worse outputs due to weather or lighting conditions
for e.g., a camera-based perception system can complement or even assume total control over the perception system.
However, to achieve full autonomy, it is crucial to be able to
operate a vehicle close to its limits, including slippery surfaces and avoidance maneuvers. Autonomous car racing
is an ideal platform for developing and validating new technologies under challenging conditions. Self-driving race
cars offer a unique opportunity to test software required
in autonomous transport, such as redundant perception,
failure detection, and control in extreme conditions. Testing such systems on closed tracks mitigates the risks of
accidents and human injury.
Formula Student Germany (23), the largest student engineering competition worldwide, introduced a new category
for autonomous race cars in 2017. This category encourages student teams of any size to design and build fully
functional autonomous race cars in around nine months.
The goal is to foster the development of autonomous vehicles and accelerate progress in this field.
In the case of an autonomous race car, the LiDAR’s advantage is the accurate localization ability in 3D space
and being less resource-hungry than the neural networks
used for image processing tasks. However, camera sensors are feature rich and can extract useful information like
the color of a cone which is advantageous for path planning. As the right side of the track is marked with yellow
cones and the left side of the track is marked with blue
cones, knowing the cones’ color reduces the path’s ambiguity. Camera perception is also more stable in windy or
rainy conditions (15) which may occur at the competition.
In 2021, Formula Electric Belgium (FEB) (22) started developing their own autonomous race car shown in Figure
1.1. To provide an accurate tracking of the car position and
track landmarks, FEB fused two sensors: a LiDAR and an
INS/GNSS device (14). While this fusion provided accurate tracking in normal conditions, the possibility of sensor
failure leading to an immediate loss of tracking could be
catastrophic in a racing scenario. Therefore, there is a
need to improve the current system’s efficiency and reliability. In light of this, extensive research and discussions
with the past and present Formula Electric Driverless team
have led to the conclusion that a complementary vision
perception system is required.
This thesis aims to improve the overall performance of
an operational algorithm by implementing a camera-based
perception stack that works in parallel to the LiDAR-based
stack in providing data about the environment, namely the
location and types of cones in the surrounding of the car.
This approach has proven successful with other teams
such as AMZ (9) and MIT (36) but is still novel, and the
thesis aims to develop a perception stack that compares
and combines the best practices of landmark detection to
enhance autonomous driving capabilities of the FEB team.
There are several reasons for including a vision perception
module in an autonomous driving system as outlined in
(17):
1.2
• Visual perception is essential for accurate mapping
and localization, which is key to driving at higher
speeds without incurring penalties by hitting land-
Overview of driverless car software pipeline
At the highest level of abstraction, the software for a driverless car is composed of three main algorithms that work
3
1.3
sequentially: Track Landmarks Detection; Car Localization
and Landmarks Mapping; and Car Control.
Analysis of related systems and designs
To build upon the existing body of knowledge, this work
is based on research by previous Formula Student teams
around the world. In particular, the MIT/Delft team (50)
and AMZ team (31) which have both made significant contributions in the area of autonomous driving perception
systems in the context of the Formula Student Competitions. Namely, by fusing LiDAR and camera data to extract environmental information and as part of this strategy
achieving top results in competitions.
In the first step, Track Landmarks Detection, data from
sensors such as cameras or laser rangefinders is used to
make an estimation of the position of landmarks in the environment. Data from different kinds of sensors are fused
using sensor fusion techniques such as Kalman filter, particle filter, or Extended Kalman Filter (EKF) (24) (51). In
the case of an autonomous racecar application, EKF is
often given preference because of its ability to deal with
noisy measurements from multiple sensors (54).
Prior works from formula student teams related to computer vision tasks have explored solutions to monocular/stereo depth estimation and critical system challenges
that arise in real world systems. This paper builds on
this prior research by compiling the most effective findings
from individual studies and presenting a camera pipeline
system that delivers the best perception system to Formula Electric Belgium.
In the second step about Car Localization and Landmarks
Mapping, SLAM (Simultaneous Localization and Mapping)
(21) is the most popular approach for an autonomous
racecar application. SLAM is a technique used by robots
and autonomous systems to use sensor data to create a
map of their environment and to simultaneously determine
their own location within that map. There are many different approaches to SLAM, and the specific algorithm used
depends on the type and configuration of the sensors, the
characteristics of the environment, and the requirements
of the system. In the case of Formula Electric Driverless,
the GTSAM library (16) was used to implement SLAM due
to its robustness and efficiency.
1.4
Outline
The proposed work introduces a pipelined approach for a
camera detection system that can detect cones in images
and recover their 3D position using an object detection
model, a keypoint regression model, and the Perspectiven-Point algorithm. The paper is structured in a sequential
manner, starting with defining the requirements and discussing related work and foundations of the system, then
the design concepts and procedure, followed by implementation and evaluation of the pipeline’s performance.
In the last step, Car Control, the aim is to simultaneously keep the car within the track limits and maximize
the speed. The Car Control algorithm typically uses advanced control techniques, such as model predictive control, reinforcement learning, or neural networks, to optimize the car’s movements and maximize its speed and
performance while ensuring safety and stability. The algorithm continuously updates the car’s control outputs based
on the sensor data and the car’s dynamics to achieve the
best lap times and race results.
The paper outlines the datasets used for training the
models, the crucial libraries and frameworks employed
in computer vision, and the implementation details of the
pipelined approach. The visual perception pipeline is defined, explaining the implementation of the models and
their training schemes. The individual modules of the
pipeline are evaluated, including the performance of the
models used and the pipeline’s overall performance. The
performances are evaluated against the requirements or
targets set.
As this work is around sensor fusion and hence SLAM optimization, it focuses on Track Landmarks Detection by implementing a camera-based perception stack whose output data about the location and types of cones in the surrounding of the car can be fused to LiDAR data and forwarded to the SLAM algorithm. For this, research was
done into computer vision and architectures that would be
suited for this application i.e., real time inference and high
accuracy. Of critical importance in autonomous driving is
the latency of the hardware and software stack since the
visual perception system in an autonomous vehicle often
dominates the latency of the entire autonomy stack (50).
Furthermore, the paper analyzes each module of the
pipeline for improvement potential, and new pipeline structures are considered to overcome any possible shortcomings encountered. The discussion section provides a general evaluation of the pipeline with respect to the defined
requirements, and the future work section provides ideas
for future teams to improve upon the proposed pipeline.
The conclusion restates the paper’s findings and contributions to the Formula Electric Team, with a final word of
acknowledgment for all members involved in the creation
of this paper.
Most of the models and theory research was based on
work by different teams within the formula student competition community. These models and designs are discussed in the following section.
4
Figure 2.1: Standard track layout, (31)
Figure 2.2: Cases that define minimum look-ahead distance
(case 1) and FOV (case 2) requirements, (50)
The paper provides two contributions: Firstly, an open design of a low latency stack for autonomous racing, and
secondly, a comprehensive description of the solutions to
common bottlenecks deploying state-of-the-art computer
vision algorithms. The proposed method is not only limited to the context of the Formula Driverless Competition
but can also be adapted to various visual perception systems used for autonomous platforms.
2
was derived from SLAM mappers characteristics to
be around 0.5m as per (50).
3. Horizontal Field-of-View (FOV): This refers to the arc
of visibility, and it puts a requirement on the system’s
coverage area. Horizontal FOV is lower bounded by
hairpin U-turns, where the competition rules dictate
the minimum radius (4.5 m outside radius) (8). The
system must perceive landmarks on the inside of a
U-turn such as Figure 2.2 in order to plan an optimal
trajectory. Case 2 sets a minimum FOV of 101°as put
forth by (50) which was calculated under the same
regulations.
4. Look-ahead Distance: refers to the maximum
straight-line distance over which accuracy is maintained. Considering this vision system complements
an existing LiDAR system, FEB arbitrarily defined this
at 10m.
REQUIREMENTS ANALYSIS
The proposed perception system aims to precisely identify and locate the racetrack’s environmental landmarks,
namely traffic cones, in compliance with the regulations
outlined by the Formula Student Germany competition
(23).
As illustrated in Figure 2.1, the racetrack’s left and right
boundaries are demarcated by blue and yellow cones, respectively, with orange cones marking the start and finish
points. The proposed vision stack aims to detect cones
accurately by providing their color (blue, yellow, or orange)
and position in 3D space (as coordinates), while avoiding
false positives and ensuring computational efficiency. The
pipeline designed should be modular, easy to debug, and
allow for interchangeable sub-modules.
By adhering to these requirements, we can ensure that
the perception system operates efficiently, accurately, and
with adequate coverage area to provide reliable information for decision-making in real-time. By adhering to these
requirements, the perception system can contribute to the
overall performance of the autonomous vehicle.
To ensure the visual perception system does not become
a bottleneck in the overall vehicle performance, four system requirements have been defined. These requirements
have been adapted from the MIT/Delft team (50) and can
be generalized to aid in the design of other visual perception systems. The four system requirements are:
3
3.1
RELATED WORK AND CAMERA FOUNDATIONS
Datasets
There are two open-source datasets that are used to train
the required models. The first one is the MIT/Delft ”Rektnet Dataset” (20), and the second one is the Formula Student Objects in Context (FSOCO) dataset (7), a community project developed by the Formula Student community
to enable all teams to work with as much data as possible.
The comparison of some key features of the datasets is
shown in table 3.1.
1. Latency: refers to the total time it takes for a landmark
to be localized from the moment it captured by the
imaging sensor (camera). This puts an efficiency requirement on the system. For safety reasons as discussed in (50) the latency should not exceed 200ms.
2. Mapping Accuracy: outlines the maximum acceptable error for landmark localization. It puts an accuracy requirement on the system. Maximum error
The FSOCO dataset is a large and diverse collection of
images containing cones resulting from the contributions
5
Dataset
Annotated Images
Classes
Keypoints defined
Teams contributed
MIT/Delft
8000
1
Yes
2
FSOCO
11572
5
No
18
Table 3.1: Datasets as of February 2023
of numerous formula student teams, each utilizing different sensor setups and lighting conditions. This variety of
sources makes the dataset an excellent choice for finetuning an object detection model such as YOLOv7 to detect
cones, as it provides a wide range of examples for the
model to learn from. The dataset contains annotations for
five distinct classes, including blue cones (right side of the
track), yellow cones (left side of the track), small and large
orange cones (start/finish of the track), and other cones
(that do not comply with the rules).
Figure 3.1: YOLOv7 performance on MS COCO dataset compared to other models, (52)
problems. To evaluate the accuracy of these models, the
COCO dataset (32) is a popular benchmark. For application in an autonomous race car, the real-time performance
of such models should obviously also be considered.
The open source MIT/Delft dataset was used for training
the custom keypoint regression model, RektNet. However,
it only contains data from these two teams and their specific setups, therefore, it is not as robust as the FSOCO
dataset.
3.2.1
Another dataset of relevance is the COCO dataset (Common Objects in Context) (32) which is a large-scale image
recognition, segmentation, and captioning dataset containing more than 330,000 images with more than 2.5 million object instances labeled with 80 different object categories, such as person, car, bicycle, etc. Each image
in the dataset is accompanied by multiple annotations, including object bounding boxes. This dataset is often used
to train large object detection models such as YOLOv7
which are later finetuned to a specific use case, such as
detecting cones for example.
3.2
Evaluation Metrics for Object Detection
To understand the evaluation of the object detection model
used, it is important to understand the most common metrics involved, namely Precision, Recall, and mean Average
Precision (mAP).
Precision looks at what proportion of identifications are actually correct, whereas Recall calculates what proportion
of actual objects was identified correctly. These are defined as:
Precision =
Cone Object Detection
Recall =
Object detection is used in computer vision to identify and
track objects of interest. In this case the object of interest are cones that describe the track. Cone object detection aims to locate the bounding boxes of cone objects
and accurately classify them into different categories, such
as blue, yellow, large orange, small orange, and unknown
cone.
True Positives
True Positives + False Positives
True Positives
True Positives + False Negatives
Precision and Recall are related in that increasing one
metric often leads to a decrease in the other. A model
with high precision will have a low false positive rate, but
may miss many true positives, leading to a low recall. A
model with high recall will identify most of the true positives, but may also have a high false positive rate, leading
to a low precision.
Recent years have seen significant advancements in object detection problems, thanks to the development of
deep learning architectures based on convolutional neural
networks (CNNs). Several architectures have been proposed, such as Faster R-CNN (47), SSD (Single Shot Detector) (33), and YOLO (You Only Look Once) (52), which
have shown promising results in solving object detection
mAP (29) is another common metric used in object detection models. mAP is a measure of the average precision (AP) relating to the quality of predictions, across all
object categories. It is calculated as the area under the
Precision-Recall curve. Generally, mAP values are coupled with a threshold. For example, mAP@.5 means that
6
Figure 3.2: Architecture of YOLOv4, (44)
and classes. The output of the Neck Network is a set of
feature maps with different spatial resolutions, which are
used to capture objects of different sizes and shapes.
a prediction is considered correct if the intersection-overunion (IoU) between the predicted bounding box and the
ground truth bounding box is greater than or equal to 0.5.
IoU (48) is a measure of how much two bounding boxes
overlap, and a higher value indicates a better overlap.
3.2.2
The Head, also known as Dense Prediction, is the third
and final component of the YOLO framework. Its main
function is to generate predictions of the objects’ locations and classes by processing the unified feature representation obtained from the Neck Network and applying non-maximum suppression to remove duplicate detections, making the detections more accurate. The Head
consists of a set of convolutional layers that predict the coordinates of the bounding boxes and the probabilities of
the objects’ classes.
YOLO
One of the most widely adopted object detection systems is YOLO whose primary advantage lies in its ability
to achieve high detection speed while maintaining accuracy. Since its first version in 2015 (46), YOLO has undergone numerous iterations to improve its performance.
The latest version, YOLOv7, has demonstrated superior
performance compared to other existing object detectors
in terms of both speed and accuracy (52) as seen in Figure 3.1, making it an ideal candidate for solving the cone
object detection problem in an autonomous race car.
To improve the accuracy of object detection, YOLO uses
anchor boxes. Anchor boxes are pre-defined bounding
boxes of different sizes and aspect ratios that are used
to predict the final bounding boxes. These boxes are defined to capture the scale and aspect ratio of specific object classes and are typically chosen based on object sizes
in the training datasets. During detection, the predefined
anchor boxes are tiled across the image. The YOLO network predicts the probability and other attributes, such as
background, intersection over union (IoU) and offsets for
every tiled anchor box. When the network predicts the
final bounding boxes for each object, it uses the anchor
boxes as a reference point. The network predicts the offset, width, and height for each anchor box and then applies those values to the anchor box to obtain the final
bounding box coordinates.
YOLO object detection models belong to the category of
single-stage object detectors. Unlike two-stage detectors, which first generate region proposals and then refine
them, YOLO predicts bounding boxes in a single inference
step. The YOLO framework consists of three subcomponents: Backbone, Neck, and Head, shown in Figure 3.2.
The Backbone network is responsible for extracting highlevel features from the input image frames. Typically, it
consists of a series of convolutional layers that learn to
detect edges, textures, and other low-level features. The
output of the Backbone Network is a feature map, which is
a representation of the input image at a lower resolution.
This lower resolution representation is used to speed up
the detection process and reduce the computational burden.
3.3
Keypoint Detection
A keypoint detector makes use of features it can detect in
an image which can be of three kinds: flat regions, edges,
and corners. By far, the most interesting features are the
corners that have gradients (changes in image brightness)
in two directions, thus enabling accurate localization of a
single point, namely the corner. Commonly used feature
extraction models include the Harris corners detector (25),
The Neck further processes the output by using a series of
convolutional layers with different kernel sizes and strides
to capture features at different scales. By doing so, it obtains a unified feature representation that will be used by
the Head to generate predictions of the objects’ locations
7
Figure 3.3: Left: RektNet architecture; Right: Geometric interrelations of keypoints; from (50)
dicted keypoint, and t pi j is the corresponding jth coordinate of the ith true keypoint.
robust SIFT (34) and SURF [(11) feature extractors and
descriptors. The problem with using pre-existing feature
extraction techniques like these is that they are designed
to detect any type of feature that meets the criteria for
being considered a feature point. This lack of specificity
means that techniques like Harris corners cannot differentiate between feature points that are located on a cone
versus those located on the road for example.
3.3.1
The introduced geometric loss term penalizes deviations
from the expected geometric relationships between keypoints. Figure 3.3 on the right demonstrates the colinearity of the keypoints. The unit vectors between keypoints must thus have a unity dot product. The geometric
loss is hence given by:
RektNet
Lgeo = γhorz (2 −V12 ·V34 −V34 ·V56 )
In 2020, in the context of the competition, the MIT/Delft
team developed a keypoint detector based on residual
neural networks (26), called RektNet, developed in the PyTorch framework and available at their public Github repository (19). To improve accuracy, RektNet contains two important network modifications.
+ γvert (4 −V01·V13 −V13 ·V35 −V02 ·V24 −V24 ·V46 )
from (50) where, γhorz and γvert are hyperparameters that
determine the relative weight of the cone geometry term
and Vi j represent the vectors as seen in Figure 3.3 on the
right.
First, the fully connected output layer was replaced with a
convolutional layer. Convolutional layers outperform fully
connected layers at predicting features which are spatially
interrelated (geometrically dependent) as is the case for
the keypoints of the cone, while also improving training
convergence by having fewer parameters. The expected
value over the heatmap is then used as the keypoint location.
The architecture of RektNet is designed to detect seven
characteristic keypoints within an 80 x 80 x 3 sub-image
patch (bounding box). To maintain the input dimensions
throughout the network, the structure is composed of
ResNet blocks (27). The input is a tensor of size [height,
width, number of keypoints, batch size], with the height
and width fixed at 80 pixels, the number of keypoints set to
7, and the batch size set to 8, which refers to the number of
data samples that were processed simultaneously during
the forward pass to generate the tensor. At the output, the
network generates one heatmap per keypoint, where each
coordinate on the heatmap corresponds to the coordinate
of the keypoint in the image patch.
The second modification is an additional term in the loss
function to leverage the geometric interrelations of keypoints. The total loss Ltotal is given as:
Ltotal = Lloc + Lgeo
from (50) where Lloc is the location loss and Lgeo is the
geometric loss.
RektNet starts with a convolution layer followed by batch
normalization, which helps in normalizing the output and
makes it easier to train. The batch normalization output
then goes through a rectified linear unit (ReLU) such as
A.1 as the non-linear activation. Following this, the next
four blocks are basic residual blocks, with each block having increasing channels C ∈ 16, 32, 64, 128. Figure 3.3 on
the left depicts these blocks.
The location loss is the mean squared error (MSE) loss
between the predicted and target keypoint locations and is
given by:
Lloc =
1 N K
∑ ∑ (pi j − t pi j )2
N i=1
j=1
where N is the number of training examples, K is the number of keypoints, pi j is the jth coordinate of the ith pre8
Figure 3.4: Camera Pinhole model, (28)
3.4
Perspective-n-Point
The Perspective-n-Point (PnP) is an algorithm in the field
of computer vision and robotics that seeks to determine
the position and orientation of a calibrated camera in a
3D environment using a set of 2D image points and their
corresponding 3D world points. Over the years, the PnP
problem has been extensively studied, leading to the development of numerous algorithms aimed at solving it.
Before explaining calculation behind PnP, it is important to
understand the camera parameters involved, namely the
intrinsic matrix, K , and the extrinsic matrix, [R t] as well as
the pinhole model of a camera.
3.4.1
Figure 3.5: 3D object model of cone, (18)
world coordinate system to the camera coordinate system while translation vector t represents the position of
the camera’s origin. The camera’s coordinate system has
its origin at the optical center, and its x- and y-axes define
the image plane.
Intrinsic Matrix
The internal camera parameters, describe the intrinsic
characteristics of a camera and are used to model its geometry and optics. Represented by K , the intrinsic matrix
contains five intrinsic parameters of the pinhole model of
the camera (Figure 3.4). The matrix K is defined as:

fx
K = 0
0
3.4.3
The pinhole camera model (28) is a mathematical representation of the way cameras capture images. In this
model shown in Figure 3.4, a 3D point in the world is projected onto the image plane by drawing a straight line from
the point through the pinhole to the image plane. The intersection of the line and the image plane determines the
2D location of the point’s image in the camera . Since the
image plane is flat, the resulting image is a 2D representation of the 3D world.

s cx
f y cy 
0 1
from (28) where [cx , cy ] is the optical center (principal
point), in pixels; s is the skew coefficient which is non-zero
if the images axes are not perpendicular; and ( fx , fy ) is the
focal length in pixels, given by: fx = F/px and fy = F/py
where F is the focal length in world units and (px , py ) are
the size of the pixel in world units. Parameters cx , cy , fx , fy
can be better understood by looking at Figure 3.4.
3.4.4
3.4.2
Pinhole Model
Extrinsic Matrix
PnP calculation
To arrive at a unique solution to the pose estimation
problem, at least three pairs of corresponding points
are necessary. The camera pose, which consists of
six degrees-of-freedom, namely three-dimensional rota-
The extrinsic parameters are captured in the extrinsic matrix, [R t]. Rotation matrix R transforms points from the
9
tion (roll, pitch, and yaw) and translation (x, y, z), must
satisfy six constraints. Each pair of corresponding points
provides two constraints, one for the rotation and one for
the translation. Hence, three pairs of points are sufficient
to solve the six degrees of freedom of the camera pose.
Once the 3D points are defined as in Figure 3.5 with respect to the world frame. The pose computation problem
(35) consists in solving for the rotation R and translation
t values that minimize the reprojection error from 3D-2D
point correspondences given a set of defined object points
with coordinates pw with respect to the world frame and
their corresponding image projections on the frame p f ,
given the camera’s intrinsic matrix K . This translates to
the equation bellow from (28) which is solved for [R t]:
Figure 3.6: Field of View, (10)
p f = K · [R t] · pw
Alternatively, the equation can be expressed in its extended form as follows:
  
u
fx
v =  0
1
0
s
fy
0
 

 Xw
cx R11 R12 R13 t1  
Yw 
cy  R21 R22 R23 t2  
 Zw 
1
R31 R32 R33 t3
1
This way, one is able to calculate the rotation vector R and
translation vector t from the camera to the cone, giving the
position of the cone in the environment relative to the car.
Since the geometry of the cone consists of a revolution of
a triangle around a vertical axis, the rotation matrix R is
not useful. The only variable of interest in this application
is the translation vector t .
3.5
Field of View
The field of view (FOV) refers to the angular extent of the
scene that is captured by the camera as shown in Figure
3.6, typically measured in degrees. The FOV represents
the solid angle subtended by the camera lens at the object
being photographed or filmed. The FOV is determined by
various factors, including the focal length of the camera
lens, the size of the sensor, and the distance between the
camera and the object.
Figure 3.7: Calibration interface by camera calibration node
achieved with a longer focal length and a smaller sensor.
Additionally, the distance between the camera and the object being photographed or filmed affects the FOV. Objects
that are farther away appear smaller and therefore cover
less of the total FOV.
3.6
Calibration
Camera calibration is the process of determining the intrinsic and extrinsic parameters of a camera.
where h and w are respectively the height and width of
the image the camera captures, and fx and fy are intrinsic
parameters of the camera as discussed in 3.4.1.
A common technique for calibrating a camera is using a
checkerboard pattern as in Figure 3.7 because this provides an easily detectable target in the image. The process of camera calibration involves taking several pictures
of the checkerboard from different orientations and positions, while moving the checkerboard around in the camera’s field of view. The resulting images are then processed to find the corners of the checkerboard squares.
The positions of the corners in the image are then used to
compute the intrinsic and extrinsic parameters of the camera.
A wider FOV is generally achieved with a shorter focal length and a larger sensor, while a narrower FOV is
There is ready to use software options in ROS to calibrate
cameras (40) which is used in this thesis.
OpenCV’s calibration function (42) calculates FOVx,y as:
FOVy = 2 arctan
h
2 fy
FOVx = 2 arctan
w
2 fx
10
Figure 3.8: Left: Stereo camera used; Right: Test setup on RC, camera visible in top right (28)
4
4.1
HARDWARE
Driverless PC
All software is executed on the Driverless PC mounted
behind the cockpit visible in Figure 1.1. This computer
is equipped with an AMD Ryzen 5 3600 6-core processor, 16Gb RAM, 240.1Gb storage capacity, NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] GPU, and
runs on Ubuntu 20.04 LTS 64-bit operating system. These
specifications guarantee the computer is suitable to meet
the demands of the software application and enable realtime performance of the vision pipeline system.
4.2
Camera
Monocular cameras and stereo cameras are both widely
used in autonomous driving, each with different strengths
and limitations. Monocular cameras are simple and inexpensive, relying on other sensors and/or algorithms to determine the distance and position of objects in the scene.
However, they can be less accurate and reliable than
stereo cameras, particularly in complex or cluttered environments. Stereo cameras use multiple lenses to capture
images from different perspectives (typically two), creating a 3D model of the scene that allows for more accurate
detection and tracking of objects.
Despite their advantages, stereo cameras present their
own set of challenges in the context of this research, such
as synchronization, stereo calibration, and extra processing load on the hardware. The estimation of 3D poses from
a single image captured by a monocular camera presents
an ill-conditioned problem due to the inherent limitations
in information resulting from the projection of 2D space
into 3D space. This problem can be overcome by using
prior information about landmarks in the scene which in
the case of this project are traffic cones with known shape
and size which can thus be modeled as in Figure 3.5.
In our pipeline, we worked with a stereo camera provided
by Formula Electric Belgium shown in Figure 3.8 on the left
but using only one camera output, effectively working with
a monocular setup. By doing so, we successfully circumvented any potential stereo synchronization issues. The
output of the camera was sent to the Driverless PC via a
high-speed USB-3 connection.
4.2.1
Camera Characteristics
The camera utilized is the VEN-134-90U3C-D from Daheng Imaging (13). It offers a resolution of 1280x1024, a
pixel size of 4.8 micrometers, a global shutter color CMOS
sensor, and an adjustable frame rate capped at 90FPS.
The camera also has a 12.7mm optical sensor size and
provides a 60° horizontal and 72° vertical field of view.
As stated in requirements Section 2, the minimum FOV
required is 101° which means that in extreme situations
such as hairpin U-turns, the camera should be complemented with LiDAR. Alternatively, another camera with a
wider field of view but at least the same resolution should
be used to address these scenarios and achieve the same
performance or better.
4.2.2
Camera Positioning
In the test setup, both cameras are mounted on a steel rod
for stability and easy adjustment as seen in Figure 3.8 on
the right. On the actual race car as in Figure 1.1, the camera would be positioned above the driver’s seat similarly
to how is being done by other teams with similar setups
(50) (31). This position offers the advantage that the visual
overlap between cones is reduced to a minimum meaning
that even the cones placed behind another (in same line
of sight) can be perceived sufficiently well.
11
5
SOFTWARE
5.2.1
The software design approach adopted for this project was
geared towards ensuring real-time reliability and the ability to interpret data and sub-module outputs. To achieve
this, the system was broken down into distinct parts, with
each sub-module undergoing debugging and testing. This
approach indirectly leads to the optimization of the entire pipeline. The project was segmented into two primary
tasks, each implemented as a separate software package.
The first task is image acquisition, for which the package
contains code for the camera driver. The second task is
image processing, which involves retrieving cone positions
from a frame, and the package contains code related to
processing the images through neural networks and 3D
coordinate estimation logic.
5.1
Image Acquisition Package
This package contains the driver code for the camera
VEN-134-90U3C-D from Daheng Imaging (13). Its primary function is to serve as a bridge between the camera
hardware and the software system, thereby providing an
interface that standardizes and documents how to access
data from the imaging sensor. One of the advantages of
this is that it allows users to decouple the camera from processing. This feature allows tuning of the camera parameters to meet individual needs or to use a different camera
altogether.
The package delivers an adjustable frame rate (FPS) captured by the image processing package of Section 5.2. An
important factor to consider when developing this package
was the resolution of the images since there is a fundamental trade-off between using high- and low-resolution
images. While high-resolution images are useful for accurate keypoint detection, which improves the performance
of the PnP algorithm, they also add extra latency and overhead when transporting and storing them.
5.2
Image Processing Package
This package encompasses all the image processing
steps necessary to produce a position estimate relative
to the camera. It takes in the frames captured from the
camera, which are sent from the image acquisition package of Section 5.1, as input, and outputs translation vector
from the camera to the landmarks. The image processing
package is composed of three distinct parts: Cone detection, Keypoint detection, and 3D pose estimation, each
explained in the following sections.
Cone Detection
As noted in the related work Section 3.2, the YOLOv7 object detection algorithm is currently the most accurate and
efficient model for object detection. Therefore, it was selected as the algorithm of choice.
The input images are 640x640 BGR color images
(OpenCV works with Blue-Green-Red color format), and
the output generated are bounding boxes that indicate the
location of detected cones within the frame.
From the FSOCO dataset described in Section 3.1, 8000
random images were used for training and 1000 random
images were used for test data. YOLOv7 and YOLOv7-tiny
were trained with default parameters.
The training script runs a genetic algorithm to find the best
anchors for the images, then saves them to the model
and starts training. The models are finetuned rather than
trained from scratch to be more efficient. This means
only the last layers of the model were fine-tuned with the
FSOCO dataset, resulting in 100 total epochs being run
on the 8000 training images.
5.2.2
Keypoint detection
Including prior information about the geometry of the landmark can significantly improve the accuracy of the landmark localization process. As discussed in the related
work Section 3.3.1, the MIT/Delft team has addressed this
by developing RektNet, a powerful tool for detecting keypoints on the bounding boxes of the landmarks.
The input is a normalized 80x80 image of a cone. This
image was cropped according to the bounding box from
the frame captured by the camera. After going through
the RektNet, the output is a vector of keypoint coordinates
for that cone image.
The MIT/Delft team provided the training scheme documentation of RektNet in their paper (50) and provide a
hands on tutorial through their GitHub repository ((19)).
RektNet is trained as in the tutorial using around 3.2k annotated images from (20). The dataset was split 85/15 between training and validation. Before training, some training parameters are defined as in Table 5.1.
Note that L2 Softargmax is good at penalizing large errors.
Training keeps track of the best epoch and will stop before
running all epochs if the loss does not decrease further.
12
Number of keypoints
Learning rate
Rate of decay
Number of epochs
Use geometric loss
Loss function
Batch size
Maximum Tolerance
7
0.1
0.999
1024
Yes
L2 Softargmax
8
8
6.1.2
LibTorch (1) is a C++ library that provides a wrapper to
load, manipulate and run PyTorch models in C++ code. It
is built on top of the PyTorch C++ API and allows developers to use PyTorch’s capabilities in C++ projects.
Table 5.1: RektNet Training parameters
5.2.3
Pose estimation
As outlined in Section 3.4, the PnP algorithm needs at
least three points to obtain a unique solution. In the context of this thesis there are seven points that depict the
cone and there is prior knowledge of the cone’s geometry (Figure 3.3). This allowed to create a 3D model of the
cone as in Figure 3.5 and thus enables the PnP algorithm
to retrieve the cone’s 3D position in relation to the camera
as explained in Section 3.4. As our focus is solely on the
position of the cone, given by the translation vector t , we
may disregard the rotation vector R.
The inputs are the keypoints detected for a cone and the
corresponding 3D model of the cone which differs depending if the cone is large or small. The outputs are as discussed in Section 3.4 the rotation vector R and the translation vector t .
PyTorch (4) is a popular open-source machine learning library based on Torch, a scientific computing framework.
PyTorch provides a Python API for building and training
machine learning models. However, there are some use
cases where Python might not be the best fit. For example,
in embedded systems or real-time applications, it might
be necessary to have a low-level programming language
such as C++. This is where LibTorch comes in handy.
In this project, LibTorch is used for several purposes,
namely:
• Loading the YOLO and RektNet models to torchscript
format using the load method from the Jit library
(PyTorch).
• Preprocessing the input models from OpenCV matrices to torch input tensors and applying some preprocessing transformations on this tensor.
For this project LibTorch was installed specifically for Linux
and computing platform CUDA 11.7 which is optimized for
GPU use.
6.1.3
6
LibTorch
ROS
IMPLEMENTATION
6.1
6.1.1
Frameworks
OpenCV
OpenCV (Open-Source Computer Vision) is a popular
open-source computer vision and machine learning software library (3). The library contains more than 2,500
machine learning algorithms, focused around computer vision. In this project Version 4.x was used to:
ROS (Robot Operating System) (5) is a set of software
libraries and tools that provide a flexible, extensible and
well documented framework for building robot applications. The system is composed of a set of nodes, each
with their own function, that communicate with each other
through topics and thus inform each other of the state of
the system as a whole. It provides a number of useful features and tools, from which the most relevant to our project
include:
• Initializing ROS nodes and NodeHandles which en• Define the images/frames the camera sensor provides and store them as OpenCV matrices.
• Manipulate images before, e.g. forwarding through
the YOLO or RektNet network. This includes preprocessing steps such as resizing, changing color model
or normalizing.
• Perform non maximum suppression i.e., filtering out
overlapping boxes, using the NMSBoxes function
from the built in OpenCV dnn library (2).
• Return the rotational and the translational vectors
with the built in solvePnP method (41).
13
able communication between nodes.
• A publish-subscribe communication model, which
allows different parts of a robot system to send
and receive messages to coordinate their actions.
In the case of this project, the image publisher
node decodes the information from the camera and
sends the frames as messages to the topic named
/le f t/image raw as the ”publisher”. Then, the inference node as the ”subscriber” to that topic receives
the message and thus the frame with which it may
infer the locations and types of cones present.
Besides these contributions to our project, ROS possesses a large library of pre-built software components
(called ”packages”) that can be used to complement the
system. These packages include libraries to calibrate the
camera, send messages between nodes or testing the
system.
ROS is also equipped with a suite of tools for building, debugging, and deploying robot applications, including RVIZ,
a 3D visualization tool to display and interact with various
sensor data, robot models, and other 3D information in
real-time, which will be useful when testing the car as a
whole with the vision stack integrated.
6.2
Camera Pipeline
Although it might be justified to attempt an end-to-end image to pose estimation or even image to throttle and steering command relation, this work employs a part-based approach. This allows for easy debugging and extensive testing, indirectly helping to optimize each sub-module until a
certain level of performance is reached by the pipeline as
a whole. Besides this, a part-based approach allows for
easy replacement of sub-modules.
This structuring segregation is visible in, e.g. the inference package where the sub-modules (YOLO, RektNet
and PnP) are separated into classes and their use can
be easily controlled.
6.3
6.3.2
Cone Struct
From the moment a cone is detected by YOLO, it is treated
as an entity defined by a struct which is accessed in other
sections of the code by reference (instead of by value) for
efficiency.
Cone 3D modeling
As explained in Section 3.4, Perspective-n-Point works by
inferring a rotation vector R and translation vector t to a
landmark by using prior information of the landmark. In
this project, this prior information is a 3D model of the cone
defined by seven 3D points with respect to the base of the
cone, the world frame, as shown in 3.5. These points are
arbitrarily located where keypoints are best detected by
the RektNet meaning at every 1/3 of the height of a cone.
The 3DHandler class possesses two distinct 3D object
models, one for a small cones and one for a large cone,
so that the PnP estimation (which looks at the shape of
the landmark) can be adapted and give accurate coordinate results for both types of cones. These objects are
represented by vectors of 3D points, whose coordinates
are determined dynamically by the height and radius of
the cone.
6.3.3
Implementation Notes
6.3.1
The validity of the cone is checked at every stage of
the pipeline because in rare occasions a fallen cone is
detected and/or the position estimate is unreasonable.
These invalid cones are filtered out as a final step by the
PnP handler if the coordinates returned are beyond certain thresholds arbitrarily defined. These thresholds are
±5m for x (side to side), ±1.5m for y (up/down) and less
than 2m or more than 20m for z (distance).
Pseudocode
For reference and perhaps provide a better understanding of the image acquisition and inference package, the
pseudocode for the main file of both packages is displayed
below with explanation for each.
Pseudocode for image acquisition
Cone Struct member data:
1
2
• Bounding Box: An OpenCV Rect object representing
3
the bounding box.
4
• Confidence:
The confidence associated with the
bounding box.
5
6
7
• Class ID: Type of cone (Yellow, Blue, Small Orange,
Large Orange or Unknown).
8
• Keypoints Vector: A vector of keypoints, each repre-
9
Procedure : Image Acquisition
init ROS node
init Camera device
init image_transport ROS package to publish
images
set loop rate to acquire image from camera
while ros :: ok ()
acquire image from camera as OpenCV Mat
object
transform OpenCV Mat object to image
sensor message of ROS
publish image to respective topic
sented by a pair of integers for coordinates.
• Translation Vector: A vector representing the transla- The image acquisition package initializes: itself as a ROS
tion from the base of the cone to the camera.
node; the camera device; and the transport method to
• Validity: A Boolean variable indicating the validity of the communicate to topic. Next, it continuously acquires imcone.
ages from the camera, transforms them to OpenCV Mat
14
objects and sends them to the topic which will be used by
the image inference package.
Pseudocode for image processing
1
2
3
4
5
6
7
8
Procedure : Image Processing
transform received message to OpenCV Mat
object
forward the image through the YOLOv7 network
for cone in 8 largest cones :
crop the image with the cone bounding box
forward the cone image through RektNet
apply PnP to the obtained keypoints
filter out invalid cones
compared on a test set of 1000 images from the FSOCO
dataset. Results are displayed in Table 7.1. YOLOv7-tiny
shows an overall Precision of 79.9%, Recall of 68.4% and
mAP@.95 of 0.451 compared to YOLOv7 with an overall Precision of 83.1%, Recall of 73.9% and mAP@.95 of
0.502. Note these averages are lowered significantly by
the class of ”Unknown” cones, but in a similar way for both
models.
The Precision/Recall curve and the confusion matrix for
YOLOv7-tiny and YOLOv7 are shown in the Appendix B.
The image inference package, after initializing itself as a
ROS node and subscribing to the topic containing the images, starts continuously processing on the images. This
implies forwarding the image through the YOLO object detection network to get the bounding boxes of the cones.
Then, the 8 largest boxes (closest cones in theory) are forwarded through the RektNet keypoint regression model to
locate the keypoints and forward these to the 3DHandler
class which will execute the PnP calculation and output
the desired translation vector t .
7.2 Keypoint detection and PnP and 3D pose estimation performance
Key point detection and pose estimation are evaluated for
accuracy along the longitudinal (z) and latitudinal (x) coordinates. The remaining y axis corresponds to height,
which remains constant on a flat surface, as is the case in
the test and competition setup. Test results indicate highly
accurate and consistent measurements in the y axis, with
negligible variations. Consequently, the y axis is deemed
irrelevant for the evaluation.
7.2.1
7
The system is composed of three main parts: object detection, keypoint detection, and 3D pose estimation. During the testing phase, object detection was assessed individually, while keypoint detection and 3D pose estimation
were evaluated together. Furthermore, the performance
of each part of the system was analyzed in terms of execution time.
To accurately test the system in a realistic setting, data
was collected in the form of rosbags (6) which are a format
used in ROS to store and log data as a collection of messages exchanged between various ROS nodes. These
messages can include sensor data (e.g., camera images,
laser scans), state information, control commands, and
any other data exchanged within the ROS ecosystem. For
this work, only camera images were stored.
7.1
Longitudinal error (z)
EVALUATION
In this context, longitudinal error refers to the error in the
system’s estimation of the z coordinate in the translation
vector, which indicates how far a cone is ”in front” of the
camera. A cone with 0cm longitudinal error means that the
system has correctly estimated the distance to the cone.
Results for the longitudinal error relative to the ground
truth are shown for all cones in Figure 7.3 and by means
of a box plot in Figure 7.4, with different colors for each
class and where the dotted red line represents the acceptable error range (requirement 2). Box plot of Figure
7.4 revealed that predictions generally tended to be further away from the ground truth values. The errors were
strongly grouped by class type. Large orange cones were
generally classified further with the greatest error with almost 40cm offset.
7.2.2
YOLO cone detection performance
Two YOLOv7 models were finetuned on the same FSOCO
dataset 3.1 and compared, namely YOLOv7 and YOLOv7tiny. YOLOv7-tiny is a smaller model with fewer parameters and therefore generally detects faster and is less
resource-intensive at the cost of accuracy when compared
to YOLOv7.
The performance of the two trained YOLOv7 models is
Latitudinal error (x)
Similarly, latitudinal error refers to the error in the system’s
estimation of the x coordinate in the translation vector,
which indicates how much a cone is ”to the side.” A cone
with 0 cm latitudinal error means that the system has correctly estimated the cone’s x pose.
Results for the latitudinal error relative to the ground truth
are shown by means of a box plot in Figure 7.5, with again
15
Model
YOLOv7-tiny
YOLOv7
Class
Precision
Recall
mAP@.95
Precision
Recall
mAP@.95
Blue
Yellow
Small orange
Large orange
Unknown
0.895
0.88
0.867
0.813
0.541
0.755
0.74
0.744
0.772
0.412
0.515
0.5
0.506
0.564
0.172
0.912
0.91
0.901
0.877
0.543
0.8
0.794
0.8
0.819
0.479
0.566
0.556
0.556
0.627
0.205
All
0.799
0.684
0.451
0.831
0.739
0.502
Table 7.1: Test set results for YOLOv7-tiny and YOLOv7
different colors for each class and where the dotted red
line represents the acceptable error range (requirement 2).
The evaluation revealed that predictions were somewhat
skewed depending on cone type.
7.2.3
Test setup for longitudinal and latitudinal error
(a) View from camera at start position
For evaluation the longitudinal error z and the latitudinal
error x, physical cones are placed every 50cm along the
longitudinal and latitudinal axis on a straight line as shown
in Figure 7.1. Effectively positioned on a diagonal line.
The camera is then moved along different latitudinal positions (moving sideways as shown in Sub figure 7.1b) with
respect to the starting cone to capture the furthest cones
without any overlapping between cones which can cause
estimation errors. This test was performed for all cone
classes. The data is then compiled in tables per cone
class using Microsoft Excel (version 2304) and only data
from non-overlapping cones are considered in the evaluation.
7.3
(b) View from camera along latitudinal axis to the left
Figure 7.1: Testing setup
Pipeline latency
To assess the latency (requirement 1), different parts of
the system were timed using the chrono C++ library (12),
which includes a high-resolution clock.
The results of these measurements are displayed in Figure 7.2. It shows the vision system can process each
frame within 76ms, corresponding to 13 frames per second. Analyzing the time distribution of each submodule, we find that YOLOv7 cone detection requires 7ms,
YOLOv7-tiny being 2-3ms faster. Followed by RektNet
keypoint detection at 60ms for 8 cones, and ultimately,
PnP calculation at 1ms also for 8 cones.
16
Figure 7.2: System latency by parts
Figure 7.3: Absolute distance error by cone type
Figure 7.4: Average longitudinal z error by cone class
Figure 7.5: Average latitudinal x error by cone class
17
8
8.1
DISCUSSION
YOLO performance
The experimental results indicate YOLOv7 outperforms
the YOLOv7-tiny on the test set in terms of accuracy despite both being trained using the same hyper-parameters.
In the case of this application, small differences in
mAP@.95 are significant as high precision is required. As
explained in Evaluation Metrics Section 3.2.1, high precision of mAP@.95 means the cone will be accurately
included inside the bounding box, which is very advantageous for identifying the keypoints of the cone as the
next step. This suggests that the increased number of layers and parameters of YOLOv7 provides an advantage in
terms of detection accuracy.
Furthermore, the number of cones detected was different
for both models, especially for cones further away. This
is related to Precision and Recall as defined in Evaluation Metrics Section 3.2.1. After anchor boxes are defined
and final boxes are obtained, false positives are rarely observed, however, false negatives are still present, i.e. the
model did not detect cones that were in the frame, this is
because anchor boxes can only ”filter out” measurements
to increase the overall accuracy. YOLOv7 detected cones
that were further away much better than YOLOv7-tiny who
failed to detect them entirely. For an application such as
a racing car this random error was considered unacceptable. This detection difference is illustrated in Figure 8.1
where the same frames where processed using both object detection models.
Lastly, since the processing time of one frame by YOLOv7
was only around 2-3ms slower than by YOLOv7-tiny and
the overall system was still well within the latency budget,
YOLOv7 was the chosen model.
8.2
Rektnet and PnP performance
As mentioned in Section 5.2.2, the RektNet model was initially trained on the MIT/Delft dataset. Unfortunately, despite best efforts, results achieved by the MIT/Delft team
could not be replicated. While z distance errors remained
below 50cm up to a range of 10m as can be seen in Figures 7.4 and 7.3, the standard deviation of 19cm was not
as low as the 5cm reported by the MIT/Delft team (50).
One noteworthy observation we made during our testing
was that the MIT/Delft team did not provide any information regarding x distance estimation, which is vital to get a
complete pose estimation analysis. Our results shown in
Figure 7.5 that the x distance estimates provided by RektNet were also accurate, with errors typically below 50cm.
(a) Object detection from YOLOv7-tiny
(b) Object detection from YOLOv7
Figure 8.1: Object detection comparison: YOLOv7-tiny detects
less cones than YOLOv7, especially at further away
distances
The longitudinal and latitudinal errors were observed to be
correlated with the type of cone, which could be attributed
to RektNet’s difficulty in accurately identifying keypoints
based on the cone’s color.
Furthermore, it was noted that errors along both the longitudinal and latitudinal axes increased in a random manner
when the camera shifted its focus away from the cone.
This behavior could be attributed to inadequate camera
calibration. Despite our best efforts, consistent results
were not achieved for different angles.
Testing also revealed that the performance of RektNet was
not as robust as anticipated. Specifically, RektNet struggled to estimate the keypoints of overlapping cones in
YOLOv7 predictions. This issue stems from the inherent
architecture of RektNet (3.3.1), which computes keypoints
based on heatmap values that can be easily influenced
by any background cones present in a given prediction
bounding box.
Overall, both latitudinal and longitudinal error threshold
boundaries were not breached, hence the mapping accuracy and look ahead distance requirements were satisfied.
18
8.3
Pipeline speed performance
For safety reasons, as discussed in requirement 1, the latency should not exceed 200ms. Since that is the case,
the latency requirement is satisfied.
In order to provide comprehensive environmental information during racing conditions, it is highly desirable for the
vision stack to process a maximum number of frames.
The current inference time of 76ms (Figure 7.2), equivalent to approximately 13 frames per second (FPS), is a
good achievement in this regard.
However, it falls short of the optimal real-time processing performance typically achieved by MIT/Delft team (50)
at around 30ms, equivalent to 30 FPS. The RektNet keypoint detection appears to be a notable bottleneck in the
processing pipeline, and further optimization opportunities are discussed in Section 9 as part of future work.
Nonetheless, it effectively complements the LiDAR system, which operates at 10 FPS (43).
9
FUTURE WORK
The purpose of this camera-based system as discussed
in Section 1.1 is to complement the existing LiDAR system in providing the Simultaneous Localization and Mapping algorithm (SLAM) information about the environment,
namely the locations and classes of the cones.
As mentioned in Section 1.2, leveraging the strengths of
both camera-based visual SLAM and LiDAR can be done
using an extended Kalman filter (EKF) (24) (51). An EKF
is a mathematical algorithm used to estimate the state of
a system in the presence of noise and uncertainty. In this
case, the EKF can be used to fuse the data from the camera and LiDAR to obtain a more accurate estimate of the
car’s location and a more accurate map of the environment. The integration of camera-based visual SLAM and
LiDAR through EKF can be achieved by first performing
the LiDAR-based SLAM pipeline to obtain a rough estimate of the car’s pose and map. This estimate can then
be refined by using the camera data multiplied by some
weight factor to improve the accuracy of the car’s pose
and the map.
Formula Electric Driverless has discussed plans to run its
entire system on a smaller computing machine such as
a Jetson Xavier (37), and there are several solutions that
can increase the runtime speed of the developed vision
stack to adapt to this change. One effective technique
is int8 quantization (30), which reduces the precision of
each weight or activation value to 8 bits, significantly reducing the memory requirements and computational cost
of a neural network. Additionally, parameter pruning (39)
can be used to remove insignificant weights from a model,
reducing its size and improving its efficiency.
Another important factor is the backbone of the RektNet. Previous work by (44) has demonstrated that CSPNet blocks (53) can reduce computing effort by up to 50%
while maintaining similar or even better performance than
standard ResNet blocks. Although integrating such blocks
in the backbone was attempted during the thesis, it was
not successful.
An important improvement this thesis did not manage to
fully implement is passing the cones (cropped from the
frame by their bounding boxes) as a batch to the RektNet.
This improvement is discussed at length in Section 3.4.2
of (44). Since the keypoint regression was trained to accept batches of several images per inference, this could be
leveraged by passing several images in a single inference
through the network. In contrast, the current solution iterates though each bounding box or cone sequentially and
thus forwards several times per frame, which is not ideal.
However, in order to see this through, the inference needs
to make use of TensorRT.
A final improvement could also be using TensorRT (38), a
software development kit developed by NVIDIA that optimizes and accelerates the inference of deep learning models. It is designed to provide high performance for deep
learning applications by using certain NVIDIA GPUs such
as the one from the Driverless PC to achieve faster processing times. This thesis started implementing this solution and installed the required libraries on the Driverless
PC but was not able to fully implement it. Note that TensorRT requires first the conversion from PyTorch model to
ONNX format, and thus the code needs to adapt to working the models in this format instead. However, if this implementation is achieved, the processing could be even
faster and would be optimized for the hardware used currently as well as the Jetson Xavier.
10
CONCLUSION
The proposed work has implemented a pipelined approach for a camera detection system that can detect
cones in images and recover their 3D position using a 2D
object detector, a keypoint regression model, and the Perspective n-Point algorithm. We have presented challenges
that require us to optimize known solutions to develop
designs for Formula Student Driverless. The paper has
provided a comprehensive description of the solutions to
common bottlenecks deploying state-of-the-art computer
vision algorithms and has presented an open design of a
tested low-latency vision stack for high-performance autonomous racing.
19
The end result of this work is a perception system that is
able to achieve a sub 80ms latency from view to depth estimation, with errors smaller than 0.5m at a distance of 10m.
Satisfying all requirements except for the FOV which can
be overcome by using another high resolution camera as
discussed in Subsection 4.2.1. This system and its associated source code are available for future teams of Formula
Electric Belgium to improve upon and innovate further at
http://github.com/BayKeremm/thesis-code.
[12] C++ (Last Updated: 1 October 2022). Chrono library
date and time utilities. https://en.cppreference.
com/w/cpp/chrono.
ACKNOWLEDGEMENTS
[13] Daheng Imaging (2023). Daheng Imaging Industrial
Vision Cameras. https://www.get-cameras.com/
dahengimaging.
We would like to thank Formula Electric Belgium for this
amazing opportunity to work on a life-sized, real-time and
very demanding yet extremely interesting project. Not only
did it help us learn new skills and improve our technical
abilities but also helped us understand the dynamics of
such an organization.
[11] Bay, H., Tuytelaars, T., and Van Gool, L. (2006).
SURF: Speeded Up Robust Features. In Leonardis,
A., Bischof, H., and Pinz, A., editors, Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg.
Springer Berlin Heidelberg.
[14] De Ryck, R. (2016). State Estimation for Autonomous
Vehicles. PhD thesis, KU Leuven.
[15] De Silva, D., Roche, J., and Kondoz, A. (2017). Fusion of LiDAR and Camera Sensor Data for Environment Sensing in Driverless Vehicles.
We therefore thank all the members, advisors, and generous sponsors of Formula Electric Belgium for making this
project possible.
[16] Dellaert, F. and Contributors, G. (2022). borglab/gtsam. https://github.com/borglab/gtsam).
We would also like to thank our colleague Antoine Bauer
for providing the hardware to train YOLOv7.
[17] Dhall, A. (2018). Real-time 3D Pose Estimation with
a Monocular Camera Using Deep Learning and Object
Priors On an Autonomous Racecar. PhD thesis, ETH
Zurich.
BIBLIOGRAPHY
[1] Installing PyTorch C++ API. https://pytorch.org/
cppdocs/installing.html.
[2] OpenCV:
DNN
[18] Dhall, A., Dai, D., and Van Gool, L. (2019). Real-time
3D Traffic Cone Detection for Autonomous Driving. In
2019 IEEE Intelligent Vehicles Symposium (IV), pages
494–501.
module.
https://docs.
[19] Driverless, M. and Core, C. (2021). MIT-Driverlessopencv.org/4.x/d6/d0f/group__dnn.html#
CV-TrainingInfra.
https://github.com/cv-core/
ga9d118d70a1659af729d01b10233213ee.
MIT-Driverless-CV-TrainingInfra.
[3] OpenCV Documentation. https://docs.opencv.
[20] Driverless
MIT/Delft
(2021).
RektNet
org/4.x/index.html.
Dataset.
https://storage.cloud.google.com/
mit-driverless-open-source
.
[4] PyTorch. https://pytorch.org/.
[5] Robot Operating System. https://www.ros.org/.
[6] rosbag - ROS Wiki. http://wiki.ros.org/rosbag.
[7] (2021).
FSOCO
Dataset.
fsoco-dataset.com/.
[21] Durrant-Whyte, H. and Bailey, T. (2006). Simultaneous localization and mapping: part I. IEEE Robotics &
Automation Magazine, 13(2):99–110.
https://www. [22] FEB (2023). Formula Electric Belgium. https://
formulaelectric.be/.
[8] (2022). Formula Student Rules 2022. https://www.
formulastudent.de/fileadmin/user_upload/
all/2022/rules/FS-Rules_2022_v1.0.pdf.
[23] FSG (2023). Formula Student Germany. https://
www.formulastudent.de/fsg/.
https://www.
[24] Fujii, K. and Group, T. A.-S.-J. (2004). Extended
Kalman Filter. https://www-jlc.kek.jp/2004sep/
subg/offl/kaltest/doc/ReferenceManual.pdf.
[10] asmag.com (2013). 7 points to heed when setting
FoV. https://www.asmag.com/showpost/14308.
aspx.
[25] Harris, C. G. and Stephens, M. J. (1988). A Combined Corner and Edge Detector. In Alvey Vision Conference.
[9] AMZ (2023).
AMZ Racing.
amzracing.ch/en/node/1255.
20
[26] He, K., Zhang, X., Ren, S., and Sun, J. (2015).
Deep Residual Learning for Image Recognition. CoRR,
abs/1512.03385.
[27] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep
Residual Learning for Image Recognition.
[28] HediVision (2019). Pinhole Camera Model. https:
//hedivision.github.io/Pinhole.html.
[29] Hui,
erage
J.
(2020).
Precision)
for
MAP
(Mean
AvObject
Detection.
[38] NVIDIA Corporation.
TensorRT.
developer.nvidia.com/tensorrt.
https://
[39] O’Keeffe, S. and Villing, R. (2018). Evaluating pruned
object detection networks for real-time robot vision. In
2018 IEEE International Conference on Autonomous
Robot Systems and Competitions (ICARSC), pages
91–96.
[40] Open Source Robotics Foundation (Last Edited: 19
October 2020). ROS Camera Calibration. http://
wiki.ros.org/camera_calibration.
https://jonathan-hui.medium.com/
map-mean-average-precision-for-object-detection-45c121a31173.
[41] OpenCV (2021).
[30] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang,
M., Howard, A. G., Adam, H., and Kalenichenko, D.
(2017). Quantization and Training of Neural Networks
for Efficient Integer-Arithmetic-Only Inference. CoRR,
abs/1712.05877.
[31] Kabzan, J., de la Iglesia Valls, M., Reijgwart, V., Hendrikx, H. F. C., Ehmke, C., Prajapat, M., Bühler, A., Gosala, N. B., Gupta, M., Sivanesan, R., Dhall, A., Chisari,
E., Karnchanachari, N., Brits, S., Dangel, M., Sa, I.,
Dubé, R., Gawel, A., Pfeiffer, M., Liniger, A., Lygeros,
J., and Siegwart, R. (2019). AMZ Driverless: The Full
Autonomous Racing System. CoRR, abs/1905.05150.
[32] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona,
P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014).
Microsoft COCO: Common Objects in Context. In Fleet,
D., Pajdla, T., Schiele, B., and Tuytelaars, T., editors,
Computer Vision – ECCV 2014, pages 740–755, Cham.
Springer International Publishing.
[33] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed,
S., Fu, C.-Y., and Berg, A. C. (2016). SSD: Single Shot
MultiBox Detector. In Leibe, B., Matas, J., Sebe, N., and
Welling, M., editors, Computer Vision – ECCV 2016,
pages 21–37, Cham. Springer International Publishing.
[34] Lowe, D. G. (2004). Distinctive Image Features
from Scale-Invariant Keypoints. International Journal of
Computer Vision, 60(2):91–110.
[35] Marchand, E., Uchiyama, H., and Spindler, F. (2016).
Pose Estimation for Augmented Reality: A Hands-On
Survey. IEEE Transactions on Visualization and Computer Graphics, 22(12):2633 – 2651.
[36] MIT (2023). MIT Driverless. http://driverless.
mit.edu/.
[37] NVIDIA
Xavier.
Corporation.
NVIDIA
Jetson
solvePnP.
https://docs.
opencv.org/4.x/d5/d1f/calib3d_solvePnP.
html.
[42] OpenCV (2022). Camera Calibration and 3D Reconstruction â Calibration Matrix Values.
https:
//docs.opencv.org/2.4/modules/calib3d/doc/
camera_calibration_and_3d_reconstruction.
html#calibrationmatrixvalues.
[43] Ouster (n.d.).
Ouster OS1 Datasheet.
https:
//data.ouster.io/downloads/datasheets/
datasheet-rev7-v3p0-os1.pdf.
[44] Perauer, C. (2021). Development and Deployment of
a Perception Stack for the Formula Student Driverless
Competition.
[PyTorch] PyTorch. JIT Library. http://pytorch.org/
docs/stable/generated/torch.jit.load.html.
[46] Redmon, J. and Farhadi, A. (2017). YOLO9000: Better, Faster, Stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
6517–6525.
[47] Ren, S., He, K., Girshick, R., and Sun, J. (2017).
Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39(6):1137–
1149.
[48] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A.,
Reid, I., and Savarese, S. (2019). Generalized Intersection Over Union: A Metric and a Loss for Bounding Box
Regression. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages
658–666, Los Alamitos, CA, USA. IEEE Computer Society.
AGX
https://www.nvidia.com/en-us/ [49] SAE (2021). SAE Levels of Driving Automation Refined for Clarity and International Audience. https:
autonomous-machines/embedded-systems/
//www.sae.org/blog/sae-j3016-update.
jetson-agx-xavier/.
21
[50] Strobel, K., Zhu, S., Chang, R., and Koppula, S.
(2020). Accurate, Low-Latency Visual Perception for
Autonomous Racing: Challenges, Mechanisms, and
Practical Solutions. CoRR, abs/2007.13971.
[51] Thrun, S., Burgard, W., and Fox, D. (2005). Probabilistic Robotics. The MIT Press, Cambridge, MA, USA.
[52] Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M.
(2022). YOLOv7: Trainable bag-of-freebies sets new
state-of-the-art for real-time object detectors.
[53] Wang, C.-Y., Liao, H.-Y. M., Wu, Y.-H., Chen, P.Y., Hsieh, J.-W., and Yeh, I.-H. (2020). CSPNet: A
New Backbone That Can Enhance Learning Capability
of CNN. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR)
Workshops.
[54] Xue, Z. and Schwartz, H. (2013). A comparison of
several nonlinear filters for mobile robot pose estimation. 2013 IEEE International Conference on Mechatronics and Automation, IEEE ICMA 2013, pages 1087–
1094.
22
APPENDICES
A ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
B YOLO Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
23
APPENDIX A :
RELU
Figure A.1: ReLU: activation function shape
A-1
APPENDIX B :
YOLO PERFORMANCE METRICS
Figure B.1: YOLOv7-tiny: PR curve and Confusion matrix for difference cone types
Figure B.2: YOLOv7: PR curve and Confusion matrix for difference cone types
B-2
Download