Page |1
The Application of Computer Vision in
Autonomous Vehicles
Bingchang Wu
------------------------------------------------------------------------------------------------------------------
Introduction
Due to recent rapid advancement in Artificial Intelligence(AI), Autonomous Vehicles(AV) are
deemed by many to be the future of transportation, revolutionising the automotive industry. In
the past decade, Tesla has been one of the leading firms in the development of autonomous
driving with its Autopilot system. In Tesla’s newest quadrant(Q2), drivers using their autopilot
technology recorded one crash per 6.88 million miles driven, compared to one crash per 1.45
million miles driven by human drivers. This data suggests that AVs are more than 4 times less
likely to get into a traffic accident, and their implementation to society can greatly improve road
safety. Tesla’s autonomous driving technology, rooted in AI and in particular computer vision,
processes visual data in real-time to detect, classify objects, and respond to dynamic changes.
This paper aims to provide a comprehensive overview of the theory behind how computer vision
and various sensors enable the vehicles to interpret its surroundings. It also examines the current
capabilities and limitations of these technologies to analyse whether they can fully replace
manual driving.
Figure 1: Tesla’s autopilot
Page |2
Image Processing
An image is a 2-dimensional representation of the visible-light spectrum. It is essentially a
matrix of pixels (smallest unit in an image) represented by numbers.
Figure 2: computers see images as a matrix of numerical values
In a binary image, there are only two colours: ”0” representing black, and ”1” representing white
(1pixel=1bit). Greyscale images (e.g.figure2) have a range of shades between white and black,
varying from 0-255 where 0=pure black and 255=pure white (1pixel=8bits).
Figure 4
Figure 3
In a colour image, each pixel is composed of three channels of Red, Green, and Blue colours
(RGB) each between 0-255 of varying intensity, and the pixel’s RGB value is represented as a
1x3 matrix of [x,y,z] (1pixel=24bits).
Figure 5: an image in a) Colour b) Greyscale c) Binary
Page |3
Before deciding on the vehicle’s actions, the computer needs to extract the relevant information
from the image through a process called Image Processing – a field of signal processing. In the
context of autonomous vehicles, this is an important step. Raw visual data often includes too
much noise, motion blur, varying lighting, and other imperfections that can mislead the AI’s
judgement. For example, poor lighting might make it difficult to distinguish a pedestrian from a
background object.
In image processing, the camera
captures an image and inputs it to
the system before the system
interprets it. The output either
enhances the image or obtains its
characteristics which helps in
identifying the contents precisely.
We will now briefly review the
steps to achieve this.
Figure 6
Noise Reduction
Image noise (digital noise) is a variation/deviation of brightness in the image
(meaningless/incorrect pixels). Noise reduction is a major step in image processing used to
ensure the visual input data is clean, which is essential for accurately interpreting surroundings
and preventing misidentifications, such as the colour of traffic lights. There are different causes
of image noise depending on the type.
Figure 7
Figure 8
Smoothing is the process which enables noise-suppression through various filters by reducing
the impact of corrupted pixels and making the image appear “smoother”. Filtering is used to
Page |4
enhance an image through removal or modification of certain features in the image.
Figure 9
In linear filtering, the value of the output pixel is a linear combination of values of
neighbouring pixels to the input pixel. One example would be the box filter, where the value of
each pixel in the output image is the average of the pixels in its neighbourhood.
Consider the image represented by the matrix below(Figure10):
Figure 10
We can identify the three noise pixels(circled in red) are mathematical anomalies to their
neighbouring values(e.g. 61 is next to 50s and 110 is next to 100s). We can supress this noise by
applying the box filter. Convolution is the mathematical process used to apply a filter to an
image. The box kernel(a small matrix grid used to modify the pixels of an image through
convolution) is passed over the entire image, and at each location the pixel value is replaced by
the weighted average of the surrounding pixel values. This process removes the noise by
“blurring” it into the surroundings.
In our example, let the kernel be a 3x3 matrix as shown:
Page |5
Figure 11
The weight of each value in the kernel will be 1/9, as there are 9 values in the kernel and the total
weight needs to add to 1 to compute the average.
In figure12, the value to replace ”58” would be: (8x50x1/9)+(58x1/9)= 50.89 ≈ 51
Figure 12
Figure13 below is the end results
after convoluting the whole
image:
Figure 13
Page |6
Figure 14
Figure14: convoluting a coloured image using box filter
A limitation of the box filter is that all pixels in the
neighbourhood contribute equally to the output
pixel, leading to a uniform blur and losses
important image details like edge sharpness of an
object. The Gaussian Filter is a more practical
option, applying a weighted average where the weights are determined by a Gaussian function.
The weights decrease exponentially as the distance from the centre increases. This approach
looks natural and preserves the edges, which we will discuss the importance later.
Mathematically, the filter uses the Gaussian function to determine the weights of neighbouring
1
pixels in the kernel: πΊ(π₯, π¦) = 2ππ2 π
−
π₯2 +π¦2
2π2
where x and y are horizontal and vertical distances
from the centre and π is the standard deviation of the distribution. The function creates a bellshaped distribution where the vertical height can be seen as the given weight of the pixel at that
coordinate, higher weights are given to pixels closer to the centre.
Figure15: visual representation of Gaussian function
Page |7
Figure16: Gaussian filter maintaining facial features whilst smoothing the image
Figure 16
Non-linear filters
Non-linear filters use more complex operations than applying a linear function to the
neighbourhood of the input pixel. One of the common noise types is impulse noise. Images
corrupted by this noise would get dark pixels in bright regions and bright pixels in dark regions.
An effective solution would be the non-linear Median Filter. This filter works by replacing the
input pixel with the median value of its neighbouring pixels in the kernel. This can easily fix
impulse noises, whilst preserving the edges better than linear filters.
Figure 17
Figure18
Page |8
Edge Detection
It is essential for autonomous vehicles to distinguish different objects on the road. Edge
detection is the process of identifying sharp change in intensity/colour in an image to locate the
edges/boundaries of objects. A common method of edge detection is the Sobel Operator. It uses
two 3x3 convolution kernels, one for detecting horizontal edges and one for vertical(see
Figure18).
Figure 18
The Sobel operator uses convolution kernels to calculate gradients. These two kernels will be
convoluted over the input image, each pixel multiplying with the corresponding weight in the
Sobel kernel. Then, the resulting values from each kernel are added together to find the values of
Gx and Gy. The resulting Gradient Approximation can be calculated with:
πΊ = √{πΊπ₯2 + πΊπ¦2 }
After calculating the gradient magnitude (G), a threshold is compared to it. If G > threshold, the
pixel will be considered an edge. A binary edge map is produced after thresholding.
Figure 19
Figure 20
Page |9
Colour Space Conversion
In autonomous vehicles, the YUV colour space is commonly used for separating luminance(Y)
from chrominance(U&V). This is because when detecting edges, only the light-intensity(Y) is
needed, and colours would add extra unnecessary dimensions to the task. However, the vehicles
still need to detect colours(U&V) for traffic lights, thus the separation is crucial(figure21).
Figure 21
The conversion formula is given by(figure22):
Figure 22
For the implementation of these image processing techniques, a popular library is OpenCV, an
open-source computer vision library essential for real-time image processing.
Figure 23
P a g e | 10
Image Recognition
After processing the raw visual data, the next step is to interpret the data through Image
Recognition. This is the core theory behind autonomous vehicles, as they need to be able to
recognise things like pedestrians, other vehicles, traffic lights/signs from the images(figure24).
Figure 24
However, computers cannot recognise/distinguish the contents of an image easily like humans.
Traditional image recognition methods relied on manually defined features, which isn’t practical
as there’s too many variations in the same type of objects and uncertainties in real-life.
A solution to this problem is Deep Learning, a sub-field of machine learning that let computers
learn the features themselves directly from raw-data instead of instructing/programming them.
Deep Learning models train Neural Networks, where the algorithms are inspired by the neurons
of the human brain. A type of neural network specialised for image recognition is known as
Convolutional Neural Networks(CNNs), designed to process grid-like data such as images.
Once the raw visual data goes through image procession to reduce noise and enhance features,
they become suitable for interpretation by CNNs. These networks rely on clear and relevant data
to learn the patterns and make accurate classifications.
P a g e | 11
Figure 25
To recognise an object in the image, CNNs need to recognise patterns in the image that matches
with known(trained) information. However, we cannot just program the CNN to only recognise
this matrix arrangement(figure25) as “8” because there are countless variations/positions on an
image to represent the object. We need the CNN to recognise all the distinct features that make
up any “8”.
A CNN is composed of layers, each with a specific function/purpose to the overall process of
feature detection and object recognition. These layers work sequentially, transforming the input
data and extracting relevant features to make accurate predictions.
Figure 26
Each hidden layer performes specific operations on the pixels from the previous layer to extract a
part of the features(e.g.edges), then input them to the next layer. All identified features are
combined in the output
layer to classify the
whole object in the
image.
We will now look at the
hidden layers(figure27).
Figure 27
P a g e | 12
Convolution Layer
Just like in image processing, these layers apply kernels(filters) to the input image to detect
features such as edges and patterns. Consider the following example(figure28):
Figure 28
In the convolution layer, the kernel is slid over the pixels of the image in reading-order, moving
one pixel at a time(stride 1). At each step, the kernel’s weights and the pixels it covers are
multiplied element-wise, and the results are summed together into a value mapped to a new
matrix called Convolved Features Map. As you can see(figure29):
(1x1)+(1x0)+(1x1)+(0x0)+(1x1)+(1x0)+(0x1)+(0x0)+(1x1)=4, which is the first value of the
new matrix. The result(figure30) is shown at the right:
Figure 29
Figure 30
Figure 31
P a g e | 13
To find the dimensions of the output feature map, the formula n-f+1 is used, where n is the
length/width of the input image, f is the length/width of the kernel, and
the result is the length/width of the feature map. As in the example
above, the feature map size is 5-3+1=3 for length and width.
Figure 32
Different filters are used for detecting different features, and each
convolution layer can apply multiple filters. For example(figure32: different
possible features/edges CNN is looking for) one filter might be used to detect all horizontal edges, while
another is used for detecting circles. For each filter there will be an output feature map, where
only the feature that filter is looking for will be highlighted. In CNNs, multiple convolution
layers are often stacked to progressively recognise more complex features. Feature maps from
multiple convolution layers are combined and inputted in the following layers. This creates a
Feature Hierarchy where surface layers detect low-level features like edges/texture, and deeper
layers put them together to detect high-level features like traffic signs.
One limitation of convolutional layers is that they heavily rely on constant lighting condition and
image angle/perspective from the data they trained with, so they would struggle recognising
images with varying lighting conditions or unusual perspectives.
ReLU Layer
Once the feature maps are extracted, they are moved to the ReLU(Rectified Linear Unit) Layer,
its function defined as: π(π₯) = max(0, π₯)
Figure 33
P a g e | 14
This layer just sets all negative values on the feature maps to 0. This is to introduce nonlinearity(figure33) to the CNN, because images involve complex patterns(such as curves) that
linear models cannot recognise. The output from this layer is a Rectified Feature Map(figure34).
Figure 34
Pooling Layer
The rectified feature map is the input to the Pooling Layer, where Max Pooling is to
significantly reduce the spatial dimensions(width/height) of the input feature maps.
Figure 35
Figure 36
This is done to reduce computation times
whilst focusing on key features(figure35),
which is important for autonomous vehicles to quickly react to sudden changes. Max pooling
also enhances features in the image(figure36). To understand Max pooling, consider the following
4x4 matrix representing a rectified feature map(figure37):
Figure 37
P a g e | 15
The most common max pooling filter size is 2x2. In max pooling, there are no weights in the
filter to multiply by. Instead, the largest value in the pixels covered by the filter is selected to be
mapped on a new matrix called Pooled Map. The stride of the filter is 2, meaning it will move 2
steps at a time starting from the top-left corner. Once the example feature map is pooled, the
resulting pooled map should look like this(figure39):
Figure 38
Since the greatest value is always picked, the features are made sharper such as in figure36.
However, one downside of pooling layers is that the reduction in spatial dimensions could
sometimes lead to a loss of information(especially for small objects in the image). Reducing
these small objects or fine details could potentially impact the safety of autonomous vehicles.
Fully Connected Layer
Figure 39
So far, we looked at how different layers work together to extract features from the images. But
how do we classify these features to a particular label(object), and how to make the model learn
to associate certain features with that label? Fully Connected Layers are used to classify image
to a particular category after extracting the features from convolution and max pooling layers.
P a g e | 16
Figure 40
Consider figure40 above. After obtaining the pooled feature maps, we flatten the maps into 1dimensional arrays and pass them into the fully connected layers. The 4 layers shown consists of
1 input layer and 3 fully connected layers. If it is a multi-class classification, the output
layer(final fully connected layer) have n neurons, where n is the number of categories we trained
the model to classify. Each neuron in a layer has a weight(connection) with each neuron in the
subsequent layer. These weights are trainable parameters that our model needs to learn. These
layers are used for classification and to learn to associate features to categories(e.g. associate
headlights to vehicles).
Activation Functions are applied in the output layer to compute and decide on the output
class(e.g. vehicles, pedestrians) based on the input features extracted in previous layers. Sigmoid
1
Function is often used in binary classification(deciding between 2 classes): π(π₯) = 1+π −π₯
Figure 41
P a g e | 17
As figure41 above, the sigmoid function maps an input value between 0 and 1. This can be
modelled as a probability distribution, where output is the probability of one of the two
outcomes. The outcome with the higher probability is the category the image is classified to. For
example, if the output of the function is higher than 0.5(closer to 1) the vehicle would “go”,
whereas if the output is less than 0.5(closer to 0), the vehicle would “stop”. The threshold of 0.5
used in this example can be adjusted specific requirements, in autonomous vehicles the threshold
is set depending on the level of confidence needed for an action.
For multi-class classification, we use the SoftMax Function, given by: π(π§π ) = ∑πΎ
π π§π
π=1 β¬π
π§π
,
where π§π is the raw output for class i from the CNN, K is the total number of classes to classify,
and π(π§π ) is the SoftMax output for class i, which predicts the probability of that class. Consider
the example below(figure42):
Figure 42
Figure 42
In figure42, the output value from each neuron in the output layer is applied to the SoftMax
function, and the result at the right is the probability of each class(total probability=1).
Therefore, the class with the highest probability will be what the object in the image is classified
as.
With an understanding of how CNNs process visual data, it is important to now examine how
these technologies perform in unpredictable real-world scenarios.
P a g e | 18
Object Detection and Tracking
After an overview of the fundamentals of image recognition and CNNs, we will now look at an
advanced real-life application that extends beyond static features recognition.
Figure 43
After extracting features
from the CNN, object
detection locates them on
the image, and tracking
monitors these objects as
they move across frames.
For object detection, a
bounding box is predicted
around potential objects
that then gets classified into predefined categories the CNN is trained to recognise such as
pedestrians, cyclists, vehicles, etc.
Once objects are detected, each object in a bounding box is assigned a unique ID. Tracking
algorithms such as optical flow are used to predict the object’s position based on previous
movements.
However, traditional object detecting methods such as sliding window detection and regionbased CNNs(R-CNN) have limitations. The pipeline of multi-step process involving generating
region proposals(generating bounding boxes), feature extraction, object classification, etc. is too
computationally intensive and slow, which isn’t practical in when used in real-time like
autonomous driving.
YOLO(You Only Look Once) is a state-of-the-art object detection algorithm developed as a
more efficient approach, outperforming all previous object detection algorithms. Unlike
traditional multi-stage methods, YOLO divides the image into a grid and each grid cell predicts
bounding boxes and their class probabilities for each recorded class, allowing it to detect
multiple objects in a single image pass. Its unique output label format(Figure44) allows for
simultaneous predictions of object aspects, reducing computation time, making YOLO suitable
for the fast-paced decision-making process in autonomous vehicles.
P a g e | 19
Figure 44
Given an input image, YOLO first sets grids(usually 19x19) on the image as shown in figure44
above. The YOLO algorithm just performs image classification on all the grids. In the out label,
Pc indicates the probability of an object present in a specific grid/bounding box(1 being very
confident). Bx and By indicates the predicted centre of the detected object, and Bh and Bw are
the width and height of the bounding box. Cn’s value indicates the probability of the object in
the grid being of class n.
Once the YOLO algorithm has detected and classified objects within the vehicle’s surroundings,
the autonomous vehicle must then make real-time decisions based on this information. The
classified images provide critical data, such as the location, type, and speed of various objects
like other vehicles, pedestrians, and traffic signs. The vehicle’s onboard decision-making system
driven by complex algorithms and machine learning models analyses this data to determine the
safest course of action. For example, if the vehicle detects a pedestrian crossing the street, it will
calculate the pedestrian’s trajectory and decide whether to slow down or stop entirely to avoid a
collision. Similarly, if another vehicle is detected in an adjacent lane attempting to merge, the
autonomous vehicle may decide to adjust its speed to maintain a safe distance. This decisionmaking process integrates inputs from multiple sensors and continuously updates in response to
changing conditions, allowing the vehicle to navigate complex environments. We will now look
at the sensors that provide these additional data for the decision-making process of AVs.
Sensors in Autonomous Vehicles
While computer vision techniques like CNNs and YOLO allow automation of vehicles by
understanding their surroundings, there are limitations that affect their performance. Computer
Vision heavily relies on clear visual input and constant lighting conditions, which will be
challenged in extreme weather conditions. Cameras also convert 3D information to 2D images,
which can make accurate distance-measuring to be challenging using computer vision alone.
Various sensors are used in autonomous vehicles to provide additional environmental data to the
system to overcome these limitations, which we will briefly overview(figure45).
P a g e | 20
Figure 45: different sensors in autonomous vehicles
LiDAR(Light Detection and Ranging) sensors compensates for the lack of depth detection from
cameras by providing high-resolution 3D maps of the vehicle’s surrounding terrains in real-time,
including objects behind obstacles that cannot be seen by cameras. LiDAR works by emitting
millions of short pulses of laser per second in all directions and measuring the time for it to
π⋅π‘
reflect off objects recaptured by the sensor. The distance is measured by: π = 2 where c is the
speed of light.
Each pulse provides a distance measurement, and combining millions of these forms a point
cloud, a high-density collection of spatial data to create a detailed map. This is the main sensor
used in autonomous vehicles for many AV companies such as Waymo and Cruise(leading firms
in the industry).
P a g e | 21
Figure 46: LiDAR’s point cloud map
Nonetheless, LiDAR is very expensive, consumes a lot of power for computation, and are
notoriously unreliable in rainy or foggy/dusty conditions due to the scatter of laser pulses.
The Ultrasonic sensor and radar also works in a similar manner, emitting waves(sound and
radio) then measuring the return time to judge the distance. Ultrasonic is mainly used for parking
assistance due to its short range. Radar is mainly used to detect distance between other vehicles
for adaptive cruise control and collision avoidance systems. It works by using the Doppler
Effect to measure relative speed of objects. When an object is relatively moving closer, the
return frequency increases, and vice versa when moving away. The formula: π£π =
π⋅(ππ −ππ )
2ππ
can be
used to calculate the relative velocity v of an object to your vehicle. Nonetheless, radars lack in
details for their measurements, often not recognising smaller objects like pedestrians and
cyclists.
Other sensors can overcome other limitations, such as Infrared Cameras used in night-time
driving when light intensity is too low for consistent image recognition by capturing infrared
radiation instead of visible light. In environments without clear visual landmarks, GPS and
IMUs(Inertial Measurement Units) are used to maintain accurate localisation. GPS provides
precise location data to the system, and IMUs measure acceleration, angular velocity, and
orientation(using odometry), which is important for navigation in AVs.
To improve precision in real-time estimation of the autonomous vehicle’s position and relative
velocity of other nearby vehicles, we need to fuse measurements from the various sensors
introduced. This is where the Kalman Filter comes in, one of the most famous algorithms
widely used in various engineering fields including robotics and autonomous vehicles. The
purpose is to estimate the state of a changing system from a series of incomplete and
noisy/uncertain data. The Kalman filter has two main stages: Prediction and Correction.
P a g e | 22
In the prediction stage, the filter use state prediction to estimate the position of the object being
tracked(this can be a car ahead or the vehicle itself) based on its current position and velocity. It
also predicts how uncertain the prediction is(predicted uncertainty), based on object speed and
data noise level. In the correction stage the filter updates its prediction using new measurements
by calculating a weighted average between the prediction and the new measurement. The
weights depend on how accurate the filter believes each source of data is.
Consider the following example of estimating the 1D position of a car:
Figure 47: Kalman filter
Let’s say at time t-1 the car is estimated by the radar from your vehicle to be around 10m
ahead(Xt-1≈ 10). This is mapped by a Gaussian distribution curve, where the mean µ is 10 and
is the highest probability of the actual position of the car(however it could be slightly
further/closer within the variance). The new position at time t can be estimated using the
equations of motion(SUVAT) with the car’s current position and velocity, which we calculate to
be 15m ahead(Xt). However at time t the radar measures the distance to be around 17m(Yt). We
can combine the probability distribution of our radar’s estimate(Yt) and the state estimate(Xt)
into an optimal estimate distribution using the Kalman filter in the correction stage.
The filter calculates a Kalman Gain, which is a factor that determines the weight of the new
measurement to influence the adjustment of the current estimation. The formula is:
predicted uncertainty
πΎπΊ = predicted uncertainty+measurement uncertainty
A higher Kalman Gain indicates more trust in the new measurements(as predicted uncertainty is
higher), while a lower gain indicates more reliance on the prediction. In the example, Xt’s
predicted position distribution has a lower probability density than Yt(the curve is higher than
Xt’s). Thus the filter trusts the radar’s estimation more in this case(this might be because the
car’s speed was constantly changing), so more weights are put on Yt’s distribution(optimal
estimate is closer to Yt). Using the equation Estimate(t)=Estimate(t-1)+KG[Measurement(t)Estimate(t-1)], if KG is 0.75, the µ of optimal estimate is 15+0.75(17-15)=16.5, the best estimate
of the car’s actual position.
P a g e | 23
The Kalman filter’s ability to filter noise and fuse different sensor data makes its predictions
invaluable in autonomous vehicle’s safety despite imperfect data.
Autonomous Vehicles in the Real-World
After reviewing the technical process, we will now analyse AV’s future. To discuss whether
autonomous vehicles will completely replace manual driving, we should analyse their
performance in the real-world. I have personally experience an autonomous taxi ride in Abu
Dhabi(UAE) in April 2023. The journey was quite smooth and the vehicle followed all the
standards in driving(e.g. remaining under speed limit, using lane-switch lights). However, the
vehicle wasn’t completely autonomous, as there was still a taxi driver in the driving seat just in
case of emergencies, which indicates that the current technology is still immature.
In 2022, The National Highway Traffic Safety Administration(NHTSA) reported that Tesla’s
vehicles on Autopilot had a crash rate of 0.18 per million miles driven compared to 0.8 for
human drivers. This might suggest that AVs are more reliable than human drivers at a first
glance. However, the crashes involved were often rear-end collisions or where the system failed
to recognise stationary objects, showing a
Figure 48
major flaw in the current technologies used.
Human intervention is a measure of AV
reliability. Data from the California
DMV(Department of Motor Vehicles)
shows that Waymo and Cruise vehicles
averages at 0.09 and 0.05 disengagements
per 1,000 miles, respectively. This suggests
that current AV technology is not yet fully
autonomous, particularly in unpredictable
environments.
The consequences of said yet fully-mature technology being used on roads can be dangerous and
ethically questionable, which is reflected in March of 2018 as Elaine Herzberg was the first ever
pedestrian killed by an autonomous vehicle(Uber prototype AV). While Herzberg was detected,
Uber’s system failed to infer her motions and did not apply emergency breaking(figure48). In
2020, Toyota’s e-Palette AV collided with a visually impaired pedestrian crossing the road due
to misidentification. These real-world incidents show that despite statistically speaking they
result in less accidents on average, their ability to respond effectively to unpredictable edge
cases(that a human driver can perfectly handle) remains a significant concern.
P a g e | 24
Limitations of current technology
Despite visual navigation working with sensors such as LiDAR resolves many limitations from
computer vision alone, there are still many limitations of AVs that remain currently unsolved
before fully integrating them into society.
AVs perform notably poor in adverse weather conditions. MIT’s AgeLab research showed that
heavy rain and snow could reduce the effectiveness of LiDAR and cameras(the two main sensors
used in AVs), decreasing the vehicle’s ability to detect and respond to obstacles. This limitation
shows that AVs cannot adapt to various weather conditions, making its broader implementation
in place of human drivers very difficult.
Currently, AVs rely on highly detailed maps and well-marked roads to ensure safety. They
usually follow predetermined paths and rely on high-definition(HD) maps that provide detailed
road information, including specific lane markings, and even position of trees. These areas are
few, thus AVs have a very limited areas to operate in, making them less accessible. A study from
Stanford University found that AVs’ performance significantly declines in rural areas where road
markings may be faded, or GPS signals are less reliable, which is essential for navigation.
Driving a vehicle requires a high level of understanding and predicting human behaviours
beyond obstacle avoidance. Scenarios that are easy for human to predict might be challenging for
AVs, such as when a pedestrian steps off the pavement and disappears behind a parked car, a
human driver would expect them to reappear from the other side to cross the road, but computers
cannot foresee this yet.
To incorporate more advanced AI to solve the above limitations requires much more
computational power to make split-second decisions. To overcome this, researchers need to
develop more efficient algorithms to speed up the processing times.
Another limitation is that current AI models might not respond to edge cases due to not being
trained on these very uncommon scenarios. Future advancements could focus on integrating AI
that learns continuously from new data in real-time, similar to human learning
Conclusion
As for whether autonomous vehicles can fully replace manual driving, the current answer would
be no, due to the limitations previously mentioned. AVs are still very niche, only being able to
operate safely in very specific locations and weather conditions, whilst still risking from edge
cases and unpredictable human behaviours. This niche is also reflected in its cost, as AVs
consumes much more energy than regular vehicles and requires a lot of expensive sensors. This
economic limitation is a significant barrier to the widespread adoption of AVs considering the
scale required to fully replace manual driving. For AVs to serve as a replacement to manual
P a g e | 25
driving, continuous development in AI and improvements in production efficiency and cost
reduction will be essential. Despite these challenges, AVs remain to be a promising future with
an estimated 1.3 million less car accidents if they eventually fully replace manual driving, albeit
a distant future. Until then, AVs will likely serve as a complementary alternative, enhancing
safety and convenience in specific contexts rather than fully replacing human drivers.
P a g e | 26
Bibliography
1.How self-driving cars work | Synopsys (page 1, 24, 25)
https://www.synopsys.com/glossary/what-is-autonomouscar.html#:~:text=An%20autonomous%20car%20is%20a,in%20the%20vehicle%20at%20all.
2.Tesla (page 1, 23)
Autopilot | Tesla
3.Tesla Vehicle Safety Report | Tesla (page 1, 23, 24)
4.Image Processing playlist by Shiram Vasudevan (page 2-8)
https://youtube.com/playlist?list=PL3uLubnzL2TkQ5ZpBIpX34t0cpggGMuIF&si=8MGZO0IfP
PPHXvRE
5.Image Processing | First principles of Computer Vision (page 2-4, 9)
https://youtube.com/playlist?list=PL2zRqk16wsdorCSZ5GWZQr1EMWXs2TDeu&si=MmuINyxy0c-AcZC
6.Example Gaussian Filter | Udacity (page 5, 6)
https://youtu.be/-AuwMJAqjJc?si=aI_RNkSJSmVcbsiM
7."Image Processing 101: Understanding Image Filters and Convolutions" by Towards Data
Science (page 2-6)
A Comprehensive Guide to Image Processing: Fundamentals | by YaΔmur ÇiΔdem AktaΕ |
Towards Data Science
P a g e | 27
8."A Comprehensive Guide to Edge Detection in Image Processing" by Analytics Vidhya
(page 7, 8)
Comprehensive Guide to Edge Detection Algorithms - Analytics Vidhya
9.Convolutional Neural Network Tutorial (CNN) | How CNN Works | Deep Learning
Tutorial | Simplilearn (page 10-13)
10.Convolutional Neural Network Coding Lane Playlist (page 10-16)
https://youtube.com/playlist?list=PLuhqtP7jdD8CD6rOWy20INGM44kULvrHu&si=KW
SOZHoV3cwuWI6H
11.C4W1L09 Pooling Layers | DeepLearningAI (page 14, 15)
https://youtu.be/8oOgPUO-TBY?si=O_qSKlxsmRwz_ogk
12.What is Activation function in Neural Network ? Types of Activation Function in
Neural Network (page 16, 17)
https://youtu.be/Y9qdKsOHRjA?si=1Tfv6WrLHjQqnQyJ
13.The Sigmoid Function Clearly Explained (page 16, 17)
https://youtu.be/TPqr8t919YM?si=2fDgdEXtZQfmqxfo
14.What is YOLO algorithm? | Deep Learning Tutorial 31 (page 18, 19)
https://youtu.be/ag3DLKsl2vk?si=pCPf4oQVqYdgmPcR
15.Kalman Filter for Beginners (page 22, 23)
https://youtu.be/bm3cwEP2nUo?si=Tn7TUWb-q9pWRpji
P a g e | 28
16.Lidar vs. Tesla: the race for fully self driving cars (page 20, 21)
https://youtu.be/pUtJ8HPZRkw?si=Crlqstysa_Zeqrj9
17.Consumer Reports, 2022 Evaluation of Tesla’s Autopilot (page 23, 24)
18.The Truth About Self Driving Cars (page 23, 24s)
https://youtu.be/d5TiaIYdug4?si=QyfIob8ugVRpjH5V
19.Stanford University, 2021 Study on Autonomous Vehicles in Urban and Rural Settings
(page 24)