Uploaded by Leandro Omar

Deep Learning Localization for Self-driving Cars Thesis

Rochester Institute of Technology
RIT Scholar Works
Thesis/Dissertation Collections
Deep Learning Localization for Self-driving Cars
Suvam Bag
Follow this and additional works at: http://scholarworks.rit.edu/theses
Recommended Citation
Bag, Suvam, "Deep Learning Localization for Self-driving Cars" (2017). Thesis. Rochester Institute of Technology. Accessed from
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Deep Learning Localization for Self-driving Cars
Suvam Bag
Deep Learning Localization for Self-driving Cars
Suvam Bag
February 2017
A Thesis Submitted
in Partial Fulfillment
of the Requirements for the Degree of
Master of Science
Computer Engineering
Department of Computer Engineering
Deep Learning Localization for Self-driving Cars
Suvam Bag
Committee Approval:
Dr. Raymond W. Ptucha Advisor
Associate Professor
Dr. Shanchieh J. Yang
Dr. Clark G. Hochgraf
Associate Professor
I would like to thank the Machine Intelligence Lab of Rochester Institute of Technology (RIT) for providing me the resources as well as inspiration to complete this
project, my adviser Dr. Raymond W. Ptucha for assisting me throughout my thesis
and my colleague Mr. Vishwas Venkatachalapathy for his valuable feedback. I would
also like to thank the autonomous people mover’s team of RIT for their help.
I dedicate this thesis to my parents for their endless support.
Smart cars have been present in our lives for a long time but only in the form of
science fiction. A number of movies and authors have visualized smart cars capable
of traveling to different locations and performing different activities. However this
has remained a fairly impossible task, almost a myth until Stanford and then Google
actually were able to create the worlds first autonomous cars. The Defense Advanced
Research Projects Agency (DARPA) Grand Challenges brought this difficult problem
to the forefront and initiated much of the baseline technology that has made today’s
limited autonomous driving cars possible. These cars will make our roadways safer,
our environment cleaner, our roads less congested, and our lifestyles more efficient.
Despite the numerous challenges that remain, scientists generally agree that it is no
longer impossible. Besides several other challenges associated with building a smart
car, one of the core problems is localization. This project introduces a novel approach
for advanced localization performance by applying deep learning in the field of visual
odometry. The proposed method will have the ability to assist or replace a purely
Global Positioning System based localization approach with a vision based approach.
Signature Sheet
Table of Contents
List of Figures
List of Tables
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Localization and mapping . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Lack of identification of proper objects . . . . . . . . . . . . . . . . .
1.4 Weather/Climate/Time of day . . . . . . . . . . . . . . . . . . . . . .
1.5 Vehicle to vehicle communication . . . . . . . . . . . . . . . . . . . .
1.6 Visual odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Visual odometry in localization . . . . . . . . . . . . . . . . . . . . .
1.8 Loop closure detection in Simultaneous Localization and Mapping (SLAM) 9
1.9 CNN in vision based localization . . . . . . . . . . . . . . . . . . . . 10
2 Thesis Statement
3 Thesis Objectives
4 Background/Related Work
4.1 Artificial intelligence in self-driving cars .
4.2 Localization . . . . . . . . . . . . . . . .
4.3 Deep learning . . . . . . . . . . . . . . .
4.3.1 History and evolution . . . . . . .
Traditional techniques . . . . .
Convolutional Neural Networks
Case studies . . . . . . . . . . .
Deep learning frameworks . . .
5 Time of day/Weather invariant
5.1 Time of day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Weather invariant . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Proposed method
6.1 Google Street View .
6.1.1 Field of View
6.1.2 Heading . . .
6.1.3 Pitch . . . . .
6.2 Google Maps API . .
6.3 Dataset1 . . . . . . .
6.4 Dataset2 . . . . . . .
6.5 Autonomous golf cart
6.6 Camera . . . . . . .
6.7 GPS ground truth . .
6.8 Dataset3 . . . . . . .
6.9 Dataset4 . . . . . . .
6.10 Dataset5 . . . . . . .
6.11 Datasets with smaller
6.12 Classifier . . . . . . .
6.13 Time of day invariant
6.14 Weather invariant . .
6.15 Hierarchical . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
7 Results
8 Future Work
9 Conclusion
List of Figures
Google self-driving car. . . . . . . . . . . . . . . . . . . . . . . . . . .
Localization of a smart car [2]. . . . . . . . . . . . . . . . . . . . . . .
Effect of different weather/climate [3]. . . . . . . . . . . . . . . . . .
Multi-hop system of ad-hoc networks [4]. . . . . . . . . . . . . . . . .
Classification results from the course CS231n - Convolutional Neural
Networks for Visual Recognition [5]. . . . . . . . . . . . . . . . . . . .
Stanley [21]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Flowchart of the Boss software architecture [23]. . . . . . . . . . . . .
Boss (http://www.tartanracing.org/gallery.html). . . . . . . . . . . .
(left)GPS localization induces greater than equal to 1 meter error
(right)No noticeable error in particle filter localization [25]. . . . . . .
4.5 Confusion matrix for individual traffic light detection [27]. . . . . . .
4.6 (a) Left: Detections returned by Felzenszwalb’s state of the art car
detector [31]. (b) Right: Detections returned by the algorithm proposed in [30]. False positives are shown in red and the true positives
are shown in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Architecture of LeNet-5. Each plane is a feature map i.e. a set of units
whose weights are constrained to be identical. [38]. . . . . . . . . . .
4.8 Nearest Neighbor classification using L1 distance example[5]. . . . . .
4.9 Backpropagation visualized through a circuit diagram[5]. . . . . . . .
4.10 ReLU activation function, which is zero when x¡0 and then linear with
slope 1 when x¿0 [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 a 3-layer neural network with three inputs, two hidden layers of 4
neurons each and one output layer [5]. . . . . . . . . . . . . . . . . .
4.12 a cartoon depicting the effects of different learning rates. While lower
learning rates gives linear improvements, the higher it goes, the more
exponential they become. Higher learning rates will decay the loss
faster, but they get stuck at worse values of loss (green line). This
is because there is too much ”energy” in the optimization and the
parameters are bouncing around chaotically, unable to settle in a nice
spot in the optimization landscape [5]. . . . . . . . . . . . . . . . . .
4.13 Convolutional layer - input layer [32×32×3], filter dimension [5×5×3],
activation maps [28×28×1] (http://image.slidesharecdn.com/case-studyof-cnn-160501201532/95/case-study-of-convolutional-neural-network-5638.jpg?cb=1462133741). . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.14 Max pooling with a stride of 2 [5]. . . . . . . . . . . . . . . . . . . . . 44
Street View car [58]. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Street View images from Dataset1. . . . . . . . . . . . . . . . . . . .
Zoomed in portion of Dataset2: Distance between locations based on
viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Routes created out of Google Maps API. . . . . . . . . . . . . . . . .
Locations/classes of Dataset2. . . . . . . . . . . . . . . . . . . . . . .
RIT’s autonomous golf cart. . . . . . . . . . . . . . . . . . . . . . . .
(left) Hikvision Bullet IP camera (right) Cameras on the golf cart. . .
Locations from Dataset3. . . . . . . . . . . . . . . . . . . . . . . . . .
Images from Dataset3. . . . . . . . . . . . . . . . . . . . . . . . . . .
Locations from Dataset4. . . . . . . . . . . . . . . . . . . . . . . . . .
Images from dataset4. . . . . . . . . . . . . . . . . . . . . . . . . . .
Locations from Dataset5. . . . . . . . . . . . . . . . . . . . . . . . . .
Images from dataset5. . . . . . . . . . . . . . . . . . . . . . . . . . .
The network’s input is 150,528 dimensional, and the number of neurons
in the network’s remaining layers is given by 253,440-186,624-64,89664,896-43,264-4096-4096-1000 [39]. . . . . . . . . . . . . . . . . . . . .
Block diagram of the algorithm. . . . . . . . . . . . . . . . . . . . . .
(left) Original Image (middle) Brighter-afternoon (right) Darker-evening.
(left) Original image, (middle left) Synthetically added rain, (middle
right) Rain removed by the approach from [56], (right) Rain removed
by the approach from [57]. . . . . . . . . . . . . . . . . . . . . . . . .
Venn diagram for hierarchical approach where Pn -(latitude,longitude)
and R-GPS precision/smaller region. . . . . . . . . . . . . . . . . . .
Validation loss vs number of epochs in Dataset2. . . . . . . . . . . . .
Correct vs incorrect predictions - Dataset3. . . . . . . . . . . . . . . .
Validation accuracy vs number of epochs in Dataset3. . . . . . . . . .
Ground truth of Google Street View vs predicted class from cross testing. 78
Motion estimation using visual odometry and deep learning localization. 80
List of Tables
Boundary coordinates of the rectangle. .
Dataset1: parameters. . . . . . . . . . .
Dataset1 details. . . . . . . . . . . . . .
Dataset2: parameters. . . . . . . . . . .
Dataset2 details. . . . . . . . . . . . . .
Configuration of Hikvision cameras. . . .
Dataset3 details. . . . . . . . . . . . . .
Dataset4 details. . . . . . . . . . . . . .
Dataset5 details. . . . . . . . . . . . . .
Datasets with smaller inter-class distance
Training hyperparameters. . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cross-testing with the three models. . . . . . . . . . . . . . . . . . .
Prediction from the model trained on the three datasets combined.
Results - Datasets with smaller inter-class distance. . . . . . . . . .
Weather invariant results. . . . . . . . . . . . . . . . . . . . . . . .
Artificial Neural Network
Application Program Interface
Convolutional Neural Network
Defense Advanced Research Projects Agency
Global Positioning System
Graphical Processing Unit
Inertial Measurement Unit
Internet Protocol
K-Nearest Neighbor
Light Detection and Ranging
Radio Detection and Ranging
Rectified Linear Unit
Root Mean Square
Simultaneous Localization And Measurement
Support Vector Machine
Chapter 1
Research and development of autonomous vehicles is becoming more and more popular in the automotive industry. It is believed that autonomous vehicles are the future
for easy and efficient transportation that will make for safer, less congested roadways.
In 2014, according to the Department of Transportation, besides the human toll of
32,000 deaths in the US and 2.31M people injured, the costs are $1 trillion! In recent
years, nearly all states have passed laws prohibiting the use of handheld devices while
driving. Nevada took a different approach. In a first for any state, it passed a law
that legalizes texting, provided one does so in a self-driving autonomous car. This
places Nevada at the forefront of innovation.
Googles vast computing resources are crucial to the technology used in self-driving
cars. Googles self-driving cars memorize the road infrastructure in minute detail.
They use computerized maps to determine where to drive, and to anticipate road
signs, traffic lights and roadblocks long before they are visible to the human eye.
They use specialized lasers, radar and cameras to analyze traffic at a speed faster
than the human brain can process. And they leverage the cloud to share information
at blazing speed. These self-driving cars have now traveled nearly 1.5 million miles
on public highways in California and Nevada. They have driven from San Francisco
to Los Angeles and around Lake Tahoe, and have even descended crooked Lombard
Street in San Francisco. They drive anywhere a car can legally drive. According to
Sebastian Thrun, “I am confident that our self-driving cars will transform mobility.
By this I mean they will affect all aspects of moving people and things around and
result in a fundamentally improved infrastructure.”
Figure 1.1: Google self-driving car.
Two examples include improvement in mobility and use of efficient parking. Take
todays cities- they are full of parked cars. It is estimated, that the average car is
immobile 96 percent of its lifetime. This situation leads to a world full of underused
cars and occupied parking spaces. Self-driving cars will enable car sharing even in
spread-out suburbs. A car will come to you just when you need it. And when you are
done with it, the car will just drive away, so you wont even have to look for parking.
Self-driving cars can also change the way we use our highways. The European
Union has recently started a program to develop technologies for vehicle platoons
on public highways. Platooning is technical lingo for self-driving cars that drive so
closely together that they behave more like trains than individual cars. Research at
the University of California, Berkeley, has shown that the fuel consumption of trucks
can be reduced by up to 21 percent simply by drafting behind other trucks. And it is
easy to imagine that our highways can bear more cars, if cars drive closer together.
Last but not least, self-driving cars will be good news for the millions of Americans
who are blind or have brain injury, Alzheimers or Parkinsons disease. Tens of millions
of Americans are denied the privilege of operating motor vehicles today because of
issues related to disability, health, or age.
How does the car see the road? Super Cruise and other similar systems do more
than just see the road. Using an array of sensors, lasers, radar, cameras, and GPS
technology, they can actually analyze a car’s surroundings. Super Cruise is a combination of two technologies. The first is the increasingly common adaptive cruise
control, which uses a long-range radar (more than 100 meters) in the grille to keep
the car a uniform distance behind another vehicle while maintaining a set speed. The
second, lane-centering, uses multiple cameras with machine-vision software to read
road lines and detect objects. This information is sent to a computer that processes
the data and adjusts the electrically assisted steering to keep the vehicle centered in
the lane. Because Super Cruise is intended only for highways, General Motors will
use the vehicle’s GPS to determine its location before allowing the driver to engage
the feature. In addition, General Motors is also considering using short-range radars
(30 to 50 meters) and extra ultrasonic sensors (3 meters) to enhance the vehicle’s
overall awareness. Cars with park-assist systems already have four similar sensors in
the front and in the rear of the car. General Motors is also experimenting with costeffective LIDAR units, which are more powerful and accurate than ultrasonic sensors.
It’s unclear whether LIDAR will make it into the same vehicle as Super Cruise.
A smart car is a complex but wonderful example of technical innovation by humankind. Thousands of design constraints, complex hardware and software as well as
sophisticated machine learning is involved to create this technical marvel. While some
of them are associated within the car, others are more related to the environment of
the car like communication with other cars called Vehicle to Vehicle Communication
(V2V), positioning it locally with the help of LIDAR scans, lane markers etc. Some
of the key challenges are described below.
Localization and mapping
One of the most important aspects of a self-driven car is to know where it is. In
technical terms this is called localization. Localization essentially lays the ground
work of autonomous cars. Localization cannot be achieved from a single sensor or
a single simple algorithm. Another task self-driving cars must tackle is mapping.
Despite getting a static map from Google maps or other satellite generated maps,
it isnt enough for navigating as the environment is quite dynamic in the real world.
To adapt to these changes, the underlying technology of the car creates a local map
of its own and integrates it with the global static map from the Google maps to
identify the exact location of the car. This local map is created by the various sensors
present in the car like a high dimensional LIDAR, a RADAR and multiple cameras.
After getting a sense of its local environment, the car can continue executing its
navigation algorithm to reach its destination point. Most autonomous cars, such as
Googles smart car utilize the A* path planning algorithm along with a number of
filters to predict and improve its position determination [1]. A* is at the very core
of the path planning algorithm in the Google car. This algorithm was improvised a
lot using dynamic planning, control parameters, stochastic solutions, path smoothing
etc. Path planning, navigation, and obstacle avoidance have numerous challenges
associated with them, many of which have not been solved yet. Although its difficult
to visualize the complete algorithm in a single image, Figure 1.2 tries to portray
how navigation algorithms can be combined with local maps to drive an autonomous
Figure 1.2: Localization of a smart car [2].
One of the primary challenges faced by any car be it smart or not, is to locate
itself and plan its route based on inaccurate GPS coordinates. The average root mean
square GPS precision of the Google maps is 2-10 meters. This precision is sufficient
for a human being to drive and plan his/her next move but its not so for a self-driving
car. For example, lets consider that the car has to take a left turn upon reaching a
certain point. If this point is not exactly accurate, the car will certainly sway out of its
lane and cause an accident. This is exactly where the sensors help. While the detailed
algorithm used by Google smart cars is unknown, a fraction of that knowledge does
help us in understanding how to use these sensors to improve localization. Their
primary tools are the Google maps for path planning, Google street view for learning
maps, and different sensors for lane marking, corner detection, obstacle detection,
etc. To improve sensor data, various filters are also used like the Monte Carlo and
Kalman filters.
Lack of identification of proper objects
Despite using state-of-the-art object recognition algorithms, the Google smart car
lacks the ability to identify what we may consider to be simple objects. For example
it fails to differentiate between pedestrians and police officers. It cannot differentiate
between puddles and pot potholes, nor can it recognize people doing specific actions
like a police signaling the car to stop. It also has difficulty parking itself in the parking
lot. Although these might seem relatively easy compared to what has already been
achieved, they are actually difficult and require sophisticated machine learning. The
better learning algorithm it has, the better it will be able to recognize these actions
or objects on the road.
Weather/Climate/Time of day
The capability of most sensors change drastically under different weather conditions
and different times of day. For example a LIDAR performs very well in clear conditions
but its performance accuracy drops significantly in rainy or snowy conditions. A lot of
research and improvement needs to be done on this front so that the autonomous cars
become globally acceptable in every country and in every state, besides performing
with the same precision throughout the day. In Figure 1.3, the effect of different
climate has been in the same location. It is clearly evident that a learned model on
any single type of climate images would behave very differently on others. A lot of
research has been and still going on making models invariant of weather and climate
Figure 1.3: Effect of different weather/climate [3].
Vehicle to vehicle communication
Moving on to a different challenge faced by a system of smart cars, we see that its
quintessential that such a system in the future would need the cars to communicate
with each other. This is where vehicle to vehicle (V2V) communication comes into the
picture. V2V communication proposes that cars will communicate with each other
in real time creating an inter-communicated smart system. V2V communication is
not necessarily limited to smart cars. Nearly every manufacturer today is working
on implementing this technology on their cars, hoping to significantly reduce the
number of road accidents every year. V2V communication lets other cars know their
position, speed, future intentions, etc. Naturally we can see why this technology will
be imperative in smart cars. One of the many challenges of designing such a system
is to create a multi-hop system of ad-hoc networks (Figure 1.4) powerful, secure, and
effective enough to communicate in real time. There are a number of models which
have been proposed to make this realizable and a lot of research is going on in this
Figure 1.4: Multi-hop system of ad-hoc networks [4].
Visual odometry
Visual odometry (VO) [6] is a form of localization which has been proposed for some
time to implement on cars. This mode of localization is not limited to smart cars
only. In fact there has been significant research going on in universities as well as
manufacturing industries to develop VO models. A critical task of VO is to identify
specific landmarks or objects in the surrounding environment of the moving vehicle
to improve the cars vision, position, communication with other cars etc.
Visual odometry in localization
VO plays a key role in an autonomous cars path navigation. For example lets say
an autonomous car is driving through an unknown territory where it cannot connect
to the satellite map due to a weak signal or get inaccurate data due to GPS error.
Based on previously learned databases, the vehicle can identify certain key objects to
help determine its location. A number of famous extraction techniques are used for
VO recognition namely Scale-invariant feature transform (SIFT) [7], Speeded Up Robust Features (SURF) [8], Binary Robust Invariant Scalable Keypoints (BRISK) [9],
BRIEF [10], ORB [11] etc. Although SIFT has been shown to be highly dependable
in extracting a large number of features, it is a computationally intensive algorithm
and not suitable for real time applications like this.
Loop closure detection in Simultaneous Localization and
Mapping (SLAM)
Simultaneous Localization and Mapping (SLAM) has been the driving force between
autonomous cars primary algorithms for a long time. It involves a lot of difficult tasks
which have been partially to completely achieve over the years. However one of the
challenges associated with SLAM is to solve the loop closure problem using visual information in life-long situations. The term loop closure detection is primarily defined
as the computer’s ability to detect whether it is in the same place or not after traveling a certain distance. The difficulty of this task is in the strong appearance changes
that a place suffers due to dynamic elements, illumination, weather or seasons. Based
on research in the academia as well as in the industry, this area of SLAM hasnt been
perfected yet. Obviously, vision based localization plays a key role here. There are
some famous existing marvel approaches to this problem like the FAB-MAP, which is
a topological appearance based SLAM. There are multiple papers on this approach
and has been regarded as one of the reliable and stable among other approaches [19].
Since change in illumination and weather affects the place recognition to a significant
extent, [20] has proposed a different approach called Seq-SLAM to acknowledge this
problem. Seq-SLAM removes the need of a global matching performance by calculating the best candidate matching location within every local navigation sequence
instead of calculating the single location most likely given a current image. Post
invention of these novel approaches there have been other improved methods which
have used traditional feature detectors and modified ones, to address this problem
[14]. A very interesting part of this problem is the bidirectional loop closure detection
[15], which tests the autonomous cars ability to detect its location irrelevant of its
direction of approach.
CNN in vision based localization
In recent years deep learning has revolutionized the world of machine learning, pattern
recognition, computer vision, robotics etc. In many of the cases, it has been found
that deep learning is able to produce better detection ability due to its sophisticated
filtering through its multiple layers. CNNs are feed-forward artificial neural networks
where the individual neurons are tiled in such a way that they respond to overlapping
regions in the visual field. In the world of deep learning, CNNs have mostly proved
to yield better results than traditional techniques which use hand-crafted features
for detection [12]. In 2015, T. Lin, J. Hays and C. Tech [13], presented an excellent
research paper using CNNs for ground to areal geolocalization. This paper once again
proves the importance of vision based localization and its ability to improve current
localization methods including GPS. Figure 1.5 shows an example from a live CNN
running on the browser from the course CS231n - Convolutional Neural Networks for
Visual Recognition [5], predicting a random test image in real time.
Figure 1.5: Classification results from the course CS231n - Convolutional Neural Networks
for Visual Recognition [5].
The primary advantage in using CNNs in vision based localization is its ability
to preprocess images through its layers. CNNs have proved to be more effective in
matching images or identifying objects in recent years and may make traditional feature detection techniques obsolete in future years. One of the major advantages of the
CNN architecture is its co-learning of features with classifier, giving it an advantage
over hand crafted features pared with conventional classifiers. The disadvantage of
CNNs lie in their requirement of huge training datasets making it computationally
expensive during training. However, despite arduous training time, processing test
frames is extremely efficient making it suitable for real time applications. Deep learning frameworks like Caffe [16], Torch [17], Tensorflow [18], etc. have addressed the
training problem with the help of GPUs and have made the learning process much
more efficient. To combat the need for large datasets, a concept called transfer learning is often used. Transfer learning is a technique used to learn new layer weights in
neural networks for a given dataset from pre-learned filters from a different dataset. It
often works out quite well due to the ability of neural networks to modify its weights
given the datasets are somewhat similar in nature. Since these weights are often open
sourced on the internet, users can take advantage of them quite easily, thus avoiding
the need to train huge datasets end to end.
Chapter 2
Thesis Statement
Identifying the location of an autonomous car with the help of visual sensors can be an
excellent alternative to traditional approaches like Global Positioning Systems (GPS)
which are often inaccurate and absent due to insufficient network coverage. Recent
revolutionary research in deep learning has produced excellent results in different
domains leading to the proposition of this thesis which intends to use deep learning
to solve the problem of localization in smart cars using deep learning models on visual
Chapter 3
Thesis Objectives
The primary objective of the thesis will be to develop an efficient algorithm for an
autonomous car to help more accurately localize the vehicle in as much real time as
possible. This will be done by utilizing deep CNNs to identify its location. Experiments will be done to determine if Google Street View can be used either for the
supervised component of localization on the RIT campus, or as transfer learning. If
new data has to be collected, a camera with GPS tagged frames will be utilized,
whereby experiments will determine the amount of data needed to be recorded per
locale for accurate localization. The efficacy of the CNN models across different
weather/light conditions will be investigated.
An efficient model will not only improve smart cars localization, but it will also
improve traditional cars vision and might be able to reduce the number of accidents.
End-to-end learning will be compared with fine tuning existing models trained on
ImageNet [17] like AlexNet [18], VGGNet [19], GoogLeNet [20], ResNet [21], including
rigorous manipulation of different hyperparameters. Extensive experiments will be
conducted before establishing this method as a novel alternative to localization by
Chapter 4
Background/Related Work
Artificial intelligence in self-driving cars
As of 2016, autonomous vehicles are no longer products of science fiction or just longterm visions of research and development departments of different corporations. The
beginning of the success started with the self-driving car ”Stanley” which won the
2005 DARPA Grand Challenge [21].
Figure 4.1: Stanley [21].
Besides other traditional sensors, this car had five laser range finders for measuring
cross-sections of the terrain ahead up to 25m in front of the vehicle, a color camera for
long-range road perception, and two 24 GHz RADAR sensors for long range detection
of large obstacles. Despite winning the challenge, it left open a number of important
problems like adapting to dynamic environments from a given static map or the ability
to differentiate between objects with subtle differences. One of the important results
observed from this race and highly relevant to this thesis research was the fact that
during 4.7% of the challenge , the GPS reported 60cm error or more. This highlighted
the importance of online mapping and path planning in the race. It also proved that
a system solely dependent on GPS coordinates for navigation in self-driving cars is
not sufficient, as the error tolerance for autonomous vehicles is around 10cm. The
real time update of the global map based on the local environment helped Stanley to
eliminate this problem in most of the cases.
The 2005 DARPA Grand Challenge was conducted on a desert track. Stanley
won this challenge, but a lot of new challenges were foreseen from the results. The
next challenge conducted by DARPA was in an urban environment. Stanford introduces the successor to Stanley named ”Junior” [22]. Junior was equipped with
five laser rangefinders, a GPS-aided inertial navigation system and five radars as its
environment perception sensors. The vehicle had an obstacle detection range of upto
120 meters and the ability to attain a maximum velocity of 30mph. A combination
of planning, perception followed by control helped in its navigation. The software
architecture primarily consisted of five modules - sensor interfaces, perception modules, navigation modules, drive-by-wire interface and global services. The perception
modules were responsible for segmenting the environment data into moving vehicles
and static obstacles. They also provided precision localization of the vehicle relative
to the digital map of the environment. One of the major successes of this car was
its successful completion of the journey with almost flawless static obstacle detection. However, it was found that the GPS -based inertial position computed by the
software system was generally not accurate enough to perform reliable lane keeping
without sensor feedback. Hence Junior used an error correction system for accurate
localization with the help of feedback from other local sensors. This fine-grained
localization used two types of information: road reflectivity and curb-like obstacles.
The reflectivity was sensed using the laser range finders, pointed towards the ground.
The filter for localization was a 1-D histogram filter which was used to estimate the
vehicles lateral offset relative to the provided GPS coordinates for the desired path
to be followed. Based on the reflectivity and the curbs within the vehicle’s visibility,
the filter would estimate the posterior distribution of any lateral offset. In a similar
way like reinforcement learning it favored offsets for which lane marker reflectivity
patterns aligned with the lane markers or the road side from the supplied coordinates
of the path. It also negated offsets for which an observed curb would reach into the
driving corridor of the assumed coordinates. As a result, at any point in time the
vehicle estimated a fine-grained offset to the measured location by the GPS-based system. A precision or lateral offset of one meter was common in the challenge. Without
this error correction system, the car would have gone off the road or often hit a curb.
It was observed that velocity estimates from the pose estimation system were much
more stable than the position estimates, even when GPS feedback was not strong.
X and Y velocities were particularly resistant to jumps because they were partially
observed by wheel odometry.
Junior finished second in the DARPA Urban Challenge but the winner was a
different autonomous vehicle called ”Boss” developed by the Tartar Racing Team.
‘Boss’was composed of students, staff and researchers from many institutions like
Carnegie Mellon University, General Motors etc. Many of members from this team
as well as Junior’s would later be associated with the Google self-driving car project.
This included the Tartar team’s technology leader, Chris Urmson, and one of the
most celebrated people in the autonomous vehicles industry, Junior’s team leader
Sebastian Thrun.
Very similar to Junior, Boss also had an arsenal of sensors with a very advanced
and complex system architecture [23]. The software architecture of Boss was widely
divided into the following layers - 1) Motion planning subsystem, 2) Perception subsystem, 3) Mission planner, 4) Behavioral system. The motion planning system was
responsible for handling static and dynamic objects under two different scenarios,
structured driving and unstructured driving. While structured driving concentrates
on typical marked roads, unstructured driving concentrates on more difficult scenarios such as parking lots. Unstructured driving is more difficult because of the lack of
lanes and markings. Boss used a four dimensional search space (position, orientation,
direction of travel) was used. In both of the cases, the end result is a trajectory
good enough to reach the destination. The perception module was responsible for
sensing the local environment with the help of sensors. This module created a local
dynamic map, which was merged with the static global map, while simultaneously
localizing the vehicle. The mission planner also played a key role in the navigation
of the car. Using only the navigation wasn’t good enough for winning the race. This
is where the mission planner helped. It took all the constraints of different routes
under consideration and created an optimal path. Finally the behavioral system took
all the information provided by the mission planer and fed it to the motion planner. It also handled all the errors when there were problems. This subsystem was
roughly divided into three sub-components: lane driving, intersection handling, and
goal selection. The flowchart of the process is shown in Figure 4.2.
Like Stanley and Junior, a local path is generated in Boss using a navigation algorithm called Anytime D* [24]. This search algorithm is efficient in updating it’s
solution to account for new information concerning the environment like a dynamic
obstacle. Since D* is influenced from dynamic programming, it already has all the
possible routes stored in memory and can just update and select a new route avoiding
the dynamic obstacle’s coordinates. The motion planner stores a set of trajectories
to a few immediate local goals near to the centerline path to create a robust desired
Figure 4.2: Flowchart of the Boss software architecture [23].
path lest it needs to avoid static and dynamic obstacles and modify its path. The
local goals are placed at a fixed longitudinal distance down the centerline path but
vary in lateral offset from the path to provide several options for the planner. [23]
presents a detailed description of all the modules and its sub-components. In keeping
the discussion related to this research, the localization module has been discussed
in greater details compared to the others. Particularly the roadmap localization is
one of the most important requirements of autonomous vehicles. Boss was capable
of either estimating road geometry or localizing itself relative to roads with known
geometry. One of the primary challenges in urban driving is responding to the abrupt
changes in the shape of the roads and local disturbances. Given that the shape and
location of paved roads change infrequently, their approach was to localize relative
to paved roads and estimate the shape of dirt roads, which change geometry more
frequently. The pose was incorporated into the coordinate frame from the GPS feedback. To do this, it combined data from a commercially available position estimation
system and measurements of road lane markers with an annotated map. Eventually
the nominal performance was improved to a 0.1m planar expected positioning error.
Normally a positioning accuracy of 0.1m would be sufficient to blindly localize within
a lane, but the correction signals were frequently disrupted by even small amounts of
overhead vegetation. Once disrupted, the signal’s reacquisition took approximately
a half hour. Thus, relying on the corrections was not viable for urban driving. Furthermore, lane geometries might not be known to meter accuracies a priori. It was
critically important to be localized correctly relative to the lane boundaries, because
crossing over the lane center could have disastrous consequences. During the test, the
difference error in the lateral position reached upto 2.5m, more than enough to put
the vehicle either off the road or in another lane if not compensated for. To conclude
the roadmap localization system played an important role during the challenge. The
error estimate for most of the challenge in the roadmap localization system was less
than 0.5m, but there was more than 16 minutes when the error was greater than
0.5m, with the peak error of 2.5m. Without the road map localization system being
active, Boss would most likely have been either off the road or in a wrong lane for a
significant amount of time.
Following the success of the various autonomous vehicles in the 2007 ”DARPA
Urban Challenge”, Google started building it’s own self-driving car collaborating
with various academic institutions and industries. Unfortunately not many research
papers are available on the Google self-driving car but it is true that many of the
sensor fusion technologies, algorithms etc were inspired from the ones used in the selfdriving cars who participated in the DARPA challenges. Research that is pertinent
to this thesis will next be discussed in the following sections.
Figure 4.3: Boss (http://www.tartanracing.org/gallery.html).
Localization is one of the most important factors in autonomous driving, if not the
most important one. Although localizing a robot within a certain tolerance is possible
using existing technologies like GPS without much effort, it is often not sufficient for
autonomous driving. Precise localization with respect to the dynamically changing
environment is quite challenging and is a problem researchers have been trying to
tackle for quite some time. Levinson et al.[25] proposed a method of high-accuracy
localization of mobile vehicles by utilizing static maps of urban environments but
updating them with the help of GPS, IMU, wheel odometry, and LIDAR data acquired
by the vehicle. It also removed the dynamic objects in the environment providing
a 2-D surface image of ground reflectivity in the infrared spectrum with 5cm pixel
resolution.For the final step, a particle filter method was used for correlating LIDAR
measurements with the map. The primary contribution from this research was an
innovative method to separate the dynamic obstacles from the static ones and create
a final map which could be used for autonomous driving. It addressed the problem of
environment dynamics by reducing the map to only features that with very likelihood
were static. In particular, using a 3-D LIDAR information, only the flat road surface
was retained, thereby removing the imprints of potentially dynamic objects like nonstationary cars.The resulting map was then simply an overhead image of the road
surface, where the image brightness corresponded to the infrared reflectivity. Once
the map was built, a particle filter method was used to localize the vehicle in real
time. This system was able to track the location of the vehicle with relative accuracy
of 1̃0cm in most cases. Experiments were also conducted to track the vehicle using
only GPS data. It actually failed just within 10 meters, proving that GPS alone is
insufficient for autonomous driving.
Figure 4.4: (left)GPS localization induces greater than equal to 1 meter error (right)No
noticeable error in particle filter localization [25].
While [25] considered the dynamic obstacles on the map as binary data i.e. - either
true or false, [26] treats them as probabilities resulting in a probabilistic grid map,
where every cell was represented as its own Gaussian distribution over remittance values. Using offline SLAM to align multiple passes of the same environment, possibly
separated in time by days or even months , it was possible to build an increasingly
robust understanding of the world that could be exploited for localization. Instead of
having to explicitly decide whether each measurement in the grid either was or was
not part of the static environment, a sum of all observed data would be considered
and the variances from each section of the map would be modeled. Each cell in the
final map would store both the average infrared reflectivity observed at the particular location as well as the variance of those values very similar to a Kalman filter.
Like [25], the 2-dimensional histogram filter would comprise of the motion update to
reduce confidence in the estimate based on motion, and the measurement update, to
increase confidence in the estimate based on sensor data. Since the motion update
was theoretically quadratic in the number of cells, in practice considering only the
neighboring cells would be perfectly accessible. The motion update would be followed
by the measurement update, in which the incoming laser scans would be used to refine
the vehicle’s velocity. The results were quite impressive during a ten-minute drive, a
RMS lateral correction of 66cm was necessary, and the localizer was able to correct
large errors of upto 1.5 meters. The resulting error after localization was extremely
low, with a RMS value of 9cm. The autonomous car was able to drive in densely
populated urban environments like the 11th Avenue in downtown Manhattan which
would be impossible before without such high precision in localization. In conclusion, this method was able to improve the precision of the best GPS/IMU systems
available by an order of magnitude, both vertically and horizontally, thus enabling
decimeter-level accuracy which is more than sufficient for autonomous driving.
Detection of traffic lights is another important factor in autonomous driving. Although there have been past research on algorithms to detect traffic lights by applying
simple techniques of computer vision, they have never been real world applicable because of various issues. According to [27], the existing reliable systems for determining
traffic light state information at the time required explicit communication between
a traffic signal and vehicle. [27] presents a passive camera-base pipeline for traffic
light state detection, using imperfect localization and prior knowledge of traffic light
location. Taking advantage of temporal information from recorded video data and
location data from GPS, the proposed technique estimated the actual light location
and state using a histogram filter.
One of the problems arising from vision data is the inability to distinguish between
tail lights and traffic lights especially during night. Hence [27] boiled the problem
to two points - a) Inferring the image region which corresponded to the traffic light
and b) inferring its state by analyzing the acquired intensity pattern. Ideally, both
problems ideally could be solved by taking advantage of temporal consistency. The
choice of the detection grid as a reference frame assumed several structured error
components, allowing the light’s position within the grid to change slowly over time.
Given this temporal constraint, and the vision algorithm performing within the limits
of reasonable expectations, a histogram filter could be applied to infer the image region
of the light and determine the color. This approach was able to get fairly good results
achieving a 91.7% accuracy for individual light detections and 94.0% accuracy across
intersections at three times of the day. The confusion matrix of the classification
results for individual light detections has been shown in Figure 4.5.
Figure 4.5: Confusion matrix for individual traffic light detection [27].
Since the right object detection in real time can play a very important role in
the behavior of autonomous cars in dealing with different situations, a lot of research
has been done in this field as well. [28] suggests a new track classification method,
based on a mathematically principled method of combining log odds estimators. The
method was fast enough for real time use, and non-specific to object class, while
performing well (98.5% accuracy) on the task of classifying correctly-tracked, well
segmented objects into car, pedestrian, bicyclist, and background classes. A com24
mon problem in the world of machine learning is the availability of sufficient data for
training datasets. Another problem is to recognize new data using a model trained
on a different dataset. There are cases where labeled data is unavailable, and segmentation is preferred over classification. Different kinds of learning algorithms work
in each case like unsupervised learning in the last case. [29] proposes a method based
on the expectation-maximization(EM) algorithm. EM iteratively 1)trains a classifier
and 2)extracts useful training examples from unlabeled data by exploiting tracking
information. Given only three examples per object class, their research reports the
same accuracy as a fully supervised approach on a large amount of data.
Robotics applications can usually benefit a lot from semi-supervised learning, as
they often face situations where they have to make decisions from unforeseen examples
in a timely manner. In reality, the behavior of an autonomous vehicle will change
a lot on proper detection of other cars on the road rather than different objects,
as localizing other cars in its view is quintessential in the decision making process.
Despite being highly important, efficiently detecting cars in real world-images is fairly
difficult because of various problems like occlusion, unexpected occurrences etc. [30]
takes advantage of context and scale to build a monocular single-frame image based
car detector. The system proposed by this paper uses a probabilistic model to combine
various evidences for both context and scale to locate cars in real-world images. By
using a calibrated camera and localization on a road map, it was possible for the
authors of the paper to obtain context and scale information from a single image
without the use of a 3D laser. A latent SVM was used to learn a set of car templates
representing six different car orientations for the outline of the car, referred to as
”root” templates and six ”part” templates containing more detailed representations of
different sections of a car in an image. Each of these templates was convolved with the
gradient of the test image at multiple scales with probable locations at high responses.
For each bounding box, two scores were computed based on how scale-appropriate
the size of the box was given its location on the image. This helped in removing
some of the false positives. The scores were computed by estimating the real world
height of the object using the cameras known position and orientation. Considering
the autonomous vehicle had some sort of localization and mapping system, the search
for the cars could be further fine-tuned by eliminating unlikely locations in the image
like the sky or a tree, basically limiting the search region to the roads. Using the
appearance, scale and context scores, a new prediction for each bounding box was
estimated using the dual form of L2-regularized logistic regression.
Figure 4.6: (a) Left: Detections returned by Felzenszwalb’s state of the art car detector
[31]. (b) Right: Detections returned by the algorithm proposed in [30]. False positives are
shown in red and the true positives are shown in green.
This algorithm achieved an average precision of 52.9% which was 9.4% better
than the baseline at the time. The primary contribution of this paper was achieving
a good accuracy using only a vision based system, thus paving the idea of eliminating
the need of expensive sensors like the LIDAR or the RADAR. The results are shown
in Figure 4.6. It must be noted here that although these results were impressive at
the time, the recent progress in deep learning influenced object detection frameworks
have proved to be better in identifying and localizing objects in images. Some of
these frameworks have been discussed in further details in the next section. A lot
of research has been done on precise tracking of other vehicles leveraging both 3D
and 2D data. [32] takes advantage of 3D sparse laser data and dense color 2D data
from the camera to obtain accurate velocity estimates of mobile vehicles by combining
points from both the data to obtain a dense colored point cloud. A precise estimate
of the vehicle’s velocity could be estimated using a color-augmented search algorithm
to align the dense color point clouds from successive time frames. Improving on
[32], [33] proposes a method which combines 3D shape, color and motion cues to
accurately track moving objects in real time. Adding color and motion information
gave a large benefit, especially for distant objects or objects under heavy occlusions,
where detailed 3D shape information might not be available. This approach claimed
to outperform the baseline accuracy by at least 10% with a runtime overhead of 0.7
M. Yu and U.D. Manu in [65] developed an Android application for Stanford
navigation using GPS tagged visual data. They used only 114 training images with
three to five images per location for a35 locations. Hand-tuned features and linear
classifiers were used for classification like the SIFT, K-means and RANSAC. This
paper obtained a validation accuracy of 71% on 640×480×3 resolution images and
42% on 480×420×3 and 320×240×3 resolution images. The prediction time of a
query image was 50 seconds on average, hence not real time. A.R.Zamir and M.Shah
in [66] used a somewhat similar approach like this research. They created a dataset
using 10k images from the Google Street View with each image at a 12m separation.
For each location, the dataset had five images out of which four were side-view images
and the other one covered the upper hemisphere view. This paper used SIFT for its
features and the Nearest Neighbor tree (FLANN) as its classifier. 60% of the images
from the test set were predicted within 100 meters of the ground truth. This paper
used a concept called the Confidence of Localization (COL) to improve the accuracy.
Deep learning
History and evolution
The last five years starting from the year 2012 have ushered in a lot of success in the
world of machine learning especially due to the boom in deep learning. Although it
may seem that deep neural networks were invented very recently, they were conceived
of in the 1980s. Although these early architectures were not in the exact structure
that is present today, their underlying concept is very similar. Before diving into
the detailed working mechanism of Convolutional Neural Networks (CNNs), revisiting their origin and why they became successful in the recent years can lead to a
better understanding of deep learning. [34] presented the first general, working learning algorithm for supervised deep feedforward multilayer perceptron. In 1971, [35]
described a deep network with 8 layers. It was trained on a computer identification
system known as ”Alpha”. Other Deep Learning working architectures, especially
those built from ANNs date back to 1980 [36]. The architecture of this network was
relatively simple compared to networks that are present today. It composed of alternate cells known as simple and complex cells in a sandwich type of architecture
used for unsupervised learning. While the simple cells had modifiable parameters, the
complex cells were used for pooling. Due to various constraints, one of them being
limited processing power in the hardware, these networks didn’t perform quite as well
as alternate techniques. Backpropagation, one of the fundamental concepts in learning a network, was first applied by Yann LeCun et al. to a deep neural network for the
purpose of recognizing handwritten ZIP codes on mail for the US Postal Service [37].
The input of the network consisted of normalized images of isolated digits. Despite
being applied almost 20 years back, it produced excellent results with a 1% error
rate for the specific application. Due to the hardware constraints, it wasn’t suitable
for general use at the time. The time to train the full network took approximately
3 days. The first convolutional neural network with backpropagation as we know
today was proposed in [38] by Yann LeCun et al. in 1998 for document recognition.
This was a 6-layer network composed of three convolutional layers, two pooling layer
(subsampling) and a fully connected layer in the end. The name of the network was
LeNet-5. The detailed explanation of each layer in a convolutional neural network
will be explained in the next section. Although, these kind of networks were very
successful in handling smaller size images or other problems like character or word
recognition, it was thought until 2011, that these networks wouldn’t be able to handle
larger more complex images and problems, hence the use of traditional methods like
object detectors using hand-tuned features and classifiers.
Figure 4.7: Architecture of LeNet-5. Each plane is a feature map i.e. a set of units whose
weights are constrained to be identical. [38].
2012 was the year when convolutional neural networks really started making their
mark in the world of machine learning.[39] presented a deep convolutional neural net
which produced significantly better results in the ImageNet classification challenge
[40] compared to the traditional techniques. The architecture was called AlexNet
after the name of the lead author of the paper, Alex Krizhevsky. Besides the understanding of deep neural networks and showing us how to mold them to achieve
good results, this paper contributed in many other important ways by introducing
various important building blocks of neural networks such as ReLUs and a new alternative called ”dropout”. Many prominent researchers believe that multiple factors
contributed behind the success of this network, these include:
1) Data - An increase from the order of 103 to 106 in number of samples was used to
train the model as compared to previous techniques.
2) Computational power - NVIDIA GPU with CUDA library support providing approximately 20x speedup
3) Algorithm:
a) Deeper: More layers (8 weight layers)
b) Fancy regularization: Dropout [41]
c) Fancy non-linearity: ReLU [42]
The terms are explained in details in the next few sections. Detailed examples and
the results are discussed in the Case studies section.
Traditional techniques
K-Nearest Neighbor
The simplest classifier in machine learning is the K-nearest neighbor classifier. There’s
practically no learning involved in this classifier but rather the prediction is made by
direct comparison of training and testing images. However instead of direct interpixel comparison between two images, different hyperparameters are used to obtain
a better result. The choice of distance i.e. - measurement of the difference between
pixels plays an important role in the Nearest Neighbor classifier. There are often two
types of hyperparameters used in this case -
1) L1 distance:
d1 (I1 ,I2 ) =
I1p -I2p
(I1p -I2p )2
2) L2 distance:
d2 (I1 ,I2 ) =
Figure 4.8: Nearest Neighbor classification using L1 distance example[5].
Although this is an elementary classifier, it has still outperformed random guess.
Random guess would produce 10% accuracy on classification of cifar-10 dataset which
has 10 classes, but the Nearest Neighbor classifier using L1 distance produced approximately 38.6% accuracy. K-Nearest Neighbor is a modified version of the Nearest
Neighbor classifier, where the top ’k’ closest images are chosen in the training dataset
instead of the single closest image.These images are then used to vote on the label of
the test image. When k=1, we treat the classifier as the Nearest Neighbor. Higher values of ’k’ have a smoothing effect that makes the classifier more resistant to outliers.
The value of ’k’ and the other hyperparameters are often chosen by experimenting
with the validation set results. The primary disadvantages of this method are - a)
The classifier must remember all of the training data for comparison during testing
leading to memory constraints when the size of the dataset is large, and b) Classifying
a test image is computationally expensive.
Support Vector Machine
Moving on from a direct inter-pixel comparison to predict images, a more complex
relation can be built between key features of images and use them to build a classifier.
This approach is more intuitive and robust. It has two primarily two sections - a score
function that maps the raw data to class scores, and a loss function that quantifies
the agreement between the predicted scores and the ground truth labels. These two
fundamental concepts are used in almost all classifiers, starting from linear classifiers
like SVMs to neural networks and CNNs. The only difference is that the nature of
the functions become more complex as they are applied into more complex networks.
f(xi ,W ,b) = Wxi + b
In the above equation, the image xi has all of its pixels concatenated to a single
column vector of shape [D × 1]. The matrix W (of size [K × D]), and the vector b (of
size [K × 1]) are the parameters of the function. D is the size of each image and K
is the total number of classes in the dataset. W has the weight parameters are often
called the weights, and b is called the bias vector but it influences the output scores,
but without interacting with the actual data xi . Post training we only need to keep
the learned weights. New test images can be simply forwarded through the function
and classified based on the computed scores. Lastly, classifying the test image involves a single matrix multiplication and addition, which is significantly faster than
comparing a test image to all training images as in the K-Nearest Neighbor approach.
A linear classifier computes the score of a class as a weighted sum of all of its pixel
values across all three of its color channels. Depending on precisely what values are
set for these weights, the function has the capacity to reward or punish certain colors
at certain positions in the image based on the sign of each weight.
The loss function is used to determine how far the predicted score from the actual
score in terms of numbers. Hence the lesser the loss, the more assuring is the predicted
score from the classifier. The Multiclass SVM loss for the i-th example is -
Li =
max(0, sj − syi + ∆)
where, Li is the loss for the i-th example,
yi is the label that specifies the index of the correct class
sj = f(xi ,W) for the j-th element, where xi is the image and sj is the score for the
j-th class
∆ is a hyperparameter to keep the SVM loss positive
The loss function stated above has a problem unaccounted for. It can produce a
set of similar W, that will classify the test examples correctly i.e. - every condition
will be met including Li =0 for all i. This loophole is often taken care of by adding a
term called the regularization penalty R(W ) to the loss function. The most commonly
used regularization is the L2 norm shown below as per 4.5. The L2 norm intuitively
”rewards” smaller weights through an element wise quadratic penalty over all the
R(W ) =
It should be noted here, that the regularization term is not a function of the data
but based on the weights. Hence adding both the data loss and the regularization
loss, the full Multiclass SVM becomes
1 X
Li + λR(W )
N i
where N is the number of training examples and λ is a hyperparameter used for
weighing th regularization penalty. This hyperparameter is usually determined by
Now that the loss has been computed for each prediction, it has to be used to
improve the future predictions. This is where backpropagation comes into the picture.
To understand propagation, a few other terms need to be explained.
1) Gradient - From equation 4.3, it intuitively tells us that the term ’W’ representing the weight matrix needs to be set or updated in the best possible manner to
get the right score for each prediction. Now there are various way to initialize this
weight matrix, but none of them will ever be perfect. Hence we need to update the
value of ’W’ iteratively making it slightly better each time. We need to find a direction in the weight-space that would lower our loss. The trick to find this direction
is actually related to a fundamental concept of mathematics - differential equations.
Since the derivative of each dimension in the input space provides the gradient of the
loss function, it can help us in finding the direction of the sharpest descend in the
step size.
f (x + h) − f (x)
df (x)
= lim
When the functions of interest take a vector of numbers instead of a single number,
the derivatives are called partial derivatives,
A partial derivative is a derivative when a function is represented in the form of
a vector of numbers whereas the gradient is simply the vector of these derivatives
in each dimension. The gradient tells us the slope of the loss function along every
dimension, which we can use to make an update.
Wgradient = fevaluateg radient (loss, data, W )
where, Wgradient is the gradient, fevaluateg radient is the function for evaluating the
2) Learning rate - The step size or the learning rate plays one of the most important
roles in training a network. Intuitively, the learning rate is the rate at which we should
step in the direction provided by the gradient.
3) Backpropagation - It can be defined as a way of computing gradients of expressions through recursive application of chain rule.
Figure 4.9: Backpropagation visualized through a circuit diagram[5].
In Figure 4.9, the entire circuit can be visualized as a mini network where the gates
play the key role in shaping the final output. On receiving some input, every gate
gets activated and produces some output value. It also computes the local gradient
of those inputs with respect to the output value. Similar to the neurons in a neural
network as explained later, these gates are mutually independent, hence do not affect
each other’s outputs. Once the loss is computed from the forward pass, these gates
learn the gradient of its output value on the final output of the entire circuit. Due
to the chain rule, the gates recording smaller values lose their importance gradually,
while the others gain importance. In this way the gates communicate with each other
and learn about the effect of each input on the final output and model the network
in a similar fashion.
4) Gradient descent/parameter update - After computing the gradient of the loss
function, the procedure of repeatedly evaluating the gradient and then performing a
parameter update is called Gradient Descent. There are different types of gradient
descent techniques. One of them is the vanilla gradient descent is shown below -
W + = −learning rate ∗ Wgradient
where, W is the weight, Wgradient is the gradient. It is evident from 4.8 and 4.9
how the weights are updated based on the loss, gradient and the gradient descent.
This loop is essentially at the core of all neural networks.
Momentum update is another approach that almost always enjoys better converge
rates on deep networks.
v = mu ∗ v − learning rate ∗ Wgradient
W+ = v
where, W , Wgradient and learning rate are the same as equation 4.9. Equation 4.10
is for integrating velocity and 4.11 is for integrating the position. v is initialized at
zero and mu is an additional hyperparamater in this case. There are other methods
of momentum update like the Adagrad [45], RMSprop [46] and Adam [47].
Neural Networks
Until now, the input variable i.e. - the image has been thought to be linearly related
to the output i.e. - the score. However, this is actually rare in the real world data,
which is why linear classifiers do not always perform well. For example the relation
between the image and the score can be as shown in 4.12.
s = W2 max(0, W1 x)
where W1 could be, for example, a [100x3072] weight matrix transforming the
image into a 100-dimensional hidden vector. The function max(0,W1 x ) is a nonlinearity. Finally, the matrix W2 would then be of size [10 × 100], giving a vector of
[1 × 10] class scores. The non-linearity function plays an important role here, as it
separates the equation above from that of a linear classifiers. Equation 4.12 is that
of a 2-layer neural network.
Figure 4.9 is considered a 2-layer neural network, and the gates (addition and
multiplication) are actually known as activation functions. They play a key role in
the both neural and convolutional neural networks and a lot of research has been
focused on them in recent years. However, there are a few activation functions which
are commonly used - 1) Sigmoid -
σ(x) =
1 + e−x
2) Tanh tanh(x) = 2σ(2x) − 1
f (x) = max(0, x)
3) ReLU [42] -
The third activation function ReLU is one of the most commonly used activation
functions used in neural networks and CNNs. It was found to greatly accelerate
(e.g. a factor of 6 in [39]) the convergence of stochastic gradient descent compared
to the sigmoid and tanh functions. It is argued that this is because it’s linear, nonsaturating form. The implementation of this function is also fairly easy as it only
involves thresholding to zero. Unfortunately, the disadvantage of this function is
that ReLU units can be fragile during training and can become ineffective inside
the network. For example, a large gradient passing through a ReLU neuron could
cause the weights to update in such a way that the neuron will never activate on
any datapoint again, making all the future gradients through that unit zero. That
is, the ReLU units can irreversibly die during training since they can get knocked off
the data manifold. For example, we may find that as much as 40% of the network
becomes dead (i.e. neurons that never activate across the entire training dataset)
if the learning rate is set too high. Hence setting the right learning rate is very
important while training a neural network.
Figure 4.10: ReLU activation function, which is zero when x¡0 and then linear with slope
1 when x¿0 [5].
4) Leaky ReLU [43] -
f (x) = 1(x < 0)αx + 1(x >= 0)(x)
where, α is a small constant.
A typical neural network is usually modeled as a collection of neurons in multiple
layers like a acyclic graph. The most common layer type is a fully connected layer in a
regular neural network. A fully connected layer refers to layers where all the neurons
are connected to each other between two layers but do not share any connection
between them within a single layer.
The final layer in a neural network usually represents class scores in classification
and probabilities of occurrence in case of regression. Unlike other layers, this final
layer doesn’t have any activation function.
Training neural networks can be tricky with the proper setting of different hyperparameters involved like different kinds of weight initialization, pre-processing of data
Figure 4.11: a 3-layer neural network with three inputs, two hidden layers of 4 neurons
each and one output layer [5].
by mean subtraction , normalization etc. A recently developed technique by Ioffe and
Szegedy [48] called Batch Normalization reduces the burden of perfect weight initialization in neural networks by explicitly forcing the activations throughout a network
to take on a unit Gaussian distribution right from the beginning of the training. Usually a BatchNorm layer is inserted immediately after fully connected layers and before
non-linearities. There are several ways of controlling the Neural Networks to prevent
overfitting - 1) L2 regularization , 2) L1 regularization, 3) Max norm constraints and
4) Dropout. Dropout is an extremely effective, simple and recently introduced regularization technique by [44] that complements the other methods (L1, L2, maxnorm).
Dropout is implemented while training by only keeping a neuron active with some
probability p (a hyperparameter), or setting it to zero otherwise. Observing the
change in loss while training neural networks can be useful, as its evaluated on individual batches during forward propagation. Figure 4.12 is a cartoon diagram showing
the loss over time. The different learning rates can be observed from the diagram.
The second important quantity to track while training a classifier is the validation/training accuracy. The model should be trained carefully so that it doesn’t
overfit the training set, which would cause it to perform poorly on an unforseen
test examples. Other factors which can be monitored are ratio of weights:updates,
activation/gradient distributions per layer, first layer visualizations etc.
Figure 4.12: a cartoon depicting the effects of different learning rates. While lower
learning rates gives linear improvements, the higher it goes, the more exponential they
become. Higher learning rates will decay the loss faster, but they get stuck at worse values
of loss (green line). This is because there is too much ”energy” in the optimization and
the parameters are bouncing around chaotically, unable to settle in a nice spot in the
optimization landscape [5].
Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are very similar to ordinary neural networks
in th sense that they are made of neurons that have learnable weights and biases. In
the case of CNNs, the weight parameters are actually filter coefficients. Each layer
receives some inputs, performs a convolution operation and optionally follows it with
an activation function. The whole network still expresses a single differentiable score
function from the input to the output and a loss function (e.g. SVM/Softmax) on
the last (fully-connected) layer. Regular neural networks receive an input in the form
of a concatenated vector, and transform it through multiple hidden layers. Each
hidden layer is made up of a set of neurons, where each neuron is fully connected
to all neurons in the previous layer, but completely independent of each other in a
single layer. A fully connected neuron in the first hidden layer given an input image
size of 200×200×3 would have 120,000 weights which is computationally expensive
and often not required. CNNs help over normal neural networks in this case. They
have neurons arranged in 3 dimensions: width, height and width in case of RGB
images. But these neurons in a layer will only be connected to a small region of the
layer before it, instead of all of the neurons in a fully-connected manner. Moreover,
the final output layer will have a dimension of 1×1×(number of classes) as the CNN
architecture will reduce the full image into a single vector of class scores, arranged
along the depth dimension. A simple CNN is a sequence of layers, which transform
one volume of activations to another through a differentiable function. Traditionally
CNNs have three layers in an architecture - convolutional layer, pooling layer and
the fully-connected layer. There are of course other fancy layers squeezed in between
in some architectures to improve the results and depending on the input data like
activation function layer, batch normalization [48], dropout etc. In summary 1)A ConvNet architecture is a list of layers that transform the image volume into
an output volume (eq. - holding the class scores)
2)There are a few distinct types of layers - convolutional/fully-connected/ReLU/pooling
3)Each Layer accepts an input 3D volume and transforms it to an output 3D volume
through a non-linear differentiable function.
4)Each Layer may or may not have parameters (e.g. convolutional/fully-connected
do, ReLU/pooling dont)
5)Each Layer may or may not have additional hyperparameters (e.g. convolutional/fullyconnected/pooling do, ReLU doesnt)
The commonly used layers have been briefly described below A) Convolutional layer (CONV layer)
The convolutional layer is the core building block of a CNN that does most of the
computational heavy lifting. These layers learn the filter weights by performing a
convolution operation along the width and height of the input volume and the filters.
The filters must have the same depth as the input volume i.e. - in case of the initial
input layer , the depth of the filter will be 3, due to the image itself having a depth
of 3(RGB).A 2-dimensional activation map is produced as the output for each filter.
Intuitively, the network will learn filters that activate when they see some type of
visual feature like edge, or more abstract and complex concepts on higher layers of
the network. Each filter produces it’s own activation map, all of which are stacked
together along the depth dimension. In Figure 4.13, the input volume is 32×32×3 and
if we consider two filters of dimension 5×5×3 have been applied then two activation
maps of size [28×28×1] are produced. A couple of terms need to mentioned here
for understanding the internal dimensions of the network - 1) stride - the rate of
sliding the filter spatially in the convolutional operation e.g. - if the stride is 1, then
the filters are moved one pixel at a time and 2)padding - it’s used to pad the input
volume with zeros around the border giving the user control over the spatial size of
output volumes. It is often used to preserve the spatial size of the input volume so
that the input and the output width and height are the same.
Figure 4.13:
Convolutional layer - input layer [32×32×3], filter dimension
[5×5×3], activation maps [28×28×1] (http://image.slidesharecdn.com/case-study-of-cnn160501201532/95/case-study-of-convolutional-neural-network-5-638.jpg?cb=1462133741).
To summarize the Conv Layer 1) Accepts a volume of size W1 H1 D1
2) Requires four hyperparameters: Number of filters (K), their spatial extent (F), the
stride (S), the amount of zero padding (P)
3)Produces a volume of size W2 H2 D2 where:
W2 = (W1 - F + 2P)/S + 1
H2 = (H1 - F + 2P)/S + 1
D2 = K
3)With parameter sharing, it introduces (F*F*D1 )*K weights per filter, for a total of
(F*F*D1 )*K weights and K biases. 4)In the output volume, the d-th depth slice (of
size W2 H2 ) is the result of performing a valid convolution of the d-th filter over the
input volume with a stride of S, and then offset by d-th bias.
B) Pooling layer
The pooling layer helps in reducing the spatial size of the representation to reduce the
amount of parameters and computation in the network. Pooling enables the network
to learn a series of progressively more abstract concepts. There are a few common
types of pooling techniques like max pooling (Figure 4.14), average pooling, L2-norm
pooling etc.
1) Accepts a volume of size W1 xH1 D1
2) Requires two hyperparameters: their spatial extent F, the stride S
3) Produces a volume of size W2 xH2 xD2 where:
W2 = (W1 -F)/S + 1
H2 = (H1 -F)/S + 1
D2 = D1
4) Introduces zero parameters since it computes a fixed function of the input
5) It is not common to use zero-padding for Pooling layers
3) Fully-connected layer (FC layer)
From a mathematical point of view the neurons in both the CONV layers and the
FC layers compute dot products, hence they have the same functional representation.
Figure 4.14: Max pooling with a stride of 2 [5].
The only difference between these two layers are the neurons in the CONV layer
are connected only to a local region in the input and share parameters, while the
neurons in the FC layer are independent of each other but are connected to all the
activations in the previous layer. Hence their activations can be computed with a
matrix multiplication followed by a bias offset.
Case studies
[5] 1) LeNet [37]- The first successful applications of CNNs were developed by Yann
LeCun in the 1990s. Of these, the best known is the LeNet architecture that was
used to read zip codes, digits, etc.
2) AlexNet [39] - The first work that broadly popularized CNNs in computer vision
was the AlexNet, developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton.
AlexNet was the winner of the ImageNet ILSVRC challenge in 2012 and significantly
outperformed the second runner-up by a margin of 10% top-5 error. It had a very
similar architecture to LeNet, but was deeper. The architecture has been explained
in further details in the section ”Proposed Method”.
3) ZF Net [49] - The ILSVRC 2013 winner was a CNN from Matthew Zeiler and
Rob Fergus. It became known as the ZFNet. This network was created by tweaking
the AlexNet architecture hyperparameters. The size of the convolutional layers in
the middle were increased and the size of the stride and filter on the first layer were
4) GoogLeNet [50] - The ILSVRC 2014 winner was a CNN from Google. This paper
invented a new module called the inception module to form a different type of architecture. The inception module dramatically reduced the number of parameters in the
network to 4 million, compared to AlexNet with 60 million. Additionally, this paper
uses average pooling instead of FC layers at the top of the ConvNet, thus helping
to reduce a large number of parameters. There are also several follow-up versions to
GoogLeNet, most recently Inception-v4 [51].
5) VGGNet [52] - The runner-up in ILSVRC 2014 was the CNN architecture called
VGGNet. The primary contribution of this paper was in proving that the depth of the
network plays a critical role in the performance of the network . Their recommended
network contains 16 CONV/FC layers and features. It used an extremely homogeneous architecture that only performs 3×3 convolutions and 2×2 pooling throughout
the entire network.
6) ResNet [53] - Residual Network developed by Kaiming He et al. was the winner
of ILSVRC 2015. It features unique skip connections with heavy use of batch normalization. Like GoogleNet, the architecture removes the fully connected layers at
the end of the network. ResNets are currently considered by many as state of the art
CNN architecture.
Deep learning frameworks
1) Theano [54] - Theano is a Python library that allows you to define, optimize,
and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
It was first developed at University of Montreal. It has been powering large-scale
computationally intensive scientific investigations since 2007. Theano features a tight
integration with NumPy which is a python library having many optimized libraries.
It also has GPU support and dynamic C code generation for evaluating expressions
2) Caffe [16] - Caffe is a deep learning framework made with expression, speed, and
modularity in mind. It is developed by the Berkeley Vision and Learning Center
(BVLC) and by community contributors. The framework is a BSD-licensed C++
library with Python and MATLAB bindings for training and deploying convolutional
neural networks. It also provides GPU support with CUDA library. Caffe boasts
of a powerful online community which users can take significant advantage of e.g. pretrained models.
3) Torch [17] - Torch is a scientific computing framework with wide support for
machine learning algorithms. The frontend uses the scripting language LuaJIT and
an underlying C/CUDA implementation in the backend. Lua itself being one of the
fastest languages, Torch provides excellent speed especially since it was built keeping
GPU support first in mind. Some of the core features include • a powerful N-dimensional array
• lots of routines for indexing, slicing, transposing, ...
• interface to C, via LuaJIT
• linear algebra routines
• neural network, and energy-based models
• numeric optimization routines
• Fast and efficient GPU support
• Embeddable, with ports to iOS, Android and FPGA backends
Among others, it is used by the Facebook AI Research Group, IBM, Yandex and
the Idiap Research Institute.
4) TensorFlow [18] - TensorFlow is the most recent of all these frameworks developed
by Google in 2015. It was developed with the focus of easy access to machine learning
models and algorithms, ease of use and easy deployment in different heterogeneous
machines like mobile devices.
Chapter 5
Time of day/Weather invariant
Time of day
The primary focus of this research is to develop a vision based approach which can
substitute or help a traditional GPS based localization method. The results of this
process will get affected by the quality of the datasets and the model trained on them.
For example a model trained on a dataset created out of images in the time frame
of 9am-10am will not be such a good predictor for images at a different time of the
day like the late afternoon or evening. This is primarily because there’s a significant
change in the illumination of the images at different times of the day. Hence, if we
could come up with a method which could filter out this illumination change, so that a
single model can be used at anytime of the day, then that would help a lot. Although
this is a problem for this research, it is not so new in the computer vision world and a
lot of research has been conducted in the past to resolve it. An algorithm inspired from
[55] has been used later in this research and the results looked promising. [55] presents
a model of brightness perception suitable that receives High Dynamic Range(HDR)
radiance maps and outputs Low Dynamic Range(LDR) images while retaining the
important visual features. The algorithm proposed in this paper is motivated from
the popular assumption about human vision: human vision responds to local changes
in contrast rather than to global stimuli levels. From this assumption, the primary
goal in this paper was to find the reflectance/perception gain L(x,y) such that when it
divided the input I(x,y), it would produce the reflectance/perceived sensation R(x,y)
of the scene, in which the local contrast would be appropriately enhanced.
I(x, y)
= R(x, y)[55]
L(x, y)
Weather invariant
Similar to time of day, weather plays an equally important role for the purpose of
prediction. While the technique of removing the effect of drastic changes on an
image due to inclement weather (heavy rain or snow) hasn’t been perfected yet, a
good amount of research has been done recently on removing rain streaks. [56] takes
advantage of the strength of convolutional neural networks to learn both high-level
and low-level features from images. This paper proposes a new deep convolutional
neural network called DerainNet which has the ability to learn the non-linear mapping
relationship between rainy and clean image detail layers from data. Each image
is decomposed into a low-frequency base layer and a high-frequency detail layer.
The network is trained on the detail layer rather than the image domain. It has
been further discussed in the ’Proposed Method’ and ’Results’ sections. While [56]
takes advantage of CNNs, [57] proposes a different method using Gaussian mixture
models (GMM). Both the approaches treat the problem as a layer decomposition
one i.e. - superimposition of two layers, the base layer (without rain) and the detail
layer (with rain). [57] uses a simple patch-based priors for both the background and
rain layers. These priors are based on GMMs and have the ability to accommodate
multiple orientations and scales of the rain streaks. It is important to understand
that removing rain from a single frame is a lot more difficult than video streams, as
all the approaches treating video stream assume that they have a static background
and can treat weather as removable temporal information.
Chapter 6
Proposed method
The primary objective of this research is to come up with an alternative to a solely
GPS-dependent localization for autonomous or semi-autonomous vehicles. The alternative approach may be used in assisting GPS output or may be used as its replacement. The motivation behind this comes from the fact that GPS data is not
always available and not precise enough all the time for accurate localization. The
eye acts as the best perception sensor of the environment in human beings. From
just a quick glance, the human eye is capable of understanding its surrounding in
real time, including depth information. This accurate perception helps a driver in
controlling the vehicle in the desired way. The best part of this perception is the
depth, which is one of the primary factors contributing in precise localization. It is
evident from this information, that vision sensors can be equally powerful if molded
the right way. Although scientists are quite far away from understanding the detailed
working mechanism of the human brain including how it processes the visual data,
recent advances in machine learning have proven to be very successful in mimicking
some of the human visual system. Trusting this progress, it is believed that convolutional neural networks have the ability to identify locations accurately, if trained on
visual data for each of those locations.
There were multiple datasets created for this project. Although each of these
datasets were created in unique ways, the fundamental approach was the same for
all of them. At first, a bunch of classes were formed, each class being a unique pair
of GPS coordinates along with images from different viewpoints. These datasets
were then classified to observe if each location can be identified based on the visual
stimulus. There were other experiments conducted on these datasets like testing
weather invariant methods etc.
Google Street View
Google Street View is a technology that provides panoramic views of the world primarily from roads. Using a very efficient image stitching algorithm, it displays panoramas
of stitched images from different viewpoints. Most of the images are captured using
a car, but other modes of transportation are also used in some cases. The current
version of Street View uses JavaScript extensively and also provides a JavaScript application programming interface integrated with Google Maps which has been used
extensively in this project. The images are not taken from one specific camera as
the version of the camera has also changed over the years. The cameras are usually
mounted on the roof of the car when recording data. The ground truth of the data is
recorded using a GPS sensor, wheel speed sensor and inertial navigation sensor data.
The advantage of Street View data is the availability of images for a single unique
location from different viewpoints like field of view, heading and the pitch.
Field of View
The field of view determines the horizontal field of view of the camera. It is expressed
in degrees, with a maximum value of 120. Field of view essentially represents zoom,
with its magnitude being inversely proportional to the level of zoom. The default
value is 90.
Figure 6.1: Street View car [58].
The heading is the most useful feature provided by Google Street View. It gives the
user the ability to look at a 360 degree view from a location. This helps a user or a car
particularly when she is facing a different heading while being on the same location.
The magnitudes are - North:90, East:180, South:270, West:360.
The pitch specifies the up or down angle of the camera relative to the Street View
vehicle. Positive values mean the camera is facing upward with 90 meaning vertically
up while negative values mean the camera is facing downward with -90 meaning vertically down. The default value is 0.
The different viewpoints help in increasing the robustness of the dataset. It also
helps in solving a problem related to navigation of the car. These viewpoints, espe52
cially can be taken advantage of to estimate the pose of the car at a certain location
(x, y). Alternatively, if estimating pose is not a factor, these can be used to make
the model richer for predicting just the location coordinates (x, y). For example,
it is difficult to predict the direction of the car while traveling. The heading helps
especially in this case. Then again, if the road is not particularly smooth, the pitch
helps, while the field of view helps in handling the zoom.
Google Maps API
Google Maps API helps users in a variety of ways in visualizing maps and access other
useful features like directions and Street View. Further usage for specific datasets have
been explained and shown under the individual datasets.
The first dataset was created with an intention to cover the entire campus of the
university. The four furthest coordinates of the campus were selected as the boundary
coordinates and all the GPS locations were collected by connecting these boundary
coordinates forming a virtual rectangle. This was a brute force kind of method
where the street view data was collected within the area of this rectangle to form
the dataset.It should be noted here that this was the first attempt in this project
to test the fundamental idea of identifying locations with the usage of vision data
only. The latitudinal and longitudinal differences between each location was .0002
and .0001 respectively. Since Google maps gives approximately a root mean square
(RMS) precision of 10cm on a six decimal point latitude/longitude, the latitudinal
and longitudinal differences for this dataset could be converted to about 20m and
10m respectively. The parameters of the street view used in creating this dataset are
given below.
Table 6.1: Boundary coordinates of the rectangle.
Southwest tip
Southeast tip
Northwest tip
Northeast tip
43.080089,-77.685000 43.080089,-77.666080 43.087605,-77.685000 43.087605,-77.666080
Table 6.2: Dataset1: parameters.
Heading Field of View
Table 6.3: Dataset1 details.
Image resolution
No. of images
No. of classes No. of images/class
Figure 6.2: Street View images from Dataset1.
While the first dataset was created in a brute force fashion, the second dataset was
created much more intelligently. The objective was to create a dataset in a way,
so that the images from each class were less similar in nature and restricted to the
viewpoint of roads inside the campus. Since most of the data from Google Street
View is acquired from the roads, this would eliminate images of places which are
inaccessible to vehicles. As this research is focused on autonomous cars, this made
sense. A zoomed in view of some of the locations have been shown in Figure 6.2. It
can be seen in this figure that while the distance between two locations is relatively
large on a straight road, it is relatively less during a turn or when the road is curvy.
This is because the images are much similar relative to distance when on a straight
road, the primary difference being only the zooming effect. But on the other hand,
images would look quite different for each location when the road is curvy, hence
locations being close to each other. This helps in building a better model.
Figure 6.3: Zoomed in portion of Dataset2: Distance between locations based on viewpoint.
The Google Maps API was used extensively for creating this dataset. Twenty-six
important landmarks were chosen in a way such that they would cover almost the
entire university campus. Using these landmarks an application was created using the
Google Maps API. This API uses HTML5 as its primary platform, which leverages
both HTML as a markup language and JavaScript for web applications. The primary
objective of the application was to create paths between a user’s start and end location
and store all the coordinates along the path in a similar fashion as shown in Figure
6.3. There was one more unique feature in the application - the user could create
a path so that it would pass through user selected waypoints before reaching the
end location. This has been shown in Figure 6.4. The start and end locations in
Figure 6.4 are the green marker ‘A’and red marker respectively and all the other
green markers are waypoints selected by the user. The blue line shows the route
between the start and end locations, while passing through the other waypoints. The
individual route segments between each pair of waypoints is also shown on the right.
Taking various permutations of these locations (start, end, waypoints), a bunch of
coordinates were collected throughout the campus from the paths, while eliminating
redundant coordinates before creating Dataset2.
Figure 6.4: Routes created out of Google Maps API.
Finally, the dataset was created out of these coordinates in a similar fashion like
the first dataset. The coordinates/classes have been shown in Figure 6.5.
Table 6.4: Dataset2: parameters.
Heading Field of View
Table 6.5: Dataset2 details.
Image resolution
No. of images
No. of classes No. of images/class
Figure 6.5: Locations/classes of Dataset2.
The rest of the datasets were created by driving a golf cart around the university
campus while capturing images using a set of cameras and collecting the GPS ground
truth using an open source library. A brief description of the golf cart, the cameras
used and the library used for collecting ground truth are given below.
Autonomous golf cart
Rochester Institute of Technology has been doing some excellent research in the field
of autonomous driving recently through different programs. A number of students
have been involved in senior design projects as well as graduate research inventing as
well as experimenting with traditional technologies related to autonomous vehicles. A
stable platform is of quintessential importance to test these algorithms. Students have
converted a traditional golf cart to an autonomous one, after putting in multiple sensors and a Central Processing Unit (CPU) with Ubuntu installed on it to handle all the
feedback of these sensors and run other operations. This is only shedding light on the
software of the golf cart, but a lot of other complex electrical and mechanical work are
also integral parts of its operation. Further details can be found at the Autonomous
People Mover Phase3’s website - http://edge.rit.edu/edge/P16241/public/Home.
Figure 6.6: RIT’s autonomous golf cart.
As part of this research, a couple of forward facing Hikvision Bullet IP cameras were
installed on the roof of the golf cart pointing straight in order to collect data from
the university campus (Figure 6.7). Since these cameras operate on IP protocol, a
static IP had to be registered to the cameras in order to process images from them
Due to the high frame rate and the encoding format, a few issues did become bottlenecks while collecting data from the cameras. The system’s data pipeline was getting
jammed without being able to process the data in real time. Scripts were written in
order to handle the data pipeline efficiently in the form of a queue. The queue is an
important concept in embedded systems software development. From a higher level,
a queue makes sure that the system doesn’t start processing data in a pipeline, until
all the data before it have been successfully processed.
Table 6.6: Configuration of Hikvision cameras.
Image resolution
Frame rate Max. bitrate
Video encoding
Figure 6.7: (left) Hikvision Bullet IP camera (right) Cameras on the golf cart.
GPS ground truth
An open source library/tool [60] was used for collecting the ground truth for the
datasets created by driving the golf cart around. It uses the OSX Core Location[59]
framework to get the latitude and longitude of a specific location. Since Core Location
uses the cellular tower triangulation in addition to Wi-Fi positioning, the Wifi-hotspot
network from a cellular phone was used during the time of recording ground truth,
to make sure the presence of a stable internet connection.
This was the first dataset created by collecting images from the university campus
while driving the golf cart around. The images were collected in early August with
an objective of recording images from summer. A bunch of images were acquired
randomly for each location. While images for some of the locations were recorded
twice, by driving the golf cart towards and away from the location (in a round trip
fashion), for others, the images were taken with the cameras facing in one direction
(either facing towards or away from the location). The former case provided more
robustness to some locations. The average distance between locations was (3.8m,
7.1m) in the (vertical, horizontal) directions respectively. For data augmentation,
10× crops were used, which is basically taking random crops out of the original
images. Locations/classes of this dataset are shown in Figure 6.8 and some images
in Figure 6.9.
Table 6.7: Dataset3 details.
Image resolution
No. of images
No. of classes
No. of images/class
Not constant
Average inter-class dista
(3.8m, 7.1m)
Figure 6.8: Locations from Dataset3.
Since different weather/climates/time of the year affects the environment, it will
affect the model trained on images from these different types of environments as well.
Figure 6.9: Images from Dataset3.
Hence, as part of this research, datasets were created at different times of the year
and separate models were trained on them. Dataset4 was created around the end of
October during the fall season. The average distance between locations was (1.8m,
6.4m) in the (vertical, horizontal) directions respectively. For data augmentation,
scaling and rotational transforms were used besides the 10× method. While the
scaling operation was performed using perspective transform in the range of (30,-30)
with changes of 10 units for every zooming effect, for rotation, the range was between
(10,50) and (310,350) degrees with each image rotated 10 degrees from each other.
This added to the robustness of the dataset. Locations/classes of this dataset are
shown in Figure 6.10 and some images in Figure 6.11.
Table 6.8: Dataset4 details.
Image resolution
No. of images
No. of classes
No. of images/class
Not constant
Average inter-class dista
(1.8m, 6.4m)
Figure 6.10: Locations from Dataset4.
Figure 6.11: Images from dataset4.
Dataset5 was created around early January during the winter season. The average
distance between locations was (1.4m, 3.9m) in the (vertical, horizontal) directions
respectively. For data augmentation, scaling and rotational transforms were used
besides the 10× method similar to Dataset4. To make the dataset more robust to
changes, Gaussian blur was added to these images as well using a [5×5] averaging
kernel with different means in the range of (20,30) with a difference of 2 each time.
Table 6.9: Dataset5 details.
Image resolution
No. of images
No. of classes
No. of images/class
Not constant
Average inter-class dista
(1.4m, 3.9m)
Figure 6.12: Locations from Dataset5.
Figure 6.13: Images from dataset5.
Datasets with smaller inter-class distance
Datasets 3-5 were recorded by a human driven golf cart, hence it was prone to variable
speeds at different parts of the campus. Due to this reason the average distance was
taken as an estimate of inter-class separation, instead of a constant value between
each location. The localization accuracy i.e. - the real world precision achieved from
training models on these datasets were high (∼25cm). But since the average interclass distance was much higher than the precision in case of Datasets 3-5, a further
diagnosis was needed to evaluate the performance of similar models trained on data
at a smaller inter-class distance equal to the precision (∼25cm). In order to test
this hypothesis, Dataset6 (6a..6e) was created with each inter-class distance 25cm
away from each other. A video stream of one second per location was collected at
35frames/sec using a cellular phone. Two out of these five datasets were created with
the locations in a straight line, two with locations at a small angle to each other, and
the final one with locations both in straight line and at an angle to each other at some
points. Like previous datasets, both the classification and localization accuracies were
high. Hence two more datasets were created. The locations in both of them were in a
straight line, but the inter-class separation and the viewpoints per class were different.
Dataset7 had a 25cm inter-class distance with a 360 degree viewpoint per class from
a ten second video stream for each location. Thus each location was represented with
varied images. Dataset8 was collected in a similar fashion as the Dataset6 discussed
before, but with only 5cm inter-class separation distance. For data augmentation,
scaling (0,-20,20), rotation (30,50,330,350), Gaussian blur (kernel - [5×5], means 25,30) and 10× crop were used for Datasets 6a-6e and Dataset8 while only 10× crop
was used for augmenting Dataset7.
Table 6.10: Datasets with smaller inter-class distance details.
Datasets(6a-6e) Dataset7 Dataset8
No. of images
No. of classes
Images/class(before augmentation)
RMS Inter-class distance
An architecture inspired from the original AlexNet architecture, the winner of the
ImageNet ILSVRC challenge was used to classify the datasets discussed above. The
original architecture performed very well in the ImageNet challenge with 1000 classes
and it delivered equally promising results in this research as well. The original architecture and the modifications on it used in the experiments have been discussed
Although there have been many modifications produced on the original architecture, initially AlexNet consisted of eight layers - five CONV layers followed by three
FC layers. The last FC layer is connected to a softmax function (log probability of
the prediction) which produces a distribution over the number of classes. Some of
the CONV layers were followed by pooling layers. While the pooling layers reduce
the computation after each layer due to decrease in the spatial size of the representation, they also enable the extraction of an abstract hierarchy of filters with increasing
receptive. In addition to making a more robust feature for classification, there is
some evidence these higher level features prevent overfitting. Overfitting is a common problem in machine learning, where the trained model has difficulty in predicting
unforeseen data from the test set, due to a huge number of parameters while not having enough samples to learn from. This is often due to the complexity of the dataset.
Since the pooling layer reduces the number of parameters, it often helps in reducing
overfitting. AlexNet uses ReLU as the non-linear mapping function i.e. the activation function. CNNs usually train faster when using ReLU as its activation function.
There was one more interesting concept used in this architecture to reduce overfitting. It’s called ‘dropout’. Dropout helps in removing/deactivating neurons with a
certain probability which would otherwise have contributed less during backpropagation through a network. It is worth mentioning here, that many researchers believe,
that this was one of the important factors in making this architecture successful.
Dropout was used in the first two FC layers.
Figure 6.14: The network’s input is 150,528 dimensional, and the number of neurons in
the network’s remaining layers is given by 253,440-186,624-64,896-64,896-43,264-4096-40961000 [39].
The modified architecture used in this research is from [63]. The primary difference
with the original architecture is the addition of batch-normalization layers after all
the CONV and FC layers except for the last one. Often, input data is divided
into batches before being sent into the pipeline. This helps in parallel processing.
However if there is a huge variance between these batches of data, the model might
not train in the expected fashion. Batch normalization is often used to neutralize
this internal co-variance shift. The code-base for training the models was used from
the popular imagenet-multiGPU code developed by Mr. Soumith Chintala under the
Torch framework [64]. The hyperparameter values used for training the datasets have
been given below. These values tended to produce good results in most of the cases.
Table 6.11: Training hyperparameters.
Batch-size No. of epochs learning rate
Momentum Weight decay
Time of day invariant
Using a gamma correction method inspired from [55], images were augmented to
different illumination conditions based on different times of the day. For example,
the visibility from dusk going into the evening, the brightness is usually low whereas
it’s the opposite in the early morning. The block diagram of the process has been
shown below.
Figure 6.15: Block diagram of the algorithm.
The Gamma matrix is the most important part of this block diagram. The user
doesn’t have direct control over the input image, but she does have the ability to
modify the Gamma matrix to get the desired reflectance/perceived sensation (R(x,y)).
Comparing Figure 6.15 and 6.1 below, it can be observed that the mean/median of
the histogram of the image (I(x,y)) or a constant factor in the range of 0-255 are tied
to the term L(x,y). In Figure 6.16 below, the original image is in the left, while the
images in the middle and the right are gamma corrected images.
I(x, y)
= R(x, y)[55]
L(x, y)
Figure 6.16: (left) Original Image (middle) Brighter-afternoon (right) Darker-evening.
Weather invariant
It’s difficult to acquire several datasets for different weather conditions and keep
trained models on these datasets in memory for real time use. Alternatively, if it
was possible to remove the effects of harsh weather from the images, it would be
easier to leverage the power of a single model trained on normal images for the same
purpose. Inspired from this idea, the approaches in [56] and [57](previously discussed
in Chapter Background/related work) were applied in this research to remove the
effect of heavy rain from images. Since it is difficult to acquire data during heavy
rain, synthetic rain was added to images using Adobe Photoshop [61] (Figure 6.17
(middle)). It is also possible to add snowfall on images using a similar technique [62]
(picture not shown here). However, since snow adds significant changes to the texture
of the environment and ground itself, adding only snowfall to images wouldn’t produce
real world scenarios. This is a well known problem, and actually leads to a different
research domain, as to how to generate mock real world winter images. In conclusion,
experiments were conducted for only rainy conditions as part of this research. To
test the strength of the model trained on normal images in predicting rain-removed
images, random classes were chosen from the validation set in the Google Street
View dataset, followed by adding rain, and then removing the same by the methods
mentioned above. Out of the 166 classes in the Google Street View dataset, rain
was added and removed randomly on 84 classes. A comparison between the two
approaches from [56] and [57] in removing rain have been shown in the figures below.
Figure 6.17: (left) Original image, (middle left) Synthetically added rain, (middle right)
Rain removed by the approach from [56], (right) Rain removed by the approach from [57].
If the accuracy of a model is not very high, a hierarchical approach can be undertaken for better results. The objectives of a hierarchical approach in this research is
to attack multiple problems at a time
1) Improve precision of GPS sensors
2) Optimize testing time
3) Improve localization accuracy
Every GPS sensor has a precision which is usually known beforehand from various
tests. This approach proposes to improve this precision, by dividing the dataset into
smaller parts (circles) containing only a handful of coordinates. The radius of each
Figure 6.18: Venn diagram for hierarchical approach where Pn -(latitude,longitude) and
R-GPS precision/smaller region.
part can either be the precision of the sensor or a small distance. If the radius is the
former, then we can consider an arbitrary number of points inside the circle based on
the precision of the sensor. If the radius is the latter, we will consider the number of
points given by the sensor within that distance. The scenario is explained with the
help of the Venn diagram in Figure 12.
According to Esa SiJainti, a Finnish Google maps developer, the precision from
GPS Latitude/Longitude and Google Maps API respectively is:
a) 5-6 decimal places - sufficient in most cases
b) 4 decimal places - appropriate for detail maps
c) 3 decimal places - good enough for centering on cities
d) 2 decimal places - appropriate for countries
Precision related to physical distance from the Google Maps 6 decimal places - 10cm
5 decimal places - 1m
4 decimal places - 11m
Realistically, the GPS sensors available for everyday use is much worse. Experiments are being conducted at present on authenticating these claims and testing if
the precision can indeed be improved.
Chapter 7
Table 7.1: Results.
Dataset1 Dataset2
No. of images
No. of classes
Classification Accuracy
Localization Accuracy
Not constant
Not constant
(2cm, 7cm)
Not constant
In the table above, the classification accuracies shown are for the validation sets.
Dataset1 did not produce a very high validation accuracy (∼36.8%). This was most
likely because of two reasons. The first is related to a property of deep neural networks, that it needs a large number of samples per class for good prediction, which
wasn’t present in Dataset1. However, the primary reason behind the poor performance of the classifier was the nature of the images in Dataset1, which were very
similar in many cases, when they were within close proximity but without distinctive
features, like areas of empty grasslands.
The prediction results from Dataset2 was much better than the previous dataset.
As explained in the earlier chapter, this dataset was created much more intelligently,
making each class somewhat distinct to each other, with the help of the Google Maps
API. Also, the number of images per class was 4× the previous dataset. As a result,
the best validation accuracy for the lowest loss obtained was ∼75.37% after training
for 25 epochs. The loss vs validation accuracy for 25 epochs has been shown in Figure
Figure 7.1: Validation loss vs number of epochs in Dataset2.
The deep learning classifier produced higher than 95% accuracy for Datasets 3
through 5 onward. Often a classifier produces high accuracy while overfitting, hence
different experiments were conducted to make sure that the high accuracy reported by
the classifier is legitimate.In Figure 7.2, location predictions from the trained model
are shown for a total of 714 samples from the test set of Dataset3. While the red
markers represent the ground truth, the green markers show the locations where the
model’s predictions didn’t match the ground truth. The red markers represent both
the ground truth as well as correct predictions from the model, since they coincide.
After taking the average of the sum of the difference of all the incorrect predictions
and their ground truth, it was found that the average difference for the latitudes
and longitudes were .0000020 and .0000022 respectively for Dataset3. Considering
.0001 represents 10m accuracy approximately in Google maps, the average error shift
vertically (latitude) was about 20cm and horizontally (longitude) was about 22cm,
whenever the model predicted incorrectly in case of Dataset3. Similarly, the localization accuracies were also computed for Dataset 4-5. It should be noted here that the
images collected for each class from Dataset3 onward didn’t have the same amount of
variation in viewpoints like the Google Street View. The lack of variation wouldn’t
affect the navigation of an autonomous vehicle much, as it typically follows with minimum variation along the same route as during the time of the creation of the dataset.
The validation accuracy was 97.5% while classifying Dataset3 and an average real
world of distance of (20cm, 22cm) between the incorrect predictions and the ground
truth, when each class was located at an average distance of (3.8m, 7.1) in the vertical
and horizontal directions respectively. Hence ideally, the models should be trained on
the Street View images so that it can predict new images using that robust model.
This is discussed further in the next chapter.
Figure 7.2: Correct vs incorrect predictions - Dataset3.
Datasets 3 to 5 were created during three different seasons - summer, fall and
winter respectively. Although the time of the year was different and hence the images
varied significantly, many of the locations/ground truth were the same. The results
of cross-testing using the models are shown below in Table 7.2. It shows performance
of each model on all the datasets. Although they performed quite well when the
test set was from the same dataset, they did not perform well when it came to other
test sets. After careful investigation, it was found that this was primarily because of
Figure 7.3: Validation accuracy vs number of epochs in Dataset3.
the change in precision of the ground truth recorded by the GPS sensor during the
collection of each dataset. However, when a model was built on the three datasets
combined together, it improved significantly in predicting the individual datasets from
the three seasons. The results are shown in Table 7.3.
Table 7.2: Cross-testing with the three models.
Dataset3(summer) Dataset4(fall) Dataset5(winter)
Table 7.3: Prediction from the model trained on the three datasets combined.
Test set - datasets combined
Dataset3 Dataset4 Dataset5
As evident from Table 7.4, both the classification accuracy and the localization
accuracy were high in case of Dataset6 and Dataset8. Similar to Datasets 3-5, the
localization accuracy was calculated as the average of the sum of the difference between the incorrect predictions and ground truth value for the same location. It was
Table 7.4: Results - Datasets with smaller inter-class distance.
Dataset6(6a-6e) Dataset7 Dataset8
No. of images
No. of classes
Images/class(before augmentation)
RMS Inter-class distance
Classification accuracy
Localization accuracy
concluded from these results, that CNNs are good at differentiating between classes
even if the difference in depth in between the locations is small. On the other hand it
didn’t perform so well in case of Dataset7. This was because of the number of different
viewpoints included in the same class, making the images in each class varied.
Table 7.5: Weather invariant results.
Original accuracy
Rain added to validation set Rain removed by [56]
Rain removed by [57]
Rain was synthetically added and removed on Dataset2. Table 7.5 shows that
heavy rain alters the dataset significantly, enough to confuse the model and bring
down the accuracy by almost 20%. Surprisingly the approach from [56] didn’t work
out so well and actually lowered the validation accuracy further. However, [57] performed well and produced the final validation accuracy almost the same as the original, thereby proving it was able to remove the effects of heavy rain or atleast negate
its effect on the classifier.
Chapter 8
Future Work
The classification accuracy wasn’t as high as the other datasets on the Street View
datasets. The primary reason behind this is the inclusion of several viewpoints for
each location in the Street View dataset, making the dataset robust to changes, but
also harder for the classifier to predict locations from new images. There are a few
different solutions to improve the accuracy 1) A more powerful classifier than AlexNet such as GoogleNet or ResNet. These
classifiers have proved to be more powerful in recent years due to their intuitive
learning techniques.
2) Accuracy can be further increased using a hierarchical approach as discussed
under the ‘Proposed Method’. The larger region considered would essentially be
divided into groups of smaller regions with less number of locations and separate
models would be trained on images from these smaller regions.
3) One of the hyperparameters in the Street View is the heading which essentially
represents the pose. Since the primary objective of this research was to determine the
2D coordinate of the location, all the the poses for that location were included in the
same class. This made the classifier’s job harder. Instead, if the pose is taken into
account for future work, two birds can be killed with one stone - (a) pose estimation
and (b) each class representing (x,y,pose) will have more unique images, making it
easier for the classifier to predict, albeit with more classes to predict.
Since it’s difficult to collect images for large regions manually, the ideal way to
conduct the experiments discussed in the former sections would be to train the models
on datasets created out of the Google Street View, and use them to predict test images
while actually driving a car. It should be noted here that this approach would need
the camera configurations more or less similar to the one used for acquiring Google
Street View images. And further for evaluating the performance (test runs), the
ground truth of the test images should be the same as that of Street View i.e. - the
precision and other configurations of the two GPS sensors should be the same.
Figure 8.1: Ground truth of Google Street View vs predicted class from cross testing.
In Figure 8.1 the same images from Figure 7.2 have been predicted by the model
trained on Google Street View data from Dataset3. The red markers are locations
when the model predicted correctly, whereas the green are the incorrect ones. As
evident from the figure, the results weren’t as good as Figure 7.2. This is primarily
because of the reasons discussed before. Provided similar GPS sensors and camera
configurations, this method would work out well.
The same location on a map can appear different at different times of the day due
to factors like traffic, people etc. Adapting to the environment by neglecting dynamic
obstacles during testing has been a hot topic of research since many years as evident
in [25] and [26]. Combining these methods with the proposed method in this research
would yield more robust results.
Similar to dynamic obstacles, systems need to adapt to different illumination
conditions at different times of the day and weather in order to perform efficiently.
Although a few attempts have been made in this thesis like adjusting to different light
conditions by gamma correction, rain removal techniques, a lot of research is still left
to be done in this field. For example, a solution good enough to adapt to snowy
conditions still doesn’t exist, unless new datasets are created from these conditions.
Localization can assist robots in a number of ways and is the primary module in
the navigation system in the world of robotics. As a conclusion to this thesis, an
innovative approach of motion estimation using visual odometry and aided by localization is shown. This approach was accepted in the EI2017 ”Autonomous Vehicles
and Machines” conference and received well by the reviewers. It is a combination
of two master’s theses. The motion estimation algorithm using visual odometry was
designed by Vishwas Venkatachalapathy, as part of his master’s thesis from the Computer Engineering department of Rochester Institute of Technology. Motion estimation using visual odometry is a concept by which the displacement of a vehicle is
tracked through optical flow between consecutive images. The details of the exact
approach is beyond the scope of this document. Interested readers are encouraged to
check/contact the Computer Engineering department of Rochester Institute of Technology to read his thesis manuscript or read the paper ‘”Motion Estimation Using
Visual Odometry and Deep Learning Localization’. As part of his thesis, a stereo
dataset was collected from the university campus the same way D ataset3 has been
created in this research. Although it’s a highly efficient method, visual odometry
tends to acquire a small error every few images, resulting in drifting away from the
ground truth over time. With the assistance of a highly accurate localization module, it is possible to keep the vehicle on track. In Figure 8.2, a block diagram of
the complete process is explained. From a higher level of understanding, the Visual
Odometry module calculates the 2D coordinates and pose from a sequence of images
and seeks the help of the localization module after every ‘n’frames. The localization
module keeps track of the ground truth, which the Visual Odometry can take advantage of in order to stay on track and help in the overall navigation of the vehicle. The
advantage of this process is that it is entirely vision based. Since encoders/IMUs(for
tracking displacement) are error prone and GPS data(for localization) is not always
precise or available, a vision based approach can be an excellent alternative.
Figure 8.2: Motion estimation using visual odometry and deep learning localization.
Chapter 9
Visual data can prove to be very rich if utilized, in the right way for the right objective.
In this research, it has been shown, that visual data can aid or even act as a substitute
for traditional GPS based localization for semi-autonomous and autonomous cars.
A validation accuracy greater than 95% was achieved on the datasets created by
driving the golf cart around the campus. This method was able to achieve an average
localization accuracy of (11cm, 23cm) with an average distance of (2.3m, 5.8m) for
(latitude, longitude) respectively between locations in case of the datasets created by
driving a golf cart around the campus. It also achieved a high ∼27cm localization
accuracy when the classes were a lot closer in case of the datasets collected with a
phone. Since the average RMS GPS precision is between 2-10m, the precision achieved
by this method is a lot higher ∼25cm. It should be noted here that this precision is
without the aid of any error correction system, which could increase it to ∼2-10cm.
After performing experiments on different datasets with variable distances in between locations, it was observed CNNs are capable of giving both high classification
and localization accuracies if the variance in the images in a single class is small.
On the contrary, if the data in a single class has a high variance, it becomes more
difficult for the classifier to predict with high confidence. However the localization
accuracy was promising throughout the experiments on most of the datasets. This
unique observation can be further utilized to determine if the pose should be included
in the same class or form separate classes.
Weather plays a very important role in any vision based applications because
of its unstable nature. In this thesis, a few state of the art methods from recent
publications have been utilized to remove the harsh effects of weather from images.
It was clearly evident from the results, that further research needs to be conducted
in this field to make systems entirely resistant to weather effects. Smart cars are
gradually becoming a part of our society. A lot of research is currently under way
to make visual localization methods more robust. The intention of this research
was to provide a solution to an existing localization problem that many cars face in
navigation. With the high accuracies obtained through various experiments, it proved
to be a viable solution.
[1] S. Thrun ; J.J. Leonard; ”Simultaneous Localization and Mapping,” in Springer
Handbook of Robotics, vol 23, no. 7-8, 2008, pp. 871-889
[3] http://echibbs.blogspot.com/2016/06/summer-and-easy-reading-whychange.html
[4] http://www.motorauthority.com/news/1060101 six-cities-named-for-newvehicle-to-vehicle-v2v-communications-trials
[5] http://cs231n.stanford.edu
[6] D. Scaramuzza and F. Fraundorfer, Visual Odometry [Tutorial], IEEE Robotics
& Automation Magazine, vol. 18, no. 4. pp. 8092, 2011.
[7] M. Fiala and A. Ufkes, Visual odometry using 3-dimensional video input, in
Proceedings - 2011 Canadian Conference on Computer and Robot Vision, CRV
2011, 2011, pp. 8693.
[8] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J.
Comput. Vis., vol. 60, no. 2, pp. 91110, 2004.
[9] S. Leutenegger, M. Chli, and R. Y. Siegwart, BRISK: Binary Robust invariant scalable keypoints, in Proceedings of the IEEE International Conference on
Computer Vision, 2011, pp. 25482555.
[10] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, BRIEF: Binary robust independent elementary features, in Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010, vol. 6314 LNCS, no. PART 4, pp. 778792.
[11] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ORB: An efficient alternative to SIFT or SURF, in Proceedings of the IEEE International Conference on
Computer Vision, 2011, pp. 25642571.
[12] Z. Chen, O. Lam, A. Jacobson, and M. Milford, Convolutional Neural Networkbased Place Recognition, 2013.
[13] T. Lin, J. Hays, and C. Tech, Learning Deep Representations for Ground-toAerial Geolocalization, pp. 50075015, 2015.
[14] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Bronte, Fast and
effective visual place recognition using binary codes and disparity information,
2014 IEEE/RSJ Int. Conf. Intell. Robot. Syst., no. SEPTEMBER, pp. 30893094,
[15] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Gamez, Bidirectional loop closure detection on panoramas for visual navigation, IEEE Intell.
Veh. Symp. Proc., no. JUNE, pp. 13781383, 2014.
[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe, Proc. ACM Int. Conf. Multimed. - MM 14, pp.
675678, 2014.
[17] R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A matlab-like environment for machine learning, 2011.
[18] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, L. Kaiser, M. Kudlur, J. Levenberg, D. Man, R. Monga, S.
Moore, D. Murray, J. Shlens, B. Steiner, I. Sutskever, P. Tucker, V. Vanhoucke,
V. Vasudevan, O. Vinyals, P. Warden, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,
arXiv:1603.04467v2, p. 19, 2015.
[19] M. Cummins and P. Newman, FAB-MAP: Appearance-Based Place Recognition
and Mapping using a Learned Visual Vocabulary Model, Proc. 27th Int. Conf.
Mach. Learn., pp. 310, 2010.
[20] M. J. Milford and G. F. Wyeth, SeqSLAM: Visual route-based navigation for
sunny summer days and stormy winter nights, in Proceedings - IEEE International Conference on Robotics and Automation, 2012, pp. 16431649.
[21] Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J.,
Fong, P., et al. (2007). Stanley: The robot that won the DARPA Grand Challenge. Springer Tracts in Advanced Robotics, 36, 1-43.
[22] Montemerlo, M., Becker, J., Bhat, S., Dahlkamp, H., Dolgov, D., Ettinger, S.,
Haehnel, D., et al. (2009). Junior: The stanford entry in the urban challenge.
Springer Tracts in Advanced Robotics (Vol. 56, pp. 91-123).
[23] Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner, R., Clark, M. N., Dolan,
J., et al. (2009). Autonomous driving in Urban environments: Boss and the
Urban Challenge. Springer Tracts in Advanced Robotics (Vol. 56, pp. 1-59).
[24] Likhachev,
[25] Levinson,
[26] Levinson,
tional Conference on Robotics and Automation (pp. 4372-4378). Retrieved
[27] Levinson, J., Askeland, J., Dolson, J., & Thrun, S. (2011). Traffic light mapping,
localization, and state detection for autonomous vehicles. Proceedings - IEEE
International Conference on Robotics and Automation (pp. 5784-5791).
[28] Teichman, A., Levinson, J., & Thrun, S. (2011). Towards 3D object recognition
via classification of arbitrary object tracks. Proceedings - IEEE International
Conference on Robotics and Automation (pp. 4034-4041).
[29] Teichman, A., & Thrun, S. (2012). Tracking-based semi-supervised learning.
The International Journal of Robotics Research, 31(7), 804-818. Retrieved from
[30] Held, D., Levinson, J., & Thrun, S. (2012). A Probabilistic Framework for
Car Detection in Images using Context and Scale. International Conference on
Robotics and Automation, 1628-1634.
[31] Felzenszwalb, P. F., Girshick, R. B., Mcallester, D., & Ramanan, D. (2009). Object Detection with Discriminatively Trained Part Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1-20. Retrieved from
http://cs.brown.edu/ pff/papers/lsvm-pami.pdf
[32] Held, D., Levinson, J., & Thrun, S. (2013). Precision tracking with sparse 3D and
dense color 2D data. Proceedings - IEEE International Conference on Robotics
and Automation (pp. 1138-1145).
[33] Held, D., Levingson, J., Thrun, S., & Savarese, S. (2014). Combining 3D Shape,
Color, and Motion for Robust Anytime Tracking. Robotics: Science and Systems
[34] Ivakhnenko, Alexey (1965). Cybernetic Predicting Devices. Kiev: Naukova
[35] Ivakhnenko, Alexey (1971). ”Polynomial theory of complex systems”. IEEE
Transactions on Systems, Man and Cybernetics (4): 364378.
[36] Fukushima, K. (1980). ”Neocognitron: A self-organizing neural network model
for a mechanism of pattern recognition unaffected by shift in position”. Biol.
Cybern. 36: 193202. doi:10.1007/bf00344251. PMID 7370364.
[37] Le Cun Jackel, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, L. D., Cun, B. L., Denker, J., & Henderson, D. (1990).
Handwritten Digit Recognition with a Back-Propagation Network. Advances in Neural Information Processing Systems, 396-404. Retrieved from
[38] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2323.
[39] Krizhevsky,
& Geoffrey E.,
H. (2012). ImageNet
Classification with Deep Convolutional Neural Networks. Advances in
Neural Information Processing Systems 25 (NIPS2012),
19. Retrieved
[40] Deng, J. D. J., Dong, W. D. W., Socher, R., Li, L.-J. L. L.-J., Li, K. L. K., &
Fei-Fei, L. F.-F. L. (2009). ImageNet: A large-scale hierarchical image database.
2009 IEEE Conference on Computer Vision and Pattern Recognition, 2-9.
[41] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov,
R. R. (2012). Improving neural networks by preventing co-adaptation of feature
detectors. ArXiv e-prints, 1-18. Retrieved from http://arxiv.org/abs/1207.0580
[42] Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted
Boltzmann Machines. Proceedings of the 27th International Conference on Machine Learning, (3), 807-814.
[43] Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve
Neural Network Acoustic Models. Proceedings of the 30 th International Conference on Machine Learning (p. 6). Retrieved from https://web.stanford.edu/
∼ awni/papers/reluh ybridi cml2013f inal.pdf
[44] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.
(2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
Journal of Machine Learning Research, 15, 1929-1958.
[45] Duchi,
& Singer,
Y. (2011). Adaptive Subgradient
[46] Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by
a running average of its recent magnitude. COURSERA: Neural Networks for
Machine Learning.
[47] Kingma, D., & Ba, J. (2014). Adam: A Method for Stochastic Optimization.
International Conference on Learning Representations, 1-13. Retrieved from
[48] Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Arxiv, 1-11. Retrieved from
[49] Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional
networks. Lecture Notes in Computer Science (including subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8689, pp.
818-833). Springer Verlag.
[50] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., et
al. (2015). Going deeper with convolutions. Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (Vol. 07-12June-2015, pp. 1-9). IEEE Computer Society.
[51] https://arxiv.org/abs/1602.07261
[52] Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks
for Large-Scale Image Recognition. ImageNet Challenge, 1-10. Retrieved from
[53] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. Arxiv.Org,
171-180. Retrieved from
[54] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., et al. (2010). Theano: a CPU and GPU Math Expression Compiler. Proceedings of the Python for Scientific Computing Con-
ference (SciPy), 1-7. Retrieved from http://www-etud.iro.umontreal.ca/ ∼
wardef ar/publications/theanos cipy2010.pdf
[55] Brajovic, V. (2004). Brightness perception, dynamic range and noise: a unifying
model for adaptive image sensors. Computer Vision and Pattern Recognition,
2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference
on, 2, II-189- II-196 Vol.2. Retrieved from 10.1109/CVPR.2004.1315163
[56] X. Fu, J. Huang, X.Ding, Y. Liao and J. Paisley, ”Clearing the Skies: A deep
network architecture for single-image rain removal,” arXiv:1609.02087v1 [cs.CV]
7 Sep 2016
[57] Y. Li, R. T. Tan, X. Guo, J. Lu and M. S. Brown, ”Rain Streak Removal Using
Layer Priors,” CVPR 2016
[58] https : //en.wikipedia.org/wiki/GoogleS treetV iew
[59] https : //en.wikipedia.org/wiki/IOS SDK#Core Location
[60] https : //github.com/robmathers/W hereAmI/blob/master/READM E.md
[61] http : //www.photoshopessentials.com/photo − ef f ects/rain/
[62] http : //www.photoshopessentials.com/photo − ef f ects/photoshop − snow/
[63] http : //arxiv.org/abs/1404.5997
[64] S. Chintala, ”https://github.com/soumith/imagenet-multiGPU.torch”, Copyright (c) 2016, Soumith Chintala All rights reserved
[65] M. Yu and U.D. Manu, Stanford Navigation Android Phone Navigation System
based on the SIFT image recognition algorithm,
[66] A. R. Zamir and M. Shah, ”Accurate Image Localization Based on Google Maps
Street View,” Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part
IV. LNCS, vol. 6314, pp. 255268. Springer, Heidelberg (2010).