Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 2-2017 Deep Learning Localization for Self-driving Cars Suvam Bag sb5124@rit.edu Follow this and additional works at: http://scholarworks.rit.edu/theses Recommended Citation Bag, Suvam, "Deep Learning Localization for Self-driving Cars" (2017). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu. Deep Learning Localization for Self-driving Cars Suvam Bag Deep Learning Localization for Self-driving Cars Suvam Bag February 2017 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Department of Computer Engineering Deep Learning Localization for Self-driving Cars Suvam Bag Committee Approval: Dr. Raymond W. Ptucha Advisor Associate Professor Date Dr. Shanchieh J. Yang Professor Date Dr. Clark G. Hochgraf Associate Professor Date i Acknowledgments I would like to thank the Machine Intelligence Lab of Rochester Institute of Technology (RIT) for providing me the resources as well as inspiration to complete this project, my adviser Dr. Raymond W. Ptucha for assisting me throughout my thesis and my colleague Mr. Vishwas Venkatachalapathy for his valuable feedback. I would also like to thank the autonomous people mover’s team of RIT for their help. ii I dedicate this thesis to my parents for their endless support. iii Abstract Smart cars have been present in our lives for a long time but only in the form of science fiction. A number of movies and authors have visualized smart cars capable of traveling to different locations and performing different activities. However this has remained a fairly impossible task, almost a myth until Stanford and then Google actually were able to create the worlds first autonomous cars. The Defense Advanced Research Projects Agency (DARPA) Grand Challenges brought this difficult problem to the forefront and initiated much of the baseline technology that has made today’s limited autonomous driving cars possible. These cars will make our roadways safer, our environment cleaner, our roads less congested, and our lifestyles more efficient. Despite the numerous challenges that remain, scientists generally agree that it is no longer impossible. Besides several other challenges associated with building a smart car, one of the core problems is localization. This project introduces a novel approach for advanced localization performance by applying deep learning in the field of visual odometry. The proposed method will have the ability to assist or replace a purely Global Positioning System based localization approach with a vision based approach. iv Contents Signature Sheet i Acknowledgments ii Dedication iii Abstract iv Table of Contents v List of Figures vii List of Tables ix Acronyms x 1 Introduction 2 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Localization and mapping . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Lack of identification of proper objects . . . . . . . . . . . . . . . . . 6 1.4 Weather/Climate/Time of day . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Vehicle to vehicle communication . . . . . . . . . . . . . . . . . . . . 8 1.6 Visual odometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Visual odometry in localization . . . . . . . . . . . . . . . . . . . . . 9 1.8 Loop closure detection in Simultaneous Localization and Mapping (SLAM) 9 1.9 CNN in vision based localization . . . . . . . . . . . . . . . . . . . . 10 2 Thesis Statement 13 3 Thesis Objectives 14 4 Background/Related Work 4.1 Artificial intelligence in self-driving cars . 4.2 Localization . . . . . . . . . . . . . . . . 4.3 Deep learning . . . . . . . . . . . . . . . 4.3.1 History and evolution . . . . . . . 15 15 21 28 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v CONTENTS 4.3.2 4.3.3 4.3.4 4.3.5 Traditional techniques . . . . . Convolutional Neural Networks Case studies . . . . . . . . . . . Deep learning frameworks . . . . . . . 30 40 44 45 5 Time of day/Weather invariant 5.1 Time of day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Weather invariant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 49 6 Proposed method 6.1 Google Street View . 6.1.1 Field of View 6.1.2 Heading . . . 6.1.3 Pitch . . . . . 6.2 Google Maps API . . 6.3 Dataset1 . . . . . . . 6.4 Dataset2 . . . . . . . 6.5 Autonomous golf cart 6.6 Camera . . . . . . . 6.7 GPS ground truth . . 6.8 Dataset3 . . . . . . . 6.9 Dataset4 . . . . . . . 6.10 Dataset5 . . . . . . . 6.11 Datasets with smaller 6.12 Classifier . . . . . . . 6.13 Time of day invariant 6.14 Weather invariant . . 6.15 Hierarchical . . . . . 50 51 51 52 52 53 53 54 57 58 59 59 60 62 64 65 67 68 69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . inter-class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Results 72 8 Future Work 77 9 Conclusion 81 vi List of Figures 1.1 1.2 1.3 1.4 1.5 4.1 4.2 4.3 4.4 Google self-driving car. . . . . . . . . . . . . . . . . . . . . . . . . . . Localization of a smart car [2]. . . . . . . . . . . . . . . . . . . . . . . Effect of different weather/climate [3]. . . . . . . . . . . . . . . . . . Multi-hop system of ad-hoc networks [4]. . . . . . . . . . . . . . . . . Classification results from the course CS231n - Convolutional Neural Networks for Visual Recognition [5]. . . . . . . . . . . . . . . . . . . . Stanley [21]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flowchart of the Boss software architecture [23]. . . . . . . . . . . . . Boss (http://www.tartanracing.org/gallery.html). . . . . . . . . . . . (left)GPS localization induces greater than equal to 1 meter error (right)No noticeable error in particle filter localization [25]. . . . . . . 4.5 Confusion matrix for individual traffic light detection [27]. . . . . . . 4.6 (a) Left: Detections returned by Felzenszwalb’s state of the art car detector [31]. (b) Right: Detections returned by the algorithm proposed in [30]. False positives are shown in red and the true positives are shown in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Architecture of LeNet-5. Each plane is a feature map i.e. a set of units whose weights are constrained to be identical. [38]. . . . . . . . . . . 4.8 Nearest Neighbor classification using L1 distance example[5]. . . . . . 4.9 Backpropagation visualized through a circuit diagram[5]. . . . . . . . 4.10 ReLU activation function, which is zero when x¡0 and then linear with slope 1 when x¿0 [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 a 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer [5]. . . . . . . . . . . . . . . . . . 4.12 a cartoon depicting the effects of different learning rates. While lower learning rates gives linear improvements, the higher it goes, the more exponential they become. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much ”energy” in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape [5]. . . . . . . . . . . . . . . . . . 3 6 7 8 11 15 19 21 22 24 26 29 31 35 38 39 40 vii LIST OF FIGURES 4.13 Convolutional layer - input layer [32×32×3], filter dimension [5×5×3], activation maps [28×28×1] (http://image.slidesharecdn.com/case-studyof-cnn-160501201532/95/case-study-of-convolutional-neural-network-5638.jpg?cb=1462133741). . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.14 Max pooling with a stride of 2 [5]. . . . . . . . . . . . . . . . . . . . . 44 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 Street View car [58]. . . . . . . . . . . . . . . . . . . . . . . . . . . . Street View images from Dataset1. . . . . . . . . . . . . . . . . . . . Zoomed in portion of Dataset2: Distance between locations based on viewpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Routes created out of Google Maps API. . . . . . . . . . . . . . . . . Locations/classes of Dataset2. . . . . . . . . . . . . . . . . . . . . . . RIT’s autonomous golf cart. . . . . . . . . . . . . . . . . . . . . . . . (left) Hikvision Bullet IP camera (right) Cameras on the golf cart. . . Locations from Dataset3. . . . . . . . . . . . . . . . . . . . . . . . . . Images from Dataset3. . . . . . . . . . . . . . . . . . . . . . . . . . . Locations from Dataset4. . . . . . . . . . . . . . . . . . . . . . . . . . Images from dataset4. . . . . . . . . . . . . . . . . . . . . . . . . . . Locations from Dataset5. . . . . . . . . . . . . . . . . . . . . . . . . . Images from dataset5. . . . . . . . . . . . . . . . . . . . . . . . . . . The network’s input is 150,528 dimensional, and the number of neurons in the network’s remaining layers is given by 253,440-186,624-64,89664,896-43,264-4096-4096-1000 [39]. . . . . . . . . . . . . . . . . . . . . Block diagram of the algorithm. . . . . . . . . . . . . . . . . . . . . . (left) Original Image (middle) Brighter-afternoon (right) Darker-evening. (left) Original image, (middle left) Synthetically added rain, (middle right) Rain removed by the approach from [56], (right) Rain removed by the approach from [57]. . . . . . . . . . . . . . . . . . . . . . . . . Venn diagram for hierarchical approach where Pn -(latitude,longitude) and R-GPS precision/smaller region. . . . . . . . . . . . . . . . . . . 52 54 55 56 57 58 59 60 61 62 62 63 63 66 67 68 69 70 7.1 7.2 7.3 Validation loss vs number of epochs in Dataset2. . . . . . . . . . . . . Correct vs incorrect predictions - Dataset3. . . . . . . . . . . . . . . . Validation accuracy vs number of epochs in Dataset3. . . . . . . . . . 73 74 75 8.1 8.2 Ground truth of Google Street View vs predicted class from cross testing. 78 Motion estimation using visual odometry and deep learning localization. 80 viii List of Tables 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 Boundary coordinates of the rectangle. . Dataset1: parameters. . . . . . . . . . . Dataset1 details. . . . . . . . . . . . . . Dataset2: parameters. . . . . . . . . . . Dataset2 details. . . . . . . . . . . . . . Configuration of Hikvision cameras. . . . Dataset3 details. . . . . . . . . . . . . . Dataset4 details. . . . . . . . . . . . . . Dataset5 details. . . . . . . . . . . . . . Datasets with smaller inter-class distance Training hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 54 56 57 59 60 61 63 65 67 7.1 7.2 7.3 7.4 7.5 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-testing with the three models. . . . . . . . . . . . . . . . . . . Prediction from the model trained on the three datasets combined. Results - Datasets with smaller inter-class distance. . . . . . . . . . Weather invariant results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 75 75 76 76 ix Acronyms 2D two-dimensional 3D three-dimensional ANN Artificial Neural Network API Application Program Interface CNN Convolutional Neural Network DARPA Defense Advanced Research Projects Agency GPS Global Positioning System GPU Graphical Processing Unit IMU Inertial Measurement Unit x Acronyms IP Internet Protocol KNN K-Nearest Neighbor LIDAR Light Detection and Ranging RADAR Radio Detection and Ranging ReLU Rectified Linear Unit RMS Root Mean Square SLAM Simultaneous Localization And Measurement SVM Support Vector Machine 1 Chapter 1 Introduction 1.1 Motivation Research and development of autonomous vehicles is becoming more and more popular in the automotive industry. It is believed that autonomous vehicles are the future for easy and efficient transportation that will make for safer, less congested roadways. In 2014, according to the Department of Transportation, besides the human toll of 32,000 deaths in the US and 2.31M people injured, the costs are $1 trillion! In recent years, nearly all states have passed laws prohibiting the use of handheld devices while driving. Nevada took a different approach. In a first for any state, it passed a law that legalizes texting, provided one does so in a self-driving autonomous car. This places Nevada at the forefront of innovation. Googles vast computing resources are crucial to the technology used in self-driving cars. Googles self-driving cars memorize the road infrastructure in minute detail. They use computerized maps to determine where to drive, and to anticipate road signs, traffic lights and roadblocks long before they are visible to the human eye. They use specialized lasers, radar and cameras to analyze traffic at a speed faster than the human brain can process. And they leverage the cloud to share information at blazing speed. These self-driving cars have now traveled nearly 1.5 million miles on public highways in California and Nevada. They have driven from San Francisco 2 CHAPTER 1. INTRODUCTION to Los Angeles and around Lake Tahoe, and have even descended crooked Lombard Street in San Francisco. They drive anywhere a car can legally drive. According to Sebastian Thrun, “I am confident that our self-driving cars will transform mobility. By this I mean they will affect all aspects of moving people and things around and result in a fundamentally improved infrastructure.” Figure 1.1: Google self-driving car. Two examples include improvement in mobility and use of efficient parking. Take todays cities- they are full of parked cars. It is estimated, that the average car is immobile 96 percent of its lifetime. This situation leads to a world full of underused cars and occupied parking spaces. Self-driving cars will enable car sharing even in spread-out suburbs. A car will come to you just when you need it. And when you are done with it, the car will just drive away, so you wont even have to look for parking. Self-driving cars can also change the way we use our highways. The European Union has recently started a program to develop technologies for vehicle platoons on public highways. Platooning is technical lingo for self-driving cars that drive so 3 CHAPTER 1. INTRODUCTION closely together that they behave more like trains than individual cars. Research at the University of California, Berkeley, has shown that the fuel consumption of trucks can be reduced by up to 21 percent simply by drafting behind other trucks. And it is easy to imagine that our highways can bear more cars, if cars drive closer together. Last but not least, self-driving cars will be good news for the millions of Americans who are blind or have brain injury, Alzheimers or Parkinsons disease. Tens of millions of Americans are denied the privilege of operating motor vehicles today because of issues related to disability, health, or age. How does the car see the road? Super Cruise and other similar systems do more than just see the road. Using an array of sensors, lasers, radar, cameras, and GPS technology, they can actually analyze a car’s surroundings. Super Cruise is a combination of two technologies. The first is the increasingly common adaptive cruise control, which uses a long-range radar (more than 100 meters) in the grille to keep the car a uniform distance behind another vehicle while maintaining a set speed. The second, lane-centering, uses multiple cameras with machine-vision software to read road lines and detect objects. This information is sent to a computer that processes the data and adjusts the electrically assisted steering to keep the vehicle centered in the lane. Because Super Cruise is intended only for highways, General Motors will use the vehicle’s GPS to determine its location before allowing the driver to engage the feature. In addition, General Motors is also considering using short-range radars (30 to 50 meters) and extra ultrasonic sensors (3 meters) to enhance the vehicle’s overall awareness. Cars with park-assist systems already have four similar sensors in the front and in the rear of the car. General Motors is also experimenting with costeffective LIDAR units, which are more powerful and accurate than ultrasonic sensors. It’s unclear whether LIDAR will make it into the same vehicle as Super Cruise. A smart car is a complex but wonderful example of technical innovation by humankind. Thousands of design constraints, complex hardware and software as well as 4 CHAPTER 1. INTRODUCTION sophisticated machine learning is involved to create this technical marvel. While some of them are associated within the car, others are more related to the environment of the car like communication with other cars called Vehicle to Vehicle Communication (V2V), positioning it locally with the help of LIDAR scans, lane markers etc. Some of the key challenges are described below. 1.2 Localization and mapping One of the most important aspects of a self-driven car is to know where it is. In technical terms this is called localization. Localization essentially lays the ground work of autonomous cars. Localization cannot be achieved from a single sensor or a single simple algorithm. Another task self-driving cars must tackle is mapping. Despite getting a static map from Google maps or other satellite generated maps, it isnt enough for navigating as the environment is quite dynamic in the real world. To adapt to these changes, the underlying technology of the car creates a local map of its own and integrates it with the global static map from the Google maps to identify the exact location of the car. This local map is created by the various sensors present in the car like a high dimensional LIDAR, a RADAR and multiple cameras. After getting a sense of its local environment, the car can continue executing its navigation algorithm to reach its destination point. Most autonomous cars, such as Googles smart car utilize the A* path planning algorithm along with a number of filters to predict and improve its position determination [1]. A* is at the very core of the path planning algorithm in the Google car. This algorithm was improvised a lot using dynamic planning, control parameters, stochastic solutions, path smoothing etc. Path planning, navigation, and obstacle avoidance have numerous challenges associated with them, many of which have not been solved yet. Although its difficult to visualize the complete algorithm in a single image, Figure 1.2 tries to portray how navigation algorithms can be combined with local maps to drive an autonomous 5 CHAPTER 1. INTRODUCTION vehicle. Figure 1.2: Localization of a smart car [2]. One of the primary challenges faced by any car be it smart or not, is to locate itself and plan its route based on inaccurate GPS coordinates. The average root mean square GPS precision of the Google maps is 2-10 meters. This precision is sufficient for a human being to drive and plan his/her next move but its not so for a self-driving car. For example, lets consider that the car has to take a left turn upon reaching a certain point. If this point is not exactly accurate, the car will certainly sway out of its lane and cause an accident. This is exactly where the sensors help. While the detailed algorithm used by Google smart cars is unknown, a fraction of that knowledge does help us in understanding how to use these sensors to improve localization. Their primary tools are the Google maps for path planning, Google street view for learning maps, and different sensors for lane marking, corner detection, obstacle detection, etc. To improve sensor data, various filters are also used like the Monte Carlo and Kalman filters. 1.3 Lack of identification of proper objects Despite using state-of-the-art object recognition algorithms, the Google smart car lacks the ability to identify what we may consider to be simple objects. For example 6 CHAPTER 1. INTRODUCTION it fails to differentiate between pedestrians and police officers. It cannot differentiate between puddles and pot potholes, nor can it recognize people doing specific actions like a police signaling the car to stop. It also has difficulty parking itself in the parking lot. Although these might seem relatively easy compared to what has already been achieved, they are actually difficult and require sophisticated machine learning. The better learning algorithm it has, the better it will be able to recognize these actions or objects on the road. 1.4 Weather/Climate/Time of day The capability of most sensors change drastically under different weather conditions and different times of day. For example a LIDAR performs very well in clear conditions but its performance accuracy drops significantly in rainy or snowy conditions. A lot of research and improvement needs to be done on this front so that the autonomous cars become globally acceptable in every country and in every state, besides performing with the same precision throughout the day. In Figure 1.3, the effect of different climate has been in the same location. It is clearly evident that a learned model on any single type of climate images would behave very differently on others. A lot of research has been and still going on making models invariant of weather and climate . Figure 1.3: Effect of different weather/climate [3]. 7 CHAPTER 1. INTRODUCTION 1.5 Vehicle to vehicle communication Moving on to a different challenge faced by a system of smart cars, we see that its quintessential that such a system in the future would need the cars to communicate with each other. This is where vehicle to vehicle (V2V) communication comes into the picture. V2V communication proposes that cars will communicate with each other in real time creating an inter-communicated smart system. V2V communication is not necessarily limited to smart cars. Nearly every manufacturer today is working on implementing this technology on their cars, hoping to significantly reduce the number of road accidents every year. V2V communication lets other cars know their position, speed, future intentions, etc. Naturally we can see why this technology will be imperative in smart cars. One of the many challenges of designing such a system is to create a multi-hop system of ad-hoc networks (Figure 1.4) powerful, secure, and effective enough to communicate in real time. There are a number of models which have been proposed to make this realizable and a lot of research is going on in this field. Figure 1.4: Multi-hop system of ad-hoc networks [4]. 8 CHAPTER 1. INTRODUCTION 1.6 Visual odometry Visual odometry (VO) [6] is a form of localization which has been proposed for some time to implement on cars. This mode of localization is not limited to smart cars only. In fact there has been significant research going on in universities as well as manufacturing industries to develop VO models. A critical task of VO is to identify specific landmarks or objects in the surrounding environment of the moving vehicle to improve the cars vision, position, communication with other cars etc. 1.7 Visual odometry in localization VO plays a key role in an autonomous cars path navigation. For example lets say an autonomous car is driving through an unknown territory where it cannot connect to the satellite map due to a weak signal or get inaccurate data due to GPS error. Based on previously learned databases, the vehicle can identify certain key objects to help determine its location. A number of famous extraction techniques are used for VO recognition namely Scale-invariant feature transform (SIFT) [7], Speeded Up Robust Features (SURF) [8], Binary Robust Invariant Scalable Keypoints (BRISK) [9], BRIEF [10], ORB [11] etc. Although SIFT has been shown to be highly dependable in extracting a large number of features, it is a computationally intensive algorithm and not suitable for real time applications like this. 1.8 Loop closure detection in Simultaneous Localization and Mapping (SLAM) Simultaneous Localization and Mapping (SLAM) has been the driving force between autonomous cars primary algorithms for a long time. It involves a lot of difficult tasks which have been partially to completely achieve over the years. However one of the 9 CHAPTER 1. INTRODUCTION challenges associated with SLAM is to solve the loop closure problem using visual information in life-long situations. The term loop closure detection is primarily defined as the computer’s ability to detect whether it is in the same place or not after traveling a certain distance. The difficulty of this task is in the strong appearance changes that a place suffers due to dynamic elements, illumination, weather or seasons. Based on research in the academia as well as in the industry, this area of SLAM hasnt been perfected yet. Obviously, vision based localization plays a key role here. There are some famous existing marvel approaches to this problem like the FAB-MAP, which is a topological appearance based SLAM. There are multiple papers on this approach and has been regarded as one of the reliable and stable among other approaches [19]. Since change in illumination and weather affects the place recognition to a significant extent, [20] has proposed a different approach called Seq-SLAM to acknowledge this problem. Seq-SLAM removes the need of a global matching performance by calculating the best candidate matching location within every local navigation sequence instead of calculating the single location most likely given a current image. Post invention of these novel approaches there have been other improved methods which have used traditional feature detectors and modified ones, to address this problem [14]. A very interesting part of this problem is the bidirectional loop closure detection [15], which tests the autonomous cars ability to detect its location irrelevant of its direction of approach. 1.9 CNN in vision based localization In recent years deep learning has revolutionized the world of machine learning, pattern recognition, computer vision, robotics etc. In many of the cases, it has been found that deep learning is able to produce better detection ability due to its sophisticated filtering through its multiple layers. CNNs are feed-forward artificial neural networks where the individual neurons are tiled in such a way that they respond to overlapping 10 CHAPTER 1. INTRODUCTION regions in the visual field. In the world of deep learning, CNNs have mostly proved to yield better results than traditional techniques which use hand-crafted features for detection [12]. In 2015, T. Lin, J. Hays and C. Tech [13], presented an excellent research paper using CNNs for ground to areal geolocalization. This paper once again proves the importance of vision based localization and its ability to improve current localization methods including GPS. Figure 1.5 shows an example from a live CNN running on the browser from the course CS231n - Convolutional Neural Networks for Visual Recognition [5], predicting a random test image in real time. Figure 1.5: Classification results from the course CS231n - Convolutional Neural Networks for Visual Recognition [5]. The primary advantage in using CNNs in vision based localization is its ability to preprocess images through its layers. CNNs have proved to be more effective in matching images or identifying objects in recent years and may make traditional feature detection techniques obsolete in future years. One of the major advantages of the CNN architecture is its co-learning of features with classifier, giving it an advantage over hand crafted features pared with conventional classifiers. The disadvantage of CNNs lie in their requirement of huge training datasets making it computationally expensive during training. However, despite arduous training time, processing test frames is extremely efficient making it suitable for real time applications. Deep learning frameworks like Caffe [16], Torch [17], Tensorflow [18], etc. have addressed the 11 CHAPTER 1. INTRODUCTION training problem with the help of GPUs and have made the learning process much more efficient. To combat the need for large datasets, a concept called transfer learning is often used. Transfer learning is a technique used to learn new layer weights in neural networks for a given dataset from pre-learned filters from a different dataset. It often works out quite well due to the ability of neural networks to modify its weights given the datasets are somewhat similar in nature. Since these weights are often open sourced on the internet, users can take advantage of them quite easily, thus avoiding the need to train huge datasets end to end. 12 Chapter 2 Thesis Statement Identifying the location of an autonomous car with the help of visual sensors can be an excellent alternative to traditional approaches like Global Positioning Systems (GPS) which are often inaccurate and absent due to insufficient network coverage. Recent revolutionary research in deep learning has produced excellent results in different domains leading to the proposition of this thesis which intends to use deep learning to solve the problem of localization in smart cars using deep learning models on visual data. 13 Chapter 3 Thesis Objectives The primary objective of the thesis will be to develop an efficient algorithm for an autonomous car to help more accurately localize the vehicle in as much real time as possible. This will be done by utilizing deep CNNs to identify its location. Experiments will be done to determine if Google Street View can be used either for the supervised component of localization on the RIT campus, or as transfer learning. If new data has to be collected, a camera with GPS tagged frames will be utilized, whereby experiments will determine the amount of data needed to be recorded per locale for accurate localization. The efficacy of the CNN models across different weather/light conditions will be investigated. An efficient model will not only improve smart cars localization, but it will also improve traditional cars vision and might be able to reduce the number of accidents. End-to-end learning will be compared with fine tuning existing models trained on ImageNet [17] like AlexNet [18], VGGNet [19], GoogLeNet [20], ResNet [21], including rigorous manipulation of different hyperparameters. Extensive experiments will be conducted before establishing this method as a novel alternative to localization by GPS. 14 Chapter 4 Background/Related Work 4.1 Artificial intelligence in self-driving cars As of 2016, autonomous vehicles are no longer products of science fiction or just longterm visions of research and development departments of different corporations. The beginning of the success started with the self-driving car ”Stanley” which won the 2005 DARPA Grand Challenge [21]. Figure 4.1: Stanley [21]. Besides other traditional sensors, this car had five laser range finders for measuring cross-sections of the terrain ahead up to 25m in front of the vehicle, a color camera for long-range road perception, and two 24 GHz RADAR sensors for long range detection 15 CHAPTER 4. BACKGROUND/RELATED WORK of large obstacles. Despite winning the challenge, it left open a number of important problems like adapting to dynamic environments from a given static map or the ability to differentiate between objects with subtle differences. One of the important results observed from this race and highly relevant to this thesis research was the fact that during 4.7% of the challenge , the GPS reported 60cm error or more. This highlighted the importance of online mapping and path planning in the race. It also proved that a system solely dependent on GPS coordinates for navigation in self-driving cars is not sufficient, as the error tolerance for autonomous vehicles is around 10cm. The real time update of the global map based on the local environment helped Stanley to eliminate this problem in most of the cases. The 2005 DARPA Grand Challenge was conducted on a desert track. Stanley won this challenge, but a lot of new challenges were foreseen from the results. The next challenge conducted by DARPA was in an urban environment. Stanford introduces the successor to Stanley named ”Junior” [22]. Junior was equipped with five laser rangefinders, a GPS-aided inertial navigation system and five radars as its environment perception sensors. The vehicle had an obstacle detection range of upto 120 meters and the ability to attain a maximum velocity of 30mph. A combination of planning, perception followed by control helped in its navigation. The software architecture primarily consisted of five modules - sensor interfaces, perception modules, navigation modules, drive-by-wire interface and global services. The perception modules were responsible for segmenting the environment data into moving vehicles and static obstacles. They also provided precision localization of the vehicle relative to the digital map of the environment. One of the major successes of this car was its successful completion of the journey with almost flawless static obstacle detection. However, it was found that the GPS -based inertial position computed by the software system was generally not accurate enough to perform reliable lane keeping without sensor feedback. Hence Junior used an error correction system for accurate 16 CHAPTER 4. BACKGROUND/RELATED WORK localization with the help of feedback from other local sensors. This fine-grained localization used two types of information: road reflectivity and curb-like obstacles. The reflectivity was sensed using the laser range finders, pointed towards the ground. The filter for localization was a 1-D histogram filter which was used to estimate the vehicles lateral offset relative to the provided GPS coordinates for the desired path to be followed. Based on the reflectivity and the curbs within the vehicle’s visibility, the filter would estimate the posterior distribution of any lateral offset. In a similar way like reinforcement learning it favored offsets for which lane marker reflectivity patterns aligned with the lane markers or the road side from the supplied coordinates of the path. It also negated offsets for which an observed curb would reach into the driving corridor of the assumed coordinates. As a result, at any point in time the vehicle estimated a fine-grained offset to the measured location by the GPS-based system. A precision or lateral offset of one meter was common in the challenge. Without this error correction system, the car would have gone off the road or often hit a curb. It was observed that velocity estimates from the pose estimation system were much more stable than the position estimates, even when GPS feedback was not strong. X and Y velocities were particularly resistant to jumps because they were partially observed by wheel odometry. Junior finished second in the DARPA Urban Challenge but the winner was a different autonomous vehicle called ”Boss” developed by the Tartar Racing Team. ‘Boss’was composed of students, staff and researchers from many institutions like Carnegie Mellon University, General Motors etc. Many of members from this team as well as Junior’s would later be associated with the Google self-driving car project. This included the Tartar team’s technology leader, Chris Urmson, and one of the most celebrated people in the autonomous vehicles industry, Junior’s team leader Sebastian Thrun. Very similar to Junior, Boss also had an arsenal of sensors with a very advanced 17 CHAPTER 4. BACKGROUND/RELATED WORK and complex system architecture [23]. The software architecture of Boss was widely divided into the following layers - 1) Motion planning subsystem, 2) Perception subsystem, 3) Mission planner, 4) Behavioral system. The motion planning system was responsible for handling static and dynamic objects under two different scenarios, structured driving and unstructured driving. While structured driving concentrates on typical marked roads, unstructured driving concentrates on more difficult scenarios such as parking lots. Unstructured driving is more difficult because of the lack of lanes and markings. Boss used a four dimensional search space (position, orientation, direction of travel) was used. In both of the cases, the end result is a trajectory good enough to reach the destination. The perception module was responsible for sensing the local environment with the help of sensors. This module created a local dynamic map, which was merged with the static global map, while simultaneously localizing the vehicle. The mission planner also played a key role in the navigation of the car. Using only the navigation wasn’t good enough for winning the race. This is where the mission planner helped. It took all the constraints of different routes under consideration and created an optimal path. Finally the behavioral system took all the information provided by the mission planer and fed it to the motion planner. It also handled all the errors when there were problems. This subsystem was roughly divided into three sub-components: lane driving, intersection handling, and goal selection. The flowchart of the process is shown in Figure 4.2. Like Stanley and Junior, a local path is generated in Boss using a navigation algorithm called Anytime D* [24]. This search algorithm is efficient in updating it’s solution to account for new information concerning the environment like a dynamic obstacle. Since D* is influenced from dynamic programming, it already has all the possible routes stored in memory and can just update and select a new route avoiding the dynamic obstacle’s coordinates. The motion planner stores a set of trajectories to a few immediate local goals near to the centerline path to create a robust desired 18 CHAPTER 4. BACKGROUND/RELATED WORK Figure 4.2: Flowchart of the Boss software architecture [23]. path lest it needs to avoid static and dynamic obstacles and modify its path. The local goals are placed at a fixed longitudinal distance down the centerline path but vary in lateral offset from the path to provide several options for the planner. [23] presents a detailed description of all the modules and its sub-components. In keeping the discussion related to this research, the localization module has been discussed in greater details compared to the others. Particularly the roadmap localization is one of the most important requirements of autonomous vehicles. Boss was capable of either estimating road geometry or localizing itself relative to roads with known geometry. One of the primary challenges in urban driving is responding to the abrupt changes in the shape of the roads and local disturbances. Given that the shape and location of paved roads change infrequently, their approach was to localize relative to paved roads and estimate the shape of dirt roads, which change geometry more frequently. The pose was incorporated into the coordinate frame from the GPS feedback. To do this, it combined data from a commercially available position estimation 19 CHAPTER 4. BACKGROUND/RELATED WORK system and measurements of road lane markers with an annotated map. Eventually the nominal performance was improved to a 0.1m planar expected positioning error. Normally a positioning accuracy of 0.1m would be sufficient to blindly localize within a lane, but the correction signals were frequently disrupted by even small amounts of overhead vegetation. Once disrupted, the signal’s reacquisition took approximately a half hour. Thus, relying on the corrections was not viable for urban driving. Furthermore, lane geometries might not be known to meter accuracies a priori. It was critically important to be localized correctly relative to the lane boundaries, because crossing over the lane center could have disastrous consequences. During the test, the difference error in the lateral position reached upto 2.5m, more than enough to put the vehicle either off the road or in another lane if not compensated for. To conclude the roadmap localization system played an important role during the challenge. The error estimate for most of the challenge in the roadmap localization system was less than 0.5m, but there was more than 16 minutes when the error was greater than 0.5m, with the peak error of 2.5m. Without the road map localization system being active, Boss would most likely have been either off the road or in a wrong lane for a significant amount of time. Following the success of the various autonomous vehicles in the 2007 ”DARPA Urban Challenge”, Google started building it’s own self-driving car collaborating with various academic institutions and industries. Unfortunately not many research papers are available on the Google self-driving car but it is true that many of the sensor fusion technologies, algorithms etc were inspired from the ones used in the selfdriving cars who participated in the DARPA challenges. Research that is pertinent to this thesis will next be discussed in the following sections. 20 CHAPTER 4. BACKGROUND/RELATED WORK Figure 4.3: Boss (http://www.tartanracing.org/gallery.html). 4.2 Localization Localization is one of the most important factors in autonomous driving, if not the most important one. Although localizing a robot within a certain tolerance is possible using existing technologies like GPS without much effort, it is often not sufficient for autonomous driving. Precise localization with respect to the dynamically changing environment is quite challenging and is a problem researchers have been trying to tackle for quite some time. Levinson et al.[25] proposed a method of high-accuracy localization of mobile vehicles by utilizing static maps of urban environments but updating them with the help of GPS, IMU, wheel odometry, and LIDAR data acquired by the vehicle. It also removed the dynamic objects in the environment providing a 2-D surface image of ground reflectivity in the infrared spectrum with 5cm pixel resolution.For the final step, a particle filter method was used for correlating LIDAR measurements with the map. The primary contribution from this research was an innovative method to separate the dynamic obstacles from the static ones and create 21 CHAPTER 4. BACKGROUND/RELATED WORK a final map which could be used for autonomous driving. It addressed the problem of environment dynamics by reducing the map to only features that with very likelihood were static. In particular, using a 3-D LIDAR information, only the flat road surface was retained, thereby removing the imprints of potentially dynamic objects like nonstationary cars.The resulting map was then simply an overhead image of the road surface, where the image brightness corresponded to the infrared reflectivity. Once the map was built, a particle filter method was used to localize the vehicle in real time. This system was able to track the location of the vehicle with relative accuracy of 1̃0cm in most cases. Experiments were also conducted to track the vehicle using only GPS data. It actually failed just within 10 meters, proving that GPS alone is insufficient for autonomous driving. Figure 4.4: (left)GPS localization induces greater than equal to 1 meter error (right)No noticeable error in particle filter localization [25]. While [25] considered the dynamic obstacles on the map as binary data i.e. - either true or false, [26] treats them as probabilities resulting in a probabilistic grid map, where every cell was represented as its own Gaussian distribution over remittance values. Using offline SLAM to align multiple passes of the same environment, possibly separated in time by days or even months , it was possible to build an increasingly robust understanding of the world that could be exploited for localization. Instead of having to explicitly decide whether each measurement in the grid either was or was not part of the static environment, a sum of all observed data would be considered and the variances from each section of the map would be modeled. Each cell in the 22 CHAPTER 4. BACKGROUND/RELATED WORK final map would store both the average infrared reflectivity observed at the particular location as well as the variance of those values very similar to a Kalman filter. Like [25], the 2-dimensional histogram filter would comprise of the motion update to reduce confidence in the estimate based on motion, and the measurement update, to increase confidence in the estimate based on sensor data. Since the motion update was theoretically quadratic in the number of cells, in practice considering only the neighboring cells would be perfectly accessible. The motion update would be followed by the measurement update, in which the incoming laser scans would be used to refine the vehicle’s velocity. The results were quite impressive during a ten-minute drive, a RMS lateral correction of 66cm was necessary, and the localizer was able to correct large errors of upto 1.5 meters. The resulting error after localization was extremely low, with a RMS value of 9cm. The autonomous car was able to drive in densely populated urban environments like the 11th Avenue in downtown Manhattan which would be impossible before without such high precision in localization. In conclusion, this method was able to improve the precision of the best GPS/IMU systems available by an order of magnitude, both vertically and horizontally, thus enabling decimeter-level accuracy which is more than sufficient for autonomous driving. Detection of traffic lights is another important factor in autonomous driving. Although there have been past research on algorithms to detect traffic lights by applying simple techniques of computer vision, they have never been real world applicable because of various issues. According to [27], the existing reliable systems for determining traffic light state information at the time required explicit communication between a traffic signal and vehicle. [27] presents a passive camera-base pipeline for traffic light state detection, using imperfect localization and prior knowledge of traffic light location. Taking advantage of temporal information from recorded video data and location data from GPS, the proposed technique estimated the actual light location and state using a histogram filter. 23 CHAPTER 4. BACKGROUND/RELATED WORK One of the problems arising from vision data is the inability to distinguish between tail lights and traffic lights especially during night. Hence [27] boiled the problem to two points - a) Inferring the image region which corresponded to the traffic light and b) inferring its state by analyzing the acquired intensity pattern. Ideally, both problems ideally could be solved by taking advantage of temporal consistency. The choice of the detection grid as a reference frame assumed several structured error components, allowing the light’s position within the grid to change slowly over time. Given this temporal constraint, and the vision algorithm performing within the limits of reasonable expectations, a histogram filter could be applied to infer the image region of the light and determine the color. This approach was able to get fairly good results achieving a 91.7% accuracy for individual light detections and 94.0% accuracy across intersections at three times of the day. The confusion matrix of the classification results for individual light detections has been shown in Figure 4.5. Figure 4.5: Confusion matrix for individual traffic light detection [27]. Since the right object detection in real time can play a very important role in the behavior of autonomous cars in dealing with different situations, a lot of research has been done in this field as well. [28] suggests a new track classification method, based on a mathematically principled method of combining log odds estimators. The method was fast enough for real time use, and non-specific to object class, while performing well (98.5% accuracy) on the task of classifying correctly-tracked, well segmented objects into car, pedestrian, bicyclist, and background classes. A com24 CHAPTER 4. BACKGROUND/RELATED WORK mon problem in the world of machine learning is the availability of sufficient data for training datasets. Another problem is to recognize new data using a model trained on a different dataset. There are cases where labeled data is unavailable, and segmentation is preferred over classification. Different kinds of learning algorithms work in each case like unsupervised learning in the last case. [29] proposes a method based on the expectation-maximization(EM) algorithm. EM iteratively 1)trains a classifier and 2)extracts useful training examples from unlabeled data by exploiting tracking information. Given only three examples per object class, their research reports the same accuracy as a fully supervised approach on a large amount of data. Robotics applications can usually benefit a lot from semi-supervised learning, as they often face situations where they have to make decisions from unforeseen examples in a timely manner. In reality, the behavior of an autonomous vehicle will change a lot on proper detection of other cars on the road rather than different objects, as localizing other cars in its view is quintessential in the decision making process. Despite being highly important, efficiently detecting cars in real world-images is fairly difficult because of various problems like occlusion, unexpected occurrences etc. [30] takes advantage of context and scale to build a monocular single-frame image based car detector. The system proposed by this paper uses a probabilistic model to combine various evidences for both context and scale to locate cars in real-world images. By using a calibrated camera and localization on a road map, it was possible for the authors of the paper to obtain context and scale information from a single image without the use of a 3D laser. A latent SVM was used to learn a set of car templates representing six different car orientations for the outline of the car, referred to as ”root” templates and six ”part” templates containing more detailed representations of different sections of a car in an image. Each of these templates was convolved with the gradient of the test image at multiple scales with probable locations at high responses. For each bounding box, two scores were computed based on how scale-appropriate 25 CHAPTER 4. BACKGROUND/RELATED WORK the size of the box was given its location on the image. This helped in removing some of the false positives. The scores were computed by estimating the real world height of the object using the cameras known position and orientation. Considering the autonomous vehicle had some sort of localization and mapping system, the search for the cars could be further fine-tuned by eliminating unlikely locations in the image like the sky or a tree, basically limiting the search region to the roads. Using the appearance, scale and context scores, a new prediction for each bounding box was estimated using the dual form of L2-regularized logistic regression. Figure 4.6: (a) Left: Detections returned by Felzenszwalb’s state of the art car detector [31]. (b) Right: Detections returned by the algorithm proposed in [30]. False positives are shown in red and the true positives are shown in green. This algorithm achieved an average precision of 52.9% which was 9.4% better than the baseline at the time. The primary contribution of this paper was achieving a good accuracy using only a vision based system, thus paving the idea of eliminating the need of expensive sensors like the LIDAR or the RADAR. The results are shown in Figure 4.6. It must be noted here that although these results were impressive at the time, the recent progress in deep learning influenced object detection frameworks 26 CHAPTER 4. BACKGROUND/RELATED WORK have proved to be better in identifying and localizing objects in images. Some of these frameworks have been discussed in further details in the next section. A lot of research has been done on precise tracking of other vehicles leveraging both 3D and 2D data. [32] takes advantage of 3D sparse laser data and dense color 2D data from the camera to obtain accurate velocity estimates of mobile vehicles by combining points from both the data to obtain a dense colored point cloud. A precise estimate of the vehicle’s velocity could be estimated using a color-augmented search algorithm to align the dense color point clouds from successive time frames. Improving on [32], [33] proposes a method which combines 3D shape, color and motion cues to accurately track moving objects in real time. Adding color and motion information gave a large benefit, especially for distant objects or objects under heavy occlusions, where detailed 3D shape information might not be available. This approach claimed to outperform the baseline accuracy by at least 10% with a runtime overhead of 0.7 milliseconds. M. Yu and U.D. Manu in [65] developed an Android application for Stanford navigation using GPS tagged visual data. They used only 114 training images with three to five images per location for a35 locations. Hand-tuned features and linear classifiers were used for classification like the SIFT, K-means and RANSAC. This paper obtained a validation accuracy of 71% on 640×480×3 resolution images and 42% on 480×420×3 and 320×240×3 resolution images. The prediction time of a query image was 50 seconds on average, hence not real time. A.R.Zamir and M.Shah in [66] used a somewhat similar approach like this research. They created a dataset using 10k images from the Google Street View with each image at a 12m separation. For each location, the dataset had five images out of which four were side-view images and the other one covered the upper hemisphere view. This paper used SIFT for its features and the Nearest Neighbor tree (FLANN) as its classifier. 60% of the images from the test set were predicted within 100 meters of the ground truth. This paper 27 CHAPTER 4. BACKGROUND/RELATED WORK used a concept called the Confidence of Localization (COL) to improve the accuracy. 4.3 4.3.1 Deep learning History and evolution The last five years starting from the year 2012 have ushered in a lot of success in the world of machine learning especially due to the boom in deep learning. Although it may seem that deep neural networks were invented very recently, they were conceived of in the 1980s. Although these early architectures were not in the exact structure that is present today, their underlying concept is very similar. Before diving into the detailed working mechanism of Convolutional Neural Networks (CNNs), revisiting their origin and why they became successful in the recent years can lead to a better understanding of deep learning. [34] presented the first general, working learning algorithm for supervised deep feedforward multilayer perceptron. In 1971, [35] described a deep network with 8 layers. It was trained on a computer identification system known as ”Alpha”. Other Deep Learning working architectures, especially those built from ANNs date back to 1980 [36]. The architecture of this network was relatively simple compared to networks that are present today. It composed of alternate cells known as simple and complex cells in a sandwich type of architecture used for unsupervised learning. While the simple cells had modifiable parameters, the complex cells were used for pooling. Due to various constraints, one of them being limited processing power in the hardware, these networks didn’t perform quite as well as alternate techniques. Backpropagation, one of the fundamental concepts in learning a network, was first applied by Yann LeCun et al. to a deep neural network for the purpose of recognizing handwritten ZIP codes on mail for the US Postal Service [37]. The input of the network consisted of normalized images of isolated digits. Despite being applied almost 20 years back, it produced excellent results with a 1% error 28 CHAPTER 4. BACKGROUND/RELATED WORK rate for the specific application. Due to the hardware constraints, it wasn’t suitable for general use at the time. The time to train the full network took approximately 3 days. The first convolutional neural network with backpropagation as we know today was proposed in [38] by Yann LeCun et al. in 1998 for document recognition. This was a 6-layer network composed of three convolutional layers, two pooling layer (subsampling) and a fully connected layer in the end. The name of the network was LeNet-5. The detailed explanation of each layer in a convolutional neural network will be explained in the next section. Although, these kind of networks were very successful in handling smaller size images or other problems like character or word recognition, it was thought until 2011, that these networks wouldn’t be able to handle larger more complex images and problems, hence the use of traditional methods like object detectors using hand-tuned features and classifiers. Figure 4.7: Architecture of LeNet-5. Each plane is a feature map i.e. a set of units whose weights are constrained to be identical. [38]. 2012 was the year when convolutional neural networks really started making their mark in the world of machine learning.[39] presented a deep convolutional neural net which produced significantly better results in the ImageNet classification challenge [40] compared to the traditional techniques. The architecture was called AlexNet after the name of the lead author of the paper, Alex Krizhevsky. Besides the understanding of deep neural networks and showing us how to mold them to achieve good results, this paper contributed in many other important ways by introducing 29 CHAPTER 4. BACKGROUND/RELATED WORK various important building blocks of neural networks such as ReLUs and a new alternative called ”dropout”. Many prominent researchers believe that multiple factors contributed behind the success of this network, these include: 1) Data - An increase from the order of 103 to 106 in number of samples was used to train the model as compared to previous techniques. 2) Computational power - NVIDIA GPU with CUDA library support providing approximately 20x speedup 3) Algorithm: a) Deeper: More layers (8 weight layers) b) Fancy regularization: Dropout [41] c) Fancy non-linearity: ReLU [42] The terms are explained in details in the next few sections. Detailed examples and the results are discussed in the Case studies section. 4.3.2 4.3.2.1 Traditional techniques K-Nearest Neighbor The simplest classifier in machine learning is the K-nearest neighbor classifier. There’s practically no learning involved in this classifier but rather the prediction is made by direct comparison of training and testing images. However instead of direct interpixel comparison between two images, different hyperparameters are used to obtain a better result. The choice of distance i.e. - measurement of the difference between pixels plays an important role in the Nearest Neighbor classifier. There are often two types of hyperparameters used in this case - 1) L1 distance: 30 CHAPTER 4. BACKGROUND/RELATED WORK d1 (I1 ,I2 ) = X I1p -I2p (4.1) (I1p -I2p )2 (4.2) p 2) L2 distance: d2 (I1 ,I2 ) = sX p Figure 4.8: Nearest Neighbor classification using L1 distance example[5]. Although this is an elementary classifier, it has still outperformed random guess. Random guess would produce 10% accuracy on classification of cifar-10 dataset which has 10 classes, but the Nearest Neighbor classifier using L1 distance produced approximately 38.6% accuracy. K-Nearest Neighbor is a modified version of the Nearest Neighbor classifier, where the top ’k’ closest images are chosen in the training dataset instead of the single closest image.These images are then used to vote on the label of the test image. When k=1, we treat the classifier as the Nearest Neighbor. Higher values of ’k’ have a smoothing effect that makes the classifier more resistant to outliers. The value of ’k’ and the other hyperparameters are often chosen by experimenting with the validation set results. The primary disadvantages of this method are - a) The classifier must remember all of the training data for comparison during testing leading to memory constraints when the size of the dataset is large, and b) Classifying a test image is computationally expensive. 31 CHAPTER 4. BACKGROUND/RELATED WORK 4.3.2.2 Support Vector Machine Moving on from a direct inter-pixel comparison to predict images, a more complex relation can be built between key features of images and use them to build a classifier. This approach is more intuitive and robust. It has two primarily two sections - a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. These two fundamental concepts are used in almost all classifiers, starting from linear classifiers like SVMs to neural networks and CNNs. The only difference is that the nature of the functions become more complex as they are applied into more complex networks. f(xi ,W ,b) = Wxi + b (4.3) In the above equation, the image xi has all of its pixels concatenated to a single column vector of shape [D × 1]. The matrix W (of size [K × D]), and the vector b (of size [K × 1]) are the parameters of the function. D is the size of each image and K is the total number of classes in the dataset. W has the weight parameters are often called the weights, and b is called the bias vector but it influences the output scores, but without interacting with the actual data xi . Post training we only need to keep the learned weights. New test images can be simply forwarded through the function and classified based on the computed scores. Lastly, classifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images as in the K-Nearest Neighbor approach. A linear classifier computes the score of a class as a weighted sum of all of its pixel values across all three of its color channels. Depending on precisely what values are set for these weights, the function has the capacity to reward or punish certain colors at certain positions in the image based on the sign of each weight. 32 CHAPTER 4. BACKGROUND/RELATED WORK The loss function is used to determine how far the predicted score from the actual score in terms of numbers. Hence the lesser the loss, the more assuring is the predicted score from the classifier. The Multiclass SVM loss for the i-th example is - Li = X max(0, sj − syi + ∆) (4.4) j6=yi where, Li is the loss for the i-th example, yi is the label that specifies the index of the correct class sj = f(xi ,W) for the j-th element, where xi is the image and sj is the score for the j-th class ∆ is a hyperparameter to keep the SVM loss positive The loss function stated above has a problem unaccounted for. It can produce a set of similar W, that will classify the test examples correctly i.e. - every condition will be met including Li =0 for all i. This loophole is often taken care of by adding a term called the regularization penalty R(W ) to the loss function. The most commonly used regularization is the L2 norm shown below as per 4.5. The L2 norm intuitively ”rewards” smaller weights through an element wise quadratic penalty over all the parameters. R(W ) = XX k 2 Wk,l (4.5) l It should be noted here, that the regularization term is not a function of the data but based on the weights. Hence adding both the data loss and the regularization loss, the full Multiclass SVM becomes L= 1 X Li + λR(W ) N i (4.6) where N is the number of training examples and λ is a hyperparameter used for 33 CHAPTER 4. BACKGROUND/RELATED WORK weighing th regularization penalty. This hyperparameter is usually determined by cross-validation. Now that the loss has been computed for each prediction, it has to be used to improve the future predictions. This is where backpropagation comes into the picture. To understand propagation, a few other terms need to be explained. 1) Gradient - From equation 4.3, it intuitively tells us that the term ’W’ representing the weight matrix needs to be set or updated in the best possible manner to get the right score for each prediction. Now there are various way to initialize this weight matrix, but none of them will ever be perfect. Hence we need to update the value of ’W’ iteratively making it slightly better each time. We need to find a direction in the weight-space that would lower our loss. The trick to find this direction is actually related to a fundamental concept of mathematics - differential equations. Since the derivative of each dimension in the input space provides the gradient of the loss function, it can help us in finding the direction of the sharpest descend in the step size. f (x + h) − f (x) df (x) = lim h→0 dx h (4.7) When the functions of interest take a vector of numbers instead of a single number, the derivatives are called partial derivatives, A partial derivative is a derivative when a function is represented in the form of a vector of numbers whereas the gradient is simply the vector of these derivatives in each dimension. The gradient tells us the slope of the loss function along every dimension, which we can use to make an update. Wgradient = fevaluateg radient (loss, data, W ) (4.8) where, Wgradient is the gradient, fevaluateg radient is the function for evaluating the 34 CHAPTER 4. BACKGROUND/RELATED WORK gradient. 2) Learning rate - The step size or the learning rate plays one of the most important roles in training a network. Intuitively, the learning rate is the rate at which we should step in the direction provided by the gradient. 3) Backpropagation - It can be defined as a way of computing gradients of expressions through recursive application of chain rule. Figure 4.9: Backpropagation visualized through a circuit diagram[5]. In Figure 4.9, the entire circuit can be visualized as a mini network where the gates play the key role in shaping the final output. On receiving some input, every gate gets activated and produces some output value. It also computes the local gradient of those inputs with respect to the output value. Similar to the neurons in a neural network as explained later, these gates are mutually independent, hence do not affect each other’s outputs. Once the loss is computed from the forward pass, these gates learn the gradient of its output value on the final output of the entire circuit. Due to the chain rule, the gates recording smaller values lose their importance gradually, while the others gain importance. In this way the gates communicate with each other and learn about the effect of each input on the final output and model the network in a similar fashion. 4) Gradient descent/parameter update - After computing the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent. There are different types of gradient 35 CHAPTER 4. BACKGROUND/RELATED WORK descent techniques. One of them is the vanilla gradient descent is shown below - W + = −learning rate ∗ Wgradient (4.9) where, W is the weight, Wgradient is the gradient. It is evident from 4.8 and 4.9 how the weights are updated based on the loss, gradient and the gradient descent. This loop is essentially at the core of all neural networks. Momentum update is another approach that almost always enjoys better converge rates on deep networks. v = mu ∗ v − learning rate ∗ Wgradient (4.10) W+ = v (4.11) where, W , Wgradient and learning rate are the same as equation 4.9. Equation 4.10 is for integrating velocity and 4.11 is for integrating the position. v is initialized at zero and mu is an additional hyperparamater in this case. There are other methods of momentum update like the Adagrad [45], RMSprop [46] and Adam [47]. 4.3.2.3 Neural Networks Until now, the input variable i.e. - the image has been thought to be linearly related to the output i.e. - the score. However, this is actually rare in the real world data, which is why linear classifiers do not always perform well. For example the relation between the image and the score can be as shown in 4.12. s = W2 max(0, W1 x) (4.12) where W1 could be, for example, a [100x3072] weight matrix transforming the 36 CHAPTER 4. BACKGROUND/RELATED WORK image into a 100-dimensional hidden vector. The function max(0,W1 x ) is a nonlinearity. Finally, the matrix W2 would then be of size [10 × 100], giving a vector of [1 × 10] class scores. The non-linearity function plays an important role here, as it separates the equation above from that of a linear classifiers. Equation 4.12 is that of a 2-layer neural network. Figure 4.9 is considered a 2-layer neural network, and the gates (addition and multiplication) are actually known as activation functions. They play a key role in the both neural and convolutional neural networks and a lot of research has been focused on them in recent years. However, there are a few activation functions which are commonly used - 1) Sigmoid - σ(x) = 1 1 + e−x (4.13) 2) Tanh tanh(x) = 2σ(2x) − 1 (4.14) f (x) = max(0, x) (4.15) 3) ReLU [42] - The third activation function ReLU is one of the most commonly used activation functions used in neural networks and CNNs. It was found to greatly accelerate (e.g. a factor of 6 in [39]) the convergence of stochastic gradient descent compared to the sigmoid and tanh functions. It is argued that this is because it’s linear, nonsaturating form. The implementation of this function is also fairly easy as it only involves thresholding to zero. Unfortunately, the disadvantage of this function is that ReLU units can be fragile during training and can become ineffective inside the network. For example, a large gradient passing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again, making all the future gradients through that unit zero. That 37 CHAPTER 4. BACKGROUND/RELATED WORK is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, we may find that as much as 40% of the network becomes dead (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. Hence setting the right learning rate is very important while training a neural network. Figure 4.10: ReLU activation function, which is zero when x¡0 and then linear with slope 1 when x¿0 [5]. 4) Leaky ReLU [43] - f (x) = 1(x < 0)αx + 1(x >= 0)(x) (4.16) where, α is a small constant. A typical neural network is usually modeled as a collection of neurons in multiple layers like a acyclic graph. The most common layer type is a fully connected layer in a regular neural network. A fully connected layer refers to layers where all the neurons are connected to each other between two layers but do not share any connection between them within a single layer. The final layer in a neural network usually represents class scores in classification and probabilities of occurrence in case of regression. Unlike other layers, this final layer doesn’t have any activation function. Training neural networks can be tricky with the proper setting of different hyperparameters involved like different kinds of weight initialization, pre-processing of data 38 CHAPTER 4. BACKGROUND/RELATED WORK Figure 4.11: a 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer [5]. by mean subtraction , normalization etc. A recently developed technique by Ioffe and Szegedy [48] called Batch Normalization reduces the burden of perfect weight initialization in neural networks by explicitly forcing the activations throughout a network to take on a unit Gaussian distribution right from the beginning of the training. Usually a BatchNorm layer is inserted immediately after fully connected layers and before non-linearities. There are several ways of controlling the Neural Networks to prevent overfitting - 1) L2 regularization , 2) L1 regularization, 3) Max norm constraints and 4) Dropout. Dropout is an extremely effective, simple and recently introduced regularization technique by [44] that complements the other methods (L1, L2, maxnorm). Dropout is implemented while training by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise. Observing the change in loss while training neural networks can be useful, as its evaluated on individual batches during forward propagation. Figure 4.12 is a cartoon diagram showing the loss over time. The different learning rates can be observed from the diagram. The second important quantity to track while training a classifier is the validation/training accuracy. The model should be trained carefully so that it doesn’t overfit the training set, which would cause it to perform poorly on an unforseen test examples. Other factors which can be monitored are ratio of weights:updates, activation/gradient distributions per layer, first layer visualizations etc. 39 CHAPTER 4. BACKGROUND/RELATED WORK Figure 4.12: a cartoon depicting the effects of different learning rates. While lower learning rates gives linear improvements, the higher it goes, the more exponential they become. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much ”energy” in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape [5]. 4.3.3 Convolutional Neural Networks Convolutional Neural Networks (CNNs) are very similar to ordinary neural networks in th sense that they are made of neurons that have learnable weights and biases. In the case of CNNs, the weight parameters are actually filter coefficients. Each layer receives some inputs, performs a convolution operation and optionally follows it with an activation function. The whole network still expresses a single differentiable score function from the input to the output and a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer. Regular neural networks receive an input in the form of a concatenated vector, and transform it through multiple hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, but completely independent of each other in a single layer. A fully connected neuron in the first hidden layer given an input image 40 CHAPTER 4. BACKGROUND/RELATED WORK size of 200×200×3 would have 120,000 weights which is computationally expensive and often not required. CNNs help over normal neural networks in this case. They have neurons arranged in 3 dimensions: width, height and width in case of RGB images. But these neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer will have a dimension of 1×1×(number of classes) as the CNN architecture will reduce the full image into a single vector of class scores, arranged along the depth dimension. A simple CNN is a sequence of layers, which transform one volume of activations to another through a differentiable function. Traditionally CNNs have three layers in an architecture - convolutional layer, pooling layer and the fully-connected layer. There are of course other fancy layers squeezed in between in some architectures to improve the results and depending on the input data like activation function layer, batch normalization [48], dropout etc. In summary 1)A ConvNet architecture is a list of layers that transform the image volume into an output volume (eq. - holding the class scores) 2)There are a few distinct types of layers - convolutional/fully-connected/ReLU/pooling etc. 3)Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a non-linear differentiable function. 4)Each Layer may or may not have parameters (e.g. convolutional/fully-connected do, ReLU/pooling dont) 5)Each Layer may or may not have additional hyperparameters (e.g. convolutional/fullyconnected/pooling do, ReLU doesnt) The commonly used layers have been briefly described below A) Convolutional layer (CONV layer) The convolutional layer is the core building block of a CNN that does most of the 41 CHAPTER 4. BACKGROUND/RELATED WORK computational heavy lifting. These layers learn the filter weights by performing a convolution operation along the width and height of the input volume and the filters. The filters must have the same depth as the input volume i.e. - in case of the initial input layer , the depth of the filter will be 3, due to the image itself having a depth of 3(RGB).A 2-dimensional activation map is produced as the output for each filter. Intuitively, the network will learn filters that activate when they see some type of visual feature like edge, or more abstract and complex concepts on higher layers of the network. Each filter produces it’s own activation map, all of which are stacked together along the depth dimension. In Figure 4.13, the input volume is 32×32×3 and if we consider two filters of dimension 5×5×3 have been applied then two activation maps of size [28×28×1] are produced. A couple of terms need to mentioned here for understanding the internal dimensions of the network - 1) stride - the rate of sliding the filter spatially in the convolutional operation e.g. - if the stride is 1, then the filters are moved one pixel at a time and 2)padding - it’s used to pad the input volume with zeros around the border giving the user control over the spatial size of output volumes. It is often used to preserve the spatial size of the input volume so that the input and the output width and height are the same. Figure 4.13: Convolutional layer - input layer [32×32×3], filter dimension [5×5×3], activation maps [28×28×1] (http://image.slidesharecdn.com/case-study-of-cnn160501201532/95/case-study-of-convolutional-neural-network-5-638.jpg?cb=1462133741). To summarize the Conv Layer 1) Accepts a volume of size W1 H1 D1 42 CHAPTER 4. BACKGROUND/RELATED WORK 2) Requires four hyperparameters: Number of filters (K), their spatial extent (F), the stride (S), the amount of zero padding (P) 3)Produces a volume of size W2 H2 D2 where: W2 = (W1 - F + 2P)/S + 1 H2 = (H1 - F + 2P)/S + 1 D2 = K 3)With parameter sharing, it introduces (F*F*D1 )*K weights per filter, for a total of (F*F*D1 )*K weights and K biases. 4)In the output volume, the d-th depth slice (of size W2 H2 ) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias. B) Pooling layer The pooling layer helps in reducing the spatial size of the representation to reduce the amount of parameters and computation in the network. Pooling enables the network to learn a series of progressively more abstract concepts. There are a few common types of pooling techniques like max pooling (Figure 4.14), average pooling, L2-norm pooling etc. 1) Accepts a volume of size W1 xH1 D1 2) Requires two hyperparameters: their spatial extent F, the stride S 3) Produces a volume of size W2 xH2 xD2 where: W2 = (W1 -F)/S + 1 H2 = (H1 -F)/S + 1 D2 = D1 4) Introduces zero parameters since it computes a fixed function of the input 5) It is not common to use zero-padding for Pooling layers 3) Fully-connected layer (FC layer) From a mathematical point of view the neurons in both the CONV layers and the FC layers compute dot products, hence they have the same functional representation. 43 CHAPTER 4. BACKGROUND/RELATED WORK Figure 4.14: Max pooling with a stride of 2 [5]. The only difference between these two layers are the neurons in the CONV layer are connected only to a local region in the input and share parameters, while the neurons in the FC layer are independent of each other but are connected to all the activations in the previous layer. Hence their activations can be computed with a matrix multiplication followed by a bias offset. 4.3.4 Case studies [5] 1) LeNet [37]- The first successful applications of CNNs were developed by Yann LeCun in the 1990s. Of these, the best known is the LeNet architecture that was used to read zip codes, digits, etc. 2) AlexNet [39] - The first work that broadly popularized CNNs in computer vision was the AlexNet, developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton. AlexNet was the winner of the ImageNet ILSVRC challenge in 2012 and significantly outperformed the second runner-up by a margin of 10% top-5 error. It had a very similar architecture to LeNet, but was deeper. The architecture has been explained in further details in the section ”Proposed Method”. 3) ZF Net [49] - The ILSVRC 2013 winner was a CNN from Matthew Zeiler and Rob Fergus. It became known as the ZFNet. This network was created by tweaking the AlexNet architecture hyperparameters. The size of the convolutional layers in the middle were increased and the size of the stride and filter on the first layer were 44 CHAPTER 4. BACKGROUND/RELATED WORK decreased. 4) GoogLeNet [50] - The ILSVRC 2014 winner was a CNN from Google. This paper invented a new module called the inception module to form a different type of architecture. The inception module dramatically reduced the number of parameters in the network to 4 million, compared to AlexNet with 60 million. Additionally, this paper uses average pooling instead of FC layers at the top of the ConvNet, thus helping to reduce a large number of parameters. There are also several follow-up versions to GoogLeNet, most recently Inception-v4 [51]. 5) VGGNet [52] - The runner-up in ILSVRC 2014 was the CNN architecture called VGGNet. The primary contribution of this paper was in proving that the depth of the network plays a critical role in the performance of the network . Their recommended network contains 16 CONV/FC layers and features. It used an extremely homogeneous architecture that only performs 3×3 convolutions and 2×2 pooling throughout the entire network. 6) ResNet [53] - Residual Network developed by Kaiming He et al. was the winner of ILSVRC 2015. It features unique skip connections with heavy use of batch normalization. Like GoogleNet, the architecture removes the fully connected layers at the end of the network. ResNets are currently considered by many as state of the art CNN architecture. 4.3.5 Deep learning frameworks 1) Theano [54] - Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It was first developed at University of Montreal. It has been powering large-scale computationally intensive scientific investigations since 2007. Theano features a tight integration with NumPy which is a python library having many optimized libraries. 45 CHAPTER 4. BACKGROUND/RELATED WORK It also has GPU support and dynamic C code generation for evaluating expressions faster. 2) Caffe [16] - Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying convolutional neural networks. It also provides GPU support with CUDA library. Caffe boasts of a powerful online community which users can take significant advantage of e.g. pretrained models. 3) Torch [17] - Torch is a scientific computing framework with wide support for machine learning algorithms. The frontend uses the scripting language LuaJIT and an underlying C/CUDA implementation in the backend. Lua itself being one of the fastest languages, Torch provides excellent speed especially since it was built keeping GPU support first in mind. Some of the core features include • a powerful N-dimensional array • lots of routines for indexing, slicing, transposing, ... • interface to C, via LuaJIT • linear algebra routines • neural network, and energy-based models • numeric optimization routines • Fast and efficient GPU support • Embeddable, with ports to iOS, Android and FPGA backends Among others, it is used by the Facebook AI Research Group, IBM, Yandex and the Idiap Research Institute. 46 CHAPTER 4. BACKGROUND/RELATED WORK 4) TensorFlow [18] - TensorFlow is the most recent of all these frameworks developed by Google in 2015. It was developed with the focus of easy access to machine learning models and algorithms, ease of use and easy deployment in different heterogeneous machines like mobile devices. 47 Chapter 5 Time of day/Weather invariant 5.1 Time of day The primary focus of this research is to develop a vision based approach which can substitute or help a traditional GPS based localization method. The results of this process will get affected by the quality of the datasets and the model trained on them. For example a model trained on a dataset created out of images in the time frame of 9am-10am will not be such a good predictor for images at a different time of the day like the late afternoon or evening. This is primarily because there’s a significant change in the illumination of the images at different times of the day. Hence, if we could come up with a method which could filter out this illumination change, so that a single model can be used at anytime of the day, then that would help a lot. Although this is a problem for this research, it is not so new in the computer vision world and a lot of research has been conducted in the past to resolve it. An algorithm inspired from [55] has been used later in this research and the results looked promising. [55] presents a model of brightness perception suitable that receives High Dynamic Range(HDR) radiance maps and outputs Low Dynamic Range(LDR) images while retaining the important visual features. The algorithm proposed in this paper is motivated from the popular assumption about human vision: human vision responds to local changes in contrast rather than to global stimuli levels. From this assumption, the primary 48 CHAPTER 5. TIME OF DAY/WEATHER INVARIANT goal in this paper was to find the reflectance/perception gain L(x,y) such that when it divided the input I(x,y), it would produce the reflectance/perceived sensation R(x,y) of the scene, in which the local contrast would be appropriately enhanced. I(x, y) 5.2 1 = R(x, y)[55] L(x, y) (5.1) Weather invariant Similar to time of day, weather plays an equally important role for the purpose of prediction. While the technique of removing the effect of drastic changes on an image due to inclement weather (heavy rain or snow) hasn’t been perfected yet, a good amount of research has been done recently on removing rain streaks. [56] takes advantage of the strength of convolutional neural networks to learn both high-level and low-level features from images. This paper proposes a new deep convolutional neural network called DerainNet which has the ability to learn the non-linear mapping relationship between rainy and clean image detail layers from data. Each image is decomposed into a low-frequency base layer and a high-frequency detail layer. The network is trained on the detail layer rather than the image domain. It has been further discussed in the ’Proposed Method’ and ’Results’ sections. While [56] takes advantage of CNNs, [57] proposes a different method using Gaussian mixture models (GMM). Both the approaches treat the problem as a layer decomposition one i.e. - superimposition of two layers, the base layer (without rain) and the detail layer (with rain). [57] uses a simple patch-based priors for both the background and rain layers. These priors are based on GMMs and have the ability to accommodate multiple orientations and scales of the rain streaks. It is important to understand that removing rain from a single frame is a lot more difficult than video streams, as all the approaches treating video stream assume that they have a static background and can treat weather as removable temporal information. 49 Chapter 6 Proposed method The primary objective of this research is to come up with an alternative to a solely GPS-dependent localization for autonomous or semi-autonomous vehicles. The alternative approach may be used in assisting GPS output or may be used as its replacement. The motivation behind this comes from the fact that GPS data is not always available and not precise enough all the time for accurate localization. The eye acts as the best perception sensor of the environment in human beings. From just a quick glance, the human eye is capable of understanding its surrounding in real time, including depth information. This accurate perception helps a driver in controlling the vehicle in the desired way. The best part of this perception is the depth, which is one of the primary factors contributing in precise localization. It is evident from this information, that vision sensors can be equally powerful if molded the right way. Although scientists are quite far away from understanding the detailed working mechanism of the human brain including how it processes the visual data, recent advances in machine learning have proven to be very successful in mimicking some of the human visual system. Trusting this progress, it is believed that convolutional neural networks have the ability to identify locations accurately, if trained on visual data for each of those locations. There were multiple datasets created for this project. Although each of these datasets were created in unique ways, the fundamental approach was the same for 50 CHAPTER 6. PROPOSED METHOD all of them. At first, a bunch of classes were formed, each class being a unique pair of GPS coordinates along with images from different viewpoints. These datasets were then classified to observe if each location can be identified based on the visual stimulus. There were other experiments conducted on these datasets like testing weather invariant methods etc. 6.1 Google Street View Google Street View is a technology that provides panoramic views of the world primarily from roads. Using a very efficient image stitching algorithm, it displays panoramas of stitched images from different viewpoints. Most of the images are captured using a car, but other modes of transportation are also used in some cases. The current version of Street View uses JavaScript extensively and also provides a JavaScript application programming interface integrated with Google Maps which has been used extensively in this project. The images are not taken from one specific camera as the version of the camera has also changed over the years. The cameras are usually mounted on the roof of the car when recording data. The ground truth of the data is recorded using a GPS sensor, wheel speed sensor and inertial navigation sensor data. The advantage of Street View data is the availability of images for a single unique location from different viewpoints like field of view, heading and the pitch. 6.1.1 Field of View The field of view determines the horizontal field of view of the camera. It is expressed in degrees, with a maximum value of 120. Field of view essentially represents zoom, with its magnitude being inversely proportional to the level of zoom. The default value is 90. 51 CHAPTER 6. PROPOSED METHOD Figure 6.1: Street View car [58]. 6.1.2 Heading The heading is the most useful feature provided by Google Street View. It gives the user the ability to look at a 360 degree view from a location. This helps a user or a car particularly when she is facing a different heading while being on the same location. The magnitudes are - North:90, East:180, South:270, West:360. 6.1.3 Pitch The pitch specifies the up or down angle of the camera relative to the Street View vehicle. Positive values mean the camera is facing upward with 90 meaning vertically up while negative values mean the camera is facing downward with -90 meaning vertically down. The default value is 0. The different viewpoints help in increasing the robustness of the dataset. It also helps in solving a problem related to navigation of the car. These viewpoints, espe52 CHAPTER 6. PROPOSED METHOD cially can be taken advantage of to estimate the pose of the car at a certain location (x, y). Alternatively, if estimating pose is not a factor, these can be used to make the model richer for predicting just the location coordinates (x, y). For example, it is difficult to predict the direction of the car while traveling. The heading helps especially in this case. Then again, if the road is not particularly smooth, the pitch helps, while the field of view helps in handling the zoom. 6.2 Google Maps API Google Maps API helps users in a variety of ways in visualizing maps and access other useful features like directions and Street View. Further usage for specific datasets have been explained and shown under the individual datasets. 6.3 Dataset1 The first dataset was created with an intention to cover the entire campus of the university. The four furthest coordinates of the campus were selected as the boundary coordinates and all the GPS locations were collected by connecting these boundary coordinates forming a virtual rectangle. This was a brute force kind of method where the street view data was collected within the area of this rectangle to form the dataset.It should be noted here that this was the first attempt in this project to test the fundamental idea of identifying locations with the usage of vision data only. The latitudinal and longitudinal differences between each location was .0002 and .0001 respectively. Since Google maps gives approximately a root mean square (RMS) precision of 10cm on a six decimal point latitude/longitude, the latitudinal and longitudinal differences for this dataset could be converted to about 20m and 10m respectively. The parameters of the street view used in creating this dataset are given below. 53 CHAPTER 6. PROPOSED METHOD Table 6.1: Boundary coordinates of the rectangle. Southwest tip Southeast tip Northwest tip Northeast tip 43.080089,-77.685000 43.080089,-77.666080 43.087605,-77.685000 43.087605,-77.666080 Table 6.2: Dataset1: parameters. Heading Field of View 0 20 40 60 ” 360 90 90 90 90 ” 90 Pitch -10,0,10 -10,0,10 -10,0,10 -10,0,10 ” -10,0,10 Table 6.3: Dataset1 details. Image resolution No. of images 256×256×3 87k No. of classes No. of images/class 1530 54 Figure 6.2: Street View images from Dataset1. 6.4 Dataset2 While the first dataset was created in a brute force fashion, the second dataset was created much more intelligently. The objective was to create a dataset in a way, so that the images from each class were less similar in nature and restricted to the viewpoint of roads inside the campus. Since most of the data from Google Street 54 CHAPTER 6. PROPOSED METHOD View is acquired from the roads, this would eliminate images of places which are inaccessible to vehicles. As this research is focused on autonomous cars, this made sense. A zoomed in view of some of the locations have been shown in Figure 6.2. It can be seen in this figure that while the distance between two locations is relatively large on a straight road, it is relatively less during a turn or when the road is curvy. This is because the images are much similar relative to distance when on a straight road, the primary difference being only the zooming effect. But on the other hand, images would look quite different for each location when the road is curvy, hence locations being close to each other. This helps in building a better model. Figure 6.3: Zoomed in portion of Dataset2: Distance between locations based on viewpoint. The Google Maps API was used extensively for creating this dataset. Twenty-six important landmarks were chosen in a way such that they would cover almost the entire university campus. Using these landmarks an application was created using the Google Maps API. This API uses HTML5 as its primary platform, which leverages both HTML as a markup language and JavaScript for web applications. The primary objective of the application was to create paths between a user’s start and end location and store all the coordinates along the path in a similar fashion as shown in Figure 6.3. There was one more unique feature in the application - the user could create a path so that it would pass through user selected waypoints before reaching the 55 CHAPTER 6. PROPOSED METHOD end location. This has been shown in Figure 6.4. The start and end locations in Figure 6.4 are the green marker ‘A’and red marker respectively and all the other green markers are waypoints selected by the user. The blue line shows the route between the start and end locations, while passing through the other waypoints. The individual route segments between each pair of waypoints is also shown on the right. Taking various permutations of these locations (start, end, waypoints), a bunch of coordinates were collected throughout the campus from the paths, while eliminating redundant coordinates before creating Dataset2. Figure 6.4: Routes created out of Google Maps API. Finally, the dataset was created out of these coordinates in a similar fashion like the first dataset. The coordinates/classes have been shown in Figure 6.5. Table 6.4: Dataset2: parameters. Heading Field of View 0 5 10 15 ” 360 90 90 90 90 ” 90 Pitch -10,0,10 -10,0,10 -10,0,10 -10,0,10 ” -10,0,10 56 CHAPTER 6. PROPOSED METHOD Table 6.5: Dataset2 details. Image resolution No. of images 256×256×3 36k No. of classes No. of images/class 166 216 Figure 6.5: Locations/classes of Dataset2. The rest of the datasets were created by driving a golf cart around the university campus while capturing images using a set of cameras and collecting the GPS ground truth using an open source library. A brief description of the golf cart, the cameras used and the library used for collecting ground truth are given below. 6.5 Autonomous golf cart Rochester Institute of Technology has been doing some excellent research in the field of autonomous driving recently through different programs. A number of students have been involved in senior design projects as well as graduate research inventing as well as experimenting with traditional technologies related to autonomous vehicles. A stable platform is of quintessential importance to test these algorithms. Students have converted a traditional golf cart to an autonomous one, after putting in multiple sensors and a Central Processing Unit (CPU) with Ubuntu installed on it to handle all the 57 CHAPTER 6. PROPOSED METHOD feedback of these sensors and run other operations. This is only shedding light on the software of the golf cart, but a lot of other complex electrical and mechanical work are also integral parts of its operation. Further details can be found at the Autonomous People Mover Phase3’s website - http://edge.rit.edu/edge/P16241/public/Home. Figure 6.6: RIT’s autonomous golf cart. 6.6 Camera As part of this research, a couple of forward facing Hikvision Bullet IP cameras were installed on the roof of the golf cart pointing straight in order to collect data from the university campus (Figure 6.7). Since these cameras operate on IP protocol, a static IP had to be registered to the cameras in order to process images from them (http://edge.rit.edu/edge/P15242/public/Code/GitRepository/Computer/CameraCmds). Due to the high frame rate and the encoding format, a few issues did become bottlenecks while collecting data from the cameras. The system’s data pipeline was getting jammed without being able to process the data in real time. Scripts were written in order to handle the data pipeline efficiently in the form of a queue. The queue is an important concept in embedded systems software development. From a higher level, a queue makes sure that the system doesn’t start processing data in a pipeline, until 58 CHAPTER 6. PROPOSED METHOD all the data before it have been successfully processed. Table 6.6: Configuration of Hikvision cameras. Image resolution Frame rate Max. bitrate 704×576×3 25 4096 Video encoding H.264 Figure 6.7: (left) Hikvision Bullet IP camera (right) Cameras on the golf cart. 6.7 GPS ground truth An open source library/tool [60] was used for collecting the ground truth for the datasets created by driving the golf cart around. It uses the OSX Core Location[59] framework to get the latitude and longitude of a specific location. Since Core Location uses the cellular tower triangulation in addition to Wi-Fi positioning, the Wifi-hotspot network from a cellular phone was used during the time of recording ground truth, to make sure the presence of a stable internet connection. 6.8 Dataset3 This was the first dataset created by collecting images from the university campus while driving the golf cart around. The images were collected in early August with an objective of recording images from summer. A bunch of images were acquired randomly for each location. While images for some of the locations were recorded 59 CHAPTER 6. PROPOSED METHOD twice, by driving the golf cart towards and away from the location (in a round trip fashion), for others, the images were taken with the cameras facing in one direction (either facing towards or away from the location). The former case provided more robustness to some locations. The average distance between locations was (3.8m, 7.1m) in the (vertical, horizontal) directions respectively. For data augmentation, 10× crops were used, which is basically taking random crops out of the original images. Locations/classes of this dataset are shown in Figure 6.8 and some images in Figure 6.9. Table 6.7: Dataset3 details. Image resolution No. of images No. of classes No. of images/class 256×256×3 71200 89 Not constant Average inter-class dista (3.8m, 7.1m) Figure 6.8: Locations from Dataset3. 6.9 Dataset4 Since different weather/climates/time of the year affects the environment, it will affect the model trained on images from these different types of environments as well. 60 CHAPTER 6. PROPOSED METHOD Figure 6.9: Images from Dataset3. Hence, as part of this research, datasets were created at different times of the year and separate models were trained on them. Dataset4 was created around the end of October during the fall season. The average distance between locations was (1.8m, 6.4m) in the (vertical, horizontal) directions respectively. For data augmentation, scaling and rotational transforms were used besides the 10× method. While the scaling operation was performed using perspective transform in the range of (30,-30) with changes of 10 units for every zooming effect, for rotation, the range was between (10,50) and (310,350) degrees with each image rotated 10 degrees from each other. This added to the robustness of the dataset. Locations/classes of this dataset are shown in Figure 6.10 and some images in Figure 6.11. Table 6.8: Dataset4 details. Image resolution No. of images No. of classes No. of images/class 256×256×3 81360 115 Not constant Average inter-class dista (1.8m, 6.4m) 61 CHAPTER 6. PROPOSED METHOD Figure 6.10: Locations from Dataset4. Figure 6.11: Images from dataset4. 6.10 Dataset5 Dataset5 was created around early January during the winter season. The average distance between locations was (1.4m, 3.9m) in the (vertical, horizontal) directions respectively. For data augmentation, scaling and rotational transforms were used besides the 10× method similar to Dataset4. To make the dataset more robust to changes, Gaussian blur was added to these images as well using a [5×5] averaging 62 CHAPTER 6. PROPOSED METHOD kernel with different means in the range of (20,30) with a difference of 2 each time. Table 6.9: Dataset5 details. Image resolution No. of images No. of classes No. of images/class 256×256×3 94300 115 Not constant Average inter-class dista (1.4m, 3.9m) Figure 6.12: Locations from Dataset5. Figure 6.13: Images from dataset5. 63 CHAPTER 6. PROPOSED METHOD 6.11 Datasets with smaller inter-class distance Datasets 3-5 were recorded by a human driven golf cart, hence it was prone to variable speeds at different parts of the campus. Due to this reason the average distance was taken as an estimate of inter-class separation, instead of a constant value between each location. The localization accuracy i.e. - the real world precision achieved from training models on these datasets were high (∼25cm). But since the average interclass distance was much higher than the precision in case of Datasets 3-5, a further diagnosis was needed to evaluate the performance of similar models trained on data at a smaller inter-class distance equal to the precision (∼25cm). In order to test this hypothesis, Dataset6 (6a..6e) was created with each inter-class distance 25cm away from each other. A video stream of one second per location was collected at 35frames/sec using a cellular phone. Two out of these five datasets were created with the locations in a straight line, two with locations at a small angle to each other, and the final one with locations both in straight line and at an angle to each other at some points. Like previous datasets, both the classification and localization accuracies were high. Hence two more datasets were created. The locations in both of them were in a straight line, but the inter-class separation and the viewpoints per class were different. Dataset7 had a 25cm inter-class distance with a 360 degree viewpoint per class from a ten second video stream for each location. Thus each location was represented with varied images. Dataset8 was collected in a similar fashion as the Dataset6 discussed before, but with only 5cm inter-class separation distance. For data augmentation, scaling (0,-20,20), rotation (30,50,330,350), Gaussian blur (kernel - [5×5], means 25,30) and 10× crop were used for Datasets 6a-6e and Dataset8 while only 10× crop was used for augmenting Dataset7. 64 CHAPTER 6. PROPOSED METHOD Table 6.10: Datasets with smaller inter-class distance details. Datasets(6a-6e) Dataset7 Dataset8 No. of images No. of classes Images/class(before augmentation) RMS Inter-class distance 6.12 ∼99k 27 35 25cm ∼94k 27 350 25cm ∼99k 27 35 5cm Classifier An architecture inspired from the original AlexNet architecture, the winner of the ImageNet ILSVRC challenge was used to classify the datasets discussed above. The original architecture performed very well in the ImageNet challenge with 1000 classes and it delivered equally promising results in this research as well. The original architecture and the modifications on it used in the experiments have been discussed below. Although there have been many modifications produced on the original architecture, initially AlexNet consisted of eight layers - five CONV layers followed by three FC layers. The last FC layer is connected to a softmax function (log probability of the prediction) which produces a distribution over the number of classes. Some of the CONV layers were followed by pooling layers. While the pooling layers reduce the computation after each layer due to decrease in the spatial size of the representation, they also enable the extraction of an abstract hierarchy of filters with increasing receptive. In addition to making a more robust feature for classification, there is some evidence these higher level features prevent overfitting. Overfitting is a common problem in machine learning, where the trained model has difficulty in predicting unforeseen data from the test set, due to a huge number of parameters while not having enough samples to learn from. This is often due to the complexity of the dataset. Since the pooling layer reduces the number of parameters, it often helps in reducing 65 CHAPTER 6. PROPOSED METHOD overfitting. AlexNet uses ReLU as the non-linear mapping function i.e. the activation function. CNNs usually train faster when using ReLU as its activation function. There was one more interesting concept used in this architecture to reduce overfitting. It’s called ‘dropout’. Dropout helps in removing/deactivating neurons with a certain probability which would otherwise have contributed less during backpropagation through a network. It is worth mentioning here, that many researchers believe, that this was one of the important factors in making this architecture successful. Dropout was used in the first two FC layers. Figure 6.14: The network’s input is 150,528 dimensional, and the number of neurons in the network’s remaining layers is given by 253,440-186,624-64,896-64,896-43,264-4096-40961000 [39]. The modified architecture used in this research is from [63]. The primary difference with the original architecture is the addition of batch-normalization layers after all the CONV and FC layers except for the last one. Often, input data is divided into batches before being sent into the pipeline. This helps in parallel processing. However if there is a huge variance between these batches of data, the model might not train in the expected fashion. Batch normalization is often used to neutralize this internal co-variance shift. The code-base for training the models was used from the popular imagenet-multiGPU code developed by Mr. Soumith Chintala under the Torch framework [64]. The hyperparameter values used for training the datasets have been given below. These values tended to produce good results in most of the cases. 66 CHAPTER 6. PROPOSED METHOD Table 6.11: Training hyperparameters. Batch-size No. of epochs learning rate 256 256 256 6.13 1-8 9-15 16-25 0.001 0.0005 0.0001 Momentum Weight decay 0.9 0.9 0.9 0.0005 0.0005 0 Time of day invariant Using a gamma correction method inspired from [55], images were augmented to different illumination conditions based on different times of the day. For example, the visibility from dusk going into the evening, the brightness is usually low whereas it’s the opposite in the early morning. The block diagram of the process has been shown below. Figure 6.15: Block diagram of the algorithm. The Gamma matrix is the most important part of this block diagram. The user doesn’t have direct control over the input image, but she does have the ability to modify the Gamma matrix to get the desired reflectance/perceived sensation (R(x,y)). Comparing Figure 6.15 and 6.1 below, it can be observed that the mean/median of the histogram of the image (I(x,y)) or a constant factor in the range of 0-255 are tied to the term L(x,y). In Figure 6.16 below, the original image is in the left, while the 67 CHAPTER 6. PROPOSED METHOD images in the middle and the right are gamma corrected images. I(x, y) 1 = R(x, y)[55] L(x, y) (6.1) Figure 6.16: (left) Original Image (middle) Brighter-afternoon (right) Darker-evening. 6.14 Weather invariant It’s difficult to acquire several datasets for different weather conditions and keep trained models on these datasets in memory for real time use. Alternatively, if it was possible to remove the effects of harsh weather from the images, it would be easier to leverage the power of a single model trained on normal images for the same purpose. Inspired from this idea, the approaches in [56] and [57](previously discussed in Chapter Background/related work) were applied in this research to remove the effect of heavy rain from images. Since it is difficult to acquire data during heavy rain, synthetic rain was added to images using Adobe Photoshop [61] (Figure 6.17 (middle)). It is also possible to add snowfall on images using a similar technique [62] (picture not shown here). However, since snow adds significant changes to the texture of the environment and ground itself, adding only snowfall to images wouldn’t produce real world scenarios. This is a well known problem, and actually leads to a different research domain, as to how to generate mock real world winter images. In conclusion, experiments were conducted for only rainy conditions as part of this research. To 68 CHAPTER 6. PROPOSED METHOD test the strength of the model trained on normal images in predicting rain-removed images, random classes were chosen from the validation set in the Google Street View dataset, followed by adding rain, and then removing the same by the methods mentioned above. Out of the 166 classes in the Google Street View dataset, rain was added and removed randomly on 84 classes. A comparison between the two approaches from [56] and [57] in removing rain have been shown in the figures below. Figure 6.17: (left) Original image, (middle left) Synthetically added rain, (middle right) Rain removed by the approach from [56], (right) Rain removed by the approach from [57]. 6.15 Hierarchical If the accuracy of a model is not very high, a hierarchical approach can be undertaken for better results. The objectives of a hierarchical approach in this research is to attack multiple problems at a time 1) Improve precision of GPS sensors 2) Optimize testing time 3) Improve localization accuracy Every GPS sensor has a precision which is usually known beforehand from various tests. This approach proposes to improve this precision, by dividing the dataset into smaller parts (circles) containing only a handful of coordinates. The radius of each 69 CHAPTER 6. PROPOSED METHOD Figure 6.18: Venn diagram for hierarchical approach where Pn -(latitude,longitude) and R-GPS precision/smaller region. part can either be the precision of the sensor or a small distance. If the radius is the former, then we can consider an arbitrary number of points inside the circle based on the precision of the sensor. If the radius is the latter, we will consider the number of points given by the sensor within that distance. The scenario is explained with the help of the Venn diagram in Figure 12. According to Esa SiJainti, a Finnish Google maps developer, the precision from GPS Latitude/Longitude and Google Maps API respectively is: a) 5-6 decimal places - sufficient in most cases b) 4 decimal places - appropriate for detail maps c) 3 decimal places - good enough for centering on cities d) 2 decimal places - appropriate for countries Precision related to physical distance from the Google Maps 6 decimal places - 10cm 5 decimal places - 1m 4 decimal places - 11m Realistically, the GPS sensors available for everyday use is much worse. Experiments are being conducted at present on authenticating these claims and testing if 70 CHAPTER 6. PROPOSED METHOD the precision can indeed be improved. 71 Chapter 7 Results Table 7.1: Results. Dataset1 Dataset2 No. of images No. of classes Images/class Classification Accuracy Localization Accuracy ∼87k 1530 54 36.8% N/A ∼36k 166 216 75.37% N/A Dataset3 Dataset4 Dataset5 ∼71k 89 Not constant 97.5% (20cm,22cm) ∼81k 115 Not constant 98.5% (2cm, 7cm) ∼94k 115 Not constant 98.7% (10cm,40cm) In the table above, the classification accuracies shown are for the validation sets. Dataset1 did not produce a very high validation accuracy (∼36.8%). This was most likely because of two reasons. The first is related to a property of deep neural networks, that it needs a large number of samples per class for good prediction, which wasn’t present in Dataset1. However, the primary reason behind the poor performance of the classifier was the nature of the images in Dataset1, which were very similar in many cases, when they were within close proximity but without distinctive features, like areas of empty grasslands. The prediction results from Dataset2 was much better than the previous dataset. As explained in the earlier chapter, this dataset was created much more intelligently, making each class somewhat distinct to each other, with the help of the Google Maps API. Also, the number of images per class was 4× the previous dataset. As a result, the best validation accuracy for the lowest loss obtained was ∼75.37% after training 72 CHAPTER 7. RESULTS for 25 epochs. The loss vs validation accuracy for 25 epochs has been shown in Figure 7.1. Figure 7.1: Validation loss vs number of epochs in Dataset2. The deep learning classifier produced higher than 95% accuracy for Datasets 3 through 5 onward. Often a classifier produces high accuracy while overfitting, hence different experiments were conducted to make sure that the high accuracy reported by the classifier is legitimate.In Figure 7.2, location predictions from the trained model are shown for a total of 714 samples from the test set of Dataset3. While the red markers represent the ground truth, the green markers show the locations where the model’s predictions didn’t match the ground truth. The red markers represent both the ground truth as well as correct predictions from the model, since they coincide. After taking the average of the sum of the difference of all the incorrect predictions and their ground truth, it was found that the average difference for the latitudes and longitudes were .0000020 and .0000022 respectively for Dataset3. Considering .0001 represents 10m accuracy approximately in Google maps, the average error shift vertically (latitude) was about 20cm and horizontally (longitude) was about 22cm, whenever the model predicted incorrectly in case of Dataset3. Similarly, the localization accuracies were also computed for Dataset 4-5. It should be noted here that the 73 CHAPTER 7. RESULTS images collected for each class from Dataset3 onward didn’t have the same amount of variation in viewpoints like the Google Street View. The lack of variation wouldn’t affect the navigation of an autonomous vehicle much, as it typically follows with minimum variation along the same route as during the time of the creation of the dataset. The validation accuracy was 97.5% while classifying Dataset3 and an average real world of distance of (20cm, 22cm) between the incorrect predictions and the ground truth, when each class was located at an average distance of (3.8m, 7.1) in the vertical and horizontal directions respectively. Hence ideally, the models should be trained on the Street View images so that it can predict new images using that robust model. This is discussed further in the next chapter. Figure 7.2: Correct vs incorrect predictions - Dataset3. Datasets 3 to 5 were created during three different seasons - summer, fall and winter respectively. Although the time of the year was different and hence the images varied significantly, many of the locations/ground truth were the same. The results of cross-testing using the models are shown below in Table 7.2. It shows performance of each model on all the datasets. Although they performed quite well when the test set was from the same dataset, they did not perform well when it came to other test sets. After careful investigation, it was found that this was primarily because of 74 CHAPTER 7. RESULTS Figure 7.3: Validation accuracy vs number of epochs in Dataset3. the change in precision of the ground truth recorded by the GPS sensor during the collection of each dataset. However, when a model was built on the three datasets combined together, it improved significantly in predicting the individual datasets from the three seasons. The results are shown in Table 7.3. Table 7.2: Cross-testing with the three models. Dataset3(summer) Dataset4(fall) Dataset5(winter) Dataset3(summer) Dataset4(fall) Dataset5(winter) 97.5% 10.5% 31.3% 16.3% 98.5% 42.1% 21.8% 14.9% 98.7% Table 7.3: Prediction from the model trained on the three datasets combined. Test set - datasets combined 92.1% Dataset3 Dataset4 Dataset5 98.7% 82% 35.7% As evident from Table 7.4, both the classification accuracy and the localization accuracy were high in case of Dataset6 and Dataset8. Similar to Datasets 3-5, the localization accuracy was calculated as the average of the sum of the difference between the incorrect predictions and ground truth value for the same location. It was 75 CHAPTER 7. RESULTS Table 7.4: Results - Datasets with smaller inter-class distance. Dataset6(6a-6e) Dataset7 Dataset8 No. of images No. of classes Images/class(before augmentation) RMS Inter-class distance Classification accuracy Localization accuracy ∼99k 27 35 25cm 99.3% 30cm ∼94k 27 350 25cm 89.1% 27cm ∼99k 27 35 5cm 96.4% 27cm concluded from these results, that CNNs are good at differentiating between classes even if the difference in depth in between the locations is small. On the other hand it didn’t perform so well in case of Dataset7. This was because of the number of different viewpoints included in the same class, making the images in each class varied. Table 7.5: Weather invariant results. Original accuracy Rain added to validation set Rain removed by [56] 75.37% 55.7% 41.2% Rain removed by [57] 75.7% Rain was synthetically added and removed on Dataset2. Table 7.5 shows that heavy rain alters the dataset significantly, enough to confuse the model and bring down the accuracy by almost 20%. Surprisingly the approach from [56] didn’t work out so well and actually lowered the validation accuracy further. However, [57] performed well and produced the final validation accuracy almost the same as the original, thereby proving it was able to remove the effects of heavy rain or atleast negate its effect on the classifier. 76 Chapter 8 Future Work The classification accuracy wasn’t as high as the other datasets on the Street View datasets. The primary reason behind this is the inclusion of several viewpoints for each location in the Street View dataset, making the dataset robust to changes, but also harder for the classifier to predict locations from new images. There are a few different solutions to improve the accuracy 1) A more powerful classifier than AlexNet such as GoogleNet or ResNet. These classifiers have proved to be more powerful in recent years due to their intuitive learning techniques. 2) Accuracy can be further increased using a hierarchical approach as discussed under the ‘Proposed Method’. The larger region considered would essentially be divided into groups of smaller regions with less number of locations and separate models would be trained on images from these smaller regions. 3) One of the hyperparameters in the Street View is the heading which essentially represents the pose. Since the primary objective of this research was to determine the 2D coordinate of the location, all the the poses for that location were included in the same class. This made the classifier’s job harder. Instead, if the pose is taken into account for future work, two birds can be killed with one stone - (a) pose estimation and (b) each class representing (x,y,pose) will have more unique images, making it easier for the classifier to predict, albeit with more classes to predict. 77 CHAPTER 8. FUTURE WORK Since it’s difficult to collect images for large regions manually, the ideal way to conduct the experiments discussed in the former sections would be to train the models on datasets created out of the Google Street View, and use them to predict test images while actually driving a car. It should be noted here that this approach would need the camera configurations more or less similar to the one used for acquiring Google Street View images. And further for evaluating the performance (test runs), the ground truth of the test images should be the same as that of Street View i.e. - the precision and other configurations of the two GPS sensors should be the same. Figure 8.1: Ground truth of Google Street View vs predicted class from cross testing. In Figure 8.1 the same images from Figure 7.2 have been predicted by the model trained on Google Street View data from Dataset3. The red markers are locations when the model predicted correctly, whereas the green are the incorrect ones. As evident from the figure, the results weren’t as good as Figure 7.2. This is primarily because of the reasons discussed before. Provided similar GPS sensors and camera configurations, this method would work out well. The same location on a map can appear different at different times of the day due to factors like traffic, people etc. Adapting to the environment by neglecting dynamic obstacles during testing has been a hot topic of research since many years as evident 78 CHAPTER 8. FUTURE WORK in [25] and [26]. Combining these methods with the proposed method in this research would yield more robust results. Similar to dynamic obstacles, systems need to adapt to different illumination conditions at different times of the day and weather in order to perform efficiently. Although a few attempts have been made in this thesis like adjusting to different light conditions by gamma correction, rain removal techniques, a lot of research is still left to be done in this field. For example, a solution good enough to adapt to snowy conditions still doesn’t exist, unless new datasets are created from these conditions. Localization can assist robots in a number of ways and is the primary module in the navigation system in the world of robotics. As a conclusion to this thesis, an innovative approach of motion estimation using visual odometry and aided by localization is shown. This approach was accepted in the EI2017 ”Autonomous Vehicles and Machines” conference and received well by the reviewers. It is a combination of two master’s theses. The motion estimation algorithm using visual odometry was designed by Vishwas Venkatachalapathy, as part of his master’s thesis from the Computer Engineering department of Rochester Institute of Technology. Motion estimation using visual odometry is a concept by which the displacement of a vehicle is tracked through optical flow between consecutive images. The details of the exact approach is beyond the scope of this document. Interested readers are encouraged to check/contact the Computer Engineering department of Rochester Institute of Technology to read his thesis manuscript or read the paper ‘”Motion Estimation Using Visual Odometry and Deep Learning Localization’. As part of his thesis, a stereo dataset was collected from the university campus the same way D ataset3 has been created in this research. Although it’s a highly efficient method, visual odometry tends to acquire a small error every few images, resulting in drifting away from the ground truth over time. With the assistance of a highly accurate localization module, it is possible to keep the vehicle on track. In Figure 8.2, a block diagram of 79 CHAPTER 8. FUTURE WORK the complete process is explained. From a higher level of understanding, the Visual Odometry module calculates the 2D coordinates and pose from a sequence of images and seeks the help of the localization module after every ‘n’frames. The localization module keeps track of the ground truth, which the Visual Odometry can take advantage of in order to stay on track and help in the overall navigation of the vehicle. The advantage of this process is that it is entirely vision based. Since encoders/IMUs(for tracking displacement) are error prone and GPS data(for localization) is not always precise or available, a vision based approach can be an excellent alternative. Figure 8.2: Motion estimation using visual odometry and deep learning localization. 80 Chapter 9 Conclusion Visual data can prove to be very rich if utilized, in the right way for the right objective. In this research, it has been shown, that visual data can aid or even act as a substitute for traditional GPS based localization for semi-autonomous and autonomous cars. A validation accuracy greater than 95% was achieved on the datasets created by driving the golf cart around the campus. This method was able to achieve an average localization accuracy of (11cm, 23cm) with an average distance of (2.3m, 5.8m) for (latitude, longitude) respectively between locations in case of the datasets created by driving a golf cart around the campus. It also achieved a high ∼27cm localization accuracy when the classes were a lot closer in case of the datasets collected with a phone. Since the average RMS GPS precision is between 2-10m, the precision achieved by this method is a lot higher ∼25cm. It should be noted here that this precision is without the aid of any error correction system, which could increase it to ∼2-10cm. After performing experiments on different datasets with variable distances in between locations, it was observed CNNs are capable of giving both high classification and localization accuracies if the variance in the images in a single class is small. On the contrary, if the data in a single class has a high variance, it becomes more difficult for the classifier to predict with high confidence. However the localization accuracy was promising throughout the experiments on most of the datasets. This unique observation can be further utilized to determine if the pose should be included 81 CHAPTER 9. CONCLUSION in the same class or form separate classes. Weather plays a very important role in any vision based applications because of its unstable nature. In this thesis, a few state of the art methods from recent publications have been utilized to remove the harsh effects of weather from images. It was clearly evident from the results, that further research needs to be conducted in this field to make systems entirely resistant to weather effects. Smart cars are gradually becoming a part of our society. A lot of research is currently under way to make visual localization methods more robust. The intention of this research was to provide a solution to an existing localization problem that many cars face in navigation. With the high accuracies obtained through various experiments, it proved to be a viable solution. 82 Bibliography [1] S. Thrun ; J.J. Leonard; ”Simultaneous Localization and Mapping,” in Springer Handbook of Robotics, vol 23, no. 7-8, 2008, pp. 871-889 [2] NVIDIA Drive - Software Development Kit for Self Driving Cars https://developer.nvidia.com/driveworks [3] http://echibbs.blogspot.com/2016/06/summer-and-easy-reading-whychange.html [4] http://www.motorauthority.com/news/1060101 six-cities-named-for-newvehicle-to-vehicle-v2v-communications-trials [5] http://cs231n.stanford.edu [6] D. Scaramuzza and F. Fraundorfer, Visual Odometry [Tutorial], IEEE Robotics & Automation Magazine, vol. 18, no. 4. pp. 8092, 2011. [7] M. Fiala and A. Ufkes, Visual odometry using 3-dimensional video input, in Proceedings - 2011 Canadian Conference on Computer and Robot Vision, CRV 2011, 2011, pp. 8693. [8] D. G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, 2004. [9] S. Leutenegger, M. Chli, and R. Y. Siegwart, BRISK: Binary Robust invariant scalable keypoints, in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 25482555. [10] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, BRIEF: Binary robust independent elementary features, in Lecture Notes in Computer Science (including 83 BIBLIOGRAPHY subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010, vol. 6314 LNCS, no. PART 4, pp. 778792. [11] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ORB: An efficient alternative to SIFT or SURF, in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 25642571. [12] Z. Chen, O. Lam, A. Jacobson, and M. Milford, Convolutional Neural Networkbased Place Recognition, 2013. [13] T. Lin, J. Hays, and C. Tech, Learning Deep Representations for Ground-toAerial Geolocalization, pp. 50075015, 2015. [14] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Bronte, Fast and effective visual place recognition using binary codes and disparity information, 2014 IEEE/RSJ Int. Conf. Intell. Robot. Syst., no. SEPTEMBER, pp. 30893094, 2014. [15] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Gamez, Bidirectional loop closure detection on panoramas for visual navigation, IEEE Intell. Veh. Symp. Proc., no. JUNE, pp. 13781383, 2014. [16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe, Proc. ACM Int. Conf. Multimed. - MM 14, pp. 675678, 2014. [17] R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A matlab-like environment for machine learning, 2011. [18] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, L. Kaiser, M. Kudlur, J. Levenberg, D. Man, R. Monga, S. 84 BIBLIOGRAPHY Moore, D. Murray, J. Shlens, B. Steiner, I. Sutskever, P. Tucker, V. Vanhoucke, V. Vasudevan, O. Vinyals, P. Warden, M. Wicke, Y. Yu, and X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, arXiv:1603.04467v2, p. 19, 2015. [19] M. Cummins and P. Newman, FAB-MAP: Appearance-Based Place Recognition and Mapping using a Learned Visual Vocabulary Model, Proc. 27th Int. Conf. Mach. Learn., pp. 310, 2010. [20] M. J. Milford and G. F. Wyeth, SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights, in Proceedings - IEEE International Conference on Robotics and Automation, 2012, pp. 16431649. [21] Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J., Fong, P., et al. (2007). Stanley: The robot that won the DARPA Grand Challenge. Springer Tracts in Advanced Robotics, 36, 1-43. [22] Montemerlo, M., Becker, J., Bhat, S., Dahlkamp, H., Dolgov, D., Ettinger, S., Haehnel, D., et al. (2009). Junior: The stanford entry in the urban challenge. Springer Tracts in Advanced Robotics (Vol. 56, pp. 91-123). [23] Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner, R., Clark, M. N., Dolan, J., et al. (2009). Autonomous driving in Urban environments: Boss and the Urban Challenge. Springer Tracts in Advanced Robotics (Vol. 56, pp. 1-59). [24] Likhachev, Thrun, , M., S. Ferguson, (2005). Replanning D., Anytime Algorithm. Gordon, Dynamic Science, G., A Stentz, *: An 262271. A., & Anytime Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Anytime+Dynamic+A+*: [25] Levinson, J., Montemerlo, Based Precision Vehicle M., & Localization Thrun, in S. Urban (2008). Map- Environments. 85 BIBLIOGRAPHY Robotics: Science and Systems III, 121-128. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.222&rep=rep1&type=pdf [26] Levinson, urban J., & Thrun, environments using S. (2010). probabilistic Robust maps. vehicle 2010 localization IEEE in Interna- tional Conference on Robotics and Automation (pp. 4372-4378). Retrieved from http://www.scopus.com/inward/record.url?eid=2-s2.0- 77955779566&partnerID=tZOtx3y1 [27] Levinson, J., Askeland, J., Dolson, J., & Thrun, S. (2011). Traffic light mapping, localization, and state detection for autonomous vehicles. Proceedings - IEEE International Conference on Robotics and Automation (pp. 5784-5791). [28] Teichman, A., Levinson, J., & Thrun, S. (2011). Towards 3D object recognition via classification of arbitrary object tracks. Proceedings - IEEE International Conference on Robotics and Automation (pp. 4034-4041). [29] Teichman, A., & Thrun, S. (2012). Tracking-based semi-supervised learning. The International Journal of Robotics Research, 31(7), 804-818. Retrieved from http://ijr.sagepub.com/content/31/7/804.short. [30] Held, D., Levinson, J., & Thrun, S. (2012). A Probabilistic Framework for Car Detection in Images using Context and Scale. International Conference on Robotics and Automation, 1628-1634. [31] Felzenszwalb, P. F., Girshick, R. B., Mcallester, D., & Ramanan, D. (2009). Object Detection with Discriminatively Trained Part Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1-20. Retrieved from http://cs.brown.edu/ pff/papers/lsvm-pami.pdf 86 BIBLIOGRAPHY [32] Held, D., Levinson, J., & Thrun, S. (2013). Precision tracking with sparse 3D and dense color 2D data. Proceedings - IEEE International Conference on Robotics and Automation (pp. 1138-1145). [33] Held, D., Levingson, J., Thrun, S., & Savarese, S. (2014). Combining 3D Shape, Color, and Motion for Robust Anytime Tracking. Robotics: Science and Systems (RSS). [34] Ivakhnenko, Alexey (1965). Cybernetic Predicting Devices. Kiev: Naukova Dumka. [35] Ivakhnenko, Alexey (1971). ”Polynomial theory of complex systems”. IEEE Transactions on Systems, Man and Cybernetics (4): 364378. [36] Fukushima, K. (1980). ”Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. Biol. Cybern. 36: 193202. doi:10.1007/bf00344251. PMID 7370364. [37] Le Cun Jackel, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D., Cun, B. L., Denker, J., & Henderson, D. (1990). Handwritten Digit Recognition with a Back-Propagation Network. Advances in Neural Information Processing Systems, 396-404. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.5076 nhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.5076&rep=rep1&type=pdf [38] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2323. [39] Krizhevsky, A., Sutskever, I., & Geoffrey E., H. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NIPS2012), 19. Retrieved 87 BIBLIOGRAPHY from https://papers.nips.cc/paper/4824-imagenet-classification-with-deep- convolutional-neural-networks.pdf [40] Deng, J. D. J., Dong, W. D. W., Socher, R., Li, L.-J. L. L.-J., Li, K. L. K., & Fei-Fei, L. F.-F. L. (2009). ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2-9. [41] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. ArXiv e-prints, 1-18. Retrieved from http://arxiv.org/abs/1207.0580 [42] Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. Proceedings of the 27th International Conference on Machine Learning, (3), 807-814. [43] Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. Proceedings of the 30 th International Conference on Machine Learning (p. 6). Retrieved from https://web.stanford.edu/ ∼ awni/papers/reluh ybridi cml2013f inal.pdf [44] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15, 1929-1958. [45] Duchi, Methods nal of J., Hazan, for E., Online Machine & Singer, Learning Learning Y. (2011). Adaptive Subgradient and Research, Stochastic 12, Optimization. 2121-2159. Retrieved Jourfrom http://jmlr.org/papers/v12/duchi11a.html [46] Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. 88 BIBLIOGRAPHY [47] Kingma, D., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, 1-13. Retrieved from http://arxiv.org/abs/1412.6980 [48] Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Arxiv, 1-11. Retrieved from http://arxiv.org/abs/1502.03167 [49] Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8689, pp. 818-833). Springer Verlag. [50] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., et al. (2015). Going deeper with convolutions. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 07-12June-2015, pp. 1-9). IEEE Computer Society. [51] https://arxiv.org/abs/1602.07261 [52] Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. ImageNet Challenge, 1-10. Retrieved from http://arxiv.org/abs/1409.1556 [53] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. Arxiv.Org, 7(3), 171-180. Retrieved from http://arxiv.org/pdf/1512.03385v1.pdf [54] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., et al. (2010). Theano: a CPU and GPU Math Expression Compiler. Proceedings of the Python for Scientific Computing Con- 89 BIBLIOGRAPHY ference (SciPy), 1-7. Retrieved from http://www-etud.iro.umontreal.ca/ ∼ wardef ar/publications/theanos cipy2010.pdf [55] Brajovic, V. (2004). Brightness perception, dynamic range and noise: a unifying model for adaptive image sensors. Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, 2, II-189- II-196 Vol.2. Retrieved from 10.1109/CVPR.2004.1315163 [56] X. Fu, J. Huang, X.Ding, Y. Liao and J. Paisley, ”Clearing the Skies: A deep network architecture for single-image rain removal,” arXiv:1609.02087v1 [cs.CV] 7 Sep 2016 [57] Y. Li, R. T. Tan, X. Guo, J. Lu and M. S. Brown, ”Rain Streak Removal Using Layer Priors,” CVPR 2016 [58] https : //en.wikipedia.org/wiki/GoogleS treetV iew [59] https : //en.wikipedia.org/wiki/IOS SDK#Core Location [60] https : //github.com/robmathers/W hereAmI/blob/master/READM E.md [61] http : //www.photoshopessentials.com/photo − ef f ects/rain/ [62] http : //www.photoshopessentials.com/photo − ef f ects/photoshop − snow/ [63] http : //arxiv.org/abs/1404.5997 [64] S. Chintala, ”https://github.com/soumith/imagenet-multiGPU.torch”, Copyright (c) 2016, Soumith Chintala All rights reserved [65] M. Yu and U.D. Manu, Stanford Navigation Android Phone Navigation System based on the SIFT image recognition algorithm, 90 BIBLIOGRAPHY [66] A. R. Zamir and M. Shah, ”Accurate Image Localization Based on Google Maps Street View,” Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 255268. Springer, Heidelberg (2010). 91 BIBLIOGRAPHY 92