ISCF Landing strategy of an aerial vehicle on moving targets using deep learning algorithms Presented by: Giacomo ROSATO Gabriele SORO Reda TALEB Presented to: Pedro Castillo Garcia Armando Alatorre Sevilla 1 TABLE OF CONTENTS 1. Introduction 3 2. Aruco marker 4 3. Landing strategy 5 4. Target detection 6 4.1. OpenCV 7 4.1.1. Python Libraries 8 4.1.2. Camera Calibration 8 4.1.3. Marker detection with OpenCV 4.1.3.1 Short overview about the Levenberg-Marquardt algorithm 4.1.4. Results 11 12 13 4.1.4.1 Results using the laptop camera 13 4.1.4.2 Comparison results using Optitrack system 15 4.2. YOLO 17 4.3. MASK R-CNN 19 4.3.1. Types of Mask R-CNN: 19 4.3.2. Which one we chose 20 4.3.3. Libraries used 20 4.3.4. General Explanation: 21 4.3.5. Code explanation in details 23 4.3.5.1. Mask R-CNN Training 23 4.3.6. Results and discussions 28 4.3.6.1 Real time implementation results 29 4.3.6.2 Real time implementation discussion 31 4.3.6.3 Drone captured video implementation results 32 4.3.6.4 Drone captured video implementation discussion 33 4.4 Position Comparisons 34 5. Speed Target Estimation 35 5.1 OpenCV 35 6. CONCLUSIONS 36 REFERENCES 38 ORIGINAL GANTT DIAGRAM 39 UPDATED GANTT DIAGRAM 39 2 1. Introduction Aerial vehicles, such as drones and quadcopters, have become increasingly popular in recent years for a wide range of applications, including delivery, search and rescue, and inspection. One important challenge in the operation of aerial vehicles is the ability to land autonomously and accurately on a moving target. This is particularly useful in scenarios where the vehicle needs to land on a platform that is itself in motion, such as a boat or a moving vehicle. To achieve a successful landing on a moving target, the aerial vehicle must be able to accurately estimate the position, orientation, and velocity of the target. Deep learning algorithms can be used to address these challenges by providing a means to process visual data from cameras mounted on the vehicle and make real-time predictions about the target's motion. In order to get a dynamic landing trajectory, our goal can be divided into two main steps: 1) Finding a deep learning algorithm to estimate target information, such as position and orientation. The algorithm receives images data, captured from a camera positioned at the base of the drone, which points vertically to the ground. 2) Implementing an observer algorithm to estimate the target’s velocity. Having a target speed estimation gives a better accuracy and reliability of the landing since it allows one to anticipate the future position and adjust its trajectory accordingly, especially when the target is moving quickly. Once we find the dynamic trajectory, thanks to an already implemented tracking controller, we should be able to control the quadcopter and make it land correctly and safely. 3 1. Aruco marker The choice of an appropriate marker on the moving target is an important aspect for a better detectability and trackability by the vehicle’s camera while landing performs. These markers can be of different shapes, such as QR codes, bar codes or with circular ones. The most widely used in this field are the ArUco markers [1]. We decided to use them for their robustness, accuracy and ease of use. One They consist of square shapes with a unique black and white pattern in order to get easily detected. The marker IDs are defined in a dictionary, predefined in the ArUco module or manually created by the user. One of the main advantages is to have some libraries implemented in C++ and usable in Python along with the OpenCV. Its design is thought to provide a quick 3D position of the camera with respect to the marker. In addition, the detection algorithm was implemented with Hamming code to guarantee a high resistance to false detection. It consists of a set of check bits that are interleaved with the data bits in the marker. The check bits are chosen such that each data bit is a function of a subset of the check bits. When the marker is detected by the camera, the image of the marker is processed to extract the data bits and check bits. This algorithm can determine the locations of the errors and correct them (the number of fixable bits is given by the minimum distance). In the early experiments we used an ArUco marker with ID_10 with 6x6 grid pixels (Figure 1). Fig. 1 6x6 ID_10 ArUco Marker Later on, we thought about how to improve the landing robustness while the quadcopter gets closer to the target and we found an ingenious approach [3] that consists in using two different ArUco markers one within the other. This feature is especially useful at late stages of the landing, when the outer marker is too close to the camera and may not be entirely captured by the camera's frame. In this situation, the inner marker can provide additional information for the aerial vehicle to use to accurately position itself for landing. When selecting a particular ID for the outer ArUco marker, attention should be paid that the inner ArUco marker replaces a single black ”pixel” in its centre. Therefore the outer Aruco marker ID selection is limited to the ones with a black block in the centre, while it is recommended to select an inner Aruco marker in such a way that a number of its black pixels dominate over the number of white ones. This guarantees a more reliable detection. Therefore, as markers, we used two 7x7 ArUco with outer marker ID_33 and inner marker ID_29, as shown in Figure 2. 4 Fig. 2 7x7 Outer marker ID_33, Inner marker ID_29 2. Landing strategy The idea behind the use of ArUco markers for a dynamic landing strategy is the following [2]: Once the coordinates and the distance from the marker are computed, at first the quadcopter has to reach the centre of the marker to switch to “land” mode. In this mode, the drone must maintain the same position with respect to the target and smoothly descend up to a certain small height H in which the engines are turned off. 5 3. Target detection For our purpose, there are many different approaches to detect the moving target in real-time. We mainly considered three among them: OpenCV usage, Yolo and MASK R-CNN deep learning algorithms. Here a From different studies, in particular [5], we found there are pros and cons for each one: - OpenCV ● Pros: - ● - Cons: - OpenCV's object detection algorithms may not be as accurate as more recent deep learning-based algorithms, especially in case of marker obstruction. Yolo ● Pros: - ● Cons: - - It is a widely used library for computer vision tasks, and has a large community of developers and users. It contains a wide range of pre-built algorithms and functions that can be used for real-time object detection. Generally considered to be less computationally expensive than deep learning-based algorithms It is a real-time object detection system that is able to achieve high detection accuracy while maintaining a fast processing speed It uses a single convolutional neural network for end-to-end object detection, which makes it less computationally expensive than other approaches like region-based CNNs. Less accurate than some other object detection models, such as Faster R-CNN and Mask R-CNN, especially when it comes to detecting small objects. Not compatible with ArUco, issues to get position and orientation information of the markers MASK R-CNN ● Pros: - ● Cons: - state-of-the-art object detection model that is able to achieve high detection accuracy while also generating instance segmentation masks. It's a 2 stage detection model, its first step generates a Region Proposal, which will be fed to a CNN, by having a region proposal to detect the object it makes the model way more accurate. Mask R-CNN is computationally expensive, especially when it comes to generating instance segmentation masks, it might not be the best approach for real-time object detection on resource-constrained environments such as aerial vehicles. Mask R-CNN also requires a large amount of data and computational resources to train the model. 6 There were many deep learning approaches, such as SSD and Faster RCNN but we took into account the ones that were more suited to our needs. Here some comparisons from [5] that give an idea of the deep learning performances in terms of accuracy and speed: Fig. 3 Speed vs. Accuracy for object detection methods Fig. 4 Accuracy comparison for different sizes of target objects Despite the outstanding performances of Yolo, we used Mask RCNN, which is an extension of Faster R-CNN that adds an additional branch for predicting segmentation masks with even better accuracy. 4.1. OpenCV In the following section we present the reasons, the methodologies and the approach of why and how we implemented the Aruco Marker detection with OpenCV: - - OpenCV is a widely used and well-documented computer vision library, making it easy to find resources and guidance on using its various functions. The OpenCV Python library provides a range of functions specifically designed for detecting and tracking Aruco markers. These functions make it easy to implement ArUco marker detection and tracking in a Python script. The OpenCV Python library is highly efficient, allowing for real-time processing of images and videos. This is important when working with drones, as accurate and timely position estimation is crucial for safe operation. 7 - OpenCV is compatible with a wide range of programming languages, including Python, which makes it easy to integrate with other libraries and tools. OpenCV is optimised for fast and efficient image and video processing, which makes it well-suited for use with Aruco markers, which require fast and accurate detection and tracking. Despite Python being a high-level programming language, the OpenCV functions are optimised in the sense that they’re written in C++ language (which is a low level programming), while Python works only as an easy to use “Wrapper” for calling those C++ function from a simpler language as python. In summary, using the OpenCV Python library to implement ArUco marker detection and tracking provides a reliable, efficient, and customizable solution for accurately estimating the position of a moving target. 4.1.1. Python Libraries For the implementation of the OpenCV approach we used many libraries. In Particular: - - - - - Opencv2- contrib which is an additional library that extends the functionality of the openCV library. It contains a collection of algorithms and utilities that are not included in the core openCV library as, for example, the sub-library used to generate, detect, and estimate the pose of the ArUco Marker. The Numpy library is used for scientific computing. It provides a range of functions and data structures for working with numerical data, including support for multi-dimensional arrays, matrices, and mathematical operations. The Time library in Python is a built-in library that provides functions for working with time and dates. It includes a range of functions for obtaining the current time, formatting time and dates, and performing time-related calculations. Matplotlib is a Python library that is used for data visualisation. It provides a wide range of functions and tools for creating plots, charts, and other types of visualisations. The Math library provides a range of mathematical functions and constants. It includes functions for performing basic mathematical operations, such as addition, subtraction, multiplication, and division, as well as more advanced functions, such as trigonometric and logarithmic functions. The xlsxwriter library provides a range of functions and tools for creating and writing to Microsoft Excel files. It allows developers to create spreadsheet files, add data to them, and apply formatting and styling to the data. 4.1.2. Camera Calibration An important step before approaching the code development of the recognition itself is to calibrate the camera that we want to use for the Marker recognition. In fact, When a camera is not calibrated, it can produce distorted or skewed images, which can impact the accuracy of image analysis and object detection algorithms. There are several benefits to calibrating a camera in openCV: 1) Improved accuracy: Calibrating a camera allows for more accurate measurements and estimates of object positions and sizes in the images produced by the camera. This is important for applications that rely on precise measurements, such as robotics or surveying. 2) Enhanced reliability: Calibrating a camera can improve the reliability of image and video processing algorithms by reducing the impact of distortion and alignment issues. This can lead to more consistent results and fewer errors. 8 3) Enhanced performance: Calibrating a camera can also improve the performance of image and video processing algorithms by reducing the amount of data that needs to be processed. This can lead to faster processing times and improved efficiency. Indeed, we can correct for distortion and alignment issues in the images produced by the camera, which can reduce the amount of data that needs to be processed. This can lead to faster processing times and improved efficiency. The camera calibration process involves estimating the intrinsic and extrinsic parameters of the camera. The intrinsic parameters describe the properties of the camera itself, such as the focal length, principal point, and distortion coefficients. The extrinsic parameters describe the position and orientation of the camera in relation to the world. In particular, the parameter that we need are: - camera matrix: A 3x3 matrix that represents the intrinsic parameters of the camera. It includes information such as the focal length and principal point of the camera. distortion coefficients: A vector of 4 or 5 parameters that represent the distortion caused by the camera's lenses. These coefficients can be used to correct for distortion in the image. rotation vectors: A list of 3x1 vectors that represent the rotational components of the extrinsic parameters of the camera. translation vectors: A list of 3x1 vectors that represent the translational components of the extrinsic parameters of the camera. You can use the camera matrix and distortion coefficients to correct for distortion in the image and project 3D points onto the image plane, and the rotation and translation vectors to compute the pose of the camera in relation to the calibration pattern. To calibrate a camera in openCV, we typically need to capture a series of images of a calibration pattern, such as a chessboard or a set of points. These images are used to estimate the intrinsic and extrinsic parameters of the camera. In our case, we decided to implement the calibration using a 9x6 chessboard. To compute this task we used three main functions built-in OpenCV functions: 1) findChessboardCorners that is used to detect the corners of a chessboard pattern in an image. It takes in an image as input and returns the coordinates of the corners of the chessboard pattern using a combination of different detection algorithms between Canny edge detection, Sobel edge detection, Laplacian edge detection and Pattern Matching. In particular, the parameter used are: - the image: the input image that should be a 2D array of pixels greyscale, - the patternSize: the number of rows and columns in the chessboard pattern His outputs are: - retval: A Boolean value indicating whether the chessboard pattern was found in the image. If set to True, the chessboard pattern was found, and the corners have been detected - corners: A list of points representing the corners of the chessboard pattern in the image. These points are returned as a NumPy array of (x, y) coordinates, with the top-left corner first, followed by the remaining corners in row-major order. 2) cornerSubPix that takes an image and a set of initial corner points as input, and uses optimization algorithms to refine the position of the corner points to more accurately reflect the location of the corners in the image. It takes as input parameter: - image: The input image in which the corner points are located - corners: A list of points representing the initial corner points of the chessboard pattern in the image. These points should be in the same order as the chessboard squares - winSize: The size of the search window used to refine the corner points. This is specified as a tuple of (width, height). 9 - zeroZone: The size of the "dead zone" in the middle of the search window. This is specified as a tuple of (width, height). - criteria: A tuple of termination criteria for the optimization algorithm, including the maximum number of iterations and the required accuracy. The cornerSubPix function is typically used in conjunction with the findChessboardCorners function. The output is a vector of the new optimised corner 3) The drawChessboardCorners function in openCV is a function that is used to visualise the corners of a chessboard pattern in an image. It takes in an image and a set of corner points as input, and overlays the corner points on the image to create a visual representation of the chessboard pattern. The parameter for this function are: - image: The input image on which the chessboard corners will be overlaid. - patternSize: The size of the chessboard pattern in the image - corners: A list of points representing the corners of the chessboard pattern in the image, results of the cornerSubPix function - patternWasFound: A boolean value indicating whether the chessboard pattern was found in the image. If set to True, the corners will be overlaid on the image. If set to False, no overlays will be drawn. When we run the Script, we visualise the frame and the chessboard with lines defining the chessboard itself. An example is reported below. Example of chessboard recognition We need to store multiple images in different orientations and positions in order for the algorithm to understand the camera distortion. Examples are reported below. Examples of different images used for calculating camera distortion coefficients After we took a good amount of images, we need another algorithm that computes all the distortion matrices. To achieve this task we use the cv2.calibrateCamera function. To compute the intrinsic and extrinsic parameters of the camera, cv2.calibrateCamera uses an optimization algorithm to minimise the error between the observed 2D points in the images and the projected 3D points on the image plane. The optimization process involves adjusting the camera matrix and distortion coefficients until the error is minimised. Once the optimization process is complete, cv2.calibrateCamera returns the optimised camera matrix and distortion coefficients, as well as the rotation and translation vectors that describe the pose of the camera in 10 relation to the calibration pattern. These matrices and vectors can then be used to correct for distortion and compute the pose of the camera in other images. In particular, the specific optimization algorithm used by cv2.calibrateCamera is the Levenberg-Marquardt algorithm, which is a popular choice for non-linear optimization problems. The Levenberg-Marquardt algorithm is an iterative method that involves linearizing the optimization problem at each iteration and solving for the optimal update using a combination of the gradient descent and the Gauss-Newton method. The Levenberg-Marquardt algorithm has several advantages, including fast convergence, good robustness, and the ability to handle large numbers of variables. It is generally considered to be a reliable and efficient optimization method for camera calibration and other non-linear optimization problems. 4.1.3. Marker detection with OpenCV First, we give a general idea of the approach used to detect the marker and its position. OpenCV provides a number of functions for detecting ArUco markers in images. The main function for detecting ArUco markers is cv2.aruco.detectMarkers, which takes an image and a dictionary of ArUco markers as inputs and returns a list of detected markers and their positions in the image. To use cv2.aruco.detectMarkers, you will first need to get a dictionary of ArUco markers using the cv2.aruco.Dictionary_get function. The dictionary specifies the properties of the ArUco markers, such as the number of bits, the number of markers, and the marker layout. Once you have a dictionary of ArUco markers, you can use the cv2.aruco.detectMarkers function to detect the markers in an image. This function returns a list of detected markers, along with their positions and orientations in the image. We now describe the parameter and the output for those main functions: - - - cv2.aruco.Dictionary_get is a function in the OpenCV ArUco module that returns a predefined ArUco marker dictionary with a given name. ArUco marker dictionaries specify the properties of ArUco markers, such as the number of bits, the number of markers, and the marker layout. To use cv2.aruco.Dictionary_get, you need to provide a string argument specifying the name of the dictionary you want to retrieve. OpenCV includes a number of predefined dictionaries with names such as DICT_4X4_50, DICT_4X4_100, and DICT_6X6_250. the DetectorParameters_create function creates and returns a pointer to an instance of the DetectorParameters struct, which stores various parameters used in feature detection algorithms in OpenCV, such as the maximum number of features to detect, the minimum quality of image features to retain, and the size of the neighbourhood to consider when performing feature matching. The output of this function is used in the cv2.aruco.detectMarkers. cv2.aruco.detectMarkers is used to detect ArUco markers in the input image. The function takes as input the image, a dictionary of markers to search for, and a DetectorParameters object that specifies various parameters for the detection process. It returns the IDs of the detected markers and the corners of each marker in the image. These algorithms include the thresholding in which the input image is first thresholded to create a binary image, where pixels in the image are either black or white. This helps to separate the markers from the background. Contour detection in which the image is then processed to find the contours (boundaries) of connected regions of white pixels. Each contour is assumed to correspond to a marker in the image. Then there’s the marker identification phase where the contours are analysed to determine whether they correspond to valid ArUco markers. This is done by searching for the required number of black and white borders around the contour, and checking that the border pixels are arranged in the correct pattern. And finally the corner detection: Once a marker has been identified, the corners of the marker are detected using 11 - - the perspective transformation of the marker. This allows the position and orientation of the marker to be determined. Cv2.aruco.estimatePoseSingleMarkers is used to estimate the pose of each ArUco marker detected in the input image image. The function takes as input the corner points of the markers, the dimension in centimetres of the markers in the real world, and the camera matrix and distortion coefficients of the camera. It returns the rotation and translation vectors for each marker, which can be used to determine the 3D pose of the marker in the camera coordinate system. It uses a perspective-n-point (PnP) [Levenberg-Marquardt] algorithm. This function uses the iterative PnP algorithm implemented in the solvePnP() function to estimate the rotation and translation vectors of the marker. This algorithm uses an iterative approach to minimise the error between the observed 2D image points and the projected 3D object points. Cv2.drawFrameAxes is a function that draws 3D coordinate frame axes on an image or video frame. It takes in a 3D point and a rotation vector, and it draws the X, Y, and Z axes of the coordinate frame centred at the given point, with the X, Y, and Z axes oriented according to the given rotation vector. After the detection and the draw phase we show an image that looks like this: The data present on the marker are the distance between the marker and the camera, the rotation of the yaxis (green) in respect of the camera, and the x and y positions of the centre of the marker in respect of the centre of the camera. 4.1.3.1 Short overview about the Levenberg-Marquardt algorithm The iterative PnP algorithm, also known as the Levenberg-Marquardt algorithm, is an optimization method that is used to determine the 3D pose of an object in a 2D image. It is implemented in the solvePnP() function in OpenCV. Here is a high-level overview of how the iterative PnP algorithm works: 1) Initialise the rotation and translation vectors with some initial guess. 2) Project the 3D object points onto the image plane using the current rotation and translation vectors to obtain the predicted 2D image points. 3) Calculate the error between the observed 2D image points and the predicted 2D image points. 4) Update the rotation and translation vectors to minimise the error. 5) Repeat steps 2-4 until the error is below a certain threshold or the maximum number of iterations is reached. The iterative PnP algorithm uses an iterative approach to minimise the error between the observed 2D image points and the projected 3D object points. At each iteration, it adjusts the rotation and translation vectors to reduce the error between the observed and predicted points. This process continues until the error is below a certain threshold or the maximum number of iterations is reached. 12 One of the advantages of the iterative PnP algorithm is that it can handle a wide range of configurations and point correspondences, making it a versatile and robust method for estimating the pose of an object in an image. 4.1.4. Results The main vector that we use for extrapolating the data that we need is the translation vector (t-vec) that we get at each iteration of the previous algorithm. In fact, the detection part resides inside a while loop, and the translation vector is only generated when a marker is found. Since in our case we want to handle two nested markers, we developed a function based on the ID of the ArUco Marker that is detected, we choose whether it’s the bigger marker or the smaller one. Based on this assumption, we know that we have different t-vecs for each estimation based on the marker that we detected at the current iteration. We define the two vectors as !!" = [$!" &!" '!" ]# !!$ = )$!$ &!$ '!$ * # in which the index i refers to the i-th iteration, m refers to the smaller marker and M refers to the bigger marker. This approach, in fact, allows us to determine, depending on the marker (bigger or smaller), to determine the position of the centre of the marker in the image in respect to the centre of the camera, and also the height of the camera in respect to the marker. Given those parameters we can compute the Euclidean distance between the marker and the camera, defined as: ! ! + = ,$!" + &!" + '!" ! Here we defined the equation for the smaller marker, but it works analogously for the bigger one. As discussed before, we tested the approach on two different markers. The first time we used a simple marker, and after that we decided to improve it by using two nested markers. 4.1.4.1 Results using the laptop camera For the first tests we used our own camera on the laptop to compare the real data, such as the distance, measured by hand, with the estimations of the OpenCV algorithm. Note: every graph in this paragraph has the distance (cm) on Y-axis and the time (seconds) in X-axis. Firstly we present some tests referring the case in which we only have one marker: 13 Fixed Marker – Real distance of 98cm Fixed Marker – Real distance of 6.06M As it can be seen the algorithm works well for small distances, estimating the real value accurately and with a small standard deviation. We can also see that we do not lose frames, since we have a 30 sample/seconds which is the framerate of the camera. This changes while increasing the distance between the marker and the camera. In this case we have much more noise, the mean of the prediction is more distant to the real value, and we start losing detection (we have less sample/seconds). What we did was then to test the algorithm when we approached the camera with the marker. The main problem that we encountered is that when we approach the camera, the marker becomes too big for it to be detected. The implemented function in OpenCV, in fact, is not able to recognize the marker only given a portion of it but it needs to be all in the acquired image. In the image below on the left we show that the minimum distance that we get is about 23cm, which is a distance too large to be able to make the drone land safely. Implementing a smaller marker is not a good way to resolve this issue as it doesn’t allow the marker to be detectable at longer distances. These are the reasons why we choose to implement the Nester Marker. In the image on the right we show an example of approaching with the new marker. Approaching the marker with simple marker Approaching the marker with nested marker As we can see now we have a much smaller distance at which the drone could be able to land safely. A thing to notice in the right image is that there’s a distance ~80cm in which the algorithm starts switching from the detection of the bigger marker to the smaller one. In this case since the smaller marker is very little, 14 its estimation is not perfect. Since in this case we can still detect the bigger marker which is more accurate at this distance, we decided to compute the mean of the last 5-10 acquired samples in order to have a more realistic target distance. Here below are the results while we vary the distance. Varying the distance w/ last 10 samples mean As a final improvement on this part of the detection we had to handle the case of where there are no detections. The simpler idea to implement was that when we do not have a detection we compute the mean of the last 5-10 samples. This solution works well when a non-continuous detection of the marker as it allows to update its position. This allows to give a weight on new detections while considering the previous. We provide an example below. The green points are the cases when there’s no detection. Approaching the marker adding random obstructions 4.1.4.2 Comparison results using Optitrack system In this section we explain how we compared the results obtained with the Optitrack system, in comparison of our results. Firstly we need to define which are the parameters that we obtain when we use the new sensor. 15 Of course, the first data that we need is a recorded video of the drone that catches the moving target. The Optitrack system allows us to have multiple data for both the target and the drone. The most useful data for us are: - x= x position (Target) y= y position (Target) dx= velocity in x-axis dy= velocity in y-axis V= speed x_Drone= x position (Drone) y_Drone= y position (Drone) yaw_Drone= yaw angle (Drone) We have the data stored in an Excel document in which the first column is the instant in which the sample was taken. To get the data that we needed from the camera, we ran our algorithm taking as input the video that was taken from the camera of the drone. We stored the data in an Excel document that was then imported in matlab with the optirack’s data. In this case to better generalise the problem and better understand if the positions obtained with the optitrack system and the camera can be comparable, we decided to evaluate the x and y variation of the marker in respect of the camera by translating all the signal to the origin. The results obtained are shown below: Comparison between Optitrack System and Camera System As it can be seen in the graphs the sensor can approximate accurately the variation of the X’s and Y’s, having in some case the same data as the optitrack system. In the red line it can be seen the in some case the signal flatten, those are the cases in which the marker isn’t detected from the camera, while of course we still have the signal of the optitrack system. 16 To better understand how well the algorithm works we plot below the error obtained for X and Y variation. As it can be seen above, the errors clearly increase when the marker isn’t detected.Overall, these results are interesting for what regards the ability of the algorithm for estimating the position of the target, but we can see in fact that even if the the marker is correctly detected in some cases we have an error which is too high considering the relatively low distance between the marker and the camera (~2M in this case). To have more confidence in what we obtained we should be able to work with more data, and improve some aspects such as the calibration coefficient of the camera. We provide below, for the sake of completeness, the distance provided from the optitrack system. 17 4.2. YOLO YOLO[6] is a convolutional neural network (CNN) architecture for object detection in images. The key idea behind YOLO is to perform object detection in a single pass of the network, rather than using a two-stage approach like region proposal followed by classification. This allows YOLO to process images in real-time, making it suitable for use in applications such as video surveillance, self-driving cars, and robotics. In order to accomplish this, YOLO divides an input image into a grid of cells and each cell is responsible for predicting a set of bounding boxes, which are a set of rectangles that tightly encloses the objects of interest. Each bounding box also has an associated class probability, which indicates the likelihood that the box contains an object of a particular class. For example, it can predict the class "car" with high probability and the class "person" with lower probability for a box that encloses a car. YOLO uses a single convolutional neural network architecture that takes an entire image as input, and outputs a set of bounding boxes and class probabilities for each cell in the grid. This makes it more efficient and faster than many other object detection algorithms that rely on multiple stages or networks. YOLO is primarily designed for object detection and classification, rather than for specifically extracting the position and orientation of an Aruco marker[7]. While it is possible to use YOLO to detect Aruco markers in an image, it may not be the most efficient or accurate way to extract the marker's position and orientation. Once YOLO has been trained to detect Aruco markers, it can output bounding boxes around detected markers in an image. However, the position and orientation information of the markers is not directly encoded in the bounding boxes themselves, so additional post-processing would be required to extract this information. Using YOLO to detect Aruco markers has some drawbacks that need to be considered: - - - - Training Data: To use YOLO to detect Aruco markers, a dataset of images with Aruco markers in various poses and backgrounds needs to be collected and used to train the model, which can be a time-consuming and resource-intensive task. Detection Efficiency: YOLO is primarily designed for general object detection and not for specific shapes such as markers, which means it may not be as efficient or accurate as other algorithms specifically designed for detecting Aruco markers such as the ArUco library. Processing Time: YOLO is a complex model and processing an image through it can take a significant amount of time, making it less suitable for real-time applications. Marker's information: YOLO's detection output is represented in terms of bounding boxes, which are used to enclose the objects of interest, this may not provide the most accurate or detailed information about the position and orientation of the marker, which may require more specific information like marker corner positions. Complexity: Using YOLO to detect Aruco markers requires some modification to the architecture of YOLO and post-processing step to extract the marker's position and orientation information which is not straight forward. Segmentation-based methods using a recurrent neural network (RNN) may be a better choice for detecting ArUco markers than using YOLO, as they are specifically designed for object segmentation and can provide more accurate and detailed information about the marker's position and shape. In a segmentation-based approach, the model would be trained to predict a binary mask indicating the pixels that belong to the marker, rather than predicting a bounding box around the marker, like in YOLO. The prediction of a binary mask can provide a more fine-grained representation of the marker's shape, and this is crucial for ArUco marker detection as the markers have specific geometric shapes. In the next section we will talk about a Mask R-CNN approach that implements segmentation. 18 4.3. MASK R-CNN Deep learning is a subfield of machine learning that is concerned with creating artificial neural networks that can learn from data and make predictions or decisions without being explicitly programmed to do so. Deep learning algorithms are based on the structure and function of the human brain, using layers of interconnected nodes (neurons) to analyze and understand data. These networks are trained using large amounts of data and powerful computing resources, allowing them to learn and improve over time. Deep learning has been used in a wide range of applications, including image and speech recognition, natural language processing, and computer vision. The Mask R-CNN (Regional Convolutional Neural Network) algorithm is a state-of-the-art deep learning model for object detection and instance segmentation. It is built on top of the Faster R-CNN architecture and extends it by adding a parallel branch for predicting an object mask in addition to the object bounding box. The Mask R-CNN algorithm has two main stages: the first stage is a Region Proposal Network (RPN) which proposes regions of interest (RoIs) that might contain objects. The second stage is the detection and instance segmentation stage, where the RoIs are fed into a fully convolutional network that generates class-specific object bounding boxes and object masks. It can be trained on different architectures, including ResNet, FPN, and others. The model you used is the maskrcnn_resnet50_fpn which is a mask rcnn that is built on the FPN and ResNet50 architecture. This model is pre-trained on COCO dataset, a large-scale object detection, segmentation, and captioning dataset. And it contains 80 object classes and more than 330K images, it's a very powerful model for object detection task. 4.3.1. Types of Mask R-CNN: 19 There are several types of Mask R-CNN models, including: ● Mask R-CNN (original) ● FPN-Mask R-CNN ● RetinaNet-Mask R-CNN ● Cascade R-CNN ● Hybrid Task Cascade (HTC) ● Grid R-CNN ● Libra R-CNN ● PANet ● TensorMask ● YOLACT. Each of these models have different architectures and are suited for different types of tasks and environments. Some are faster but less accurate, while others are more accurate but slower. The original Mask R-CNN model is a two-stage architecture, while FPN-Mask R-CNN and RetinaNet-Mask R-CNN are onestage architectures. Cascade R-CNN and HTC are designed to improve the accuracy of the model by adding more stages to the network. 4.3.2. Which one we chose The Mask R-CNN model that we used in our work is a two-stage object detection and instance segmentation model, a pre-trained Mask R-CNN model with a ResNet-50-FPN backbone . The two stages are: Region Proposal Network (RPN): which generates object proposals, or regions of interest (RoIs) that may contain objects. The second stage is the detection and instance segmentation network, which classifies the RoIs and generates masks for each object. It is based on Resnet50 FPN (Feature Pyramid Network) architecture, which is a backbone architecture that is pre-trained on COCO (Common Objects in Context) dataset.The Mask R-CNN model that you used in your work is a two-stage object detection and instance segmentation model. The two stages are: Region Proposal Network (RPN): which generates object proposals, or regions of interest (RoIs) that may contain objects. The second stage is the detection and instance segmentation network, which classifies the RoIs and generates masks for each object. It is based on Resnet50 FPN (Feature Pyramid Network) architecture, which is a backbone architecture that is pre-trained on COCO (Common Objects in Context) dataset. 4.3.3. Libraries used We also used several libraries in our work. These include: ● Torchvision: This library is a part of the PyTorch library and contains popular pre-trained models and datasets for computer vision tasks such as object detection and image classification. we used this 20 library to load a pre-trained instance segmentation model (Mask R-CNN with a ResNet-50-FPN backbone) for detecting the marker in the video. ● Numpy: This library is a fundamental package for scientific computing with Python. we used it to perform array operations, such as resizing the frames from the video, and mathematical operations such as calculating the center of the bounding box and the area of the marker. ● Matplotlib: This library is a plotting library for the Python programming language. we used it to create various plots and visualizations of the results obtained from our algorithm such as Object's position over time, Distance using length, width, and area over time, x and y position over time. ● OpenCV: This library is a library of programming functions mainly aimed at real-time computer vision. we used it to read and write videos, to resize the frames from the video, and to display the frames with the predicted masks overlaid on them. ● Pickle: This library is used to store and retrieve data in a binary format. we used it to save the results obtained from our algorithm in a pickle file that can be loaded later and used to plot the results. All these libraries are widely used in computer vision and machine learning tasks, and they provide a lot of functionality and ease of use, which makes them ideal for our work. 21 4.3.4. General Explanation: In this part, we will be detailing the steps taken in our work to detect and track a marker in a video using instance segmentation. In the first step, we trained an instance segmentation model using the Mask R-CNN architecture and the ResNet-50-FPN backbone on a custom dataset of marker images. The goal was to train the model to accurately detect and segment the marker in new images. To achieve this, we used the pre-trained Mask R-CNN model from the torchvision library, and replaced the pre-trained head with a new one that is suitable for our custom dataset. We then loaded our custom dataset and used it to train the model for a number of iterations, adjusting the hyperparameters as needed to improve its performance. The choice to use the Mask R-CNN architecture and the ResNet-50-FPN backbone was based on their proven success in object detection and instance segmentation tasks, and on the availability of pre-trained models that can be fine-tuned on a custom dataset. This allowed me to save time and resources compared to training a model from scratch In the second step, we used the pre-trained Mask R-CNN model with a ResNet-50-FPN backbone, and finetuned the model on a dataset of images of the marker. The goal of this step was to train a model that could accurately detect the marker in the video and provide the bounding box coordinates of the marker. To fine-tune the model, we had to create a dataset of images of the marker. This dataset was created by manually annotating the images with bounding box coordinates using a tool such as CVAT. we also had to define the number of classes (in this case, 2: marker and background) and the number of training and validation images.After training the model, we saved it as a .torch file and loaded it into the script that would be used to detect the marker in the video. This allowed us to use the trained model to make predictions on new images and get the bounding box coordinates of the marker. This step was crucial because it allowed us to detect the marker in the video, which is a necessary step to track the marker's position over time. Without a trained model, it would not be possible to detect the marker in the video and therefore track its position. In the third step of our work, we chose to use the area of the bounding box in pixels as a metric for determining the distance of the marker from the camera. The reason for this choice is that the width and length of the bounding box can be affected by the tilt of the marker, resulting in inaccurate distance measurements. However, the area of the bounding box is not affected by the tilt of the marker and provides a more reliable measurement of the distance. To calculate the distance using the area of the bounding box, we first obtained the area of the bounding box by multiplying the width and length. We then compared this value to the known area of the marker at different distances from the camera. Using this information, we were able to create a relationship between the area of the bounding box and the distance of the marker from the camera. Finally, we used this relationship to calculate the distance of the marker in real-time as the video was being captured. We added this distance information to a list and plotted the results over time to visualise the motion of the marker. In the fourth point, we used the results obtained from the previous steps to plot the data and compare it with the real data obtained from the Optitrack system. This step helped us to visualise the performance of our algorithm and compare it with the real data, which helped us to evaluate the accuracy of our algorithm and identify any potential errors or areas of improvement. 22 4.3.5. Code explanation in details 4.3.5.1. Mask R-CNN Training ● Preparation of the data set Before training one crucial operation would be the creation of the dataset in order to train the model, this is very important for the next part, since the COCO Dataset isn’t enough, by retraining the model we are modifying the outer layers of our model in order to recognise the marker from the background, in this part we had 6 attempts to reach a good training, in the last attempt we used 205 pictures, each picture combined with it annotation. So we took 205 images of the marker from different angles and distances. These images were then annotated to create the masks. This was done by using a tool called CVAT, which allows you to draw bounding boxes around the marker and create a binary mask for each image. ● Creating the masks and reorganising the files In the first part of the work, a custom dataset was created to train a Mask R-CNN model. This was done by creating a set of images of a marker and their corresponding masks, which are binary images that indicate the pixels that belong to the marker. Once the images and masks were ready, they were split into a training set and a validation set. This is important to ensure that the model is not overfitting and is able to generalise well to new images. After splitting the data, the images and masks were transformed to a format that can be used by the model. This involved converting the images to a format that can be used by PyTorch, and normalising the pixel values. The next step was to create a custom dataset class that inherits from the PyTorch's Dataset class. This class was used to load the images and masks, and apply the necessary transformations. The custom dataset class also had to implement the __getitem__ and __len__ methods, which are used to retrieve a sample from the dataset and the length of the dataset, respectively. Finally, the custom dataset was used to train a Mask R-CNN model using transfer learning. This was done by loading a pre-trained model and fine-tuning it on the custom dataset. The model was trained for a number of epochs and its performance was monitored using the validation set. 23 ● Training the model This part of the code is focused on training a Mask R-CNN model using a pre-trained model from torchvision called maskrcnn_resnet50_fpn. This model is pre-trained on the COCO dataset. The first step is to load the pre-trained model and set it to run on the device that is available, either the GPU or the CPU. Then, the model's head is replaced with a new one using the FastRCNNPredictor, which takes in the number of input features for the classifier and the number of classes in the dataset. This is done because the pre-trained model is trained on the COCO dataset which has 80 classes, but in this case we only need 2 classes. After that, the model is loaded with the state dictionary from the previously trained model and moved to the device. Finally, the model is set to evaluation mode, so that it's in inference mode. 24 ● Testing the model in real time using the Computer's Webcam In this script we want to test the model in real time, to do that we use the webcam, and the goal is to extract both the pose (x and y) of the marker and its distance from the camera in real time. The trained model is loaded with pre-trained weights and its final layers are replaced with a new head called "FastRCNNPredictor" that can predict two classes. Then a state dict is loaded from a file "correctmodels/10000.torch" which is used to fine-tune the model. The script captures frames from the camera, resizes them to 600x600 pixels and converts them to tensors that can be processed by the model. The frames are passed through the model in inference mode, and the model's output, which includes object masks, bounding boxes and scores, are used to draw the segmented objects on the video frames. A few variables are defined at the beginning of the script, including lists that are used to store the distance using length, width, area and the x, y coordinates of the bounding boxes over time. First, it defines the length and width of the marker in the real world in centimeters, as well as the focal length of the camera in pixels. Then it uses the length and width of the bounding box returned by the Mask R-CNN model in pixels to calculate the distance from the camera using the marker's length. It uses the formula (marker_length_in_centimeters * focal_length) / marker_length_in_pixels to calculate the distance and appends the result to the distance_values_length list. It then displays the calculated distance on the video frame using the cv2.putText() function. Next, it calculates the distance from the camera using the marker's width, using the formula (marker_width_in_centimeters * focal_length) / marker_width_in_pixels. The result is added to the distance_values_width list and displayed on the video frame using the cv2.putText() function. Lastly, it calculates the distance from the camera using the area of the marker in pixels. It uses the formula (400 * focal_length_distance) / (area_px) only if the area of the marker is greater than 1000 pixels. The result is added to the distance_values_area list and displayed on the video frame using the cv2.putText() function. Then it creates a range of time_width with the length of distance_values_width list. If the area of the marker is less than 1000 pixels, it appends the last distance_using_width, distance_using_length and distance_using_area value to the distance_values_width, distance_values_length and distance_values_area lists respectively.First, it defines the length and width of the marker in the real world in centimeters, as well as the focal length of the camera in pixels. Then it uses the length and width of the bounding box returned by the Mask R-CNN model in pixels to calculate the distance from the camera using the marker's length. It uses the formula (marker_length_in_centimeters * focal_length) / marker_length_in_pixels to calculate the distance and append the result to the distance_values_length list. It then displays the calculated distance on the video frame using the cv2.putText() function. Next, it calculates the distance from the camera using the marker's width, using the formula (marker_width_in_centimeters * focal_length) / marker_width_in_pixels. The result is added to the distance_values_width list and displayed on the video frame using the cv2.putText() function. 25 Lastly, it calculates the distance from the camera using the area of the marker in pixels. It uses the formula (400 * focal_length_distance) / (area_px) only if the area of the marker is greater than 1000 pixels. The result is added to the distance_values_area list and displayed on the video frame using the cv2.putText() function. Then it creates a range of time_width with the length of distance_values_width list. If the area of the marker is less than 1000 pixels, it appends the last distance_using_width, distance_using_length and distance_using_area value to the distance_values_width, distance_values_length and distance_values_area lists respectively. The script uses the OpenCV library to capture frames from the camera and display the output video. The segmented objects are displayed on the video frames using random colors. The center x, y coordinates of the bounding boxes are also calculated and stored in the x_list and y_list. The script continues to capture and process frames until the camera is closed. Here are some examples of what we used it for: ● Testing the model on a video representing the real data The code uses OpenCV's VideoCapture class to read a video file, and in each frame of the video, the frame is resized to a specific size, converted to a tensor and passed through the model to get the predictions. The predictions contain the bounding boxes, masks, and scores for each instance segmented in the image. This code is calculating the distance between the camera and the object detected in the video frames using three different methods: using the length of the object, using the width of the object, and using the area of the object. The same computation as before is then used in order to compute the pose and distance of the marker in the video. For each instance, the code extracts the mask, bounding box, and score, and if the score is above a certain threshold, it is considered a valid instance. Then, the code calculates the center of the bounding box, and appends the x and y values of the center to the x_list and y_list respectively. Finally, the code overlays the mask on the original image, and the resulting image is displayed on the screen. 26 ● Plotting the results Here we are focusing on plotting and analysing the results obtained from the previous steps. The code begins by importing the necessary libraries such as matplotlib and pickle. Then, it loads the data from the pickle files that were created in the previous steps. These pickle files include the time, distance values calculated using length, width, and area, as well as the x and y positions of the marker. 27 Some of the results of the plotting script The code then creates a figure with 2 rows and 2 columns of subplots. The first subplot is a scatter plot that shows the x and y positions of the marker over time. The x-axis represents the x-position and the y-axis represents the y-position. The colour of the points on the scatter plot represents the time at which the position of the marker was recorded. The second subplot is a line plot that shows the distance values calculated using the length, width, and area of the bounding box of the marker over time. The x-axis represents the time and the y-axis represents the distance values. The plot has three lines, one for each method of calculating distance. The third subplot is a line plot that shows the x-position of the marker over time. The x-axis represents the time and the y-axis represents the x-position. The fourth subplot is a line plot that shows the y-position of the marker over time. The x-axis represents the time and the y-axis represents the y-position. Finally, the code shows the plots and saves them to a file. 28 4.3.6. Results and discussions 4.3.6.1 Real time implementation results Object’s position over time Translation of the marker from down-right position to up-left position Translation of the marker from down-left position to up-right position Translation of the marker from up position to down position 29 Translation of the marker from right position to left position Marker to camera distance over time: Marker-to-camera distance using the length of the bounding box in computation Marker-to-camera distance using the width of the bounding box in computation 30 Marker-to-camera distance using the area of the bounding box in computation Marker-to-camera distance using the three methods 4.3.6.2 Real time implementation discussion The experiment aimed to track the position and distance of a marker from the webcam over a period of time. The first set of figures showed the position estimation of the marker's translation from position A to B, and the results were quite satisfying. Most of the noise generated in these figures was mainly a result of human errors. The goal of these plots was to determine if the position was being accurately captured by the camera and it can be inferred that the code is ready for pose estimation in the drone captured video. In the next set of figures, we presented the distance estimation from the marker to the webcam over time. It was observed that the results varied between the distance captured and computed using length and width which are identical and the one using the area of the bounding box. The method using area computation was found to be much more accurate in far distances and when the orientation of the marker was off or even 31 a little bit tilted. This highlights the benefit of using the mask area directly since it is not affected by the orientation of the marker and offers more accurate results. 4.3.6.3 Drone captured video implementation results Tracking of the marker’s X position over time Tracking of the marker’s Y position over time 32 Tracking of the marker’s x and y over time Marker-to-drone distance using the three methods 4.3.6.4 Drone captured video implementation discussion The experiment aimed to track the position and distance of a marker from a drone's camera over a period of time. The first two figures showed the position estimation of the marker's translation from the captured video, the first one represents the tracking of the X coordinate over time and the second one the tracking of Y coordinate over time, we can see that the result are good but not enough since there is a lot of noise, this noise could be deleted if we use a filter, but after trying using a filter it didn’t work as expected since adding a filter slowed the real time implementation a lot, so we preferred to prioritise the speed over the noise filtering . The third figure is just a representation of both X and Y over time, here we can see the noise more clearly, shown by the points on the extremities of the figure. In the last figure, we presented the distance estimation from the marker to the drone over time. We can see that the results varied between the distance captured and computed using length and width which seem 33 similar and the one using the area of the bounding box, as expected the method using the area as a reference is much more accurate. 4.4 Position Comparisons In order to understand how well the algorithms perform, we exported all the results of the different algorithms in different Excel files and then imported all the data in MatLab. Here below we show the comparisons between OpenCV vs. Mask R-CNN for X’s and Y’s and also the comparison between all the three approaches. OpenCV vs. Mask R-CNN OpenCV vs. Mask R-CNN vs Optitrack 34 As it can be seen in the image above, the approach that gives the most accurate results in this case is OpenCV. Mask R-CNN is also a valid approach to take into consideration, given the fact that it is able to recognise for most of the time the marker but at this time it does not provide a good reliability for what concern the capability of detecting with accuracy the position. The improvement that is done using the Mask R-CNN approach is the in fact it’s able to recognise the marker even if it has some obstruction. Between second 11 and 13 we can see that the OpenCV algorithm doesn’t detect the marker while the Mask R-CNN is able to get some data even in the case of partial visibility. 5. Speed Target Estimation Given the promising results that we obtained for what concerns the evaluation of the distance between the marker and camera, we now want to work on the observer algorithm in particular to estimate the x and y speed of the marker in respect to the camera. As mentioned before, this is needed as, ideally, we would like before starting to land to align the Drone (camera) to the Platform (marker) and then once they’re aligned (the speed of the platform in respect of the drone is ~0), we start diminishing the distance making the drone land onto the platform. 5.1 OpenCV With the OpenCV usage approach, as a first step to get the position we decided to compute the mean speed taking into account the last N samples. In particular: .%" = $& − $&'( 0& − 0&'( In which N is usually [5,10] to not have a velocity too biassed by examples too old. Below we present some tests. An important thing to note is that the speed is always noted in cm/s. Still target with N = 10 Moving Target from left to right and vice versa with N = 10 The plot above shows the variation on the x-axis (blue) and on the y-axis (orange). The plots below show the variation of speed in relating the x-axis(blue) and y-axis (orange). Analysing the plots on the left we can see that the markers are still and in fact their speed is oscillating around 0. The plots on the right are more interesting as they show an example of the moving target along the x-axis. As it can be seen when the target moves from left to right in respect of the camera, its speed is positive. Also between the second [4;5] we can see in the plot above that the marker increases its x-position slower than 35 before, and in fact in the plot below has a negative peak in this time-range which means the velocity along that axis is less. Finally, since the marker is basically still in the y-axis(orange) lines then its velocity in the yaxis is constant, close to 0. We provide below other test examples, in particular when the target moves from bottom to top and viceversa, and when the target moves from bottom-left corner to top-right corner and vice-versa. Of course in this case analogous considerations of the previous example can be taken into account. Moving Target from bottom to up and vice versa with N = 10 Moving Target from bottom-left to top-right and vice versa with N = 10 The results we showed until now are obtained using only a pc-integrated webcam, nevertheless they can give an idea of how the system works, and that in fact the results that we obtained are at least theoretically correct. To really understand if the results obtained are correct we would need to implement this algorithm on a camera that is able to record the moving platform, and in which the coordinates and the speed along the xaxis and y-axis of the platform are known parameter taken from a sensor different from the camera. 6. CONCLUSIONS In conclusion, the goal of this work was to track the position and distance of a marker from a webcam or drone camera over a period of time. We implemented a real-time object detection and instance segmentation model using Mask R-CNN, which is based on the Resnet50 FPN architecture pre-trained on COCO dataset. The results of the real-time implementation were quite satisfying, with the position of the marker being accurately captured by the camera. We also found that using the area of the bounding box in the computation of the marker-to-camera distance was much more accurate in far distances and when the orientation of the marker was off or tilted. In the drone captured video implementation, we faced some noise issues but managed to track the marker's position and distance over time. Overall, the results of this work demonstrate the potential of using deep learning models for real-time object detection and instance segmentation in various applications. Finally, our experimentation with using an Aruco marker and OpenCV to land a drone has been a valuable learning experience. The results we obtained for the estimation of the x and y position were found to be quite promising and comparable to the optitrack system. However, when it comes to the velocity estimation, 36 we only obtained theoretical good results, but during the testing phase, we did not achieve the desired outcome. The reason for this could be the limited time we had for the experimentation and the complexity of the task itself. In our opinion, even if we had more time, we would have to test more in order to achieve good results, and probably we may have to apply some filters (like kalman filter) to try to improve the results. The results of this project have highlighted the potential of using Aruco markers and OpenCV for drone navigation. While we did not achieve the desired results for velocity estimation, the results for x and y position estimation were comparable to an optitrack system. Therefore, this project has shown that potentially Aruco markers and OpenCV can be effective tools for drone navigation. 37 REFERENCES [1] Adam Marut, Konrad Wojtowicz, Krzysztof Falkowski, ArUco markers pose estimation in UAV landing aid system, 2019 [2] Igor Lebedev, Aleksei Erashov, Aleksandra Shaba, Accurate Autonomous UAV Landing Using Vision-Based Detection of ArUco-Marker, Springer Nature Switzerland AG 2020 A. Ronzhin et al. (Eds.): ICR 2020, LNAI 12336, pp. 179–188, 2020. https://doi.org/10.1007/978-3-030-603373_18 [3] Artur Khazetdinov, Tatyana Tsoy, Evgeni Magid, Aufar Zakiev, Mikhail Svinin, Embedded ArUco: a novel approach for high precision UAV landing, 2021, International Siberian Conference on Control and Communications (SIBCON) [4] Jamie Wubben, Francisco Fabra, Carlos T. Calafate, Tomasz Krzeszowski , Johann M. Marquez-Barja, JuanCarlos Cano and Pietro Manzoni. Accurate Landing of Unmanned Aerial Vehicles Using Ground Pattern Recognition. 12 December 2019 [5] Pranav Adarsh, Pratibha Rathi, Manoj Kumar. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model. 2020 6th International Conference on Advanced Computing & Communication Systems (ICACCS) [6] https://docs.ultralytics.com/ - Yolo Documentation [7] https://docs.opencv.org/4.x/d5/dae/tutorial_aruco_detection.html - Aruco Libraries for OpenCV 38 ORIGINAL GANTT DIAGRAM UPDATED GANTT DIAGRAM 39