Uploaded by gabriele.soro.iic99

Landing strategy of an aerial vehicle on moving targets

advertisement
ISCF
Landing strategy of an aerial vehicle on moving targets
using deep learning algorithms
Presented by:
Giacomo ROSATO
Gabriele SORO
Reda TALEB
Presented to:
Pedro Castillo Garcia
Armando Alatorre Sevilla
1
TABLE OF CONTENTS
1. Introduction
3
2. Aruco marker
4
3. Landing strategy
5
4. Target detection
6
4.1. OpenCV
7
4.1.1. Python Libraries
8
4.1.2. Camera Calibration
8
4.1.3. Marker detection with OpenCV
4.1.3.1 Short overview about the Levenberg-Marquardt algorithm
4.1.4. Results
11
12
13
4.1.4.1 Results using the laptop camera
13
4.1.4.2 Comparison results using Optitrack system
15
4.2. YOLO
17
4.3. MASK R-CNN
19
4.3.1. Types of Mask R-CNN:
19
4.3.2. Which one we chose
20
4.3.3. Libraries used
20
4.3.4. General Explanation:
21
4.3.5. Code explanation in details
23
4.3.5.1. Mask R-CNN Training
23
4.3.6. Results and discussions
28
4.3.6.1 Real time implementation results
29
4.3.6.2 Real time implementation discussion
31
4.3.6.3 Drone captured video implementation results
32
4.3.6.4 Drone captured video implementation discussion
33
4.4 Position Comparisons
34
5. Speed Target Estimation
35
5.1 OpenCV
35
6. CONCLUSIONS
36
REFERENCES
38
ORIGINAL GANTT DIAGRAM
39
UPDATED GANTT DIAGRAM
39
2
1. Introduction
Aerial vehicles, such as drones and quadcopters, have become increasingly popular in recent years for a
wide range of applications, including delivery, search and rescue, and inspection. One important
challenge in the operation of aerial vehicles is the ability to land autonomously and accurately on a
moving target. This is particularly useful in scenarios where the vehicle needs to land on a platform that
is itself in motion, such as a boat or a moving vehicle.
To achieve a successful landing on a moving target, the aerial vehicle must be able to accurately estimate
the position, orientation, and velocity of the target. Deep learning algorithms can be used to address
these challenges by providing a means to process visual data from cameras mounted on the vehicle and
make real-time predictions about the target's motion.
In order to get a dynamic landing trajectory, our goal can be divided into two main steps:
1) Finding a deep learning algorithm to estimate target information, such as position and
orientation. The algorithm receives images data, captured from a camera positioned at the base
of the drone, which points vertically to the ground.
2) Implementing an observer algorithm to estimate the target’s velocity.
Having a target speed estimation gives a better accuracy and reliability of the landing since it
allows one to anticipate the future position and adjust its trajectory accordingly, especially when
the target is moving quickly.
Once we find the dynamic trajectory, thanks to an already implemented tracking controller, we should
be able to control the quadcopter and make it land correctly and safely.
3
1. Aruco marker
The choice of an appropriate marker on the moving target is an important aspect for a better detectability
and trackability by the vehicle’s camera while landing performs.
These markers can be of different shapes, such as QR codes, bar codes or with circular ones. The most widely
used in this field are the ArUco markers [1].
We decided to use them for their robustness, accuracy and ease of use.
One
They consist of square shapes with a unique black and white pattern in order to get easily detected. The
marker IDs are defined in a dictionary, predefined in the ArUco module or manually created by the user.
One of the main advantages is to have some libraries implemented in C++ and usable in Python along with
the OpenCV.
Its design is thought to provide a quick 3D position of the camera with respect to the marker.
In addition, the detection algorithm was implemented with Hamming code to guarantee a high resistance to
false detection. It consists of a set of check bits that are interleaved with the data bits in the marker. The
check bits are chosen such that each data bit is a function of a subset of the check bits. When the marker is
detected by the camera, the image of the marker is processed to extract the data bits and check bits. This
algorithm can determine the locations of the errors and correct them (the number of fixable bits is given by
the minimum distance).
In the early experiments we used an ArUco marker with ID_10 with 6x6 grid pixels (Figure 1).
Fig. 1 6x6 ID_10 ArUco Marker
Later on, we thought about how to improve the landing robustness while the quadcopter gets closer to the
target and we found an ingenious approach [3] that consists in using two different ArUco markers one within
the other.
This feature is especially useful at late stages of the landing, when the outer marker is too close to the camera
and may not be entirely captured by the camera's frame. In this situation, the inner marker can provide
additional information for the aerial vehicle to use to accurately position itself for landing.
When selecting a particular ID for the outer ArUco marker, attention should be paid that the inner ArUco
marker replaces a single black ”pixel” in its centre. Therefore the outer Aruco marker ID selection is limited
to the ones with a black block in the centre, while it is recommended to select an inner Aruco marker in such
a way that a number of its black pixels dominate over the number of white ones. This guarantees a more
reliable detection.
Therefore, as markers, we used two 7x7 ArUco with outer marker ID_33 and inner marker ID_29, as shown
in Figure 2.
4
Fig. 2 7x7 Outer marker ID_33,
Inner marker ID_29
2. Landing strategy
The idea behind the use of ArUco markers for a dynamic landing strategy is the following [2]:
Once the coordinates and the distance from the marker are computed, at first the quadcopter has to reach
the centre of the marker to switch to “land” mode. In this mode, the drone must maintain the same position
with respect to the target and smoothly descend up to a certain small height H in which the engines are
turned off.
5
3. Target detection
For our purpose, there are many different approaches to detect the moving target in real-time.
We mainly considered three among them: OpenCV usage, Yolo and MASK R-CNN deep learning algorithms.
Here a
From different studies, in particular [5], we found there are pros and cons for each one:
-
OpenCV
●
Pros:
-
●
-
Cons:
-
OpenCV's object detection algorithms may not be as accurate as more recent deep
learning-based algorithms, especially in case of marker obstruction.
Yolo
●
Pros:
-
●
Cons:
-
-
It is a widely used library for computer vision tasks, and has a large community of
developers and users.
It contains a wide range of pre-built algorithms and functions that can be used for
real-time object detection.
Generally considered to be less computationally expensive than deep learning-based
algorithms
It is a real-time object detection system that is able to achieve high detection
accuracy while maintaining a fast processing speed
It uses a single convolutional neural network for end-to-end object detection, which
makes it less computationally expensive than other approaches like region-based
CNNs.
Less accurate than some other object detection models, such as Faster R-CNN and
Mask R-CNN, especially when it comes to detecting small objects.
Not compatible with ArUco, issues to get position and orientation information of the
markers
MASK R-CNN
●
Pros:
-
●
Cons:
-
state-of-the-art object detection model that is able to achieve high detection
accuracy while also generating instance segmentation masks.
It's a 2 stage detection model, its first step generates a Region Proposal, which will
be fed to a CNN, by having a region proposal to detect the object it makes the model
way more accurate.
Mask R-CNN is computationally expensive, especially when it comes to generating
instance segmentation masks, it might not be the best approach for real-time object
detection on resource-constrained environments such as aerial vehicles.
Mask R-CNN also requires a large amount of data and computational resources to
train the model.
6
There were many deep learning approaches, such as SSD and Faster RCNN but we took into account the ones
that were more suited to our needs.
Here some comparisons from [5] that give an idea of the deep learning performances in terms of accuracy
and speed:
Fig. 3 Speed vs. Accuracy for object detection methods
Fig. 4 Accuracy comparison for different sizes of target objects
Despite the outstanding performances of Yolo, we used Mask RCNN, which is an extension of Faster R-CNN
that adds an additional branch for predicting segmentation masks with even better accuracy.
4.1.
OpenCV
In the following section we present the reasons, the methodologies and the approach of why and how we
implemented the Aruco Marker detection with OpenCV:
-
-
OpenCV is a widely used and well-documented computer vision library, making it easy to find
resources and guidance on using its various functions.
The OpenCV Python library provides a range of functions specifically designed for detecting and
tracking Aruco markers. These functions make it easy to implement ArUco marker detection and
tracking in a Python script.
The OpenCV Python library is highly efficient, allowing for real-time processing of images and videos.
This is important when working with drones, as accurate and timely position estimation is crucial for
safe operation.
7
-
OpenCV is compatible with a wide range of programming languages, including Python, which makes
it easy to integrate with other libraries and tools.
OpenCV is optimised for fast and efficient image and video processing, which makes it well-suited
for use with Aruco markers, which require fast and accurate detection and tracking.
Despite Python being a high-level programming language, the OpenCV functions are optimised in the sense
that they’re written in C++ language (which is a low level programming), while Python works only as an easy
to use “Wrapper” for calling those C++ function from a simpler language as python.
In summary, using the OpenCV Python library to implement ArUco marker detection and tracking provides a
reliable, efficient, and customizable solution for accurately estimating the position of a moving target.
4.1.1. Python Libraries
For the implementation of the OpenCV approach we used many libraries. In Particular:
-
-
-
-
-
Opencv2- contrib which is an additional library that extends the functionality of the openCV library.
It contains a collection of algorithms and utilities that are not included in the core openCV library as,
for example, the sub-library used to generate, detect, and estimate the pose of the ArUco Marker.
The Numpy library is used for scientific computing. It provides a range of functions and data
structures for working with numerical data, including support for multi-dimensional arrays, matrices,
and mathematical operations.
The Time library in Python is a built-in library that provides functions for working with time and dates.
It includes a range of functions for obtaining the current time, formatting time and dates, and
performing time-related calculations.
Matplotlib is a Python library that is used for data visualisation. It provides a wide range of functions
and tools for creating plots, charts, and other types of visualisations.
The Math library provides a range of mathematical functions and constants. It includes functions for
performing basic mathematical operations, such as addition, subtraction, multiplication, and
division, as well as more advanced functions, such as trigonometric and logarithmic functions.
The xlsxwriter library provides a range of functions and tools for creating and writing to Microsoft
Excel files. It allows developers to create spreadsheet files, add data to them, and apply formatting
and styling to the data.
4.1.2. Camera Calibration
An important step before approaching the code development of the recognition itself is to calibrate the
camera that we want to use for the Marker recognition. In fact, When a camera is not calibrated, it can
produce distorted or skewed images, which can impact the accuracy of image analysis and object detection
algorithms.
There are several benefits to calibrating a camera in openCV:
1) Improved accuracy: Calibrating a camera allows for more accurate measurements and estimates of
object positions and sizes in the images produced by the camera. This is important for applications
that rely on precise measurements, such as robotics or surveying.
2) Enhanced reliability: Calibrating a camera can improve the reliability of image and video processing
algorithms by reducing the impact of distortion and alignment issues. This can lead to more
consistent results and fewer errors.
8
3) Enhanced performance: Calibrating a camera can also improve the performance of image and video
processing algorithms by reducing the amount of data that needs to be processed. This can lead to
faster processing times and improved efficiency. Indeed, we can correct for distortion and alignment
issues in the images produced by the camera, which can reduce the amount of data that needs to be
processed. This can lead to faster processing times and improved efficiency.
The camera calibration process involves estimating the intrinsic and extrinsic parameters of the camera. The
intrinsic parameters describe the properties of the camera itself, such as the focal length, principal point, and
distortion coefficients. The extrinsic parameters describe the position and orientation of the camera in
relation to the world. In particular, the parameter that we need are:
-
camera matrix: A 3x3 matrix that represents the intrinsic parameters of the camera. It includes
information such as the focal length and principal point of the camera.
distortion coefficients: A vector of 4 or 5 parameters that represent the distortion caused by the
camera's lenses. These coefficients can be used to correct for distortion in the image.
rotation vectors: A list of 3x1 vectors that represent the rotational components of the extrinsic
parameters of the camera.
translation vectors: A list of 3x1 vectors that represent the translational components of the extrinsic
parameters of the camera.
You can use the camera matrix and distortion coefficients to correct for distortion in the image and project
3D points onto the image plane, and the rotation and translation vectors to compute the pose of the camera
in relation to the calibration pattern.
To calibrate a camera in openCV, we typically need to capture a series of images of a calibration pattern, such
as a chessboard or a set of points. These images are used to estimate the intrinsic and extrinsic parameters
of the camera. In our case, we decided to implement the calibration using a 9x6 chessboard.
To compute this task we used three main functions built-in OpenCV functions:
1) findChessboardCorners that is used to detect the corners of a chessboard pattern in an image. It takes
in an image as input and returns the coordinates of the corners of the chessboard pattern using a
combination of different detection algorithms between Canny edge detection, Sobel edge detection,
Laplacian edge detection and Pattern Matching. In particular, the parameter used are:
- the image: the input image that should be a 2D array of pixels greyscale,
- the patternSize: the number of rows and columns in the chessboard pattern
His outputs are:
- retval: A Boolean value indicating whether the chessboard pattern was found in the image. If set to
True, the chessboard pattern was found, and the corners have been detected
- corners: A list of points representing the corners of the chessboard pattern in the image. These points
are returned as a NumPy array of (x, y) coordinates, with the top-left corner first, followed by the
remaining corners in row-major order.
2) cornerSubPix that takes an image and a set of initial corner points as input, and uses optimization
algorithms to refine the position of the corner points to more accurately reflect the location of the
corners in the image. It takes as input parameter:
- image: The input image in which the corner points are located
- corners: A list of points representing the initial corner points of the chessboard pattern in the image.
These points should be in the same order as the chessboard squares
- winSize: The size of the search window used to refine the corner points. This is specified as a tuple
of (width, height).
9
-
zeroZone: The size of the "dead zone" in the middle of the search window. This is specified as a tuple
of (width, height).
- criteria: A tuple of termination criteria for the optimization algorithm, including the maximum
number of iterations and the required accuracy.
The cornerSubPix function is typically used in conjunction with the findChessboardCorners function.
The output is a vector of the new optimised corner
3) The drawChessboardCorners function in openCV is a function that is used to visualise the corners of
a chessboard pattern in an image. It takes in an image and a set of corner points as input, and overlays
the corner points on the image to create a visual representation of the chessboard pattern. The
parameter for this function are:
- image: The input image on which the chessboard corners will be overlaid.
- patternSize: The size of the chessboard pattern in the image
- corners: A list of points representing the corners of the chessboard pattern in the image, results of
the cornerSubPix function
- patternWasFound: A boolean value indicating whether the chessboard pattern was found in the
image. If set to True, the corners will be overlaid on the image. If set to False, no overlays will be
drawn.
When we run the Script, we visualise the frame and the chessboard with lines defining the chessboard itself.
An example is reported below.
Example of chessboard recognition
We need to store multiple images in different orientations and positions in order for the algorithm to
understand the camera distortion. Examples are reported below.
Examples of different images used for calculating camera distortion coefficients
After we took a good amount of images, we need another algorithm that computes all the distortion
matrices. To achieve this task we use the cv2.calibrateCamera function. To compute the intrinsic and extrinsic
parameters of the camera, cv2.calibrateCamera uses an optimization algorithm to minimise the error
between the observed 2D points in the images and the projected 3D points on the image plane. The
optimization process involves adjusting the camera matrix and distortion coefficients until the error is
minimised.
Once the optimization process is complete, cv2.calibrateCamera returns the optimised camera matrix and
distortion coefficients, as well as the rotation and translation vectors that describe the pose of the camera in
10
relation to the calibration pattern. These matrices and vectors can then be used to correct for distortion and
compute the pose of the camera in other images.
In particular, the specific optimization algorithm used by cv2.calibrateCamera is the Levenberg-Marquardt
algorithm, which is a popular choice for non-linear optimization problems. The Levenberg-Marquardt
algorithm is an iterative method that involves linearizing the optimization problem at each iteration and
solving for the optimal update using a combination of the gradient descent and the Gauss-Newton method.
The Levenberg-Marquardt algorithm has several advantages, including fast convergence, good robustness,
and the ability to handle large numbers of variables. It is generally considered to be a reliable and efficient
optimization method for camera calibration and other non-linear optimization problems.
4.1.3. Marker detection with OpenCV
First, we give a general idea of the approach used to detect the marker and its position.
OpenCV provides a number of functions for detecting ArUco markers in images. The main function for
detecting ArUco markers is cv2.aruco.detectMarkers, which takes an image and a dictionary of ArUco
markers as inputs and returns a list of detected markers and their positions in the image. To use
cv2.aruco.detectMarkers, you will first need to get a dictionary of ArUco markers using the
cv2.aruco.Dictionary_get function. The dictionary specifies the properties of the ArUco markers, such as the
number of bits, the number of markers, and the marker layout. Once you have a dictionary of ArUco markers,
you can use the cv2.aruco.detectMarkers function to detect the markers in an image. This function returns a
list of detected markers, along with their positions and orientations in the image.
We now describe the parameter and the output for those main functions:
-
-
-
cv2.aruco.Dictionary_get is a function in the OpenCV ArUco module that returns a predefined ArUco
marker dictionary with a given name. ArUco marker dictionaries specify the properties of ArUco
markers, such as the number of bits, the number of markers, and the marker layout. To use
cv2.aruco.Dictionary_get, you need to provide a string argument specifying the name of the
dictionary you want to retrieve. OpenCV includes a number of predefined dictionaries with names
such as DICT_4X4_50, DICT_4X4_100, and DICT_6X6_250.
the DetectorParameters_create function creates and returns a pointer to an instance of the
DetectorParameters struct, which stores various parameters used in feature detection algorithms in
OpenCV, such as the maximum number of features to detect, the minimum quality of image features
to retain, and the size of the neighbourhood to consider when performing feature matching. The
output of this function is used in the cv2.aruco.detectMarkers.
cv2.aruco.detectMarkers is used to detect ArUco markers in the input image. The function takes as
input the image, a dictionary of markers to search for, and a DetectorParameters object that specifies
various parameters for the detection process. It returns the IDs of the detected markers and the
corners of each marker in the image. These algorithms include the thresholding in which the input
image is first thresholded to create a binary image, where pixels in the image are either black or
white. This helps to separate the markers from the background. Contour detection in which the
image is then processed to find the contours (boundaries) of connected regions of white pixels. Each
contour is assumed to correspond to a marker in the image. Then there’s the marker identification
phase where the contours are analysed to determine whether they correspond to valid ArUco
markers. This is done by searching for the required number of black and white borders around the
contour, and checking that the border pixels are arranged in the correct pattern. And finally the
corner detection: Once a marker has been identified, the corners of the marker are detected using
11
-
-
the perspective transformation of the marker. This allows the position and orientation of the marker
to be determined.
Cv2.aruco.estimatePoseSingleMarkers is used to estimate the pose of each ArUco marker detected
in the input image image. The function takes as input the corner points of the markers, the dimension
in centimetres of the markers in the real world, and the camera matrix and distortion coefficients of
the camera. It returns the rotation and translation vectors for each marker, which can be used to
determine the 3D pose of the marker in the camera coordinate system. It uses a perspective-n-point
(PnP) [Levenberg-Marquardt] algorithm. This function uses the iterative PnP algorithm implemented
in the solvePnP() function to estimate the rotation and translation vectors of the marker. This
algorithm uses an iterative approach to minimise the error between the observed 2D image points
and the projected 3D object points.
Cv2.drawFrameAxes is a function that draws 3D coordinate frame axes on an image or video frame.
It takes in a 3D point and a rotation vector, and it draws the X, Y, and Z axes of the coordinate frame
centred at the given point, with the X, Y, and Z axes oriented according to the given rotation vector.
After the detection and the draw phase we show an image that looks like this:
The data present on the marker are the distance between the marker and the camera, the rotation of the yaxis (green) in respect of the camera, and the x and y positions of the centre of the marker in respect of the
centre of the camera.
4.1.3.1 Short overview about the Levenberg-Marquardt algorithm
The iterative PnP algorithm, also known as the Levenberg-Marquardt algorithm, is an optimization method
that is used to determine the 3D pose of an object in a 2D image. It is implemented in the solvePnP() function
in OpenCV.
Here is a high-level overview of how the iterative PnP algorithm works:
1) Initialise the rotation and translation vectors with some initial guess.
2) Project the 3D object points onto the image plane using the current rotation and translation vectors
to obtain the predicted 2D image points.
3) Calculate the error between the observed 2D image points and the predicted 2D image points.
4) Update the rotation and translation vectors to minimise the error.
5) Repeat steps 2-4 until the error is below a certain threshold or the maximum number of iterations is
reached.
The iterative PnP algorithm uses an iterative approach to minimise the error between the observed 2D image
points and the projected 3D object points. At each iteration, it adjusts the rotation and translation vectors to
reduce the error between the observed and predicted points. This process continues until the error is below
a certain threshold or the maximum number of iterations is reached.
12
One of the advantages of the iterative PnP algorithm is that it can handle a wide range of configurations and
point correspondences, making it a versatile and robust method for estimating the pose of an object in an
image.
4.1.4. Results
The main vector that we use for extrapolating the data that we need is the translation vector (t-vec) that we
get at each iteration of the previous algorithm. In fact, the detection part resides inside a while loop, and the
translation
vector
is
only
generated
when
a
marker
is
found.
Since in our case we want to handle two nested markers, we developed a function based on the ID of the
ArUco Marker that is detected, we choose whether it’s the bigger marker or the smaller one. Based on this
assumption, we know that we have different t-vecs for each estimation based on the marker that we
detected at the current iteration.
We define the two vectors as
!!" = [$!" &!" '!" ]#
!!$ = )$!$ &!$ '!$ *
#
in which the index i refers to the i-th iteration, m refers to the smaller marker and M refers to the bigger
marker.
This approach, in fact, allows us to determine, depending on the marker (bigger or smaller), to determine the
position of the centre of the marker in the image in respect to the centre of the camera, and also the height
of the camera in respect to the marker. Given those parameters we can compute the Euclidean distance
between the marker and the camera, defined as:
!
!
+ = ,$!" + &!" + '!"
!
Here we defined the equation for the smaller marker, but it works analogously for the bigger one.
As discussed before, we tested the approach on two different markers. The first time we used a simple
marker, and after that we decided to improve it by using two nested markers.
4.1.4.1 Results using the laptop camera
For the first tests we used our own camera on the laptop to compare the real data, such as the distance,
measured by hand, with the estimations of the OpenCV algorithm.
Note: every graph in this paragraph has the distance (cm) on Y-axis and the time (seconds) in X-axis.
Firstly we present some tests referring the case in which we only have one marker:
13
Fixed Marker – Real distance of 98cm
Fixed Marker – Real distance of 6.06M
As it can be seen the algorithm works well for small distances, estimating the real value accurately and with
a small standard deviation. We can also see that we do not lose frames, since we have a 30 sample/seconds
which is the framerate of the camera. This changes while increasing the distance between the marker and
the camera. In this case we have much more noise, the mean of the prediction is more distant to the real
value, and we start losing detection (we have less sample/seconds).
What we did was then to test the algorithm when we approached the camera with the marker. The main
problem that we encountered is that when we approach the camera, the marker becomes too big for it to
be detected. The implemented function in OpenCV, in fact, is not able to recognize the marker only given a
portion of it but it needs to be all in the acquired image. In the image below on the left we show that the
minimum distance that we get is about 23cm, which is a distance too large to be able to make the drone land
safely. Implementing a smaller marker is not a good way to resolve this issue as it doesn’t allow the marker
to be detectable at longer distances. These are the reasons why we choose to implement the Nester Marker.
In the image on the right we show an example of approaching with the new marker.
Approaching the marker with simple marker
Approaching the marker with nested marker
As we can see now we have a much smaller distance at which the drone could be able to land safely.
A thing to notice in the right image is that there’s a distance ~80cm in which the algorithm starts switching
from the detection of the bigger marker to the smaller one. In this case since the smaller marker is very little,
14
its estimation is not perfect. Since in this case we can still detect the bigger marker which is more accurate
at this distance, we decided to compute the mean of the last 5-10 acquired samples in order to have a more
realistic target distance. Here below are the results while we vary the distance.
Varying the distance w/ last 10 samples mean
As a final improvement on this part of the detection we had to handle the case of where there are no
detections. The simpler idea to implement was that when we do not have a detection we compute the mean
of the last 5-10 samples. This solution works well when a non-continuous detection of the marker as it allows
to update its position. This allows to give a weight on new detections while considering the previous. We
provide an example below. The green points are the cases when there’s no detection.
Approaching the marker adding random obstructions
4.1.4.2 Comparison results using Optitrack system
In this section we explain how we compared the results obtained with the Optitrack system, in comparison
of our results. Firstly we need to define which are the parameters that we obtain when we use the new
sensor.
15
Of course, the first data that we need is a recorded video of the drone that catches the moving target. The
Optitrack system allows us to have multiple data for both the target and the drone. The most useful data for
us are:
-
x= x position (Target)
y= y position (Target)
dx= velocity in x-axis
dy= velocity in y-axis
V= speed
x_Drone= x position (Drone)
y_Drone= y position (Drone)
yaw_Drone= yaw angle (Drone)
We have the data stored in an Excel document in which the first column is the instant in which the sample
was taken.
To get the data that we needed from the camera, we ran our algorithm taking as input the video that was
taken from the camera of the drone.
We stored the data in an Excel document that was then imported in matlab with the optirack’s data. In this
case to better generalise the problem and better understand if the positions obtained with the optitrack
system and the camera can be comparable, we decided to evaluate the x and y variation of the marker in
respect of the camera by translating all the signal to the origin. The results obtained are shown below:
Comparison between Optitrack System and Camera System
As it can be seen in the graphs the sensor can approximate accurately the variation of the X’s and Y’s, having
in some case the same data as the optitrack system. In the red line it can be seen the in some case the signal
flatten, those are the cases in which the marker isn’t detected from the camera, while of course we still have
the signal of the optitrack system.
16
To better understand how well the algorithm works we plot below the error obtained for X and Y variation.
As it can be seen above, the errors clearly increase when the marker isn’t detected.Overall, these results are
interesting for what regards the ability of the algorithm for estimating the position of the target, but we can
see in fact that even if the the marker is correctly detected in some cases we have an error which is too high
considering the relatively low distance between the marker and the camera (~2M in this case). To have more
confidence in what we obtained we should be able to work with more data, and improve some aspects such
as the calibration coefficient of the camera. We provide below, for the sake of completeness, the distance
provided from the optitrack system.
17
4.2.
YOLO
YOLO[6] is a convolutional neural network (CNN) architecture for object detection in images. The key idea
behind YOLO is to perform object detection in a single pass of the network, rather than using a two-stage
approach like region proposal followed by classification. This allows YOLO to process images in real-time,
making it suitable for use in applications such as video surveillance, self-driving cars, and robotics.
In order to accomplish this, YOLO divides an input image into a grid of cells and each cell is responsible for
predicting a set of bounding boxes, which are a set of rectangles that tightly encloses the objects of interest.
Each bounding box also has an associated class probability, which indicates the likelihood that the box
contains an object of a particular class. For example, it can predict the class "car" with high probability and
the class "person" with lower probability for a box that encloses a car.
YOLO uses a single convolutional neural network architecture that takes an entire image as input, and outputs
a set of bounding boxes and class probabilities for each cell in the grid. This makes it more efficient and faster
than many other object detection algorithms that rely on multiple stages or networks.
YOLO is primarily designed for object detection and classification, rather than for specifically extracting the
position and orientation of an Aruco marker[7]. While it is possible to use YOLO to detect Aruco markers in
an image, it may not be the most efficient or accurate way to extract the marker's position and orientation.
Once YOLO has been trained to detect Aruco markers, it can output bounding boxes around detected markers
in an image. However, the position and orientation information of the markers is not directly encoded in the
bounding boxes themselves, so additional post-processing would be required to extract this information.
Using YOLO to detect Aruco markers has some drawbacks that need to be considered:
-
-
-
-
Training Data: To use YOLO to detect Aruco markers, a dataset of images with Aruco markers in
various poses and backgrounds needs to be collected and used to train the model, which can be a
time-consuming and resource-intensive task.
Detection Efficiency: YOLO is primarily designed for general object detection and not for specific
shapes such as markers, which means it may not be as efficient or accurate as other algorithms
specifically designed for detecting Aruco markers such as the ArUco library.
Processing Time: YOLO is a complex model and processing an image through it can take a significant
amount of time, making it less suitable for real-time applications.
Marker's information: YOLO's detection output is represented in terms of bounding boxes, which are
used to enclose the objects of interest, this may not provide the most accurate or detailed
information about the position and orientation of the marker, which may require more specific
information like marker corner positions.
Complexity: Using YOLO to detect Aruco markers requires some modification to the architecture of
YOLO and post-processing step to extract the marker's position and orientation information which is
not straight forward.
Segmentation-based methods using a recurrent neural network (RNN) may be a better choice for detecting
ArUco markers than using YOLO, as they are specifically designed for object segmentation and can provide
more accurate and detailed information about the marker's position and shape.
In a segmentation-based approach, the model would be trained to predict a binary mask indicating the pixels
that belong to the marker, rather than predicting a bounding box around the marker, like in YOLO. The
prediction of a binary mask can provide a more fine-grained representation of the marker's shape, and this
is crucial for ArUco marker detection as the markers have specific geometric shapes. In the next section we
will talk about a Mask R-CNN approach that implements segmentation.
18
4.3.
MASK R-CNN
Deep learning is a subfield of machine learning that is concerned with creating artificial neural networks that
can learn from data and make predictions or decisions without being explicitly programmed to do so. Deep
learning algorithms are based on the structure and function of the human brain, using layers of
interconnected nodes (neurons) to analyze and understand data. These networks are trained using large
amounts of data and powerful computing resources, allowing them to learn and improve over time. Deep
learning has been used in a wide range of applications, including image and speech recognition, natural
language processing, and computer vision. The Mask R-CNN (Regional Convolutional Neural Network)
algorithm is a state-of-the-art deep learning model for object detection and instance segmentation. It is built
on top of the Faster R-CNN architecture and extends it by adding a parallel branch for predicting an object
mask in addition to the object bounding box.
The Mask R-CNN algorithm has two main stages: the first stage is a Region Proposal Network (RPN) which
proposes regions of interest (RoIs) that might contain objects. The second stage is the detection and instance
segmentation stage, where the RoIs are fed into a fully convolutional network that generates class-specific
object bounding boxes and object masks. It can be trained on different architectures, including ResNet, FPN,
and others. The model you used is the maskrcnn_resnet50_fpn which is a mask rcnn that is built on the FPN
and ResNet50 architecture. This model is pre-trained on COCO dataset, a large-scale object detection,
segmentation, and captioning dataset. And it contains 80 object classes and more than 330K images, it's a
very powerful model for object detection task.
4.3.1. Types of Mask R-CNN:
19
There are several types of Mask R-CNN models, including:
● Mask R-CNN (original)
● FPN-Mask R-CNN
● RetinaNet-Mask R-CNN
● Cascade R-CNN
● Hybrid Task Cascade (HTC)
● Grid R-CNN
● Libra R-CNN
● PANet
● TensorMask
● YOLACT.
Each of these models have different architectures and are suited for different types of tasks and
environments. Some are faster but less accurate, while others are more accurate but slower. The original
Mask R-CNN model is a two-stage architecture, while FPN-Mask R-CNN and RetinaNet-Mask R-CNN are onestage architectures. Cascade R-CNN and HTC are designed to improve the accuracy of the model by adding
more stages to the network.
4.3.2. Which one we chose
The Mask R-CNN model that we used in our work is a two-stage object detection and instance segmentation
model, a pre-trained Mask R-CNN model with a ResNet-50-FPN backbone . The two stages are:
Region Proposal Network (RPN): which generates object proposals, or regions of interest (RoIs) that may
contain objects.
The second stage is the detection and instance segmentation network, which classifies the RoIs and
generates masks for each object.
It is based on Resnet50 FPN (Feature Pyramid Network) architecture, which is a backbone architecture that
is pre-trained on COCO (Common Objects in Context) dataset.The Mask R-CNN model that you used in your
work is a two-stage object detection and instance segmentation model. The two stages are:
Region Proposal Network (RPN): which generates object proposals, or regions of interest (RoIs) that may
contain objects.
The second stage is the detection and instance segmentation network, which classifies the RoIs and
generates masks for each object.
It is based on Resnet50 FPN (Feature Pyramid Network) architecture, which is a backbone architecture that
is pre-trained on COCO (Common Objects in Context) dataset.
4.3.3. Libraries used
We also used several libraries in our work. These include:
●
Torchvision: This library is a part of the PyTorch library and contains popular pre-trained models and
datasets for computer vision tasks such as object detection and image classification. we used this
20
library to load a pre-trained instance segmentation model (Mask R-CNN with a ResNet-50-FPN
backbone) for detecting the marker in the video.
●
Numpy: This library is a fundamental package for scientific computing with Python. we used it to
perform array operations, such as resizing the frames from the video, and mathematical operations
such as calculating the center of the bounding box and the area of the marker.
●
Matplotlib: This library is a plotting library for the Python programming language. we used it to
create various plots and visualizations of the results obtained from our algorithm such as Object's
position over time, Distance using length, width, and area over time, x and y position over time.
●
OpenCV: This library is a library of programming functions mainly aimed at real-time computer vision.
we used it to read and write videos, to resize the frames from the video, and to display the frames
with the predicted masks overlaid on them.
●
Pickle: This library is used to store and retrieve data in a binary format. we used it to save the results
obtained from our algorithm in a pickle file that can be loaded later and used to plot the results.
All these libraries are widely used in computer vision and machine learning tasks, and they provide a lot of
functionality and ease of use, which makes them ideal for our work.
21
4.3.4. General Explanation:
In this part, we will be detailing the steps taken in our work to detect and track a marker in a video using
instance segmentation.
In the first step, we trained an instance segmentation model using the Mask R-CNN architecture and the
ResNet-50-FPN backbone on a custom dataset of marker images. The goal was to train the model to
accurately detect and segment the marker in new images.
To achieve this, we used the pre-trained Mask R-CNN model from the torchvision library, and replaced the
pre-trained head with a new one that is suitable for our custom dataset. We then loaded our custom dataset
and used it to train the model for a number of iterations, adjusting the hyperparameters as needed to
improve its performance.
The choice to use the Mask R-CNN architecture and the ResNet-50-FPN backbone was based on their proven
success in object detection and instance segmentation tasks, and on the availability of pre-trained models
that can be fine-tuned on a custom dataset. This allowed me to save time and resources compared to training
a model from scratch
In the second step, we used the pre-trained Mask R-CNN model with a ResNet-50-FPN backbone, and finetuned the model on a dataset of images of the marker. The goal of this step was to train a model that could
accurately detect the marker in the video and provide the bounding box coordinates of the marker.
To fine-tune the model, we had to create a dataset of images of the marker. This dataset was created by
manually annotating the images with bounding box coordinates using a tool such as CVAT. we also had to
define the number of classes (in this case, 2: marker and background) and the number of training and
validation images.After training the model, we saved it as a .torch file and loaded it into the script that would
be used to detect the marker in the video. This allowed us to use the trained model to make predictions on
new images and get the bounding box coordinates of the marker.
This step was crucial because it allowed us to detect the marker in the video, which is a necessary step to
track the marker's position over time. Without a trained model, it would not be possible to detect the marker
in the video and therefore track its position.
In the third step of our work, we chose to use the area of the bounding box in pixels as a metric for
determining the distance of the marker from the camera. The reason for this choice is that the width and
length of the bounding box can be affected by the tilt of the marker, resulting in inaccurate distance
measurements. However, the area of the bounding box is not affected by the tilt of the marker and provides
a more reliable measurement of the distance. To calculate the distance using the area of the bounding box,
we first obtained the area of the bounding box by multiplying the width and length. We then compared this
value to the known area of the marker at different distances from the camera. Using this information, we
were able to create a relationship between the area of the bounding box and the distance of the marker from
the camera.
Finally, we used this relationship to calculate the distance of the marker in real-time as the video was being
captured. We added this distance information to a list and plotted the results over time to visualise the
motion of the marker.
In the fourth point, we used the results obtained from the previous steps to plot the data and compare it
with the real data obtained from the Optitrack system. This step helped us to visualise the performance of
our algorithm and compare it with the real data, which helped us to evaluate the accuracy of our algorithm
and identify any potential errors or areas of improvement.
22
4.3.5. Code explanation in details
4.3.5.1. Mask R-CNN Training
● Preparation of the data set
Before training one crucial operation would be the creation of the dataset in order to train the model, this is
very important for the next part, since the COCO Dataset isn’t enough, by retraining the model we are
modifying the outer layers of our model in order to recognise the marker from the background, in this part
we had 6 attempts to reach a good training, in the last attempt we used 205 pictures, each picture combined
with it annotation. So we took 205 images of the marker from different angles and distances. These images
were then annotated to create the masks. This was done by using a tool called CVAT, which allows you to
draw bounding boxes around the marker and create a binary mask for each image.
● Creating the masks and reorganising the files
In the first part of the work, a custom dataset was created to train a Mask R-CNN model. This was done by
creating a set of images of a marker and their corresponding masks, which are binary images that indicate
the pixels that belong to the marker.
Once the images and masks were ready, they were split into a training set and a validation set. This is
important to ensure that the model is not overfitting and is able to generalise well to new images.
After splitting the data, the images and masks were transformed to a format that can be used by the model.
This involved converting the images to a format that can be used by PyTorch, and normalising the pixel values.
The next step was to create a custom dataset class that inherits from the PyTorch's Dataset class. This class
was used to load the images and masks, and apply the necessary transformations. The custom dataset class
also had to implement the __getitem__ and __len__ methods, which are used to retrieve a sample from the
dataset and the length of the dataset, respectively.
Finally, the custom dataset was used to train a Mask R-CNN model using transfer learning. This was done by
loading a pre-trained model and fine-tuning it on the custom dataset. The model was trained for a number
of epochs and its performance was monitored using the validation set.
23
● Training the model
This part of the code is focused on training a Mask R-CNN model using a pre-trained model from torchvision
called maskrcnn_resnet50_fpn. This model is pre-trained on the COCO dataset.
The first step is to load the pre-trained model and set it to run on the device that is available, either the GPU
or the CPU. Then, the model's head is replaced with a new one using the FastRCNNPredictor, which takes in
the number of input features for the classifier and the number of classes in the dataset. This is done because
the pre-trained model is trained on the COCO dataset which has 80 classes, but in this case we only need 2
classes.
After that, the model is loaded with the state dictionary from the previously trained model and moved to the
device. Finally, the model is set to evaluation mode, so that it's in inference mode.
24
● Testing the model in real time using the Computer's Webcam
In this script we want to test the model in real time, to do that we use the webcam, and the goal is to extract
both the pose (x and y) of the marker and its distance from the camera in real time.
The trained model is loaded with pre-trained weights and its final layers are replaced with a new head called
"FastRCNNPredictor" that can predict two classes. Then a state dict is loaded from a file
"correctmodels/10000.torch" which is used to fine-tune the model.
The script captures frames from the camera, resizes them to 600x600 pixels and converts them to tensors
that can be processed by the model. The frames are passed through the model in inference mode, and the
model's output, which includes object masks, bounding boxes and scores, are used to draw the segmented
objects on the video frames.
A few variables are defined at the beginning of the script, including lists that are used to store the distance
using length, width, area and the x, y coordinates of the bounding boxes over time.
First, it defines the length and width of the marker in the real world in centimeters, as well as the focal length
of the camera in pixels. Then it uses the length and width of the bounding box returned by the Mask R-CNN
model in pixels to calculate the distance from the camera using the marker's length. It uses the formula
(marker_length_in_centimeters * focal_length) / marker_length_in_pixels to calculate the distance and
appends the result to the distance_values_length list. It then displays the calculated distance on the video
frame using the cv2.putText() function.
Next, it calculates the distance from the camera using the marker's width, using the formula
(marker_width_in_centimeters * focal_length) / marker_width_in_pixels. The result is added to the
distance_values_width list and displayed on the video frame using the cv2.putText() function.
Lastly, it calculates the distance from the camera using the area of the marker in pixels. It uses the formula
(400 * focal_length_distance) / (area_px) only if the area of the marker is greater than 1000 pixels. The result
is added to the distance_values_area list and displayed on the video frame using the cv2.putText() function.
Then it creates a range of time_width with the length of distance_values_width list. If the area of the marker
is less than 1000 pixels, it appends the last distance_using_width, distance_using_length and
distance_using_area value to the distance_values_width, distance_values_length and distance_values_area
lists respectively.First, it defines the length and width of the marker in the real world in centimeters, as well
as the focal length of the camera in pixels. Then it uses the length and width of the bounding box returned
by the Mask R-CNN model in pixels to calculate the distance from the camera using the marker's length. It
uses the formula (marker_length_in_centimeters * focal_length) / marker_length_in_pixels to calculate the
distance and append the result to the distance_values_length list. It then displays the calculated distance on
the video frame using the cv2.putText() function.
Next, it calculates the distance from the camera using the marker's width, using the formula
(marker_width_in_centimeters * focal_length) / marker_width_in_pixels. The result is added to the
distance_values_width list and displayed on the video frame using the cv2.putText() function.
25
Lastly, it calculates the distance from the camera using the area of the marker in pixels. It uses the formula
(400 * focal_length_distance) / (area_px) only if the area of the marker is greater than 1000 pixels. The result
is added to the distance_values_area list and displayed on the video frame using the cv2.putText() function.
Then it creates a range of time_width with the length of distance_values_width list. If the area of the marker
is less than 1000 pixels, it appends the last distance_using_width, distance_using_length and
distance_using_area value to the distance_values_width, distance_values_length and distance_values_area
lists respectively.
The script uses the OpenCV library to capture frames from the camera and display the output video. The
segmented objects are displayed on the video frames using random colors. The center x, y coordinates of the
bounding boxes are also calculated and stored in the x_list and y_list. The script continues to capture and
process frames until the camera is closed.
Here are some examples of what we used it for:
● Testing the model on a video representing the real data
The code uses OpenCV's VideoCapture class to read a video file, and in each frame of the video, the frame is
resized to a specific size, converted to a tensor and passed through the model to get the predictions. The
predictions contain the bounding boxes, masks, and scores for each instance segmented in the image.
This code is calculating the distance between the camera and the object detected in the video frames using
three different methods: using the length of the object, using the width of the object, and using the area of
the object.
The same computation as before is then used in order to compute the pose and distance of the marker in
the video.
For each instance, the code extracts the mask, bounding box, and score, and if the score is above a certain
threshold, it is considered a valid instance. Then, the code calculates the center of the bounding box, and
appends the x and y values of the center to the x_list and y_list respectively.
Finally, the code overlays the mask on the original image, and the resulting image is displayed on the screen.
26
● Plotting the results
Here we are focusing on plotting and analysing the results obtained from the previous steps. The code begins
by importing the necessary libraries such as matplotlib and pickle. Then, it loads the data from the pickle files
that were created in the previous steps. These pickle files include the time, distance values calculated using
length, width, and area, as well as the x and y positions of the marker.
27
Some of the results of the plotting script
The code then creates a figure with 2 rows and 2 columns of subplots. The first subplot is a scatter plot that
shows the x and y positions of the marker over time. The x-axis represents the x-position and the y-axis
represents the y-position. The colour of the points on the scatter plot represents the time at which the
position of the marker was recorded.
The second subplot is a line plot that shows the distance values calculated using the length, width, and area
of the bounding box of the marker over time. The x-axis represents the time and the y-axis represents the
distance values. The plot has three lines, one for each method of calculating distance.
The third subplot is a line plot that shows the x-position of the marker over time. The x-axis represents the
time and the y-axis represents the x-position.
The fourth subplot is a line plot that shows the y-position of the marker over time. The x-axis represents the
time and the y-axis represents the y-position.
Finally, the code shows the plots and saves them to a file.
28
4.3.6. Results and discussions
4.3.6.1 Real time implementation results
Object’s position over time
Translation of the marker from down-right position to up-left position
Translation of the marker from down-left position to up-right position
Translation of the marker from up position to down position
29
Translation of the marker from right position to left position
Marker to camera distance over time:
Marker-to-camera distance using the length of the bounding box in computation
Marker-to-camera distance using the width of the bounding box in computation
30
Marker-to-camera distance using the area of the bounding box in computation
Marker-to-camera distance using the three methods
4.3.6.2 Real time implementation discussion
The experiment aimed to track the position and distance of a marker from the webcam over a period of time.
The first set of figures showed the position estimation of the marker's translation from position A to B, and
the results were quite satisfying. Most of the noise generated in these figures was mainly a result of human
errors. The goal of these plots was to determine if the position was being accurately captured by the camera
and it can be inferred that the code is ready for pose estimation in the drone captured video.
In the next set of figures, we presented the distance estimation from the marker to the webcam over time.
It was observed that the results varied between the distance captured and computed using length and width
which are identical and the one using the area of the bounding box. The method using area computation
was found to be much more accurate in far distances and when the orientation of the marker was off or even
31
a little bit tilted. This highlights the benefit of using the mask area directly since it is not affected by the
orientation of the marker and offers more accurate results.
4.3.6.3 Drone captured video implementation results
Tracking of the marker’s X position over time
Tracking of the marker’s Y position over time
32
Tracking of the marker’s x and y over time
Marker-to-drone distance using the three methods
4.3.6.4 Drone captured video implementation discussion
The experiment aimed to track the position and distance of a marker from a drone's camera over a period of
time.
The first two figures showed the position estimation of the marker's translation from the captured video, the
first one represents the tracking of the X coordinate over time and the second one the tracking of Y
coordinate over time, we can see that the result are good but not enough since there is a lot of noise, this
noise could be deleted if we use a filter, but after trying using a filter it didn’t work as expected since adding
a filter slowed the real time implementation a lot, so we preferred to prioritise the speed over the noise
filtering .
The third figure is just a representation of both X and Y over time, here we can see the noise more clearly,
shown by the points on the extremities of the figure.
In the last figure, we presented the distance estimation from the marker to the drone over time. We can see
that the results varied between the distance captured and computed using length and width which seem
33
similar and the one using the area of the bounding box, as expected the method using the area as a reference
is much more accurate.
4.4 Position Comparisons
In order to understand how well the algorithms perform, we exported all the results of the different
algorithms in different Excel files and then imported all the data in MatLab. Here below we show the
comparisons between OpenCV vs. Mask R-CNN for X’s and Y’s and also the comparison between all the three
approaches.
OpenCV vs. Mask R-CNN
OpenCV vs. Mask R-CNN vs Optitrack
34
As it can be seen in the image above, the approach that gives the most accurate results in this case is OpenCV.
Mask R-CNN is also a valid approach to take into consideration, given the fact that it is able to recognise for
most of the time the marker but at this time it does not provide a good reliability for what concern the
capability of detecting with accuracy the position. The improvement that is done using the Mask R-CNN
approach is the in fact it’s able to recognise the marker even if it has some obstruction. Between second 11
and 13 we can see that the OpenCV algorithm doesn’t detect the marker while the Mask R-CNN is able to get
some data even in the case of partial visibility.
5. Speed Target Estimation
Given the promising results that we obtained for what concerns the evaluation of the distance between the
marker and camera, we now want to work on the observer algorithm in particular to estimate the x and y
speed of the marker in respect to the camera. As mentioned before, this is needed as, ideally, we would like
before starting to land to align the Drone (camera) to the Platform (marker) and then once they’re aligned
(the speed of the platform in respect of the drone is ~0), we start diminishing the distance making the drone
land onto the platform.
5.1 OpenCV
With the OpenCV usage approach, as a first step to get the position we decided to compute the mean speed
taking into account the last N samples. In particular:
.%" =
$& − $&'(
0& − 0&'(
In which N is usually [5,10] to not have a velocity too biassed by examples too old. Below we present some
tests. An important thing to note is that the speed is always noted in cm/s.
Still target with N = 10
Moving Target from left to right and vice versa with N = 10
The plot above shows the variation on the x-axis (blue) and on the y-axis (orange). The plots below show the
variation of speed in relating the x-axis(blue) and y-axis (orange). Analysing the plots on the left we can see
that the markers are still and in fact their speed is oscillating around 0.
The plots on the right are more interesting as they show an example of the moving target along the x-axis.
As it can be seen when the target moves from left to right in respect of the camera, its speed is positive. Also
between the second [4;5] we can see in the plot above that the marker increases its x-position slower than
35
before, and in fact in the plot below has a negative peak in this time-range which means the velocity along
that axis is less. Finally, since the marker is basically still in the y-axis(orange) lines then its velocity in the yaxis is constant, close to 0.
We provide below other test examples, in particular when the target moves from bottom to top and viceversa, and when the target moves from bottom-left corner to top-right corner and vice-versa. Of course in
this case analogous considerations of the previous example can be taken into account.
Moving Target from bottom to up and vice versa with N = 10
Moving Target from bottom-left to top-right and vice versa
with N = 10
The results we showed until now are obtained using only a pc-integrated webcam, nevertheless they can give
an idea of how the system works, and that in fact the results that we obtained are at least theoretically
correct.
To really understand if the results obtained are correct we would need to implement this algorithm on a
camera that is able to record the moving platform, and in which the coordinates and the speed along the xaxis and y-axis of the platform are known parameter taken from a sensor different from the camera.
6. CONCLUSIONS
In conclusion, the goal of this work was to track the position and distance of a marker from a webcam or
drone camera over a period of time. We implemented a real-time object detection and instance
segmentation model using Mask R-CNN, which is based on the Resnet50 FPN architecture pre-trained on
COCO dataset. The results of the real-time implementation were quite satisfying, with the position of the
marker being accurately captured by the camera. We also found that using the area of the bounding box in
the computation of the marker-to-camera distance was much more accurate in far distances and when the
orientation of the marker was off or tilted. In the drone captured video implementation, we faced some noise
issues but managed to track the marker's position and distance over time. Overall, the results of this work
demonstrate the potential of using deep learning models for real-time object detection and instance
segmentation in various applications.
Finally, our experimentation with using an Aruco marker and OpenCV to land a drone has been a valuable
learning experience. The results we obtained for the estimation of the x and y position were found to be
quite promising and comparable to the optitrack system. However, when it comes to the velocity estimation,
36
we only obtained theoretical good results, but during the testing phase, we did not achieve the desired
outcome. The reason for this could be the limited time we had for the experimentation and the complexity
of the task itself. In our opinion, even if we had more time, we would have to test more in order to achieve
good results, and probably we may have to apply some filters (like kalman filter) to try to improve the results.
The results of this project have highlighted the potential of using Aruco markers and OpenCV for drone
navigation. While we did not achieve the desired results for velocity estimation, the results for x and y
position estimation were comparable to an optitrack system. Therefore, this project has shown that
potentially Aruco markers and OpenCV can be effective tools for drone navigation.
37
REFERENCES
[1] Adam Marut, Konrad Wojtowicz, Krzysztof Falkowski, ArUco markers pose estimation in UAV landing aid
system, 2019
[2] Igor Lebedev, Aleksei Erashov, Aleksandra Shaba, Accurate Autonomous UAV Landing Using Vision-Based
Detection
of
ArUco-Marker,
Springer
Nature
Switzerland
AG
2020
A. Ronzhin et al. (Eds.): ICR 2020, LNAI 12336, pp. 179–188, 2020. https://doi.org/10.1007/978-3-030-603373_18
[3] Artur Khazetdinov, Tatyana Tsoy, Evgeni Magid, Aufar Zakiev, Mikhail Svinin, Embedded ArUco: a novel
approach for high precision UAV landing, 2021, International Siberian Conference on Control and
Communications (SIBCON)
[4] Jamie Wubben, Francisco Fabra, Carlos T. Calafate, Tomasz Krzeszowski , Johann M. Marquez-Barja, JuanCarlos Cano and Pietro Manzoni. Accurate Landing of Unmanned Aerial Vehicles Using Ground Pattern
Recognition. 12 December 2019
[5] Pranav Adarsh, Pratibha Rathi, Manoj Kumar. YOLO v3-Tiny: Object Detection and Recognition using one
stage improved model. 2020 6th International Conference on Advanced Computing & Communication
Systems (ICACCS)
[6] https://docs.ultralytics.com/ - Yolo Documentation
[7] https://docs.opencv.org/4.x/d5/dae/tutorial_aruco_detection.html - Aruco Libraries for OpenCV
38
ORIGINAL GANTT DIAGRAM
UPDATED GANTT DIAGRAM
39
Download