Drone-Assisted IoT Data Collection: A Q-Learning Approach

The People's Democratic Republic of Algeria
Ministry of Higher Education and Scientific Research
University of Djelfa
Faculty of Exact Sciences and Computer Science
Computer Science Department
Option : Networks and Distributed Systems
Juin 2023
Abstract :
In this work, we are mainly based on one of the most common Reinforcement Learning
(RL) methods called Q-learning, which allows Unmanned Aerial Vehicles (UAVs), or what
we call Drones, to efficiently collect data and fairly transfer their energy to and from a set
of IoT devices with a limited energy capacity, these IoT devices are randomly distributed
in a given area.
For this purpose, we simulated the deployment of a single UAV above a set of IoT devices
to wirelessly collect information and provide them with energy simultaneously, based on
the promising technology called Simultaneously Wireless Information and Power Transfer
The performance of our simulation protocol is demonstrated using the Python language
with the Airsim simulator, the most recent and common tool to make UAV simulations
which have based on Artificial intelligence (AI) techniques.
keyword : Data Collection, UAV, SWIPT, AirSim, AI, IoT.
Résumé :
Dans ce travail, en se basons principalement sur l’une des méthodes d’apprentissage par
renforcement les plus courantes appelée Q-learning, qui permet aux véhicules aériens sans
pilote (UAV), ou ce que nous appelons des Drones, de collecter efficacement des données
et de transférer équitablement leur énergie à et à partir d’un ensemble de dispositifs IoT à
capacité énergétique limitée, ces dispositifs IoT sont répartis de manière aléatoire dans une
zone donnée.
À cette fin, nous avons simulé le déploiement d’un seul UAV au-dessus d’un ensemble
d’appareils IoT pour collecter sans fil des informations et leur fournir de l’énergie simultanément, sur la base de la technologie prometteuse appelée Simultaneously Wireless
Information and Power Transfer (SWIPT).
La performance de notre protocole de simulation est démontrée en utilisant le langage
Python avec le simulateur AirSim, l’outil le plus récamment et le plus courant pour faire
des simulations d’UAV basées sur des techniques de l’intelligence artificielle (IA).
Mots clés : Collecte de données, UAV, IA, IoT , SWIPT, AirSim.
Table of Contents
Table des figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
Liste des tableaux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 First Chapter . . . . . . . . . . . . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . .
2.2 IoT Architecture . . . . . . . . . . . . . .
2.3 IoT layers . . . . . . . . . . . . . . . . .
2.3.1 Perception Layer . . . . . . . . .
2.3.2 Network Layer . . . . . . . . . .
2.3.3 Middleware Layer . . . . . . . . .
2.3.4 Application Layer . . . . . . . . .
2.3.5 Business Layer . . . . . . . . . .
2.4 IoT Applications . . . . . . . . . . . . .
2.4.1 Medical and healthcare industry .
2.4.2 Precision agriculture and breeding
2.4.3 Industrial Automation : . . . . . .
2.4.4 Smart Cities . . . . . . . . . . . .
2.4.5 Energy Management . . . . . . .
2.4.6 Retail . . . . . . . . . . . . . . .
2.5 IoT assisted by UAVs . . . . . . . . . . .
2.5.1 UAV definition . . . . . . . . . .
2.5.2 Types of UAVs . . . . . . . . . .
2.5.3 UAV applications . . . . . . . . .
2.6 Conclusion . . . . . . . . . . . . . . . .
3 Second Chapter . . . . . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . .
3.2 History of Reinforcement learning . . . . .
3.3 Reinforcement learning . . . . . . . . . . .
3.3.1 Elements of reinforcement learning
3.3.2 Q-learning . . . . . . . . . . . . . .
3.4 Background : Q-learning Based Protocols .
Protocol 1 : Adaptive UAV-Assisted Geographic Routing with
Q-Learning in VANET . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Protocol 2 : Learning to Rest : A Q-Learning Approach to Flying
Base Station Trajectory Design with Landing Spots . . . . . . . .
3.4.3 Protocol 3 : Reinforcement Learning for Decentralized Trajectory
Design in Cellular UAV Networks With Sense-and-Send Protocol
3.4.4 Protocol 4 : Visual Exploration and Energy-aware Path Planning
via Reinforcement Learning . . . . . . . . . . . . . . . . . . . .
3.4.5 Protocol 5 : Minimizing Packet Expiration Loss With Path Planning
in UAV-Assisted Data Sensing . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Third Chapter . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . .
4.2 Protocol Description . . . . . . . .
4.3 UAV Recharging Architecture . . .
4.4 Simulation Setup . . . . . . . . . .
4.4.1 AirSim . . . . . . . . . . .
4.4.2 Unreal Engine . . . . . . . .
4.5 Used Algorithm . . . . . . . . . . .
4.6 Results . . . . . . . . . . . . . . . .
4.6.1 Reward discussion . . . . .
4.6.2 Energy discussion . . . . . .
4.6.3 Transfered energy discussion
4.7 Conclusion . . . . . . . . . . . . .
5 General Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
General Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table des figures
Five-Layered Architecture of IoT . . . . . .
Drone components . . . . . . . . . . . . .
According to the Number of their Propellers
Multi-rotors UAVs . . . . . . . . . . . . . .
Q-learning. . . . .
Used algorithm. . .
Motivating scenario.
Motivating scenario.
Motivating scenario.
Motivating scenario.
AirSim [25] . . . . . . . . . . . . . . . . . . . . . .
Unreal Engine 4. [26] . . . . . . . . . . . . . . . . .
the reward result for scenario one . . . . . . . . . . .
the reward result for scenario two . . . . . . . . . . .
the consumed energy for each episode in scenario one
the consumed energy for each episode in scenario two
the transferred energy result for scenario one . . . . .
the transferred energy result for scenario two . . . . .
Liste des tableaux
Table of comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of simulation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . .
General Introduction
Forty-three billion is the expected number of connected IoT devices in 2023, according to
[1], which is a vast number that keeps increasing daily. Those IoT devices include simple
and complex objects, such as sensors, smartphones, smartwatches, industrial equipment,
and connected vehicles. Typically, IoT devices exchange information with the cloud and
also between them, depending on the service they are executing.
Like any other paradigm, IoT devices face several challenges that we need to address for
better performance. One of these significant challenges that IoT applications still suffer
from is their constrained energy capacity. Indeed, all IoT devices have limited batteries,
and they will not last for too long, especially with the new generations of mobile networks
that require more energy to perform their missions over the long term successfully.
To fill this gap, there is a promising technology that has proved itself in academia and
industry, called Simultaneous wireless information and power transfer (SWIPT) [2] could
simultaneously transfer the information and recharge IoT devices wirelessly and ensure
proper functioning for an extended period. But, It is a difficult challenge due to the unknown
positions of IoT devices in the environment ; therefore, we proposed a mobile flying energy
source as a data collector and energy transfer from and to the IoT devices. [3]
When it comes to flying objects in technology, we think of Unmanned Aerial vehicles
(UAVs) or what we call Drones. In the last few years, they have been used in many applications due to their popularity and low costs, and it has also been used in civilian services
such as delivery service, object tracking, natural phenomena, and more applications. It
has also been used in telecommunications domains and in public events like sports games
and festivals. For instance, UAVs are used as flying hotspots to improve internet speed or
be the primary source of the Internet when there is an electrical issue with the terrestrial
internet source, such as in an earthquake [4].
UAVs come in two different types, UAVs with High Altitude Platforms (HAP) [5], where
their altitude can exceed up to 10 km ; one of the uses of this type is to provide internet
access in large zones. The second type is the UAVs with Low Altitude Platforms (LAP) [6],
also called Drones. In telecommunication, LAP UAVs are deployed to maximize energy
transferred to IoT devices. Unfortunately, UAVs also have batteries on board with a limited
lifetime ; this will pose us with two main questions, how to get the UAVs to fly over the
General Introduction
dynamic IoT devices and transfer the energy ? Moreover, how do UAVs perform charging
the maximum number of IoT devices ?
In this work, with the help of the SWIPT technology mentioned above, UAVs can act as
mobile flying energy sources that can move autonomously to collect information from
and transfer the energy to IoT devices, despite UAVs’ essential role. But there is also still
suffer another problem that forces us to pose some questions, such as (i) how we can make
UAVs move autonomously and act intelligently without any human intervention. (ii) how
can UAVs assist in transferring the maximum energy to IoT devices ? (iii) how can UAVs
cover the maximum number of IoT devices ?
As for the contribution of our work, we subject a single UAV for the training and deep learning process for the ability to control its behavior and its interaction with the surrounding
environment through some previous knowledge of the environment and some available
actions, where the UAV can (i) efficiently learn and improve its behavior by the trial and
error, (ii) try to explore the environment in-depth by collecting rewards as much as possible.
(iii) It will get a punishment whenever it gets a wrong step. Therefore, the UAV based on
this method (i.e, Q-learning algorithm [7]) can develop and exploits a learning policy. Our
work is organized into three main chapters :
— In the first chapter, we talk about IoT networks and how IoT devices they are involved
in our daily lives. Also, we will talk about the existing IoT applications and where it
used. Then we talk about UAVs and different types of UAVs and their applications,
and how UAVs are helping in the future of the IoT world.
— In the second chapter, we talk about the different methods that exist for controlling
drones using the Q-Learning method. To this end, we have selected to study five
proposed protocols with different methods that have been proposed to control drones
using Q-Learning.
— In the last chapter, we will discuss our proposed protocol and how to use it to control
the UAV, compare it to other methods discussed in the previous chapter, and explain
why we chose it.
— Finally, we concluded our work by discussing our results.
First Chapter
The Internet of Things (IoT) network is defined as a collection of intelligent devices that
are wirelessly connected among themselves or connected to the cloud via the Interconnected network (or Internet). Historically, the concept of the IoT network is relatively
recent, emerging in the early 2000s. However, the roots of IoT can be traced back to the
development of various technologies and networks over the years. The first glimmers of
IoT can be seen in early computer networks and the evolution of the Internet itself.
In the 1960s and 1970s, researchers and organizations started connecting computers
together, creating the foundation for networked communication. This eventually led to
the creation of the Internet in the 1980s, which connected disparate networks and allowed
for global communication and data exchange. As the internet matured, advancements in
communication protocols, wireless technologies, and miniaturization of computing devices
paved the way for the expansion of IoT networks. The concept of embedding sensors and
actuators into physical objects to collect and exchange data gained the most prominence.
The early 2000s saw the emergence of wireless sensor networks (WSNs), which were
the precursors to IoT networks. WSNs comprised small, low-power devices equipped
with sensors to monitor and collect data from the physical environment. These devices
communicated wirelessly using protocols such as Zigbee, Bluetooth, or Wi-Fi, forming
ad-hoc networks to relay data to a central hub or gateway. The term "Internet of Things"
itself was coined in 1999 by Kevin Ashton [8], a British technologist who envisioned a
network where physical objects could be connected and communicate with each other
through the Internet. Since then, IoT has evolved rapidly, driven by advancements in
wireless connectivity, cloud computing, big data analytics, and artificial intelligence. The
integration of IoT into various industries has transformed sectors such as manufacturing,
agriculture, transportation, healthcare, and smart homes.
IoT networks are composed of a vast array of interconnected devices, including sensors,
actuators, wearables, industrial equipment, vehicles, and consumer appliances. These
devices collect data, communicate with each other or central servers, and enable realtime monitoring, control, and automation of processes. IoT networks rely on a range of
communication technologies, including cellular networks (2G, 3G, 4G, and now 5G),
Chapter 1
Internet of Things and UAVs
Wi-Fi, Bluetooth, Zigbee, LoRaWAN, and others.
IoT Architecture
IoT architectures refer to the design and structure of systems that enable their functioning.
These architectures provide a framework for connecting and managing the various components of an IoT ecosystem, including devices, networks, data processing, and applications.
Here are five commonly used IoT architectures :
— Centralized Architecture : In this architecture, all IoT devices send data to a central
server or cloud platform for processing and analysis. The server is responsible for
data storage, analysis, and application logic. It allows for centralized control and
management of the IoT network.
— Peer-to-Peer (P2P) Architecture : P2P architectures enable direct communication
and data exchange between IoT devices without relying on a central server. Each
device can act as both a client and a server, allowing for decentralized and distributed
communication. P2P architectures are often used in scenarios where low latency and
high scalability are critical.
— Three-Tier Architecture : This architecture consists of three layers : the device
layer, the gateway layer, and the cloud layer. The device layer comprises IoT sensors
and actuators, the gateway layer handles data preprocessing and local analytics, and
the cloud layer manages storage, advanced analytics, and application services. This
architecture enables a distributed and scalable approach to IoT deployments.
— Edge Computing Architecture : Edge computing architectures involve processing
data at the edge of the network, closer to the IoT devices themselves, rather than
relying on a central server or cloud. This approach reduces latency, optimizes bandwidth usage, and enables real-time analysis and decision-making. Edge computing
architectures are suitable for applications that require quick response times and
operate in resource-constrained environments.
— Hybrid Architecture : Hybrid architectures combine multiple architectural approaches to leverage the strengths of each. They may involve a combination of
centralized, distributed, and edge computing elements, depending on the specific requirements of the IoT application. Hybrid architectures provide flexibility, scalability,
and efficient resource utilization.
It is very important to note that IoT architectures can vary depending on the specific use
cases, scalability needs, security requirements, and network infrastructures. Designers and
developers create and customize architectures to meet the stringent requirements of IoT
Chapter 1
Internet of Things and UAVs
IoT layers
The IoT architecture typically consists of several layers see Fig.1 that work together to
enable the functioning of IoT systems, which can be described as follows :
Perception Layer
The perception layer, also known as the physical layer or sensing layer, is the bottommost
layer of the IoT architecture. It comprises various sensors, actuators, and physical devices
that interact with the physical world. These devices collect data from the environment,
such as temperature, humidity, light, motion, or location.[9]
Network Layer
The network layer connects the devices in the IoT ecosystem and facilitates data transfer
between them. It includes wired and wireless communication technologies like Wi-Fi,
Bluetooth, Zigbee, cellular networks, or Ethernet. This layer ensures reliable and secure
connectivity for seamless communication among devices.[10]
Middleware Layer
The middleware layer acts as an intermediary between the perception and application
layers. It manages the data flow, protocols, and communication between the devices
and the applications. Middleware provides functionalities such as data filtering, protocol
translation, device management, and security.[11]
Application Layer
The application layer is the topmost layer of the IoT architecture and represents the userfacing part. It consists of applications, services, and interfaces that enable users or other
systems to interact with the IoT ecosystem. This layer utilizes the data collected from the
perception layer to provide meaningful insights, analytics, control, and automation.[12]
Business Layer
Some IoT frameworks include a business layer, which focuses on the business logic
and processes related to IoT deployments. It involves aspects such as data monetization,
Chapter 1
Internet of Things and UAVs
business models, service management, and integration with existing enterprise systems.
Perception Layer
Network Layer
Middleware Layer
Application Layer
Business Layer
F IGURE 1. Five-Layered Architecture of IoT
IoT Applications
IoT networks refer to the network of interconnected physical devices, vehicles, appliances,
and other objects embedded with sensors, software, and network connectivity, enabling
them to collect and exchange data. IoT has numerous applications across various industries
and sectors. As follows, there are some applications of IoT :
Medical and healthcare industry
There are several applications in the healthcare field that can be framed by using IoT
capabilities. Using cell phones with RFID-sensor capabilities, medical parameters may be
monitored and drug administration can be tracked simply. The following are some of the
benefits that can be obtained by utilizing this feature : (i) convenient illness monitoring,
(ii) ad-hoc diagnosis, and (iii) immediate medical assistance in the event of an accident.
Implantable and wireless devices can be used to save and secure health records. These
health data can be utilized to save a patient’s life and provide unique treatment to persons
in emergency situations, particularly those with heart disease, cancer, diabetes, stroke,
cognitive impairments, and Alzheimer’s disease. Guided action on the body can be taken
by introducing biodegradable chips into the human body [13]. Paraplegics can receive
muscular stimulation to help them regain their mobility. A smart IoT-controlled electrical
stimulation device can be implanted to do this. Many more applications, such as (i) Remote
patient monitoring, (ii) Telemedicine, and (iii) Medication management, may be readily
accomplished with the help of IoT’s many characteristics.
Chapter 1
Internet of Things and UAVs
— Remote patient monitoring : IoT devices, such as wearables and connected medical
devices, collect and transmit patient data to healthcare providers, enabling remote
monitoring of vital signs and conditions.
— Telemedicine : IoT enables virtual consultations and remote healthcare services,
allowing patients to access medical expertise and reducing the need for in-person
— Medication management : IoT devices can remind patients to take medication,
track adherence, and monitor medication supply, ensuring proper dosage and timely
Precision agriculture and breeding
IoT is revolutionizing the agricultural sector with applications like precision farming.
Farmers can use IoT sensors to monitor soil moisture, temperature, and nutrient levels,
enabling precise irrigation and fertilization, resulting in optimized crop yields. Also,
IoT may be used to control the traceability of animals used in agriculture. This can aid
in the real-time identification of animals, particularly during the outbreak of infectious
illness. Many countries provide subsidies to farmers and shepherds based on the number
of animals they have, such as cattle, goats, and sheep. Many more applications, such as
(i) Precision Farming, (ii) Livestock monitoring, and (iii) Automated irrigation, may be
readily accomplished with the help of IoT’s many characteristics.
— Precision Farming : IoT sensors collect data on soil moisture, temperature, and
nutrient levels, helping farmers optimize irrigation, fertilization, and crop growth.
— Livestock monitoring : IoT devices such as smart collars or ear tags collect data on
the health and behavior of livestock, allowing farmers to detect diseases, manage
feeding patterns, and optimize breeding programs.
— Automated irrigation : Connected irrigation systems can adjust water usage based
on real-time weather conditions and plant needs, conserving water and improving
crop yield.
Industrial Automation :
IoT is widely used in industries for process automation, monitoring, and optimization.
Connected sensors and devices help monitor and control manufacturing equipment, inventory, supply chain logistics, and predictive maintenance, improving operational efficiency.
Many more applications, such as (i) Process automation, (ii) Asset tracking, and (iii) Predictive maintenance, may be readily accomplished with the help of IoT’s many characteristics.
Chapter 1
Internet of Things and UAVs
— Process automation : IoT sensors and actuators are used to automate tasks in
manufacturing processes, reducing manual intervention and improving efficiency
and consistency.
— Asset tracking : IoT enables real-time tracking and monitoring of assets within
factories or warehouses, optimizing inventory management and minimizing loss
or theft. Asset tracking, sometimes called asset management, involves scanning
barcode labels or using GPS or RFID tags to locate physical assets. Asset monitoring
is as critical as inventory management since you need to know your organization’s
physical assets’ location, state, maintenance schedule, and other information. Asset
monitoring is crucial to your company’s bottom line and compliance since you must
find and replace lost or expired physical assets. [14].
— Predictive maintenance : Connected sensors collect data on equipment performance,
allowing predictive maintenance schedules to prevent breakdowns and minimize
— Supply chain optimization : IoT helps track and monitor goods throughout the supply chain, providing real-time visibility and enabling efficient inventory management
and logistics planning.
Smart Cities
IoT is utilized in creating smart cities by integrating various systems, including transportation, energy, waste management, and public safety. Many more applications, such as (i)
Smart traffic management, (ii) Smart parking, and (iii) Intelligent lighting, are examples of
IoT applications in urban environments.
— Smart traffic management : IoT sensors and cameras monitor traffic flow and
congestion, optimizing signal timings and suggesting alternate routes to reduce
congestion and improve traffic efficiency.
— Smart parking : IoT-enabled sensors provide real-time information about available
parking spaces, reducing search time and congestion.
— Intelligent lighting : Connected streetlights can adjust the brightness based on ambient light levels and motion detection, reducing energy consumption and improving
public safety.
Energy Management
IoT helps optimize energy consumption and reduce costs in buildings and homes. Connected smart meters, sensors, and appliances enable real-time energy monitoring, allowing
Chapter 1
Internet of Things and UAVs
users to make informed energy usage and efficiency decisions. Many more applications,
such as (i) Smart meters, (ii) Energy monitoring, and (iii) Demand response, can be readily
accomplished with the help of IoT’s many characteristics.
— Smart meters : IoT-enabled smart meters provide real-time energy consumption
data to consumers and utility companies, enabling better management of energy
usage and billing accuracy.
— Energy monitoring : IoT sensors and devices track energy usage of appliances and
systems, allowing users to identify and reduce energy wastage.
— Demand response : IoT can facilitate load shedding or shifting during peak energy
demand periods, helping to balance the electrical grid and avoid blackouts.
IoT is used to enhance the customer experience in retail environments [15]. Smart shelves,
digital signage, and beacons can provide personalized offers, product information, and
indoor navigation, making shopping more convenient and engaging.
— Smart shelves : IoT-enabled shelves can track inventory levels in real-time, automatically triggering reordering or restocking processes.
— Digital signage : Connected displays can deliver personalized advertisements and
promotions based on customer preferences and demographics.
— Beacons : IoT beacons can transmit location-based offers and information to shoppers’ smartphones, enhancing their shopping experience and providing relevant
IoT assisted by UAVs
UAVs have significant telecommunications characteristics, such as the ability to carry out
as an aerial base station that can improve wireless access coverage and replace mobile
network antennas in case when they have been damaged accidentally. IoT devices have
also profited from capabilities offered by UAVs, such as the ability to wirelessly recharge
IoT devices and data collection from IoT devices promptly.
UAV definition
UAV stands for Unmanned Aerial Vehicle. It refers to an aircraft that operates without
a human pilot on board. UAVs, also known as drones, are typically remotely piloted or
Chapter 1
Internet of Things and UAVs
can fly autonomously [16] using pre-programmed flight plans or onboard sensors and
navigation systems. They come in various sizes and configurations, ranging from small
handheld models to large, sophisticated aircraft used for military or commercial purposes.
UAVs are equipped with different types of sensors and payloads, such as cameras, infrared
sensors, LiDAR (Light Detection and Ranging), or even weaponry, in the case of military
These sensors and payloads allow UAVs to perform a wide range of tasks, including aerial
photography and videography, surveillance and reconnaissance, mapping and surveying,
delivery of goods, disaster response, scientific research, and more. UAVs have gained
significant popularity and applications in recent years due to advancements in technology,
including the miniaturization of components, improvements in battery life, and the development of robust control systems. They offer various advantages over manned aircraft,
such as cost-effectiveness, enhanced safety, accessibility to remote or hazardous areas, and
the ability to perform repetitive or dangerous tasks with precision. However, the increasing
use of UAVs also raises concerns about privacy, safety regulations, and potential misuse.
Therefore, governments and aviation authorities have implemented regulations to ensure
responsible and safe UAV operations in public airspace. The standard UAV components
include motors, propellers, speed controllers, batterie, sensors, an antenna, a receiver, a
camera, and an accelerometer to measure the speed (see Fig.2). Generally, these are all the
technological components ; there could be more components or fewer depending on the
service this UAV provides. Its antennas could control it using Radio waves [17].
F IGURE 2. Drone components
Chapter 1
Internet of Things and UAVs
Types of UAVs
Several types of UAVs are designed for specific purposes and varying in size, capabilities,
and configurations. We characterize UAVs in three main categories :
1. According to the Number of their Propellers : There are three main types of UAVs
sorted according to the number of propellers :
— Single rotor UAV : Multirotor style designs with multiple rotors are the most
common construction in UAVs use but in the case of a single rotor model
consisting of an inside rotor and tail rotor that helps to stabilize the heading
(see Fig.3a). In case of hovering having heavy objects but requiring a faster
flight time with longer endurance, single rotor-style helicopters could be the
best option[4]
— Fixed Wing UAVs : As the name indicates, this type of UAV has a fixed wing,
and it seems like the old airplanes (see Fig.3b), it can’t stand stable up in the
air. It’s mostly used for lifting packages.
(a) Single rotor UAV
(b) Fixed Wing UAVs
F IGURE 3. According to the Number of their Propellers
— Multi-rotors UAVs : Mostly used where stabilization and flexible actions are
important such as in filming, object tracking, etc., they could be categorized
according to the number of rotors they have (i) Quadcopter UAV with four
rotors (see Fig.4(c)), (ii) Hexacopter UAVs with six rotors (see Fig.4(d)), (iii)
Octocopter UAVs with eight rotors (see Fig.4(e)). The more rotors, the more
energy consumed, and vice versa. UAVs with multi-rotors operate in electric
motors because of their high precision[4]
2. According to their size : There are three main types of UAVs sorted according to
their size [16].
— Small UAV : It could be used as a strong weapon for spying. Its size vary from
the size of a large insect to about 50cm.
Chapter 1
Internet of Things and UAVs
F IGURE 4. Multi-rotors UAVs
— Medium UAVs : It can carry up to 200kg and have a flying capacity that goes
up to 15mn.
— Large UAVs : This type of UAV is mostly used by the military, and its size
could be comparable to small airplanes.
3. According to their range : There are three main types of UAVs sorted according to
their range.
— Close range UAVs : Used in surveillance because of its ability to fly up to 50
km and the battery that could last up to 6 hours.
— Mid-range UAVs : This type is a powerful one. It can cover up to 650 km and
could be used in surveillance fields.
— Endurance UAVs : This type is a high performance can fly to about 1km above
sea level and its battery can last up to 36 hours
UAV applications
UAVs have an essential role in some IoT applications that are uncountable, and they
keep increasing daily in various domains, such as in military missions and civilian life
— Shipping and delivery : UAVs could be used to deliver products in the city, which
is a more efficient way because it’s faster, and since UAVs are flying machines, that
means they can fly everywhere and save a lot of time than ground transportation.
— Security : UAVs could be used in civilian security applications, such as in tracking
the criminal or a suspect or as a flying alarm system. Or it could be used as a security
guard since UAVs have a lot of sensors, and with the implementation of AI it can
perform really well as a guard.
— Nature disasters and rescue : UAVs can deliver supplies and medicals to the damaged areas and check on damaged buildings. Nuclear and some dangerous chemicals
Chapter 1
Internet of Things and UAVs
can cause big damage, and UAVs could be used as sensors or as a discoverer.
— Agriculture : UAVs can take the role of the satellite for monitoring the planted area,
soil and field analysis, watering specific areas, and soil fertilization.
Human lives might be made simpler and safer with the assistance of technology. One of
these crucial technologies is the IoT and its best assistance in many areas. It uses various
techniques through timely data collection, which helps IoT devices perform much better.
Using UAVs, another powerful recently created technology, we will be designing a novel
method of collecting information and transferring energy to and from IoT devices based
on the promising technology called SWIPT. We should also include the concept of a
self-controlling technique in the UAV field if we want these UAVs to supply energy to IoT
devices because so many UAVs are being used for many diverse purposes. We must allow
the UAV to use AI technology to explore and learn about its surrounding environment.
Second Chapter
UAVs can be controlled manually by humans using remote controllers, and there is
another method to give the UAV the exact path to go through and all the preliminary
information, including directions and distances. As for this work, without the use of these
controlling methods where human intervention is included. We can leverage to use selflearning methods based on AI algorithms. We can subject one UAV for the training and
deep learning process for the ability to control its behavior and its interaction with the
surrounding environment through some previous knowledge of the environment and some
available actions based on the most common reinforcement learning (RL) algorithm called
In this chapter, we provide an overview of the history of RL and its main elements. In
addition, we introduce the Q-Learning algorithm and its essential features. Then, we study
some recent case applications that show how Q-learning can be employed in the UAV field.
At the end of this chapter, we have drawn up a comparison table of these applications
containing each study case’s tools, advantages, and drawbacks.
History of Reinforcement learning
Reinforcement learning (RL) is a subfield of machine learning that focuses on developing
algorithms and models capable of learning and making decisions through interaction with
an environment. The history of reinforcement learning dates back several decades and has
seen significant advancements and breakthroughs.
According to Richard S. Sutton and Andrew G. Barto in their book called reinforcement
learning an introduction second edition [18] : The history of reinforcement learning has
two main threads, both long and rich, that were pursued independently before intertwining
in modern reinforcement learning. One thread concerns learning by trial and error, which
started in the psychology of animal learning. This thread runs through some of the earliest
work in artificial intelligence and led to the revival of reinforcement learning in the early
1980s. The other thread concerns the problem of optimal control and its solution using
value functions and dynamic programming. For the most part, this thread did not involve
learning. Although the two threads have been largely independent, the exceptions revolve
Chapter 2
Reinforcement learning and study cases
around a third, less distinct thread concerning temporal-difference methods such as those
used in the tic-tac-toe example. All three threads came together in the late 1980s to produce
the modern field of reinforcement learning.
The concept of trial-and-error learning was a fundamental thread that led to the contemporary area of reinforcement learning. According to American psychologist R.S. Woodworth,
the concept of trial-and-error learning dates back to Alexander Bain’s discussion of learning by "groping and experiment" in the 1850s and, more explicitly, to Conway Lloyd
Morgan, a British ethologist and psychologist who coined the term in 1894 to describe his
observations of animal behavior.[19]. As follows a brief overview of the RL history :
The foundations of RL can be traced back to the 1950s and 1960s, when researchers
started exploring the concepts of dynamic programming and optimal control theory. In
1951, Richard Bellman introduced the principle of optimality, which laid the groundwork
for the later development of RL algorithms. In the 1970s, several researchers, including
Christopher Watkins and Michael Littman, began working on RL algorithms, although
computational limitations at the time limited their practical applications. In the early
1980s, the concept of temporal difference (TD) learning emerged as a breakthrough in RL.
Pioneered by Richard Sutton, TD learning involved updating an agent’s value function
based on the difference between the predicted and observed rewards. Sutton’s work on
TD learning, particularly the TD(γ) algorithm, set the stage for future advancements in
RL. In 1989, Christopher Watkins introduced Q-learning, an off-policy RL algorithm
that uses a lookup table to estimate state-action pairs’ values. Q-learning was a major
development because it allowed agents to learn optimal policies without explicitly modeling
the environment. Around the same time, Andrew Barto and Richard Sutton developed
the value iteration algorithm, which provided a way to solve Markov decision processes
(MDPs) and find optimal policies.
In the late 1990s and early 2000s, researchers began exploring the use of function approximation techniques, such as neural networks, to handle high-dimensional state and
action spaces. In 2013, Deep Q-Networks (DQN), a milestone in RL, was introduced
by Volodymyr Mnih et al. DQN combined Q-learning with deep neural networks and
successfully played Atari 2600 games. This breakthrough demonstrated the potential of
deep RL and paved the way for further advancements in the field.
Recently, a growing focus has been on policy gradient methods and actor-critic architectures. Policy gradient methods directly optimize the policy parameters by estimating
the gradient of the expected return with respect to the policy parameters. Actor-critic
methods combine policy gradient techniques with value function estimation, where the
Chapter 2
Reinforcement learning and study cases
actor learns the policy, and the critic estimates the value function. These approaches have
shown great promise in a wide range of applications, including robotics, game-playing, and
autonomous systems. Also, RL has witnessed rapid progress and numerous breakthroughs
across various domains. Researchers have explored model-based RL, meta-learning, multiagent RL, and other advanced techniques to address the challenges of sample efficiency,
generalization, and exploration. RL has been applied to complex tasks such as autonomous
driving, healthcare optimization, robotics, and natural language processing.
Reinforcement learning
Reinforcement learning is the process of learning what to do in order to maximize a
numerical reward signal by mapping situations to actions. The learner is not told which
actions to take. But instead must try them out to see which ones provide the most reward.
In the most interesting and challenging cases, actions can have an impact on not only the
immediate reward but also the next situation and, by extension, all subsequent rewards. The
two most important distinguishing features of reinforcement learning are trial-and-error
search and delayed reward.
We formalize reinforcement learning as the optimal control of incompletely known Markov
decision processes based on principles from dynamical systems theory. The core concept is
to capture the most significant components of the real problem that a learning agent faces
when interacting with its environment over time in order to reach a goal. A learning agent
must be able to sense the state of its surroundings and take actions that alter it to some
The agent must also have an environmental goal or goals. Markov decision methods are
designed to incorporate only these three characteristics – sensation, action, and goal – in
the most basic form feasible, without trivializing any of them. Any method that is well
suited to solving such problems we consider to be a reinforcement learning method.
Reinforcement learning differs from supervised learning, which is the type of learning
studied in the majority of contemporary machine learning research. Leaning from a training
collection of labeled examples provided by a knowledgeable external supervisor is known
as supervised learning. Each example provides a description of a circumstance and a
specification – the label – of the correct action the system should take in that case, which
is often to identify a category to which the situation belongs. This type of learning aims
for the system to extrapolate or generalize its answers so that it can function appropriately
in scenarios that aren’t in the training set. Although this is an essential type of learning, it
is insufficient for learning via interaction. It’s difficult to find instances of desired behavior
Chapter 2
Reinforcement learning and study cases
that are both right and indicative of all the contexts in which the agent must act in interactive
problems. An agent must be able to learn from its own experience in unexplored regions,
where learning would be most beneficial.
Unsupervised learning, which is typically about finding structure hidden in collections of
unlabeled data, is not the same as reinforcement learning. The terms supervised learning
and unsupervised learning appear to categorize machine learning paradigms comprehensively, but they don’t. Although it’s tempting to think of reinforcement learning as a
form of unsupervised learning because it doesn’t rely on examples of correct behavior,
reinforcement learning focuses on maximizing a reward signal rather than searching for
hidden structure. Finding structure in an agent’s experience can be helpful in reinforcement
learning, but it doesn’t solve the problem of maximizing a reward signal on its own. Therefore researchers consider reinforcement learning to be a third machine learning paradigm,
alongside supervised learning and unsupervised learning, and perhaps other paradigms.
Elements of reinforcement learning
A reinforcement learning system has five main components in addition to the agent and the
environment : a state space, a policy, a reward signal, a value function, and exploration and
1. Agent : The agent is the learner or decision-making entity might be a software
program, a robot, or any system that can sense and act on its surrounding environment
and takes action depending on its present condition and the input obtained. The agent
maximizes cumulative rewards by choosing the optimal actions depending on their
2. Environment : The environment represents the external system or problem space
with which the agent interacts, and it can be a physical world, a simulated environment, a game, or any other scenario where the agent operates. The surrounding
environment provides feedback to the agent through the obtained rewards or penalties
based on its actions.
3. State : A state space refers to the current condition or representation of the environment for the agent to capture the relevant information about their agent’s situation at
a given time. That helps the agent make decisions. States can be discrete (e.g., game
board configurations) or continuous (e.g., sensor readings in robotics).
4. Policy : The policy is the agent’s strategy or behavior to select actions based on states
and guides the agent’s decision-making process. The policy can be deterministic
(selecting a single action) or stochastic (selecting actions probabilistically).
5. Reward : The reward is the feedback signal that indicates the desirability or quality
Chapter 2
Reinforcement learning and study cases
of the agent’s actions. Rewards can be positive or negative and are used to guide
the learning process. They can be positive or negative and serve as a measure of
immediate or long-term success. Rewards perform the agent’s learning process, as
the agent aims to maximize the cumulative reward over time.
6. Value function : The value function estimates an agent’s expected cumulative
reward from a given state or state-action pair. It guides the agent’s decision-making
process by providing a measure of the desirability of different actions or states. Value
functions guide the agent’s decision-making by helping it prioritize actions that lead
to higher rewards.
7. Exploration and Exploitation : The agent needs to strike a balance between
exploring the environment to discover new actions that might lead to higher rewards
and exploiting its current knowledge to take actions that have yielded high rewards
in the past.
Consequently, RL algorithms aim to find an optimal policy that maximizes the expected
cumulative reward over the long term. This is often achieved through the use of iterative
learning algorithms, such as Q-learning, policy gradients, or actor-critic methods. These
algorithms update the agent’s policy and value function based on observed rewards and
Q-learning [7] is a model-free reinforcement learning algorithm in which an agent transitions from one state to another by taking random actions. A set of states St and a set of
actions At define the learning space. By performing action At and moving to another state
St+1 , a reward function calculates a numeric value for taking such state-action pair and
records it in a Q-table which is initialized with zero values. The agent reaches a particular
goal state by repeatedly taking random actions at one position.
F IGURE 5. Q-learning.
Chapter 2
Reinforcement learning and study cases
The Q-table values get updated at each step and, after many iterations, eventually converge.
Q-Learning’s goal is to maximize the total reward for all state-action pairs from the
beginning up to reaching the goal state, so-called the optimal policy π. The optimal policy
π indicates which action is the best to take in different states, which results in a maximized
overall gain. Q-Learning has been widely used in UAV-related research recently. It belongs
to the Temporal Difference (TD) learning methods class and is widely used for solving
Markov decision processes (MDPs). As follows is the pseudo-code of the Q-learning
algorithm 2 and its different steps :
Algorithm 1 Q-learning
Initialize Q-table with arbitrary initial values or zeros
Choose action A[t] based on the current state S[t]
(using an exploration-exploitation strategy)
Take action A[t], observe next state S[t + 1] and immediate reward R[t]
Update Q-value :
Q(S[t], A[t]) ← Q(S[t], A[t]) + α · (R + γ · maxA Q(S[t + 1], A[t]) − Q(S[t], A[t]))
Until convergence criterion is met
Extract policy : Select action A[t]∗ with the highest Q-value for each state S[t]
Background : Q-learning Based Protocols
Protocol 1 : Adaptive UAV-Assisted Geographic Routing with
Q-Learning in VANET
In this approach, the authors in [20] proposed a Q-learning-based Adaptive Geographic
Routing (QAGR) system in Vehicle Adhoc Network (VANET) assisted by an unmanned
aerial vehicle (UAV), which is divided into two components. In the aerial component, the
global routing path is calculated by the fuzzy-logic and depth-first-search (DFS) algorithm
using the UAV collected information like the global road traffic, which is then forwarded to
the ground requesting vehicle. In the ground component, the vehicle maintains a fix-sized
Q-table converged with a well-designed reward function and forwards the routing request
to the optimal node by looking up the Q-table filtered according to the global routing path.
QAGR algorithm is used to improve the converging speed and resource utilization of the
geographic routing approaches in VANET. UAVs are deployed to guide the transmission
path, and the Q-learning algorithm is used to help each node choose the best next hop in
a specific area. The simulation results show that the QAGR performs better than other
approaches in packet delivery and end-to-end delay.
Chapter 2
Reinforcement learning and study cases
F IGURE 6. Used algorithm.
Protocol 2 : Learning to Rest : A Q-Learning Approach to Flying
Base Station Trajectory Design with Landing Spots
In this approach, the authors in [21] used a Q-learning algorithm to make movement decisions for the UAV, maximizing the data collected from the ground users while minimizing
power consumption by exploiting the landing spots. The UAV movement decisions are
made based on the drone’s current state, i.e., position and battery content. While Landing
Spots offer the possibility to conserve energy, the UAV might have to sacrifice some users’
Quality of Service (QoS). The advantages of this application are (i) The presented system
can utilize LSs efficiently to extend mission duration and (ii) Maximize the sum rate of the
transmission without using a model or any prior information about the environment (see
Chapter 2
Reinforcement learning and study cases
F IGURE 7. Motivating scenario.
Protocol 3 : Reinforcement Learning for Decentralized Trajectory Design in Cellular UAV Networks With Sense-and-Send
In this approach, the authors introduced in [22] a novel RL-based framework for decentralized trajectory design in cellular UAV networks with a sense-and-send protocol. UAVs are
equipped with sensors to collect data from the environment and then send it to the ground
station. The trajectory of each UAV is optimized independently using an RL algorithm,
which considers the current state of the UAV, including its position, velocity, and remaining
battery life. The purpose of this proposed approach aims to maximize the amount of data
collected and transmitted while minimizing energy consumption. The simulation results
show that the proposed approach outperforms traditional methods regarding data collection
efficiency and energy consumption (see Fig.8).
F IGURE 8. Motivating scenario.
Chapter 2
Reinforcement learning and study cases
Protocol 4 : Visual Exploration and Energy-aware Path Planning via Reinforcement Learning
In this approach, the authors in [23] proposed a deep RL approach that combines the effects
of energy consumption and the object detection modules to develop a policy for object
detection in large areas with limited battery life. The learning model enables dynamic
learning of the negative rewards of each action based on the drag forces resulting from
the UAV’s motion concerning the wind field. The proposed architecture shows promising
results in detecting more goal objects than traditional coverage path planning algorithms,
especially in moderate and high wind intensities (see Fig.9).
F IGURE 9. Motivating scenario.
Protocol 5 : Minimizing Packet Expiration Loss With Path Planning in UAV-Assisted Data Sensing
In this approach, the authors in [24] proposed a UAV trajectory planning model based on
deep RL for data collection by minimizing the expired data packets in the whole sensor
system and then relaxing the obscure original problem into a min–max-AoI optimal path
scheme due to complex constraints (see Fig.10).
Chapter 2
Reinforcement learning and study cases
F IGURE 10. Motivating scenario.
Q-learning is a well-known AI-based algorithm, and it is used in several fields. UAVs can
control their system based on the Q-learning algorithm. The Q-learning can be implemented
in different ways depending on the environment and the mission to accomplish. In this
chapter, we present the comparison of the most relevant five Q-learning-based protocols in
Table 1.
Protocol 1
Protocol 2
Protocol 3
Protocol 4
Protocol 5
NS-3, SUMO Not specified
Not specified
Run Time
Not specified Not specified 100 episode Not specified
UAV Number
IoT Number
Not specified Not specified Not specified
Not specified
50-150 m
50 m
Not specified
400 W att
Not specified Not specified Not specified
Not specified
60 km/h
Not specified Not specified
20 m/s
Not defined
Not defined
ST : Simulation Tool, Env : Environment, EC : Energy Consumption.
TABLE 1. Table of comparison.
Third Chapter
The number of IoT devices is increasing and one of the challenges the IoT devices face
is limited battery life. We consider recharging them since it’s possible that an IoT device
could exist in an isolated place where electricity sources may not exist, such as in a
mountain, in a middle of a forest, or in the middle of a sea. Therefore, we proposed using
UAV as a data collector and a flying energy source to collect data and recharge these IoT
devices wirelessly because they are flexible and can fly in any direction with high precision.
It can also move faster than a terrestrial vehicle, and since we are using a flying vehicle,
the environment of terrestrial nature is not considered a challenge to face, whether the
environment is in the middle of the sea or on a rocky surface.
With the increasing number of IoT devices and the changing environment, controlling
UAVs by humans is not a good solution and will waste a lot of time. Therefore, we will
implement an algorithm called Q-learning in the UAV so it will no longer need to be
controlled by humans ; by implementing this algorithm, the UAV will learn by itself how
to behave and take action in every environment.
In this work, the main goal is to collect the maximum amount of data, transfer the maximum
amount of energy from and to IoT devices, and minimize the number of actions performed
by UAVs to transfer energy. To this end, we will implement our protocol based on the
Q-learning algorithm mentioned above to help the UAV control itself without human
involvement. So that the UAV will learn which action to take at each state by itself, and
after several episodes, the UAV will have complete knowledge about the environment, and
its actions will be more efficient.
Protocol Description
The contribution of our protocol is to deploy only one UAV as a data collector and flying
energy source to wirelessly collect data and transfer their energy from and to the IoT
devices that are located randomly on a grid of 3x3 (9 cells). Since our key goal is to
minimize the consumed energy by the UAV, so we used a Q-learning algorithm with a
specific number of actions (forward, left, right, backward). The UAV will run through a
specific number of episodes, and for each episode, the UAV will take limited steps to move
Chapter 3
Our Protocol
in the environment. The IoT devices will remain in their positions, the algorithm and the
calculations will run on a base station, and this base station will give the UAV the actions
to take at each step. This base station contains the Q-learning algorithm.
The Q-learning algorithm consists of four main parts, the q-table is where the q-values are
stored, the q-function is used to calculate the q-values and actions, the UAV is allowed to
take one of the four actions at each step, the reward after taking any action the UAV will
receive a reward for this specific action, and in its state, the reward is used in the q-function.
In the Q-learning process, the UAV will run through episodes, and for each episode, the
UAV will take a number of steps to move around the environment. The UAV may take
action from the four available actions or remain in its position in some cases. We set the
reward function depending on the behavior we want the UAV to follow A table named
q-table is a size of (9x4). 9 is the number of states, and 4 is the number of actions. The
q-value of each action at each step will be saved in this table by the coordinates (state,
There are two ways, (i) the UAV could take action ; it may either take a random action
from the four actions available or (ii) it can take action depending on the q-table, and the
optimal action to take at each state is the action with the maximum q-value at this state.
These two ways of choosing an action are called Exploration and Exploitation. which are
mentioned in the second chapter (subsection item 7). At the beginning of the process, we
set a variable called Epsilon to 1, and at each episode, Epsilon will decrease by a small
decimal number. Then, the algorithm at the beginning of the step will choose a random
decimal number between 0 and 1. If this random number is less than epsilon, the algorithm
will choose the random action way, and it’s called Exploration, but if the number is more
significant than Epsilon, then the algorithm will choose an action according to the q-table,
and this way is called Exploitation.
In the beginning, the q-table will be initialized by zeroes at the beginning, and since Epsilon
is equal to 1 at the first episode and the random number is between 0 and 1, then at each
step of the first episode, there is a probability of 99% that the algorithm will pick a random
action and explore the environment. Epsilon will decrease, the probability of Exploration
will decrease, and the probability of exploitation will increase until epsilon is less than 0,
where the algorithm will be on (100%) Exploitation, Epsilon is called Exploration rate.
The rate of change from the 99% Exploration phase to the 100% Exploitation phase
depends on the epsilon decay rate. For instance, if we set the epsilon ϵ decay rate to 0.1,
then after 11 episodes, ϵ will be less than 0, then the algorithm will no longer explore the
environment, and it will only go with Exploitation. 11 episodes may not be enough for the
Chapter 3
Our Protocol
UAV to explore all the environment especially if it’s a vast environment. Therefore, we try
to balance the number of episodes and epsilon decay rate to give the UAV enough episodes
to explore the environment.
While the UAV is learning and exploring the environment, some actions may take it out of
the environment in some cases. In this situation, the UAV will remain in its position, and
instead of getting an average reward, it will get a penalty for this action to ensure avoiding
this action next time.
The q-values will be calculated using the Q-function (see step 5 of the Q-larning algorithm
2), and the q-table will be updated at each step. Q(Staten , Actionk ) is the q-value of the
action k in the state n, R is the reward value, α is the learning rate it’s a decimal number
between 0 and 1, and it controls how much change will be in q-value at each update, high
αvalue means significant change, Γis the discounted factor it controls how the next reward
is than the current reward, a high Γ value will consider the next reward more critical.
UAV Recharging Architecture
Our proposed protocol consists of two main stages, (i) The Charging Process (CP) from the
UAV to the IoT devices and (ii) the Data Collection process (DC) from the IoT devices to
the UAV. In the first stage, The UAV is equipped with directional antennas to simultaneously
transfer information and energy to IoT devices based on SWIPT technology. As for the
Data Collection process, the IoT devices transmit their collected data using the uplink
signal to the UAV.
Simulation Setup
Device name
Processor Intel(R)
System Type
Linux edition
Unreal Engine
Core i3-6000U CPU
4 GB
64-bit operating system
Python 3.6
AirSim 2017
Unreal Engine 4
TABLE 2. List of simulation setup.
Chapter 3
Our Protocol
AirSim is an open-source simulator developed by Microsoft for autonomous vehicles,
including drones. It provides a realistic virtual environment for testing and developing
algorithms and control systems for aerial vehicles. AirSim offers a variety of features,
such as realistic physics simulations, sensor emulation (e.g., cameras, Lidar), and APIs for
interacting with the simulated environment. The goal for developing AirSim is having a
platform for AI research to experiment with deep learning, computer vision, and reinforcement learning algorithms for autonomous vehicles. For this purpose, AirSim also exposes
APIs to retrieve data and control vehicles in a platform-independent way.[25].
F IGURE 11. AirSim [25]
Unreal Engine
Unreal Engine 4 (UE4) combined with AirSim forms a robust framework for developing
and testing autonomous systems, including drones. Unreal Engine 4 is a widely used
game engine developed by Epic Games, renowned for its advanced graphics, physics
simulations, and expansive toolset. The integration of Unreal Engine 4 with AirSim, allows
us to leverage the benefits of both platforms [26].
Chapter 3
Our Protocol
F IGURE 12. Unreal Engine 4. [26]
Used Algorithm
Our code is written in Python 3, python is a famous programming language in many fields,
such as AI, data science, and web development. It is of type script. Our code contains a
main class and a main program. We will use the library NumPy to manipulate matrices and
generate random numbers for some mathematic functions, and the Matplotlib library will
be used for plotting the result graphs. After that, we created our Q-table and initialized
it with zeros. We also set the number of steps, number of episodes, Epsilon, alpha, and
discount factor. (see. Table 3).
Number of episodes
Number of steps
Learning rate
Discount factor
TABLE 3. Simulation parameters.
Chapter 3
Our Protocol
Algorithm 2 Used algorithm
Initialize the UAV location.
Initialize the environment (four IoT devices)
Initialize Q-table with arbitrary initial values or zeros
Set Q-learning parameters (epsilon ϵ , the learning rate α and the discount factor γ)
Connect to AirSim.
Repeat for each episode
Initialize the current state S[t]
Repeat for each step of episode
Choose A[t] based on the current state S[t] using policy derived from Q(ϵ greedy)
(using an exploration-exploitation strategy)
Take action A[t], observe next state S[t + 1] and immediate reward R[t]
Update Q-value :
Q(S[t], A[t]) ← Q(S[t], A[t]) + α · (R + γ · maxA Q(S[t + 1], A[t]) − Q(S[t], A[t]))
Until convergence criterion is met
Extract policy : Select action A[t]∗ with the highest Q-value for each state S[t]
Show the plots
Reward discussion
These two figures, Figure 13 and Figure 14, present the reward of each episode in two
different scenarios (with and free of obstacles). Each graph at the beginning starts with
low values, and after several episodes, the reward value increases and converges. That’s
because the algorithm is changing from exploration to exploitation, and when it is 100%
exploitation, that’s where the UAV has learned all about the environment, and it will only
choose the optimal actions. That’s why we see the reward is stable after about 100 episodes.
F IGURE 13. the reward result for scenario one
F IGURE 14. the reward result for scenario two
Chapter 3
Our Protocol
Energy discussion
In both scenarios, we assumed that the UAV has (100%) of energy at the beginning of each
episode, and if its battery reaches (20%) or less, it will return to the base station. On the
other hand, we assumed that the IoT devices have a 20% battery level at the begging of
each episode. We also assumed that in every step taken by the UAV, the battery would be
decreased by (0.5%) for the UAV. If the UAV reaches the IoT, a terminal state, its battery
will be decreased by (10%). Figure 15 and Figure 16 present the energy consumption of
the UAV after each episode, as shown that the UAV at the beginning will consume a lot of
energy because he still learning and wasting a lot of energy moving around, but after it
learns enough, the energy will converge.
F IGURE 15. the consumed energy for
each episode in scenario one
F IGURE 16. the consumed energy for
each episode in scenario two
Transfered energy discussion
As we can see in both Figure 17 and Figure 18, the energy transferred to the IoT device is
low at the beginning, then it and that’s because the UAV still learning and it doesn’t know
the locations of the IoT devices, but after it learns enough it will minimize its actions and
transfer the most possible amount of the energy.
Chapter 3
F IGURE 17. the transferred energy result
for scenario one
Our Protocol
F IGURE 18. the transferred energy result
for scenario two
In this chapter, we introduced our case study, which is making a flying UAV as an energy
resource to charge some IoT devices and deliver the maximum energy for them, by
implementing a Q-learning algorithm to optimize the movement and the consumption of
energy for the UAV, during a specific number of episode. We also mentioned the UAV
Recharging Architecture that we implanted in the UAV.
General Conclusion
In this thesis, we introduced the concept of using a UAV in two separate roles as an
aerial data collector and a flying energy source that can wirelessly collect information and
transfer power from and to terrestrial IoT devices. This concept allows UAVs to navigate
autonomously by providing them with a learning ability to control their behavior and
optimize the energy consumption of their batteries.
With the help of the Q-learning-based algorithm with only some previous knowledge
about the environment and the available actions, where UAVs can learn and improve their
behavior by trial and error policy, which means that the UAV tries to explore the unknown
environment in-depth by collecting rewards as much as it can and whenever it takes a
wrong step it will get a punishment, without letting its own battery to run out. Which will
help the UAV to start developing a learning policy and use it in the future. To eliminate
complications and illustrate the use of Q-Learning in flying energy sources. We saw how
the UAV performed in the scenarios we proposed, and according to the result, we think its
performance is efficient. Thus, we are looking to add new concepts and fix the problems
discussed in the future work section.
As for the future work, we see that in this study, we used only one agent so far, so for
the upcoming work, we think to change that and use multiple agents at the same time.
Moreover, the energy consumption problem is still not solved, so we expect to adopt new
ways to optimize energy consumption, such as Deep Q-learning (DQN), which uses Qlearning with the neural network. Another aspect that we want to achieve in our upcoming
studies thus shifting from using a discrete environment to an open environment.
