éJ £@Q®Ö ß YË@ éK QK@Qm.Ì '@ éK PñêÒm.Ì '@ éJ J.ªË@ The People’s Democratic Republic of Algeria . Ë@ð úÍAªË@ ÕæʪJË@ èP@Pð ùÒʪË@ IjJ Ministry of Higher Education and Scientific Research é®Êm.Ì '@ éªÓAg . University of Djelfa M ASTER ’ S T HESIS Submitted to Faculty of Exact Sciences and Computer Science ÐñʪË@ éJ Ê¿ úÍ B@ ÐC«B @ð é®J ¯YË@ Computer Science Department úÍ @ ÐC«B @ Õæ¯ To obtain the Academic M ASTER IN C OMPUTER S CIENCE Option : Networks and Distributed Systems By AROUR Miloud, ABEDESSLAM Meftah D RONE - ASSISTED DATA C OLLECTION AND E NERGY T RANSFER BASED ON Q- LEARNING IN I OT NETWORKS Dr. Mr. Mr. Kaddour MESSAOUDI Juin 2023 President Reviewer Supervisor Dedication First of all, we praise Allah the Almighty of God for helping us to finish this work, and we hope that this humble work will be able a reference for upcoming studies in the future. We dedicate this work to our parents and our families, AROUR’s family, and ABDELSSALEM’s family, We also want to thank every person who put on us a leap of faith to finish this work, and we also wish to thank our supervisor, Mr. Kaddour MESSAOUDI, for following us in each step. Thank you so much. - Meftah - Miloud I Abstract : In this work, we are mainly based on one of the most common Reinforcement Learning (RL) methods called Q-learning, which allows Unmanned Aerial Vehicles (UAVs), or what we call Drones, to efficiently collect data and fairly transfer their energy to and from a set of IoT devices with a limited energy capacity, these IoT devices are randomly distributed in a given area. For this purpose, we simulated the deployment of a single UAV above a set of IoT devices to wirelessly collect information and provide them with energy simultaneously, based on the promising technology called Simultaneously Wireless Information and Power Transfer (SWIPT). The performance of our simulation protocol is demonstrated using the Python language with the Airsim simulator, the most recent and common tool to make UAV simulations which have based on Artificial intelligence (AI) techniques. keyword : Data Collection, UAV, SWIPT, AirSim, AI, IoT. II Résumé : Dans ce travail, en se basons principalement sur l’une des méthodes d’apprentissage par renforcement les plus courantes appelée Q-learning, qui permet aux véhicules aériens sans pilote (UAV), ou ce que nous appelons des Drones, de collecter efficacement des données et de transférer équitablement leur énergie à et à partir d’un ensemble de dispositifs IoT à capacité énergétique limitée, ces dispositifs IoT sont répartis de manière aléatoire dans une zone donnée. À cette fin, nous avons simulé le déploiement d’un seul UAV au-dessus d’un ensemble d’appareils IoT pour collecter sans fil des informations et leur fournir de l’énergie simultanément, sur la base de la technologie prometteuse appelée Simultaneously Wireless Information and Power Transfer (SWIPT). La performance de notre protocole de simulation est démontrée en utilisant le langage Python avec le simulateur AirSim, l’outil le plus récamment et le plus courant pour faire des simulations d’UAV basées sur des techniques de l’intelligence artificielle (IA). Mots clés : Collecte de données, UAV, IA, IoT , SWIPT, AirSim. III : jÊÓ P QªÖÏ @ ÕΪJË @ IJËA @ Q» @ áÓ èYg@ð úΫ @XAÒJ«@ , ÉÒªË@ @ Yë ú¯ ð A«ñJ ùÒ úæË@ . AÖÏ @ Q« éK ñmÌ '@ HAJ ð , Learning éJÒ AÓ ð @ , UAVs éËñë . . »QÒÊË iÒ AêËCg áÓ úæË@ × úÍ@ ð áÓ ÈXA« ɾ É® Kð èZA®ºK HA . AîD¯A£ QKA¢Ë@ é«ñÒm . . KAJJ.Ë@ ©Òm.'. , PAJ£ àðYK. H@ éª H@ . ZAJ B@ I KQK@ èQêk. @ ©K PñK ÕæK , èXðYm× é¯A£ X ZAJ B@ I KQK@ èQêk. @ áÓ É¾ JÓ ú¯ ùK@ñ« . éJJªÓ 鮢 Q- × ñ ¯ èYg@ð UAV PAJ£ àðYK áÓ é«ñÒm . . , Yg@ð I ¯ð ú¯ é¯A¢ËAK. AëYK ð QKð AJºÊB ð HAÓñʪÒÊË . SWIPT é¯A¢Ë@ áÓ@QÖÏ @ éK ñk. éJ . »QÓ Qå èA¿AjÖß. AJÔ¯ , QªË@ @ YêË Ï @ ©Òm.Ì éJ P B@ ZAJ B@ I KQK@ èQêk. @ HAÓñ滅 ú¾ÊCË@ É® JË@ ùÒ èY«@ð éJ J® K úÍ@ @ XA J@ , AirSim ú» Am× ©Ó Python éªË Ð@YjJAK. AJK. AmÌ '@ èA¿AjÖÏ @ Èñ»ñKðQK. Z@X @ Q« ÕæK Q» B@ð HYg PAJ£ àðYK H@ B@ è@X B@ ÊÔ« Z@Qk. B A«ñJ úÍ@ @ XA J@ . QKA¢Ë@ èA¿Am× HAJ J® K . AI ú«AJ¢B@ ZA¿YË@ HAJ KQK @ PAêk., àðPYË@ èQKA£, HAÓñÊªÖ Ï @ ©Ôg., ú«AJ¢B@ ZA¿YË@ : éJ kAJ®Ó HAÒÊ¿ .ZAJ B@ I IV Table of Contents Table des figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII Liste des tableaux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 First Chapter . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . 2.2 IoT Architecture . . . . . . . . . . . . . . 2.3 IoT layers . . . . . . . . . . . . . . . . . 2.3.1 Perception Layer . . . . . . . . . 2.3.2 Network Layer . . . . . . . . . . 2.3.3 Middleware Layer . . . . . . . . . 2.3.4 Application Layer . . . . . . . . . 2.3.5 Business Layer . . . . . . . . . . 2.4 IoT Applications . . . . . . . . . . . . . 2.4.1 Medical and healthcare industry . 2.4.2 Precision agriculture and breeding 2.4.3 Industrial Automation : . . . . . . 2.4.4 Smart Cities . . . . . . . . . . . . 2.4.5 Energy Management . . . . . . . 2.4.6 Retail . . . . . . . . . . . . . . . 2.5 IoT assisted by UAVs . . . . . . . . . . . 2.5.1 UAV definition . . . . . . . . . . 2.5.2 Types of UAVs . . . . . . . . . . 2.5.3 UAV applications . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 5 5 5 5 5 6 6 7 7 8 8 9 9 9 11 12 13 3 Second Chapter . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 History of Reinforcement learning . . . . . 3.3 Reinforcement learning . . . . . . . . . . . 3.3.1 Elements of reinforcement learning 3.3.2 Q-learning . . . . . . . . . . . . . . 3.4 Background : Q-learning Based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 16 17 18 19 V 3.4.1 3.5 Protocol 1 : Adaptive UAV-Assisted Geographic Routing with Q-Learning in VANET . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Protocol 2 : Learning to Rest : A Q-Learning Approach to Flying Base Station Trajectory Design with Landing Spots . . . . . . . . 3.4.3 Protocol 3 : Reinforcement Learning for Decentralized Trajectory Design in Cellular UAV Networks With Sense-and-Send Protocol 3.4.4 Protocol 4 : Visual Exploration and Energy-aware Path Planning via Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 3.4.5 Protocol 5 : Minimizing Packet Expiration Loss With Path Planning in UAV-Assisted Data Sensing . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Third Chapter . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . 4.2 Protocol Description . . . . . . . . 4.3 UAV Recharging Architecture . . . 4.4 Simulation Setup . . . . . . . . . . 4.4.1 AirSim . . . . . . . . . . . 4.4.2 Unreal Engine . . . . . . . . 4.5 Used Algorithm . . . . . . . . . . . 4.6 Results . . . . . . . . . . . . . . . . 4.6.1 Reward discussion . . . . . 4.6.2 Energy discussion . . . . . . 4.6.3 Transfered energy discussion 4.7 Conclusion . . . . . . . . . . . . . 19 20 21 22 22 23 . . . . . . . . . . . . . 24 24 24 26 26 27 27 28 29 29 30 30 31 5 General Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 General Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table des figures 1 2 3 4 Five-Layered Architecture of IoT . . . . . . Drone components . . . . . . . . . . . . . According to the Number of their Propellers Multi-rotors UAVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 10 11 12 5 6 7 8 9 10 Q-learning. . . . . Used algorithm. . . Motivating scenario. Motivating scenario. Motivating scenario. Motivating scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 20 21 21 22 23 11 12 13 14 15 16 17 18 AirSim [25] . . . . . . . . . . . . . . . . . . . . . . Unreal Engine 4. [26] . . . . . . . . . . . . . . . . . the reward result for scenario one . . . . . . . . . . . the reward result for scenario two . . . . . . . . . . . the consumed energy for each episode in scenario one the consumed energy for each episode in scenario two the transferred energy result for scenario one . . . . . the transferred energy result for scenario two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28 29 29 30 30 31 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII . . . . . . . . . . . . . . . . . . . . . . . . Liste des tableaux 1 Table of comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2 3 List of simulation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 28 1. General Introduction Forty-three billion is the expected number of connected IoT devices in 2023, according to [1], which is a vast number that keeps increasing daily. Those IoT devices include simple and complex objects, such as sensors, smartphones, smartwatches, industrial equipment, and connected vehicles. Typically, IoT devices exchange information with the cloud and also between them, depending on the service they are executing. Like any other paradigm, IoT devices face several challenges that we need to address for better performance. One of these significant challenges that IoT applications still suffer from is their constrained energy capacity. Indeed, all IoT devices have limited batteries, and they will not last for too long, especially with the new generations of mobile networks that require more energy to perform their missions over the long term successfully. To fill this gap, there is a promising technology that has proved itself in academia and industry, called Simultaneous wireless information and power transfer (SWIPT) [2] could simultaneously transfer the information and recharge IoT devices wirelessly and ensure proper functioning for an extended period. But, It is a difficult challenge due to the unknown positions of IoT devices in the environment ; therefore, we proposed a mobile flying energy source as a data collector and energy transfer from and to the IoT devices. [3] When it comes to flying objects in technology, we think of Unmanned Aerial vehicles (UAVs) or what we call Drones. In the last few years, they have been used in many applications due to their popularity and low costs, and it has also been used in civilian services such as delivery service, object tracking, natural phenomena, and more applications. It has also been used in telecommunications domains and in public events like sports games and festivals. For instance, UAVs are used as flying hotspots to improve internet speed or be the primary source of the Internet when there is an electrical issue with the terrestrial internet source, such as in an earthquake [4]. UAVs come in two different types, UAVs with High Altitude Platforms (HAP) [5], where their altitude can exceed up to 10 km ; one of the uses of this type is to provide internet access in large zones. The second type is the UAVs with Low Altitude Platforms (LAP) [6], also called Drones. In telecommunication, LAP UAVs are deployed to maximize energy transferred to IoT devices. Unfortunately, UAVs also have batteries on board with a limited lifetime ; this will pose us with two main questions, how to get the UAVs to fly over the 1 General Introduction dynamic IoT devices and transfer the energy ? Moreover, how do UAVs perform charging the maximum number of IoT devices ? In this work, with the help of the SWIPT technology mentioned above, UAVs can act as mobile flying energy sources that can move autonomously to collect information from and transfer the energy to IoT devices, despite UAVs’ essential role. But there is also still suffer another problem that forces us to pose some questions, such as (i) how we can make UAVs move autonomously and act intelligently without any human intervention. (ii) how can UAVs assist in transferring the maximum energy to IoT devices ? (iii) how can UAVs cover the maximum number of IoT devices ? As for the contribution of our work, we subject a single UAV for the training and deep learning process for the ability to control its behavior and its interaction with the surrounding environment through some previous knowledge of the environment and some available actions, where the UAV can (i) efficiently learn and improve its behavior by the trial and error, (ii) try to explore the environment in-depth by collecting rewards as much as possible. (iii) It will get a punishment whenever it gets a wrong step. Therefore, the UAV based on this method (i.e, Q-learning algorithm [7]) can develop and exploits a learning policy. Our work is organized into three main chapters : — In the first chapter, we talk about IoT networks and how IoT devices they are involved in our daily lives. Also, we will talk about the existing IoT applications and where it used. Then we talk about UAVs and different types of UAVs and their applications, and how UAVs are helping in the future of the IoT world. — In the second chapter, we talk about the different methods that exist for controlling drones using the Q-Learning method. To this end, we have selected to study five proposed protocols with different methods that have been proposed to control drones using Q-Learning. — In the last chapter, we will discuss our proposed protocol and how to use it to control the UAV, compare it to other methods discussed in the previous chapter, and explain why we chose it. — Finally, we concluded our work by discussing our results. 2 2. First Chapter 2.1 Introduction The Internet of Things (IoT) network is defined as a collection of intelligent devices that are wirelessly connected among themselves or connected to the cloud via the Interconnected network (or Internet). Historically, the concept of the IoT network is relatively recent, emerging in the early 2000s. However, the roots of IoT can be traced back to the development of various technologies and networks over the years. The first glimmers of IoT can be seen in early computer networks and the evolution of the Internet itself. In the 1960s and 1970s, researchers and organizations started connecting computers together, creating the foundation for networked communication. This eventually led to the creation of the Internet in the 1980s, which connected disparate networks and allowed for global communication and data exchange. As the internet matured, advancements in communication protocols, wireless technologies, and miniaturization of computing devices paved the way for the expansion of IoT networks. The concept of embedding sensors and actuators into physical objects to collect and exchange data gained the most prominence. The early 2000s saw the emergence of wireless sensor networks (WSNs), which were the precursors to IoT networks. WSNs comprised small, low-power devices equipped with sensors to monitor and collect data from the physical environment. These devices communicated wirelessly using protocols such as Zigbee, Bluetooth, or Wi-Fi, forming ad-hoc networks to relay data to a central hub or gateway. The term "Internet of Things" itself was coined in 1999 by Kevin Ashton [8], a British technologist who envisioned a network where physical objects could be connected and communicate with each other through the Internet. Since then, IoT has evolved rapidly, driven by advancements in wireless connectivity, cloud computing, big data analytics, and artificial intelligence. The integration of IoT into various industries has transformed sectors such as manufacturing, agriculture, transportation, healthcare, and smart homes. IoT networks are composed of a vast array of interconnected devices, including sensors, actuators, wearables, industrial equipment, vehicles, and consumer appliances. These devices collect data, communicate with each other or central servers, and enable realtime monitoring, control, and automation of processes. IoT networks rely on a range of communication technologies, including cellular networks (2G, 3G, 4G, and now 5G), 3 Chapter 1 Internet of Things and UAVs Wi-Fi, Bluetooth, Zigbee, LoRaWAN, and others. 2.2 IoT Architecture IoT architectures refer to the design and structure of systems that enable their functioning. These architectures provide a framework for connecting and managing the various components of an IoT ecosystem, including devices, networks, data processing, and applications. Here are five commonly used IoT architectures : — Centralized Architecture : In this architecture, all IoT devices send data to a central server or cloud platform for processing and analysis. The server is responsible for data storage, analysis, and application logic. It allows for centralized control and management of the IoT network. — Peer-to-Peer (P2P) Architecture : P2P architectures enable direct communication and data exchange between IoT devices without relying on a central server. Each device can act as both a client and a server, allowing for decentralized and distributed communication. P2P architectures are often used in scenarios where low latency and high scalability are critical. — Three-Tier Architecture : This architecture consists of three layers : the device layer, the gateway layer, and the cloud layer. The device layer comprises IoT sensors and actuators, the gateway layer handles data preprocessing and local analytics, and the cloud layer manages storage, advanced analytics, and application services. This architecture enables a distributed and scalable approach to IoT deployments. — Edge Computing Architecture : Edge computing architectures involve processing data at the edge of the network, closer to the IoT devices themselves, rather than relying on a central server or cloud. This approach reduces latency, optimizes bandwidth usage, and enables real-time analysis and decision-making. Edge computing architectures are suitable for applications that require quick response times and operate in resource-constrained environments. — Hybrid Architecture : Hybrid architectures combine multiple architectural approaches to leverage the strengths of each. They may involve a combination of centralized, distributed, and edge computing elements, depending on the specific requirements of the IoT application. Hybrid architectures provide flexibility, scalability, and efficient resource utilization. It is very important to note that IoT architectures can vary depending on the specific use cases, scalability needs, security requirements, and network infrastructures. Designers and developers create and customize architectures to meet the stringent requirements of IoT systems. 4 Chapter 1 2.3 Internet of Things and UAVs IoT layers The IoT architecture typically consists of several layers see Fig.1 that work together to enable the functioning of IoT systems, which can be described as follows : 2.3.1 Perception Layer The perception layer, also known as the physical layer or sensing layer, is the bottommost layer of the IoT architecture. It comprises various sensors, actuators, and physical devices that interact with the physical world. These devices collect data from the environment, such as temperature, humidity, light, motion, or location.[9] 2.3.2 Network Layer The network layer connects the devices in the IoT ecosystem and facilitates data transfer between them. It includes wired and wireless communication technologies like Wi-Fi, Bluetooth, Zigbee, cellular networks, or Ethernet. This layer ensures reliable and secure connectivity for seamless communication among devices.[10] 2.3.3 Middleware Layer The middleware layer acts as an intermediary between the perception and application layers. It manages the data flow, protocols, and communication between the devices and the applications. Middleware provides functionalities such as data filtering, protocol translation, device management, and security.[11] 2.3.4 Application Layer The application layer is the topmost layer of the IoT architecture and represents the userfacing part. It consists of applications, services, and interfaces that enable users or other systems to interact with the IoT ecosystem. This layer utilizes the data collected from the perception layer to provide meaningful insights, analytics, control, and automation.[12] 2.3.5 Business Layer Some IoT frameworks include a business layer, which focuses on the business logic and processes related to IoT deployments. It involves aspects such as data monetization, 5 Chapter 1 Internet of Things and UAVs business models, service management, and integration with existing enterprise systems. Perception Layer Network Layer Middleware Layer Application Layer Business Layer F IGURE 1. Five-Layered Architecture of IoT 2.4 IoT Applications IoT networks refer to the network of interconnected physical devices, vehicles, appliances, and other objects embedded with sensors, software, and network connectivity, enabling them to collect and exchange data. IoT has numerous applications across various industries and sectors. As follows, there are some applications of IoT : 2.4.1 Medical and healthcare industry There are several applications in the healthcare field that can be framed by using IoT capabilities. Using cell phones with RFID-sensor capabilities, medical parameters may be monitored and drug administration can be tracked simply. The following are some of the benefits that can be obtained by utilizing this feature : (i) convenient illness monitoring, (ii) ad-hoc diagnosis, and (iii) immediate medical assistance in the event of an accident. Implantable and wireless devices can be used to save and secure health records. These health data can be utilized to save a patient’s life and provide unique treatment to persons in emergency situations, particularly those with heart disease, cancer, diabetes, stroke, cognitive impairments, and Alzheimer’s disease. Guided action on the body can be taken by introducing biodegradable chips into the human body [13]. Paraplegics can receive muscular stimulation to help them regain their mobility. A smart IoT-controlled electrical stimulation device can be implanted to do this. Many more applications, such as (i) Remote patient monitoring, (ii) Telemedicine, and (iii) Medication management, may be readily accomplished with the help of IoT’s many characteristics. 6 Chapter 1 Internet of Things and UAVs — Remote patient monitoring : IoT devices, such as wearables and connected medical devices, collect and transmit patient data to healthcare providers, enabling remote monitoring of vital signs and conditions. — Telemedicine : IoT enables virtual consultations and remote healthcare services, allowing patients to access medical expertise and reducing the need for in-person visits. — Medication management : IoT devices can remind patients to take medication, track adherence, and monitor medication supply, ensuring proper dosage and timely refills. 2.4.2 Precision agriculture and breeding IoT is revolutionizing the agricultural sector with applications like precision farming. Farmers can use IoT sensors to monitor soil moisture, temperature, and nutrient levels, enabling precise irrigation and fertilization, resulting in optimized crop yields. Also, IoT may be used to control the traceability of animals used in agriculture. This can aid in the real-time identification of animals, particularly during the outbreak of infectious illness. Many countries provide subsidies to farmers and shepherds based on the number of animals they have, such as cattle, goats, and sheep. Many more applications, such as (i) Precision Farming, (ii) Livestock monitoring, and (iii) Automated irrigation, may be readily accomplished with the help of IoT’s many characteristics. — Precision Farming : IoT sensors collect data on soil moisture, temperature, and nutrient levels, helping farmers optimize irrigation, fertilization, and crop growth. — Livestock monitoring : IoT devices such as smart collars or ear tags collect data on the health and behavior of livestock, allowing farmers to detect diseases, manage feeding patterns, and optimize breeding programs. — Automated irrigation : Connected irrigation systems can adjust water usage based on real-time weather conditions and plant needs, conserving water and improving crop yield. 2.4.3 Industrial Automation : IoT is widely used in industries for process automation, monitoring, and optimization. Connected sensors and devices help monitor and control manufacturing equipment, inventory, supply chain logistics, and predictive maintenance, improving operational efficiency. Many more applications, such as (i) Process automation, (ii) Asset tracking, and (iii) Predictive maintenance, may be readily accomplished with the help of IoT’s many characteristics. 7 Chapter 1 Internet of Things and UAVs — Process automation : IoT sensors and actuators are used to automate tasks in manufacturing processes, reducing manual intervention and improving efficiency and consistency. — Asset tracking : IoT enables real-time tracking and monitoring of assets within factories or warehouses, optimizing inventory management and minimizing loss or theft. Asset tracking, sometimes called asset management, involves scanning barcode labels or using GPS or RFID tags to locate physical assets. Asset monitoring is as critical as inventory management since you need to know your organization’s physical assets’ location, state, maintenance schedule, and other information. Asset monitoring is crucial to your company’s bottom line and compliance since you must find and replace lost or expired physical assets. [14]. — Predictive maintenance : Connected sensors collect data on equipment performance, allowing predictive maintenance schedules to prevent breakdowns and minimize downtime. — Supply chain optimization : IoT helps track and monitor goods throughout the supply chain, providing real-time visibility and enabling efficient inventory management and logistics planning. 2.4.4 Smart Cities IoT is utilized in creating smart cities by integrating various systems, including transportation, energy, waste management, and public safety. Many more applications, such as (i) Smart traffic management, (ii) Smart parking, and (iii) Intelligent lighting, are examples of IoT applications in urban environments. — Smart traffic management : IoT sensors and cameras monitor traffic flow and congestion, optimizing signal timings and suggesting alternate routes to reduce congestion and improve traffic efficiency. — Smart parking : IoT-enabled sensors provide real-time information about available parking spaces, reducing search time and congestion. — Intelligent lighting : Connected streetlights can adjust the brightness based on ambient light levels and motion detection, reducing energy consumption and improving public safety. 2.4.5 Energy Management IoT helps optimize energy consumption and reduce costs in buildings and homes. Connected smart meters, sensors, and appliances enable real-time energy monitoring, allowing 8 Chapter 1 Internet of Things and UAVs users to make informed energy usage and efficiency decisions. Many more applications, such as (i) Smart meters, (ii) Energy monitoring, and (iii) Demand response, can be readily accomplished with the help of IoT’s many characteristics. — Smart meters : IoT-enabled smart meters provide real-time energy consumption data to consumers and utility companies, enabling better management of energy usage and billing accuracy. — Energy monitoring : IoT sensors and devices track energy usage of appliances and systems, allowing users to identify and reduce energy wastage. — Demand response : IoT can facilitate load shedding or shifting during peak energy demand periods, helping to balance the electrical grid and avoid blackouts. 2.4.6 Retail IoT is used to enhance the customer experience in retail environments [15]. Smart shelves, digital signage, and beacons can provide personalized offers, product information, and indoor navigation, making shopping more convenient and engaging. — Smart shelves : IoT-enabled shelves can track inventory levels in real-time, automatically triggering reordering or restocking processes. — Digital signage : Connected displays can deliver personalized advertisements and promotions based on customer preferences and demographics. — Beacons : IoT beacons can transmit location-based offers and information to shoppers’ smartphones, enhancing their shopping experience and providing relevant recommendations. 2.5 IoT assisted by UAVs UAVs have significant telecommunications characteristics, such as the ability to carry out as an aerial base station that can improve wireless access coverage and replace mobile network antennas in case when they have been damaged accidentally. IoT devices have also profited from capabilities offered by UAVs, such as the ability to wirelessly recharge IoT devices and data collection from IoT devices promptly. 2.5.1 UAV definition UAV stands for Unmanned Aerial Vehicle. It refers to an aircraft that operates without a human pilot on board. UAVs, also known as drones, are typically remotely piloted or 9 Chapter 1 Internet of Things and UAVs can fly autonomously [16] using pre-programmed flight plans or onboard sensors and navigation systems. They come in various sizes and configurations, ranging from small handheld models to large, sophisticated aircraft used for military or commercial purposes. UAVs are equipped with different types of sensors and payloads, such as cameras, infrared sensors, LiDAR (Light Detection and Ranging), or even weaponry, in the case of military drones. These sensors and payloads allow UAVs to perform a wide range of tasks, including aerial photography and videography, surveillance and reconnaissance, mapping and surveying, delivery of goods, disaster response, scientific research, and more. UAVs have gained significant popularity and applications in recent years due to advancements in technology, including the miniaturization of components, improvements in battery life, and the development of robust control systems. They offer various advantages over manned aircraft, such as cost-effectiveness, enhanced safety, accessibility to remote or hazardous areas, and the ability to perform repetitive or dangerous tasks with precision. However, the increasing use of UAVs also raises concerns about privacy, safety regulations, and potential misuse. Therefore, governments and aviation authorities have implemented regulations to ensure responsible and safe UAV operations in public airspace. The standard UAV components include motors, propellers, speed controllers, batterie, sensors, an antenna, a receiver, a camera, and an accelerometer to measure the speed (see Fig.2). Generally, these are all the technological components ; there could be more components or fewer depending on the service this UAV provides. Its antennas could control it using Radio waves [17]. F IGURE 2. Drone components 10 Chapter 1 2.5.2 Internet of Things and UAVs Types of UAVs Several types of UAVs are designed for specific purposes and varying in size, capabilities, and configurations. We characterize UAVs in three main categories : 1. According to the Number of their Propellers : There are three main types of UAVs sorted according to the number of propellers : — Single rotor UAV : Multirotor style designs with multiple rotors are the most common construction in UAVs use but in the case of a single rotor model consisting of an inside rotor and tail rotor that helps to stabilize the heading (see Fig.3a). In case of hovering having heavy objects but requiring a faster flight time with longer endurance, single rotor-style helicopters could be the best option[4] — Fixed Wing UAVs : As the name indicates, this type of UAV has a fixed wing, and it seems like the old airplanes (see Fig.3b), it can’t stand stable up in the air. It’s mostly used for lifting packages. (a) Single rotor UAV (b) Fixed Wing UAVs F IGURE 3. According to the Number of their Propellers — Multi-rotors UAVs : Mostly used where stabilization and flexible actions are important such as in filming, object tracking, etc., they could be categorized according to the number of rotors they have (i) Quadcopter UAV with four rotors (see Fig.4(c)), (ii) Hexacopter UAVs with six rotors (see Fig.4(d)), (iii) Octocopter UAVs with eight rotors (see Fig.4(e)). The more rotors, the more energy consumed, and vice versa. UAVs with multi-rotors operate in electric motors because of their high precision[4] 2. According to their size : There are three main types of UAVs sorted according to their size [16]. — Small UAV : It could be used as a strong weapon for spying. Its size vary from the size of a large insect to about 50cm. 11 Chapter 1 Internet of Things and UAVs F IGURE 4. Multi-rotors UAVs — Medium UAVs : It can carry up to 200kg and have a flying capacity that goes up to 15mn. — Large UAVs : This type of UAV is mostly used by the military, and its size could be comparable to small airplanes. 3. According to their range : There are three main types of UAVs sorted according to their range. — Close range UAVs : Used in surveillance because of its ability to fly up to 50 km and the battery that could last up to 6 hours. — Mid-range UAVs : This type is a powerful one. It can cover up to 650 km and could be used in surveillance fields. — Endurance UAVs : This type is a high performance can fly to about 1km above sea level and its battery can last up to 36 hours 2.5.3 UAV applications UAVs have an essential role in some IoT applications that are uncountable, and they keep increasing daily in various domains, such as in military missions and civilian life improvement. — Shipping and delivery : UAVs could be used to deliver products in the city, which is a more efficient way because it’s faster, and since UAVs are flying machines, that means they can fly everywhere and save a lot of time than ground transportation. — Security : UAVs could be used in civilian security applications, such as in tracking the criminal or a suspect or as a flying alarm system. Or it could be used as a security guard since UAVs have a lot of sensors, and with the implementation of AI it can perform really well as a guard. — Nature disasters and rescue : UAVs can deliver supplies and medicals to the damaged areas and check on damaged buildings. Nuclear and some dangerous chemicals 12 Chapter 1 Internet of Things and UAVs can cause big damage, and UAVs could be used as sensors or as a discoverer. — Agriculture : UAVs can take the role of the satellite for monitoring the planted area, soil and field analysis, watering specific areas, and soil fertilization. 2.6 Conclusion Human lives might be made simpler and safer with the assistance of technology. One of these crucial technologies is the IoT and its best assistance in many areas. It uses various techniques through timely data collection, which helps IoT devices perform much better. Using UAVs, another powerful recently created technology, we will be designing a novel method of collecting information and transferring energy to and from IoT devices based on the promising technology called SWIPT. We should also include the concept of a self-controlling technique in the UAV field if we want these UAVs to supply energy to IoT devices because so many UAVs are being used for many diverse purposes. We must allow the UAV to use AI technology to explore and learn about its surrounding environment. 13 3. Second Chapter 3.1 Introduction UAVs can be controlled manually by humans using remote controllers, and there is another method to give the UAV the exact path to go through and all the preliminary information, including directions and distances. As for this work, without the use of these controlling methods where human intervention is included. We can leverage to use selflearning methods based on AI algorithms. We can subject one UAV for the training and deep learning process for the ability to control its behavior and its interaction with the surrounding environment through some previous knowledge of the environment and some available actions based on the most common reinforcement learning (RL) algorithm called Q-learning. In this chapter, we provide an overview of the history of RL and its main elements. In addition, we introduce the Q-Learning algorithm and its essential features. Then, we study some recent case applications that show how Q-learning can be employed in the UAV field. At the end of this chapter, we have drawn up a comparison table of these applications containing each study case’s tools, advantages, and drawbacks. 3.2 History of Reinforcement learning Reinforcement learning (RL) is a subfield of machine learning that focuses on developing algorithms and models capable of learning and making decisions through interaction with an environment. The history of reinforcement learning dates back several decades and has seen significant advancements and breakthroughs. According to Richard S. Sutton and Andrew G. Barto in their book called reinforcement learning an introduction second edition [18] : The history of reinforcement learning has two main threads, both long and rich, that were pursued independently before intertwining in modern reinforcement learning. One thread concerns learning by trial and error, which started in the psychology of animal learning. This thread runs through some of the earliest work in artificial intelligence and led to the revival of reinforcement learning in the early 1980s. The other thread concerns the problem of optimal control and its solution using value functions and dynamic programming. For the most part, this thread did not involve learning. Although the two threads have been largely independent, the exceptions revolve 14 Chapter 2 Reinforcement learning and study cases around a third, less distinct thread concerning temporal-difference methods such as those used in the tic-tac-toe example. All three threads came together in the late 1980s to produce the modern field of reinforcement learning. The concept of trial-and-error learning was a fundamental thread that led to the contemporary area of reinforcement learning. According to American psychologist R.S. Woodworth, the concept of trial-and-error learning dates back to Alexander Bain’s discussion of learning by "groping and experiment" in the 1850s and, more explicitly, to Conway Lloyd Morgan, a British ethologist and psychologist who coined the term in 1894 to describe his observations of animal behavior.[19]. As follows a brief overview of the RL history : The foundations of RL can be traced back to the 1950s and 1960s, when researchers started exploring the concepts of dynamic programming and optimal control theory. In 1951, Richard Bellman introduced the principle of optimality, which laid the groundwork for the later development of RL algorithms. In the 1970s, several researchers, including Christopher Watkins and Michael Littman, began working on RL algorithms, although computational limitations at the time limited their practical applications. In the early 1980s, the concept of temporal difference (TD) learning emerged as a breakthrough in RL. Pioneered by Richard Sutton, TD learning involved updating an agent’s value function based on the difference between the predicted and observed rewards. Sutton’s work on TD learning, particularly the TD(γ) algorithm, set the stage for future advancements in RL. In 1989, Christopher Watkins introduced Q-learning, an off-policy RL algorithm that uses a lookup table to estimate state-action pairs’ values. Q-learning was a major development because it allowed agents to learn optimal policies without explicitly modeling the environment. Around the same time, Andrew Barto and Richard Sutton developed the value iteration algorithm, which provided a way to solve Markov decision processes (MDPs) and find optimal policies. In the late 1990s and early 2000s, researchers began exploring the use of function approximation techniques, such as neural networks, to handle high-dimensional state and action spaces. In 2013, Deep Q-Networks (DQN), a milestone in RL, was introduced by Volodymyr Mnih et al. DQN combined Q-learning with deep neural networks and successfully played Atari 2600 games. This breakthrough demonstrated the potential of deep RL and paved the way for further advancements in the field. Recently, a growing focus has been on policy gradient methods and actor-critic architectures. Policy gradient methods directly optimize the policy parameters by estimating the gradient of the expected return with respect to the policy parameters. Actor-critic methods combine policy gradient techniques with value function estimation, where the 15 Chapter 2 Reinforcement learning and study cases actor learns the policy, and the critic estimates the value function. These approaches have shown great promise in a wide range of applications, including robotics, game-playing, and autonomous systems. Also, RL has witnessed rapid progress and numerous breakthroughs across various domains. Researchers have explored model-based RL, meta-learning, multiagent RL, and other advanced techniques to address the challenges of sample efficiency, generalization, and exploration. RL has been applied to complex tasks such as autonomous driving, healthcare optimization, robotics, and natural language processing. 3.3 Reinforcement learning Reinforcement learning is the process of learning what to do in order to maximize a numerical reward signal by mapping situations to actions. The learner is not told which actions to take. But instead must try them out to see which ones provide the most reward. In the most interesting and challenging cases, actions can have an impact on not only the immediate reward but also the next situation and, by extension, all subsequent rewards. The two most important distinguishing features of reinforcement learning are trial-and-error search and delayed reward. We formalize reinforcement learning as the optimal control of incompletely known Markov decision processes based on principles from dynamical systems theory. The core concept is to capture the most significant components of the real problem that a learning agent faces when interacting with its environment over time in order to reach a goal. A learning agent must be able to sense the state of its surroundings and take actions that alter it to some extent. The agent must also have an environmental goal or goals. Markov decision methods are designed to incorporate only these three characteristics – sensation, action, and goal – in the most basic form feasible, without trivializing any of them. Any method that is well suited to solving such problems we consider to be a reinforcement learning method. Reinforcement learning differs from supervised learning, which is the type of learning studied in the majority of contemporary machine learning research. Leaning from a training collection of labeled examples provided by a knowledgeable external supervisor is known as supervised learning. Each example provides a description of a circumstance and a specification – the label – of the correct action the system should take in that case, which is often to identify a category to which the situation belongs. This type of learning aims for the system to extrapolate or generalize its answers so that it can function appropriately in scenarios that aren’t in the training set. Although this is an essential type of learning, it is insufficient for learning via interaction. It’s difficult to find instances of desired behavior 16 Chapter 2 Reinforcement learning and study cases that are both right and indicative of all the contexts in which the agent must act in interactive problems. An agent must be able to learn from its own experience in unexplored regions, where learning would be most beneficial. Unsupervised learning, which is typically about finding structure hidden in collections of unlabeled data, is not the same as reinforcement learning. The terms supervised learning and unsupervised learning appear to categorize machine learning paradigms comprehensively, but they don’t. Although it’s tempting to think of reinforcement learning as a form of unsupervised learning because it doesn’t rely on examples of correct behavior, reinforcement learning focuses on maximizing a reward signal rather than searching for hidden structure. Finding structure in an agent’s experience can be helpful in reinforcement learning, but it doesn’t solve the problem of maximizing a reward signal on its own. Therefore researchers consider reinforcement learning to be a third machine learning paradigm, alongside supervised learning and unsupervised learning, and perhaps other paradigms. 3.3.1 Elements of reinforcement learning A reinforcement learning system has five main components in addition to the agent and the environment : a state space, a policy, a reward signal, a value function, and exploration and exploitation. 1. Agent : The agent is the learner or decision-making entity might be a software program, a robot, or any system that can sense and act on its surrounding environment and takes action depending on its present condition and the input obtained. The agent maximizes cumulative rewards by choosing the optimal actions depending on their condition. 2. Environment : The environment represents the external system or problem space with which the agent interacts, and it can be a physical world, a simulated environment, a game, or any other scenario where the agent operates. The surrounding environment provides feedback to the agent through the obtained rewards or penalties based on its actions. 3. State : A state space refers to the current condition or representation of the environment for the agent to capture the relevant information about their agent’s situation at a given time. That helps the agent make decisions. States can be discrete (e.g., game board configurations) or continuous (e.g., sensor readings in robotics). 4. Policy : The policy is the agent’s strategy or behavior to select actions based on states and guides the agent’s decision-making process. The policy can be deterministic (selecting a single action) or stochastic (selecting actions probabilistically). 5. Reward : The reward is the feedback signal that indicates the desirability or quality 17 Chapter 2 Reinforcement learning and study cases of the agent’s actions. Rewards can be positive or negative and are used to guide the learning process. They can be positive or negative and serve as a measure of immediate or long-term success. Rewards perform the agent’s learning process, as the agent aims to maximize the cumulative reward over time. 6. Value function : The value function estimates an agent’s expected cumulative reward from a given state or state-action pair. It guides the agent’s decision-making process by providing a measure of the desirability of different actions or states. Value functions guide the agent’s decision-making by helping it prioritize actions that lead to higher rewards. 7. Exploration and Exploitation : The agent needs to strike a balance between exploring the environment to discover new actions that might lead to higher rewards and exploiting its current knowledge to take actions that have yielded high rewards in the past. Consequently, RL algorithms aim to find an optimal policy that maximizes the expected cumulative reward over the long term. This is often achieved through the use of iterative learning algorithms, such as Q-learning, policy gradients, or actor-critic methods. These algorithms update the agent’s policy and value function based on observed rewards and experiences. 3.3.2 Q-learning Q-learning [7] is a model-free reinforcement learning algorithm in which an agent transitions from one state to another by taking random actions. A set of states St and a set of actions At define the learning space. By performing action At and moving to another state St+1 , a reward function calculates a numeric value for taking such state-action pair and records it in a Q-table which is initialized with zero values. The agent reaches a particular goal state by repeatedly taking random actions at one position. F IGURE 5. Q-learning. 18 Chapter 2 Reinforcement learning and study cases The Q-table values get updated at each step and, after many iterations, eventually converge. Q-Learning’s goal is to maximize the total reward for all state-action pairs from the beginning up to reaching the goal state, so-called the optimal policy π. The optimal policy π indicates which action is the best to take in different states, which results in a maximized overall gain. Q-Learning has been widely used in UAV-related research recently. It belongs to the Temporal Difference (TD) learning methods class and is widely used for solving Markov decision processes (MDPs). As follows is the pseudo-code of the Q-learning algorithm 2 and its different steps : Algorithm 1 Q-learning Initialize Q-table with arbitrary initial values or zeros Repeat Choose action A[t] based on the current state S[t] (using an exploration-exploitation strategy) Take action A[t], observe next state S[t + 1] and immediate reward R[t] Update Q-value : Q(S[t], A[t]) ← Q(S[t], A[t]) + α · (R + γ · maxA Q(S[t + 1], A[t]) − Q(S[t], A[t])) Until convergence criterion is met Extract policy : Select action A[t]∗ with the highest Q-value for each state S[t] 3.4 3.4.1 Background : Q-learning Based Protocols Protocol 1 : Adaptive UAV-Assisted Geographic Routing with Q-Learning in VANET In this approach, the authors in [20] proposed a Q-learning-based Adaptive Geographic Routing (QAGR) system in Vehicle Adhoc Network (VANET) assisted by an unmanned aerial vehicle (UAV), which is divided into two components. In the aerial component, the global routing path is calculated by the fuzzy-logic and depth-first-search (DFS) algorithm using the UAV collected information like the global road traffic, which is then forwarded to the ground requesting vehicle. In the ground component, the vehicle maintains a fix-sized Q-table converged with a well-designed reward function and forwards the routing request to the optimal node by looking up the Q-table filtered according to the global routing path. QAGR algorithm is used to improve the converging speed and resource utilization of the geographic routing approaches in VANET. UAVs are deployed to guide the transmission path, and the Q-learning algorithm is used to help each node choose the best next hop in a specific area. The simulation results show that the QAGR performs better than other approaches in packet delivery and end-to-end delay. 19 Chapter 2 Reinforcement learning and study cases F IGURE 6. Used algorithm. 3.4.2 Protocol 2 : Learning to Rest : A Q-Learning Approach to Flying Base Station Trajectory Design with Landing Spots In this approach, the authors in [21] used a Q-learning algorithm to make movement decisions for the UAV, maximizing the data collected from the ground users while minimizing power consumption by exploiting the landing spots. The UAV movement decisions are made based on the drone’s current state, i.e., position and battery content. While Landing Spots offer the possibility to conserve energy, the UAV might have to sacrifice some users’ Quality of Service (QoS). The advantages of this application are (i) The presented system can utilize LSs efficiently to extend mission duration and (ii) Maximize the sum rate of the transmission without using a model or any prior information about the environment (see Fig.7). 20 Chapter 2 Reinforcement learning and study cases F IGURE 7. Motivating scenario. 3.4.3 Protocol 3 : Reinforcement Learning for Decentralized Trajectory Design in Cellular UAV Networks With Sense-and-Send Protocol In this approach, the authors introduced in [22] a novel RL-based framework for decentralized trajectory design in cellular UAV networks with a sense-and-send protocol. UAVs are equipped with sensors to collect data from the environment and then send it to the ground station. The trajectory of each UAV is optimized independently using an RL algorithm, which considers the current state of the UAV, including its position, velocity, and remaining battery life. The purpose of this proposed approach aims to maximize the amount of data collected and transmitted while minimizing energy consumption. The simulation results show that the proposed approach outperforms traditional methods regarding data collection efficiency and energy consumption (see Fig.8). F IGURE 8. Motivating scenario. 21 Chapter 2 3.4.4 Reinforcement learning and study cases Protocol 4 : Visual Exploration and Energy-aware Path Planning via Reinforcement Learning In this approach, the authors in [23] proposed a deep RL approach that combines the effects of energy consumption and the object detection modules to develop a policy for object detection in large areas with limited battery life. The learning model enables dynamic learning of the negative rewards of each action based on the drag forces resulting from the UAV’s motion concerning the wind field. The proposed architecture shows promising results in detecting more goal objects than traditional coverage path planning algorithms, especially in moderate and high wind intensities (see Fig.9). F IGURE 9. Motivating scenario. 3.4.5 Protocol 5 : Minimizing Packet Expiration Loss With Path Planning in UAV-Assisted Data Sensing In this approach, the authors in [24] proposed a UAV trajectory planning model based on deep RL for data collection by minimizing the expired data packets in the whole sensor system and then relaxing the obscure original problem into a min–max-AoI optimal path scheme due to complex constraints (see Fig.10). 22 Chapter 2 Reinforcement learning and study cases F IGURE 10. Motivating scenario. 3.5 Conclusion Q-learning is a well-known AI-based algorithm, and it is used in several fields. UAVs can control their system based on the Q-learning algorithm. The Q-learning can be implemented in different ways depending on the environment and the mission to accomplish. In this chapter, we present the comparison of the most relevant five Q-learning-based protocols in Table 1. Protocol 1 Protocol 2 Protocol 3 Protocol 4 Protocol 5 ST NS-3, SUMO Not specified MATLAB AirSim MATLAB 2 2 2 Env 4000m 1000m Not specified 630m 1000m2 Run Time 300s Not specified Not specified 100 episode Not specified UAV Number 25 Multiple 3 1 1 IoT Number 80-400 Not specified Not specified Not specified 5-12 Height Not specified 40m 50-150 m Variable 50 m EC Not specified 400 W att Not specified Not specified Not specified Speed Not specified 60 km/h Not specified Not specified 20 m/s Reward Not defined Yes Yes Yes Not defined ST : Simulation Tool, Env : Environment, EC : Energy Consumption. TABLE 1. Table of comparison. 23 4. Third Chapter 4.1 Introduction The number of IoT devices is increasing and one of the challenges the IoT devices face is limited battery life. We consider recharging them since it’s possible that an IoT device could exist in an isolated place where electricity sources may not exist, such as in a mountain, in a middle of a forest, or in the middle of a sea. Therefore, we proposed using UAV as a data collector and a flying energy source to collect data and recharge these IoT devices wirelessly because they are flexible and can fly in any direction with high precision. It can also move faster than a terrestrial vehicle, and since we are using a flying vehicle, the environment of terrestrial nature is not considered a challenge to face, whether the environment is in the middle of the sea or on a rocky surface. With the increasing number of IoT devices and the changing environment, controlling UAVs by humans is not a good solution and will waste a lot of time. Therefore, we will implement an algorithm called Q-learning in the UAV so it will no longer need to be controlled by humans ; by implementing this algorithm, the UAV will learn by itself how to behave and take action in every environment. In this work, the main goal is to collect the maximum amount of data, transfer the maximum amount of energy from and to IoT devices, and minimize the number of actions performed by UAVs to transfer energy. To this end, we will implement our protocol based on the Q-learning algorithm mentioned above to help the UAV control itself without human involvement. So that the UAV will learn which action to take at each state by itself, and after several episodes, the UAV will have complete knowledge about the environment, and its actions will be more efficient. 4.2 Protocol Description The contribution of our protocol is to deploy only one UAV as a data collector and flying energy source to wirelessly collect data and transfer their energy from and to the IoT devices that are located randomly on a grid of 3x3 (9 cells). Since our key goal is to minimize the consumed energy by the UAV, so we used a Q-learning algorithm with a specific number of actions (forward, left, right, backward). The UAV will run through a specific number of episodes, and for each episode, the UAV will take limited steps to move 24 Chapter 3 Our Protocol in the environment. The IoT devices will remain in their positions, the algorithm and the calculations will run on a base station, and this base station will give the UAV the actions to take at each step. This base station contains the Q-learning algorithm. The Q-learning algorithm consists of four main parts, the q-table is where the q-values are stored, the q-function is used to calculate the q-values and actions, the UAV is allowed to take one of the four actions at each step, the reward after taking any action the UAV will receive a reward for this specific action, and in its state, the reward is used in the q-function. In the Q-learning process, the UAV will run through episodes, and for each episode, the UAV will take a number of steps to move around the environment. The UAV may take action from the four available actions or remain in its position in some cases. We set the reward function depending on the behavior we want the UAV to follow A table named q-table is a size of (9x4). 9 is the number of states, and 4 is the number of actions. The q-value of each action at each step will be saved in this table by the coordinates (state, action). There are two ways, (i) the UAV could take action ; it may either take a random action from the four actions available or (ii) it can take action depending on the q-table, and the optimal action to take at each state is the action with the maximum q-value at this state. These two ways of choosing an action are called Exploration and Exploitation. which are mentioned in the second chapter (subsection item 7). At the beginning of the process, we set a variable called Epsilon to 1, and at each episode, Epsilon will decrease by a small decimal number. Then, the algorithm at the beginning of the step will choose a random decimal number between 0 and 1. If this random number is less than epsilon, the algorithm will choose the random action way, and it’s called Exploration, but if the number is more significant than Epsilon, then the algorithm will choose an action according to the q-table, and this way is called Exploitation. In the beginning, the q-table will be initialized by zeroes at the beginning, and since Epsilon is equal to 1 at the first episode and the random number is between 0 and 1, then at each step of the first episode, there is a probability of 99% that the algorithm will pick a random action and explore the environment. Epsilon will decrease, the probability of Exploration will decrease, and the probability of exploitation will increase until epsilon is less than 0, where the algorithm will be on (100%) Exploitation, Epsilon is called Exploration rate. The rate of change from the 99% Exploration phase to the 100% Exploitation phase depends on the epsilon decay rate. For instance, if we set the epsilon ϵ decay rate to 0.1, then after 11 episodes, ϵ will be less than 0, then the algorithm will no longer explore the environment, and it will only go with Exploitation. 11 episodes may not be enough for the 25 Chapter 3 Our Protocol UAV to explore all the environment especially if it’s a vast environment. Therefore, we try to balance the number of episodes and epsilon decay rate to give the UAV enough episodes to explore the environment. While the UAV is learning and exploring the environment, some actions may take it out of the environment in some cases. In this situation, the UAV will remain in its position, and instead of getting an average reward, it will get a penalty for this action to ensure avoiding this action next time. The q-values will be calculated using the Q-function (see step 5 of the Q-larning algorithm 2), and the q-table will be updated at each step. Q(Staten , Actionk ) is the q-value of the action k in the state n, R is the reward value, α is the learning rate it’s a decimal number between 0 and 1, and it controls how much change will be in q-value at each update, high αvalue means significant change, Γis the discounted factor it controls how the next reward is than the current reward, a high Γ value will consider the next reward more critical. 4.3 UAV Recharging Architecture Our proposed protocol consists of two main stages, (i) The Charging Process (CP) from the UAV to the IoT devices and (ii) the Data Collection process (DC) from the IoT devices to the UAV. In the first stage, The UAV is equipped with directional antennas to simultaneously transfer information and energy to IoT devices based on SWIPT technology. As for the Data Collection process, the IoT devices transmit their collected data using the uplink signal to the UAV. 4.4 Simulation Setup Notation Device name Processor Intel(R) RAM System Type Linux edition Python AirSim Unreal Engine Description DESKTOP-PC Core i3-6000U CPU 4 GB 64-bit operating system 18.04 Python 3.6 AirSim 2017 Unreal Engine 4 TABLE 2. List of simulation setup. 26 Chapter 3 4.4.1 Our Protocol AirSim AirSim is an open-source simulator developed by Microsoft for autonomous vehicles, including drones. It provides a realistic virtual environment for testing and developing algorithms and control systems for aerial vehicles. AirSim offers a variety of features, such as realistic physics simulations, sensor emulation (e.g., cameras, Lidar), and APIs for interacting with the simulated environment. The goal for developing AirSim is having a platform for AI research to experiment with deep learning, computer vision, and reinforcement learning algorithms for autonomous vehicles. For this purpose, AirSim also exposes APIs to retrieve data and control vehicles in a platform-independent way.[25]. F IGURE 11. AirSim [25] 4.4.2 Unreal Engine Unreal Engine 4 (UE4) combined with AirSim forms a robust framework for developing and testing autonomous systems, including drones. Unreal Engine 4 is a widely used game engine developed by Epic Games, renowned for its advanced graphics, physics simulations, and expansive toolset. The integration of Unreal Engine 4 with AirSim, allows us to leverage the benefits of both platforms [26]. 27 Chapter 3 Our Protocol F IGURE 12. Unreal Engine 4. [26] 4.5 Used Algorithm Our code is written in Python 3, python is a famous programming language in many fields, such as AI, data science, and web development. It is of type script. Our code contains a main class and a main program. We will use the library NumPy to manipulate matrices and generate random numbers for some mathematic functions, and the Matplotlib library will be used for plotting the result graphs. After that, we created our Q-table and initialized it with zeros. We also set the number of steps, number of episodes, Epsilon, alpha, and discount factor. (see. Table 3). Environment Q-table Number of episodes Number of steps Epsilon Learning rate Discount factor env 9×4 1000 100 1 0.3 0.9 TABLE 3. Simulation parameters. 28 Chapter 3 Our Protocol Algorithm 2 Used algorithm Initialize the UAV location. Initialize the environment (four IoT devices) Initialize Q-table with arbitrary initial values or zeros Set Q-learning parameters (epsilon ϵ , the learning rate α and the discount factor γ) Connect to AirSim. Repeat for each episode Initialize the current state S[t] Repeat for each step of episode Choose A[t] based on the current state S[t] using policy derived from Q(ϵ greedy) (using an exploration-exploitation strategy) Take action A[t], observe next state S[t + 1] and immediate reward R[t] Update Q-value : Q(S[t], A[t]) ← Q(S[t], A[t]) + α · (R + γ · maxA Q(S[t + 1], A[t]) − Q(S[t], A[t])) Until convergence criterion is met Extract policy : Select action A[t]∗ with the highest Q-value for each state S[t] Show the plots 4.6 4.6.1 Results Reward discussion These two figures, Figure 13 and Figure 14, present the reward of each episode in two different scenarios (with and free of obstacles). Each graph at the beginning starts with low values, and after several episodes, the reward value increases and converges. That’s because the algorithm is changing from exploration to exploitation, and when it is 100% exploitation, that’s where the UAV has learned all about the environment, and it will only choose the optimal actions. That’s why we see the reward is stable after about 100 episodes. F IGURE 13. the reward result for scenario one F IGURE 14. the reward result for scenario two 29 Chapter 3 4.6.2 Our Protocol Energy discussion In both scenarios, we assumed that the UAV has (100%) of energy at the beginning of each episode, and if its battery reaches (20%) or less, it will return to the base station. On the other hand, we assumed that the IoT devices have a 20% battery level at the begging of each episode. We also assumed that in every step taken by the UAV, the battery would be decreased by (0.5%) for the UAV. If the UAV reaches the IoT, a terminal state, its battery will be decreased by (10%). Figure 15 and Figure 16 present the energy consumption of the UAV after each episode, as shown that the UAV at the beginning will consume a lot of energy because he still learning and wasting a lot of energy moving around, but after it learns enough, the energy will converge. F IGURE 15. the consumed energy for each episode in scenario one 4.6.3 F IGURE 16. the consumed energy for each episode in scenario two Transfered energy discussion As we can see in both Figure 17 and Figure 18, the energy transferred to the IoT device is low at the beginning, then it and that’s because the UAV still learning and it doesn’t know the locations of the IoT devices, but after it learns enough it will minimize its actions and transfer the most possible amount of the energy. 30 Chapter 3 F IGURE 17. the transferred energy result for scenario one 4.7 Our Protocol F IGURE 18. the transferred energy result for scenario two Conclusion In this chapter, we introduced our case study, which is making a flying UAV as an energy resource to charge some IoT devices and deliver the maximum energy for them, by implementing a Q-learning algorithm to optimize the movement and the consumption of energy for the UAV, during a specific number of episode. We also mentioned the UAV Recharging Architecture that we implanted in the UAV. 31 5. General Conclusion In this thesis, we introduced the concept of using a UAV in two separate roles as an aerial data collector and a flying energy source that can wirelessly collect information and transfer power from and to terrestrial IoT devices. This concept allows UAVs to navigate autonomously by providing them with a learning ability to control their behavior and optimize the energy consumption of their batteries. With the help of the Q-learning-based algorithm with only some previous knowledge about the environment and the available actions, where UAVs can learn and improve their behavior by trial and error policy, which means that the UAV tries to explore the unknown environment in-depth by collecting rewards as much as it can and whenever it takes a wrong step it will get a punishment, without letting its own battery to run out. Which will help the UAV to start developing a learning policy and use it in the future. To eliminate complications and illustrate the use of Q-Learning in flying energy sources. We saw how the UAV performed in the scenarios we proposed, and according to the result, we think its performance is efficient. Thus, we are looking to add new concepts and fix the problems discussed in the future work section. As for the future work, we see that in this study, we used only one agent so far, so for the upcoming work, we think to change that and use multiple agents at the same time. Moreover, the energy consumption problem is still not solved, so we expect to adopt new ways to optimize energy consumption, such as Deep Q-learning (DQN), which uses Qlearning with the neural network. Another aspect that we want to achieve in our upcoming studies thus shifting from using a discrete environment to an open environment. 32 References [1] Qualcomm I NC. Intelligently connecting our world in the 5G era. Available :[Online]. 2020. URL : https : / / www . qualcomm . com / content / dam / qcomm - martech / dm - assets / documents / %20intelligently _ connecting_our_world_in_the_5g_era_web_1.pdf/. [2] Tharindu D Ponnimbaduge P ERERA et al. « Simultaneous wireless information and power transfer (SWIPT) : Recent advances and future challenges ». In : IEEE Communications Surveys & Tutorials 20.1 (2017), p. 264-302. [3] Sayed Amir H OSEINI et al. « Trajectory optimization of flying energy sources using q-learning to recharge hotspot uavs ». In : IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE. 2020, p. 683-688. [4] Javad S HAHMORADI et al. « A comprehensive review of applications of drone technology in the mining industry ». In : Drones 4.3 (2020), p. 34. [5] Flavio Araripe D ’O LIVEIRA, Francisco Cristovão Lourenço de M ELO et Tessaleno Campos D EVEZAS. « High-altitude platforms—Present situation and technology trends ». In : Journal of Aerospace Technology and Management 8 (2016), p. 249262. [6] Naser Hossein M OTLAGH, Tarik TALEB et Osama A ROUK. « Low-altitude unmanned aerial vehicles-based internet of things services : Comprehensive survey and future perspectives ». In : IEEE Internet of Things Journal 3.6 (2016), p. 899-922. [7] Christopher JCH WATKINS et Peter DAYAN. « Q-learning ». In : Machine learning 8 (1992), p. 279-292. [8] Kevin A SHTON et al. « That ‘internet of things’ thing ». In : RFID journal 22.7 (2009), p. 97-114. [9] Debasis BANDYOPADHYAY et Jaydip S EN. « Internet of things : Applications and challenges in technology and standardization ». In : Wireless personal communications 58 (2011), p. 49-69. [10] Ying Z HANG. « Technology framework of the Internet of Things and its application ». In : Proceedings of the IEEE International Conference on Electrical and Control Engineering. IEEE. 2011, p. 4109-4112. 33 List of references [11] Guicheng S HEN et Bingwu L IU. « The visions, technologies, applications and security issues of Internet of Things ». In : Proceedings of the IEEE International conference on E-Business and E-Government (ICEE). IEEE. 2011, p. 1-4. [12] Miao W U et al. « Research on the architecture of Internet of Things ». In : Proceedings of the IEEE international conference on advanced computer theory and engineering (ICACTE). T. 5. IEEE. 2010, p. V5-484. [13] Thaisa A BALDO et al. « Wearable and biodegradable sensors for clinical and environmental applications ». In : ACS Applied Electronic Materials 3.1 (2020), p. 68-100. [14] CamCode I NC. Durable Asset Tracking Labels and Services. Available :[Online]. 2023. URL : https : / / www . camcode . com / blog / what - is - asset tracking/. [15] Felipe C ARO et Ramin S ADR. « The Internet of Things (IoT) in retail : Bridging supply and demand ». In : Business Horizons 62.1 (2019), p. 47-54. [16] H ODGKINSON , DAVID AND J OHNSTON , R EBECCA. « The future of drones : Unmanned Aircraft and the Future of Aviation ». In : mai 2018, p. 111-131. ISBN : 9781351332323. DOI : 10.4324/9781351332323-6. [17] Xiaolin J IA, Quanyuan F ENG et Chengzhen M A. « An efficient anti-collision protocol for RFID tag identification ». In : IEEE Communications Letters 14.11 (2010), p. 1014-1016. [18] Richard S S UTTON, Andrew G BARTO et al. Introduction to reinforcement learning. T. 135. MIT press Cambridge, 1998. [19] Robert S W OODWORTH et Harold S CHLOSBERG. « Experimental psychology. New York : Henry Holt and Company ». In : (1938). [20] Shanshan J IANG, Zhitong H UANG et Yuefeng J I. « Adaptive UAV-assisted geographic routing with Q-learning in VANET ». In : IEEE Communications Letters 25.4 (2020), p. 1358-1362. [21] Harald BAYERLEIN, Rajeev G ANGULA et David G ESBERT. « Learning to rest : A Q-learning approach to flying base station trajectory design with landing spots ». In : 2018 52nd Asilomar Conference on Signals, Systems, and Computers. IEEE. 2018, p. 724-728. [22] Jingzhi H U, Hongliang Z HANG et Lingyang S ONG. « Reinforcement learning for decentralized trajectory design in cellular UAV networks with sense-and-send protocol ». In : IEEE Internet of Things Journal 6.4 (2018), p. 6177-6189. 34 List of references [23] Amir N IARAKI, Jeremy ROGHAIR et Ali JANNESARI. « Visual exploration and energy-aware path planning via reinforcement learning ». In : arXiv preprint arXiv :1909.12217 (2019). [24] Wanyi L I, Li WANG et Aiguo F EI. « Minimizing packet expiration loss with path planning in UAV-assisted data sensing ». In : IEEE Wireless Communications Letters 8.6 (2019), p. 1520-1523. [25] Shital S HAH et al. « AirSim : High-Fidelity Visual and Physical Simulation for Autonomous Vehicles ». In : Field and Service Robotics. 2017. eprint : arXiv: 1705.05065. URL : https://arxiv.org/abs/1705.05065. [26] PV S ATHEESH. Unreal Engine 4 Game Development Essentials. Packt Publishing Ltd, 2016. 35