1. Introduction Since the early days of computer gaming and the first Nintendo and Atari systems, video game production has come a long way, with the primary objective being to amuse both kids and adults [1]. The days of pixelated graphics and sparse audio are a thing of the past; modern video games are more lifelike than ever. This made digital gaming so alluring that it turned into a momentary haven for many people from the stresses of the real world, but what precisely contributed to the significant increase in game development? The progression of video game graphics, gameplay, and even narratives may be a simple explanation, but there's something more that every gamer looks for when a new game hits the shelves: how accurate the artificial intelligence in that game is to real life. The study of video game in Artificial Intelligence (AI) has long been a research topic. It examines ways to employ AI technology to play games at a level of performance comparable to humans and is a key topic in many games. It would be difficult for a game to give the player an immersive experience without these AI-powered interactive experiences, which are typically created by NonPlayer Characters (NPCs) or even enemies that act cleverly or creatively, as if they were controlled by a human gamer or were acting with a mind of their own [2]. In the proposed Final Degree Project, a machine learning algorithm known as Reinforcement Learning is used to create a 3D Maze game using the Unity 3D engine. The algorithm includes an agent that mimics human player behaviour and continuously improves it based on past game experiences until it achieves a fully optimized score that is unbeatable by average play. Furthermore, the term "crunch" refers to a severe overtime that lasts for a long time in the professional video game development industry. This approach may cause developers to make lessthan-ideal decisions when developing. This "crunch" may push these developers to try to take short cuts, which may result in machine learning algorithm optimisation, if they employ machine learning for a video game [3]. For this reason, it is critical to investigate the performance of machine learning algorithms in various contexts when they are not tailored for those contexts. The six chapters that make up the framework of this project report are described as follows: In order to comprehend the subject of this project report in greater detail, it is important to be aware of the themes covered in chapter 2. Chapter 3 explains how the environments used for training and evaluation are set up and how the agents are trained. Two case studies with data from the evaluation and the training comprise chapter 4. The sole difference between the contents of these case studies is how each machine learning agent collects data from its surroundings. Chapter 5 is separated into two case studies and provides an analysis of the training and evaluation findings. Chapter 6 covers the final wordings about the game development using Reinforcement Learning. 2. Background work I discuss several topics in this chapter that can help to give background knowledge that is necessary to comprehend this 3D Maze game developement. The following subjects pertain to the game development environment: Deep learning, reinforcement learning, machine learning, Unity 3D, and deep reinforcement learning. 2.1 Machine Learning Machine learning is a field that focuses on two problems, according to M. I. Jordan and T. M. Mitchell [4] "How can one create computer systems that automatically develop via experience? and "What basic statistical laws of computational information theory underpin all learning systems, including those in computers, people, and organisations?" These days, machine learning is applied in data processing, speech recognition, natural language processing, robot control, and other fields. The development of various forms of machine learning has been fueled by the abundance of diverse use cases for the technology. The goal of many machine learning algorithms is to improve itself in a way that increases the function's accuracy. These algorithms are mostly employed to solve function approximation problems. Three main machine learning paradigms exist. These three types of learning are reinforced, unsupervised, and supervised. Of these three paradigms, reinforcement learning is the focus of this work because it is the paradigm most frequently employed in the study of machine learning through the use of video games. Furthurmore, figure 1 shows the different types of machine learning paradigms which are used in AI. Figure 1: Machine learning types [5] The machine learning algorithms can be provided with a large amount of training data so they can learn. This data is provided as labelled input-output pairs of training samples. Supervised learning is the paradigm that is applied when the data is in this format [6]. Unsupervised learning is the machine learning paradigm that applies when the data is not labelled. A more thorough discussion of reinforcement learning will be covered later in this chapter. 2.2 Deep Learning Another type of representation learning is deep learning. Techniques that do not require their training data to be in input-output pairs are referred to be representation learning techniques. These input-output pairs can be automatically classified by the machine learning algorithm if the training data is provided in an unprocessed format. Deep learning is created when these representation learning techniques are layered on top of one another. The fact that these representation learning layers in deep learning are not created by humans is a significant and fascinating feature. Through the process of learning, the layers are revealed from the data. IBM provides an additional explanation of deep learning used in reinforcement learning [7]. As stated by IBM: "Deep learning is a subset of machine learning, which is essentially a neural network with three or more layers." These neural networks are designed to mimic the functioning of the human brain, which enables the algorithm to learn from the data that is sent to it. These explanations of deep learning make it simple to deduce that while deep learning has been mixed with semisupervised learning, traditional machine learning algorithms developed using deep learning fall within the unsupervised paradigm. 2.3 Reinforcement Learning Reinforcement learning is a model of machine learning alongside supervised and unsupervised learning, according to the book Reinforcement Learning: An Introduction [8]. It can be summed up as a computational method of interaction-based learning. Because it lacks knowledge of the tasks at hand, the machine learning system must rely on trial and error to determine the optimal course of action to optimise its rewards. Furthermore, the objectives of reinforcement learning differ from those of unsupervised Learning. Unsupervised learning aims to detect patterns and similarities between data points, whereas reinforcement learning seeks to find an optimal action model that maximises the agent's cumulative reward over time. Figure 2 depicts the action-reward feedback loop of a generic reinforcement learning model. The trade-off between the agent trying out novel action sequences and using an action sequence that it has previously learnt is one of the most difficult aspects of implementing reinforcement learning. The agent cannot discover novel and improved solutions if it does not experiment with different sequences, but it also cannot achieve a steady outcome if it tries something new every time. The conflict between using pre-existing action sequences and creating new ones is a key feature that sets reinforcement learning apart from the other two main machine learning paradigms. Figure 2: Reinforcement process in games The goal of policy is to define an agent's actions. It sort of maps actions to environmental inputs. The main objective of the reinforcement learning problem is essentially a reward signal. The environment sends the agent a single number for each step of training. The agent seeks to maximise that figure as the reward. Because of this, the agent shouldn't be able to influence how the environment generates the reward signal. Since games acts as the agent's environment and provide a constant set of behaviours for it to perform, reinforcement learning is a good fit for use in machine learning studies including video games. The Arcade Learning Environment [9] and the Unity ML-Agents toolkit [10] are two instances of how reinforcement learning is used to video games. Both systems have the ability to use other auxiliary learning techniques, such as imitation learning, in addition to reinforcement learning. 2.4 Unity A well-liked gaming engine used by both professional and beginner developers is called Unity, formerly known as Unity3D. Unity's licence is one factor contributing to its popularity. As long as they have funding or revenue of at least $100,000 US, anyone can use Unity under the licence. User must then upgrade from the personal licence only after that. [11] Unity has become more and more well-known because to this, as well as the vast gallery of assets available on the Asset Store, the abundance of tutorials, and other support resources. Although its primary application is as a gaming engine, Unity is widely used in other fields. These fields include anything from architecture to film and animation to industry. In addition to these fields, Unity is utilised as a research environment; it is particularly wellknown for its work in the fields of machine learning, augmented reality, and virtual reality. Unity Labs is used for conducting the research work with reinforcement learning. 2.5 ML Agent ToolKit Unity Machine Learning toolkit is open source functionalty which is developed by Unity [10]. There is a Python package included with the ML-Agents toolkit, consisting of two parts. An API at a low level is the first element. Direct interaction with the Unity environment is possible using it. The agent training in the Unity environment is made possible by the second component, which is an entrance point [12]. There are three main components in the ML agent toolkit which are as follows: 1. Academy: The main functionality of this component of ML agent toolkit is to manage and coordinate different parts of learning game environment. It also provides the interface between game environment and Python API for training the agent in the given environment. 2. Agent: It is the main component of ML tookit which is trained in the given environment of game by using the GameOject in Unity. The given agents are trained in Unity build environment assign the responsibilty according to environmnet scenario. 3. Sensor: It is component which help to incorporate the functionalities of the game. The sensors contains the policy which is called behavior name in the game environment. The universal hyperparameters and the Proximal Policy Gradient method hyperparameters are combined to create the hyperparameters used in this work. The main hyperparameters used in this work are as follows: Learning rate Batch size Buffer size Beta Number of epoches When using either the SAC or PPO algorithms, the these hyperparameters in the MLAgents toolkit are present. The PPO algorithm-specific hyperparameters are Beta, Epsilon, and number of epoch. 3. Literature Work I conducted literature about the recent work done in the deep reinforcement learning for training game environment. By using a Dynamic Difficulty Adjustment (DDA) methodology and approaching the balance problem as a meta-game within the context of Reinforcement Learning (RL), the research article presents a novel approach to automatic video game balancing [13]. The study expands on standard reinforcement learning by using an agent as a game master and a meta-environment in which state transitions depend on the development of a base environment. In addition, the suggested Multi-Agent System training model has the game master interacting with several agent opponents, each of whom has different playing styles and skill levels for the base game. The experiments, which are conducted in adaptive grid-world environments for single-player and multiplayer games, produce thorough findings that assess how the game master makes decisions that are in line with the balanced objective sets. A framework for autonomous game balance is presented in the research, with a focus on modulating reward functions, action spaces, and balance space states. The conclusion expands on the scope of DDA applications, especially in serious games, where the smooth incorporation of dynamic difficulty adjustments may improve therapy session motivation and engagement, tackling issues with patient recovery and opening up new directions for serious game design. The majority of the research on visual navigation using natural language instructions has been conducted in situations in which easily recognized items in the surrounding environment act as points of reference. However, the work described in [14] tackles the difficult field of threedimensional maze-like environments, where a unique set of challenges arises from the lack of such reference points. Specialized procedures are required because conventional approaches, which have been successful in more traditional contexts, become unsuitable in this particular environment. The research presents a unique architecture that distinguishes itself by simultaneously learning instruction and visual navigation policies. The proposed approach outperforms conventional techniques in simulated experiments conducted in various maze environments, proving its capacity to comprehend instructions and navigate across complex, deficient reference point terrains. This research contributes to the understanding of visual navigation challenges and offers promising solutions to adjust maze-like scenarios. This research study focuses on implicit interaction at the motor adaptation level and tackles the crucial goal of creating intuitive collaboration between people and intelligent robots in practical applications. Although there has been a lot of research on explicit communication, co-learning via shared motor tasks is the main focus here. The authors introduce a collaborative maze game in which a human player and a robotic agent must work together well because there are only so many actions that may be made on orthogonal axes [14]. For robotic agent control, deep reinforcement learning is used, and without prior training, it shows significant results in a brief amount of real-world play. The systematic experiments conducted on the co-learning process that occurs between humans and robots emphasize the formation of a cooperative strategy that consistently wins games. Based on data from pilot research and a larger set of experiments, the result emphasizes the viability of human-robot collaboration, especially in physically demanding jobs like collaborative manufacturing robots and assistive robots for users who are paralyzed. In the end, by working together with intelligent robots, the study highlights the advantages of human-in-the-loop systems and shows how they may be used in real-world applications to enhance human capabilities rather than replace them. Research paper [15] explores the field of reinforcement learning with a particular focus on image-output video games, an environment in which classical Q-Learning struggles to handle huge and complex state spaces. With the introduction of Deep Q-Network (DQN), an artificial neural network-based technique based on Q-Learning, the study investigates the use of deep reinforcement learning to model video games based on visual input. The research suggests neural network architectures and image processing optimization strategies to improve the effectiveness of DQN in handling video game scenarios, addressing the shortcomings of conventional methods. The experimental results show that human game data expedites and enhances model training, confirming DQN's effectiveness in helping players get high scores in video games. The study also presents and confirms the use of experience replay in training DQN models, demonstrating how effectively it works to improve the agents' stability and rate of learning. In the end, this establishes DQN as a powerful method for teaching agents to perform well in video games. With an emphasis on delayed reward systems, this work [16] investigates reinforcement learning's performance-influencing features. The suggested multi-agent Deep Q-Network (NDQN) model uses ping-pong and maze-finding games as examples of scenarios with delayed rewards that are challenging for traditional Q-learning. The N-DQN model exhibits notable improvements, outperforming Q-Learning by 3.5 times and achieving goal attainment 1.1 times faster than DQN in reward-sparse scenarios. In order to solve issues with positive bias and learning efficiency, the work integrates segmentation algorithms for incentive acquisition periods and prioritizes experience replay. The article highlights improvements, but it also notes that more research is necessary to lighten the system because of growing hardware demands. The conclusion highlights the potential uses of smart factories and outlines future directions for research, including the investigation of complex algorithms for dispersed agents and the optimization of neural network topologies for real-world uses. By providing insightful information about reinforcement learning in delayed reward systems, this study opens up new avenues for future research in the area. Advances in deep learning have made it possible to construct optimum policies for agents in high-dimensional environments. Reinforcement learning (RL) has been widely used to solve sequential decision-making problems. The function of reinforcement learning and deep learning is the main topic of this research, which addresses difficulties in multi-agent settings where cooperation and communication are crucial for resolving challenging problems. The survey [17] covers a range of multi-agent deep reinforcement learning (MADRL) techniques, including non-stationarity, partial observability, multi-agent training schemes, continuous state and action spaces, and transfer learning. The paper highlights the effective implementation of MADRL in real-world problems, including urban traffic control and swarm robotics, by analyzing the advantages and disadvantages of these techniques. In the conclusion, potential research directions are identified, including learning from demonstration in multiagent environments, human-in-the-loop architectures, and model-based approaches for scalability and efficiency. Challenges and solutions are also defined, and the integration of deep learning into traditional multi-agent RL methods is discussed. It is highlighted how crucial empirical research is to applying MADRL to complex real-world problems in an effective manner, as well as how RL techniques are still evolving. The critical function of path planning in autonomous mobile robots is discussed in this research [18], which also offers a novel solution to the drawbacks of the conventional best-first search (BFS) and rapidly exploring random trees (RRT) methods. The suggested approach creates a path graph by incorporating reinforcement learning, especially Q-learning, into the path planning procedure. This improves navigation efficiency by eliminating paths that collide with barriers. Comparing the method to BFS and RRT, experimental results show that it provides smoother and shorter pathways. The conclusion emphasizes how effective Q-Learning is in generating better global paths and makes suggestions for future updates to the update equation that center on directing the path away from places with a high concentration of obstacles. By highlighting the potential of reinforcement learning to overcome the difficulties presented by conventional algorithms, this research advances global path planning for autonomous robots and has implications for improving robot navigation safety. While modest automation has been observed in ame testing—a historically difficult task in the gaming business that mostly relies on manual playtesting and scripted testing—recent attempts have begun to investigate the possibilities of deep reinforcement learning (DRL). The research [19] presented here tackles the niche of automated game testing by providing four oracles drawn from an extensive analysis of 1349 real bugs in commercial games, while existing DRLs have concentrated on winning games. Introduced in the paper, the Wuji framework increases the probability of bug detection by dynamically balancing space exploration and game progression through the use of evolutionary algorithms, DRL, and multi-objective optimization. Wuji's efficacy in examining game states and identifying unidentified bugs is demonstrated by the extensive assessment conducted on a basic game and two well-known commercial games. The framework's contribution to automated game testing is highlighted in the conclusion, which demonstrates how it combines evolutionary multi-object optimization and DRL to successfully find bugs in real-world commercial games, classify bugs, and suggest test oracles. While reinforcement learning (RL) has demonstrated impressive performance in managing agents in Markov decision processes, enduring issues such as algorithmic divergence and suboptimal policies still exist. In order to improve exploration in settings with sparse feedback, [20] presents the Dreaming Variational Autoencoder (DVAE) and its extensions, such as DVAE-SWAGAN, a neural network-based generative modeling architecture. The work assesses the performance of DVAE-SWAGAN in a variety of state-space situations, longhorizon tasks, and deterministic and stochastic challenges using deep maze as a novel and versatile game engine. We compare the performance of various variants of the technique and find gains while simulating environment models with continuous state spaces. Still, there remain limitations, especially when it comes to correctly forecasting uncharted territory in the state space, which calls for more model improvements to include reasoning and a more thorough comprehension of the dynamics of the environment. The paper recognizes the open issues in reinforcement learning and suggests avenues for further research, including investigating non-parametric variations and adding an inverse reinforcement learning component for deep maze and deep line battles. The current objective is to improve the sample efficiency of RL algorithms by utilizing generative exploration, as demonstrated by DVAESWAGAN. Inspired by the brain's segmented approach to tasks, Modular Reinforcement Learning (MRL) breaks down complicated problems into smaller, concurrently learned sub-goals. The effectiveness of dichotomy-based decomposition architectures, which prioritize safety and learning efficiency above conventional Q-learning, has been proven by MaxPain and its deep variation, Deep MaxPain. However, the modular potential of both techniques was limited because mixing weights were established manually. To tackle this, the study investigates signal scaling in relation to the discounting factor γ and presents a state-value-dependent weighting method that incorporates softmax and hardmax, derived from a Boltzmann distribution case study [21]. With an emphasis on maze-solving navigation challenges, the paper presents a lidar and image-based sensor fusion network structure that improves simulation and real-world Turtlebot3 Waffle Pi experimentation outcomes. The suggested techniques highlight developments in weighting schemes and signal design, contributing to the field of Modular Reinforcement Learning for navigation challenges involving labyrinth solutions. First-person shooter (FPS) games present significant obstacles for reinforcement learning (RL) in terms of learning logical actions because of the large action space and challenging exploration. StarNet, a hierarchical reinforcement learning network optimized for first-person shooter games, is the innovative answer that this [22] study offers. With a manager-worker structure, StarNet functions on two hierarchical levels. High-level managers learn rules over possibilities, and low-level workers carry out these options because they are motivated by internal rewards. The model has the added benefit of action space division and environmental signal use, and it effectively handles the problems of sparse rewards and reward delays. In first-person shooter games, experimental assessments like the VDAIC 2018 Track show that StarNet outperforms other RL-based models, demonstrating its potential to greatly improve performance in tasks like combat skills and labyrinth solving. In order to compare two interactive reinforcement learning techniques, this study looks at regular reward assignment and a newly proposed method called "action assignment," which is similar to giving demonstrations. Inspired by human teaching techniques, the study investigates how well these strategies work when applied to robot learning. In the article [23], the concept of action assignment is introduced. According to the agent's future behavior, the suggested action is utilized to calculate a reward. Based on a user study in a two-dimensional maze game, action assignment greatly improved users' capacity to indicate the right behavior, according to the interaction logs. According to survey results, assigning actions and rewards was seen as very natural and practical; in fact, action assignment was thought to be just as natural as reward assignment. According to the study, users expressed frustration when the agent ignored their orders or assigned prizes more than once, and awarding rewards required more mental work. The results point to action assignment as a potentially less cognitively taxing and more user-friendly approach to interactive reinforcement learning, offering important new information for improving user satisfaction in robot teaching applications. In Reinforcement Learning (RL), the study [24] presents a unique method called Potentialized Experience Replay (PotER) to improve experience replay, especially for hard exploratory tasks with scant rewards. Conventional approaches frequently ignore the potential learning value in both favorable and negative circumstances by relying on basic heuristics for sampling events. Inspired by physics, PotER imbues each state with a potential energy function by introducing an artificial potential field into experience playback. This dynamic technique efficiently addresses the difficulties of difficult exploration tasks by enabling the agent to learn from a variety of experiences. PotER's inherent state supervision adds to the agent's flexibility, enabling it to function in difficult situations. The research demonstrates PotER's compatibility with multiple RL algorithms and self-imitation learning, highlighting its adaptability. Through extensive experimental evaluations of intricate exploration settings, such as mazes and difficult games, the research presents PotER as a competitive and effective baseline for handling difficult exploration tasks in reinforcement learning. In deep reinforcement learning (DRL), this study [25] tackles the issues of sluggish convergence and sensitivity to local optimal solutions. The Magnify Saltatory Reward (MSR) algorithm introduces dynamic modifications to experience rewards in the pool, focusing on situations with reward saltation. Through optimizing the agent's use of reward saltation experiences, MSR seeks to accelerate network convergence and enable the achievement of globally optimal solutions. The paper compares the performance of DRL algorithms, including deep Q-network (DQN), double DQN, and dueling DQN, before and after adding MSR. The experiments are carried out in a simulated obstacle avoidance search environment for an unmanned aerial vehicle. The experimental results show that the addition of MSR to these algorithms greatly speeds up network convergence, providing a possible solution to the enduring problems of local optimal solutions and slow convergence in DRL. The potential of MSR to enhance training results across multiple DRL algorithms is further highlighted by its changeable parameters, which may be adjusted to suit varied contexts with reward saltation. Agent exploration presents difficulties for deep reinforcement learning, especially when there are few rewards, which results in low sample efficiency. In order to overcome these drawbacks, the suggested technique, known as E2S2RND [26], uses an auxiliary agent training paradigm and a feature extraction module that has been precisely designed to harness predicted errors and produce intrinsic incentives. This novel method considerably enhances the agent's ability to conduct thorough exploration in situations with sparse reward distribution by mitigating the exploration dilemma induced by separation derailment and white noise interference. Analyses and comparative tests conducted in the Atari 2600 experimental environment confirm the effectiveness of the refined feature extraction module and demonstrate notable improvements in performance in six chosen scenarios. The advent of E2S2RND represents a significant paradigm change, offering a fresh approach to improving exploration challenges. To ensure robust performance across different reward structures, more study is necessary to extend the applicability of the approach to diverse contexts, as indicated by the performance difference observed in situations with dense rewards. Subsequent research endeavors to expand the approach's range, augmenting its efficacy in scenarios featuring sparse as well as dense incentives. The study investigates the use of deep reinforcement learning (DRL) in humanoid robot soccer, a game in which a robot uses images from its onboard camera to interact with its surroundings and learn various skills [27]. The work shows effective learning in a robotic simulator for activities including walking towards the ball, taking penalties, and goalkeeping by utilizing the Dueling Double DQN algorithm. Notably, the knowledge acquired in the simulator is effectively applied to an actual humanoid robot, demonstrating how flexible the DRL methodology is in many settings. The outcomes demonstrate the robot's proficiency in soccerrelated activities and illustrate the potential of deep reinforcement learning (DRL) in teaching sophisticated abilities to robot soccer players. The research highlights the significance of simulation in the process of learning, enabling the robot to perform tasks thousands of times. The effective transfer of learned models from simulation to an actual robot shows how DRL may be used practically to build robotic capabilities for soccer-playing scenarios. In deep reinforcement learning (DRL) and many other machine learning approaches, running many learning agents in parallel is a common strategy for enhancing learning performance. In the creation of these algorithms, the best strategy to organise the learning agents participating in distributed search was missed. The authors leverage insights from the literature on networked optimization to demonstrate that arranging learning agents in communication networks other than entirely connected topologies (the implicit way agents are frequently arranged) may improve learning [28]. They compare the performance of four well-known graph families and discover that one of them (Erdos-Renyi random graphs) outperforms the de facto fully-connected communication architecture in a variety of DRL test workloads. They also discover that 1000 learning agents grouped in an Erdos-Renyi graph can outperform 3000 learning agents organised in the typical fully-connected topology, highlighting the large learning advantages that may be realised by carefully structuring the topology over which agents interact. Theoretical research on why their alternative topologies work better is supplemented by actual data. Overall, their results suggest that optimising the communication architecture between learning agents might help distributed machine learning algorithms perform better. Deep reinforcement learning has shown to be an effective method in a range of artificial intelligence challenges during the last several years. Deep reinforcement learning in multiagent environments, rather than single-agent settings, has been the focus of recent research. The main goal of this research is to provide a thorough and systematic overview of multi-agent deep reinforcement learning methods, including issues and applications [29]. Basic information is supplied first in order to have a better understanding of this subject. The relevant structures and illustrative techniques are then described, along with a challenge taxonomy. Finally, the future potential of multi-agent deep reinforcement learning applications is highlighted. Exploration is critical for effective deep reinforcement learning and has received a lot of attention. Existing multi-agent deep reinforcement learning techniques, on the other hand, continue to depend on noise-based approaches. Exploration strategies that take into consideration the cooperation of several agents have just recently been developed. Existing techniques, on the other hand, have a similar issue: agents struggle to identify states worth investigating, and they seldom coordinate their exploration efforts in this direction. To address this issue, the authors suggest cooperative multi-agent exploration (CMAE) [30], in which agents collaborate while exploring to pursue a common goal. The target is chosen from many projected state spaces using a normalised entropy-based technique. The agents are then taught how to collaborate in order to attain the goal. They demonstrate that CMAE consistently outperforms baselines on a range of problems, including a sparse-reward variation of the multiple-particle environment (MPE) and the Starcraft multi-agent challenge (SMAC). The outcomes of the literature review are as follows: Many efficient agent collision avoidance mechanisms are discussed in the literature which provides the cooperative relations between the agent and enemy. Reinforcement learning used the reward exchange mechanism to navigate between the agents which are much more effective in other domains too. As there are many collision avoidance and reward mehanism methods are proposed but they had not considered the environmental factors which are collision with enemy, and the navigation of agent in the given environment. The overall description, methods, tools used and limitations in the literature review are shown in Table 3.1. Ref. [13] Description Proposed method for collision avoidance in complex environment [14] Proposed an Deep Q-Network algorithm in which classical Q-Learning struggles to handle huge and complex state spaces. Unity 3D [27] Proposed collective motion strategy for agents Proposed distributed DRL approach for agent formation control DQN Unity 3D Centralized control for agents DDPG Unity ML toolkit The proposed approach for the formation of collision avoidance in the multi player games. The proposed approach for the reward strategy in games PKURLOA Unity ML-Agents DQN Unity 3D Not considered the enemy tackle factor during navigation They had not considered the multi player winning factors Not considered the environmental factor in AI based games. [28] [29] [30] Methods Dynamic Adjustment Difficulty Tools used Unity 3D Limitations Not considered the environmental factors Failed to tackle complex environment 4. Methodology The project aims to investigate how well curriculum learning can enhance several elements of a simple-code agent that learns to find its way through a maze in the fewest steps possible. The effectiveness of curricular learning could be enhanced in the areas of training and evaluation outcomes. During development, I encountered several issues with the Raycast Perception Sensor that came with the ML-Agents toolbox. Consequently, I created an alternate version of the Agent, which is also tested in this thesis. I used a tutorial by Joseph Hocking [31] to assist me design a learning environment in order to accomplish this goal. With the aid of this lesson, I was able to construct a setting in which the Agent finishes a predetermined amount of training episodes, at which point a maze appears. The number of episodes for this investigation was fixed at five. Five episodes was chosen because it is a manageable number that enables the Agent to retry each episode while providing a variety of mazes for the Agent to explore during training. The episode ends when the agent touches the goal at the upper right corner of the maze, or after 1,000 steps are accomplished. The maximum number of steps was selected because it gives the Agent ample freedom to explore each episode and enable it to successfully navigate the maze several times. There are eight Agent instances in the learning environment, and each one has a different maze that has been created. 4.1 Agent There are two Agent versions in the environment. The first one collects environmental observations alone through the use of a Raycast Perception Sensor. This sensor is configured to identify the goal object and the maze's walls in straight lines that are parallel to the X and Z axes. Figure 3 provides an example which is configured as vector sensor in agent. Figure 3: Understanding of agent in 3D Maze game After combining all of the findings, a single variable is created and divided by 255 to normalise it. The use of the bitmask was motivated by a study conducted by Goulart, Paes, and Clua [32], in which the authors experimented with several approaches to gather data for an agent that was modelled after the video game Bomberman. This is the only information that the Raycast Perception Sensor utilising Agent gathers. Additionally, the bitmask utilising Agent records its present location in reference to the surroundings. The geographic data is not standardised. Although the findings could have been better, I concluded that the benefits would not outweigh the difficulties in normalising the location data due to the maze's fluctuating size throughout curriculum learning. 5. Results and discussion The aim of this chapter is to describe all the experimental results. In this chapter, we are going to describe all the graphs, got from the training of ML-agents. Illustration of our policy performs better than other policies and how it achieves high control on the agent in presence of enemies scenarios are also listed in this chapter. How agent navigate the best path to reach the destination by avoiding the collision with different obstacles and other agents are also described in this chapter. 5.1 Evaluated Parameters This section provides a full explanation of the parameters that are investigated in order to evaluate the effectiveness of the suggested model. In this research, we are going to be assessing a number of different parameters, some of which are listed below. 5.1.1 Mean Reward The first parameter that we take a look at is the average reward. The mean reward is the value that is representative of the average of all of the prizes that were given to agents during each episode. During the training phase, the performance of the proposed model is evaluated using this parameter as a measure of how well it works. The reward values that make up the reward structure are what ultimately decide how much the mean reward is worth. If the reward value is large, it means that the agents have completed a significant amount of training. 5.1.2 Policy Loss The policy loss is used as a metric to evaluate how well the discriminator is doing. The fact that our agents were able to replicate the actions of the expert agents in the demonstrations is evidence of how successfully they did so. As seen by the minimum policy loss value, agents carry out their responsibilities successfully and properly imitate the behaviour. The efficiency with which agents function in their environments is one of the primary factors that determine the value of policy loss. 5.1.3 Value Estimate In addition to this, we make use of the value estimate to illustrate how well our proposed model operates in settings involving several agents. During the course of their training, agents go to several different states, and these value evaluations are used to represent the value of each state. The short value in the value estimate is evidence that the agents had a high level of training. 5.1.4 Policy Estimate A base model is also implemented in the same conditions as the proposed model so that the outcomes of both models may be verified. After training the agent using the basic technique, the results are compared with those produced by the proposed model in order to evaluate its effectiveness. The findings indicate that the proposed model performs better than the base model in terms of the mean reward, the extrinsic reward, the policy loss, and the value loss. In this section, the comparison between the base model and the proposed model is shown. 5.1.5 Extrinsic Reward This parameter is used to evaluate the agent's policy. This reward is offered to the agent by the policy. The high number indicates that the agent performed well. This parameter is a numerical value that is applied to each time step. The value of an extrinsic reward is determined by the reward value assigned to each occurrence, such as colliding with a barrier in room. 5.1.6 Episode Length This metric is used to assess the agent's operational performance. The episode duration indicates how many time steps an agent may function without clashing with anything. The larger value of the episode may show the good performance in many scenarios. 5.1.7 Number of Collisions This metric is used to evaluate the agent performance in terms of controled environment of rooms. This metric assisted in determining how much control is achieved under variou situations of obstacle. 5.2 Results To validate the proposed model's outcomes, the base model is also applied in identical conditions. The agents are trained using the base technique, and the results are compared to the suggested model's performance. According to the results, the suggested model outperforms the base model in terms of mean reward, extrinsic reward, policy loss, value loss and the number of collisions. This section compares the proposed model to the baseline model. The value estimate is the first parameter that is used in order to measure the performance of the proposed model. This graph depicts, for each state that an agent travels through throughout the training process, the value that was predicted for that state. A better-trained agent may be identified by their greater projected value. The suggested approach provides more accurate estimates of the value of the states in the value estimate graph. The proposed model has a more optimistic assessment of its value than the standard model does. In this particular parameter, the suggested model performs better than the standard model. The value estimate graph is shown in figure 3. Figure 3: Value estimate The second parameter that is used to evaluate our proposed model is the episode length shown in figure 4. The duration of each episode is compared to the base model using the graph below. This graph illustrates that the proposed model is superior than the base model in terms of performance. Agents trained with the proposed model had longer episodes than agents trained with base model graphs. The maximum episode length for the proposed model is 350, while the maximum episode length for the basic model is 10 - 15. This parameter demonstrates that the value of the proposed model is greater than the value of the base model, which indicates that the proposed model has good performance. The longer duration indicates that the agents operate effectively for a longer period of time, which demonstrates the agents' high level of performance. Figure 4: Episode length The third parameter utilized to evaluate our proposed model is the policy loss. The policy variations that happened throughout the training phase are shown in figure 5. The value of the graph should decrease with time since the agent should spend less time in the training process. Our results demonstrate that until 1.1 million timesteps, the proposed model takes less time to modify the policy, and then the graph decreases as the number of steps increases. After 1.1 million timesteps, the performance of both the proposed model and the base model is the same. This parameter represents the time required to change the policy, and our model performs better at lower timesteps, whereas both models almost took the same time to modify the policy and discover the optimum policy at higher timesteps. Figure 5: Policy loss The value loss is another parameter used to assess the performance of the proposed model, as shown in figure 6. This graph shows how correctly a policy value is evaluated for each policy implemented by the agent throughout the training phase. The loss should reduce as the timesteps increase. The graph illustrates that the loss was larger at the start, but it dropped after 250k timesteps, and the whole training process proceeded smoothly. The value loss of the proposed model is greater than that of the base model. Figure 6: Value loss According to this graph, the proposed model outperforms the base model. The reward for agents trained with the proposed model was larger than for agents taught with base model graphs. The greatest value of the mean reward obtained by the proposed model is -0.42, whereas the maximum value obtained by the base model is -0.5. This parameter indicates that the proposed model's value is greater than the base model's. Figure 7: Cumulative reward A comparison of cumulated reward between the proposed model and base model is shown in figure 7. The reward that is given to the agent that chooses the optimal policy is referred to as an extrinsic reward, as shown in figure 8. This reward begins at -0.42 and gradually increases until it reaches -0.32 as the number of time steps increases. Figure 8: Extrinsic reward But the reward that is obtained by the base model increases from -0.5 to -0.48 as the number of timesteps rises. This parameter shows that the value of the proposed model is higher than the value of the base model, which means that the proposed model performed better than the base model. 6. Conclusion Deep reinforcement learning has numerous applications in real world like, robotics locomotion, healthcare system, self-driving car and video games etc. DRL is used in many games to make it more user friendly and interesting with human base behavior. Deep reinforcement learning is being utilized to bulid for multiple games. Previous research offered strategies for agent navigation, collaboration, and cooperation in real life famous games. Existing research does not take into consideration many environmental elements used in games. Environmental factors like agent behavior and enemy have a negative impact on agent control. This work provided a model to improve and enhance the behvior of agent in the game environment. This report takes three environmental circumstances into account: agent behavior, reward policy, and enemy and the proposed model is implemented in Unity 3D and rendered all the mentioned environmental factors through Unity built in components and also evaluated in these settings. The proposed model trained the 3D Maze game agent under the presence of environmental factors using Generative Adversarial Imitation Learning and Proximal Policy Optimization algorithms. All the demonstrations have been recorded and these demonstrations are comprised of states and actions pair. And then these demonstrations are passed to the Generative Adversarial Imitation Learning (GAIL) as hyper-parameter. The agent employs expert demonstration to boost control in various environmental settings. A convolutional neural network with three layers is used to implement the Proximal Policy Optimization (PPO) as a trainer. Moreover, with slight different and additional hyperparameters, GAN is used for the implementations of GAIL. The performance matrices of mean reward, extrinsic reward, episode duration, policy loss, value loss, value estimation, and a number of collisions with obstacles is utilized to evaluate the performance of the proposed model with the base model. In the mean reward, extrinsic reward, value estimate, episode duration, and a number of collisions matrix, the suggested model performed well and beat the standard model. This demonstrates how the suggested approach provides the agent with good control. The results show that learning through expert behavior is very useful for increasing the control in different environmental conditions. 7. References [1] Szita, István. "Reinforcement learning in games." Reinforcement Learning: State-of-the-art. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. 539-577. [2] Hu, Zhipeng, et al. "Promoting human-AI interaction makes a better adoption of deep reinforcement learning: a real-world application in game industry." Multimedia Tools and Applications (2023): 1-22. [3] Teja, Koyya Vishnu, and Mallikarjun M. Kodabagi. "Intelligent Path-Finding Agent in Video Games Using Artificial Intelligence." International Journal of Research in Engineering, Science and Management 6.6 (2023): 6973. [4] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects”, Science, vol. 349, no. 6245, pp. 255–260, 2015. [5] Sarker, Iqbal. (2021). Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Computer Science. 2. 10.1007/s42979-021-00592-x. [6] B. Liu, “Supervised learning”, in Web data mining, Springer, 2011, pp. 63–132 [7] I. C. Education. “Deep learning”. (2020), [Online]. Available: https://www. ibm.com/cloud/learn/deep-learning (visited on 05/13/2022). [8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018. [9] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents”, Journal of Artifcial Intelligence Research, vol. 47, pp. 253–279, 2013. [10] A. Juliani, V.-P. Berges, E. Teng, et al., “Unity: A general platform for intelligent agents”, arXiv preprint arXiv:1809.02627, 2018. [11] “Unity’s terms of service”. (2022), [Online]. Available: https://unity3d. com/legal/terms-of-service/software (visited on 03/23/2022). [12] Unity Technologies. “Unity ml-agents python https://github.com/Unity- Technologies/ml- agents/blob/ 05/09/2022). low level api”. (2021), [Online]. Available: release_18_docs/docs/Python-API.md (visited on [13]. Reis, S., Reis, L. P., & Lau, N. (2021). Game adaptation by using reinforcement learning over meta games. Group Decision and Negotiation, 30(2), 321-340. [14]. Devo, A., Costante, G., & Valigi, P. (2020). Deep reinforcement learning for instruction following visual navigation in 3D maze-like environments. IEEE Robotics and Automation Letters, 5(2), 1175-1182. [15]. Shafti, A., Tjomsland, J., Dudley, W., & Faisal, A. A. (2020, October). Real-world human-robot collaborative reinforcement learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 11161-11166). IEEE. [16]. Tan, R., Zhou, J., Du, H., Shang, S., & Dai, L. (2019, May). An modeling processing method for video games based on deep reinforcement learning. In 2019 IEEE 8th joint international information technology and artificial intelligence conference (ITAIC) (pp. 939-942). IEEE. [17]. Kim, K. (2022). Multi-agent deep Q network to enhance the reinforcement learning for delayed reward system. Applied Sciences, 12(7), 3520. [18]. Coppens, Y., Bargiacchi, E., & Nowé, A. (2019, August). Reinforcement learning 101 with a virtual reality game. In Proceedings of the 1st international workshop on education in artificial intelligence K-12. [19]. Gao, P., Liu, Z., Wu, Z., & Wang, D. (2019, December). A global path planning algorithm for robots using reinforcement learning. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO) (pp. 16931698). IEEE. [20]. Zheng, Y., Xie, X., Su, T., Ma, L., Hao, J., Meng, Z., ... & Fan, C. (2019, November). Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 772-784). IEEE. [21]. Andersen, P. A., Goodwin, M., & Granmo, O. C. (2021). Increasing sample efficiency in deep reinforcement learning using generative environment modelling. Expert Systems, 38(7), e12537. [22]. Wang, J., Elfwing, S., & Uchibe, E. (2021). Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Networks, 135, 115-126. [23]. Song, S., Weng, J., Su, H., Yan, D., Zou, H., & Zhu, J. (2019, August). Playing FPS Games With EnvironmentAware Hierarchical Reinforcement Learning. In IJCAI (pp. 3475-3482). [24]. Raza, S. A., & Williams, M. A. (2020). Human feedback as action assignment in interactive reinforcement learning. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 14(4), 1-24. [25]. Zhao, E., Deng, S., Zang, Y., Kang, Y., Li, K., & Xing, J. (2021, January). Potential driven reinforcement learning for hard exploration tasks. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence (pp. 2096-2102). [26]. Hu, Z., Wan, K., Gao, X., & Zhai, Y. (2019). A dynamic adjusting reward function method for deep reinforcement learning with adjustable parameters. Mathematical Problems in Engineering, 2019, 1-10. [27]. Li, Y., Guo, T., Li, Q., & Liu, X. (2023). Optimized Feature Extraction for Sample Efficient Deep Reinforcement Learning. Electronics, 12(16), 3508. [28]. D. Adjodah, D. Calacci, A. Dubey, A. Goyal, P. land, “Leveraging communication topologies between reinforcement learning.” in AAMAS, 2020, pp. 1738–1740 [29]. W. Du and S. Ding, “A survey from the perspective of challenges Review, vol. 54, no. 5, pp. 3215–3238, 2021. Krafft, E. learning on multi-agent deep and applications,” Moro, and A. agents in Pentdeep reinforcement learning: Artificial Intelligence [30]. I.-J. Liu, U. Jain, R. A. Yeh, and for multi-agent deep reinforcement learning,” Machine Learning. PMLR, 2021, pp. 6826–6836 A. in Schwing, “Cooperative exploration International Conference on [31] “Training confguration fle”. (2021), [Online]. Available: https://github. com / Unity - Technologies / ml - agents / blob / release _ 18 _ docs / docs / Training-Configuration-File.md (visited on 05/08/2022). [32] “Using tensorboard to observe training”. (2022), [Online]. Available: https://github.com/Unity- Technologies/mlagents/blob/release_ 18_docs/docs/Using-Tensorboard.md.