Uploaded by Mohabbat Ali

3D maze game project

advertisement
1. Introduction
Since the early days of computer gaming and the first Nintendo and Atari systems, video game
production has come a long way, with the primary objective being to amuse both kids and adults
[1]. The days of pixelated graphics and sparse audio are a thing of the past; modern video games
are more lifelike than ever. This made digital gaming so alluring that it turned into a momentary
haven for many people from the stresses of the real world, but what precisely contributed to the
significant increase in game development?
The progression of video game graphics, gameplay, and even narratives may be a simple
explanation, but there's something more that every gamer looks for when a new game hits the
shelves: how accurate the artificial intelligence in that game is to real life.
The study of video game in Artificial Intelligence (AI) has long been a research topic. It examines
ways to employ AI technology to play games at a level of performance comparable to humans and
is a key topic in many games. It would be difficult for a game to give the player an immersive
experience without these AI-powered interactive experiences, which are typically created by NonPlayer Characters (NPCs) or even enemies that act cleverly or creatively, as if they were controlled
by a human gamer or were acting with a mind of their own [2].
In the proposed Final Degree Project, a machine learning algorithm known as Reinforcement
Learning is used to create a 3D Maze game using the Unity 3D engine. The algorithm includes an
agent that mimics human player behaviour and continuously improves it based on past game
experiences until it achieves a fully optimized score that is unbeatable by average play.
Furthermore, the term "crunch" refers to a severe overtime that lasts for a long time in the
professional video game development industry. This approach may cause developers to make lessthan-ideal decisions when developing. This "crunch" may push these developers to try to take short
cuts, which may result in machine learning algorithm optimisation, if they employ machine
learning for a video game [3]. For this reason, it is critical to investigate the performance of
machine learning algorithms in various contexts when they are not tailored for those contexts.
The six chapters that make up the framework of this project report are described as follows:





In order to comprehend the subject of this project report in greater detail, it is important to
be aware of the themes covered in chapter 2.
Chapter 3 explains how the environments used for training and evaluation are set up and
how the agents are trained.
Two case studies with data from the evaluation and the training comprise chapter 4. The
sole difference between the contents of these case studies is how each machine learning
agent collects data from its surroundings.
Chapter 5 is separated into two case studies and provides an analysis of the training and
evaluation findings.
Chapter 6 covers the final wordings about the game development using Reinforcement
Learning.
2. Background work
I discuss several topics in this chapter that can help to give background knowledge that is
necessary to comprehend this 3D Maze game developement. The following subjects pertain to
the game development environment: Deep learning, reinforcement learning, machine learning,
Unity 3D, and deep reinforcement learning.
2.1 Machine Learning
Machine learning is a field that focuses on two problems, according to M. I. Jordan and T.
M. Mitchell [4] "How can one create computer systems that automatically develop via
experience? and "What basic statistical laws of computational information theory underpin
all learning systems, including those in computers, people, and organisations?" These days,
machine learning is applied in data processing, speech recognition, natural language
processing, robot control, and other fields.
The development of various forms of machine learning has been fueled by the abundance
of diverse use cases for the technology. The goal of many machine learning algorithms is
to improve itself in a way that increases the function's accuracy. These algorithms are
mostly employed to solve function approximation problems.
Three main machine learning paradigms exist. These three types of learning are reinforced,
unsupervised, and supervised. Of these three paradigms, reinforcement learning is the
focus of this work because it is the paradigm most frequently employed in the study of
machine learning through the use of video games. Furthurmore, figure 1 shows the different
types of machine learning paradigms which are used in AI.
Figure 1: Machine learning types [5]
The machine learning algorithms can be provided with a large amount of training data so
they can learn. This data is provided as labelled input-output pairs of training samples.
Supervised learning is the paradigm that is applied when the data is in this format [6].
Unsupervised learning is the machine learning paradigm that applies when the data is not
labelled. A more thorough discussion of reinforcement learning will be covered later in
this chapter.
2.2 Deep Learning
Another type of representation learning is deep learning. Techniques that do not require
their training data to be in input-output pairs are referred to be representation learning
techniques. These input-output pairs can be automatically classified by the machine
learning algorithm if the training data is provided in an unprocessed format. Deep learning
is created when these representation learning techniques are layered on top of one another.
The fact that these representation learning layers in deep learning are not created by humans
is a significant and fascinating feature. Through the process of learning, the layers are
revealed from the data.
IBM provides an additional explanation of deep learning used in reinforcement learning
[7]. As stated by IBM: "Deep learning is a subset of machine learning, which is essentially
a neural network with three or more layers." These neural networks are designed to mimic
the functioning of the human brain, which enables the algorithm to learn from the data that
is sent to it. These explanations of deep learning make it simple to deduce that while deep
learning has been mixed with semisupervised learning, traditional machine learning
algorithms developed using deep learning fall within the unsupervised paradigm.
2.3 Reinforcement Learning
Reinforcement learning is a model of machine learning alongside supervised and
unsupervised learning, according to the book Reinforcement Learning: An Introduction
[8]. It can be summed up as a computational method of interaction-based learning. Because
it lacks knowledge of the tasks at hand, the machine learning system must rely on trial and
error to determine the optimal course of action to optimise its rewards.
Furthermore, the objectives of reinforcement learning differ from those of unsupervised
Learning. Unsupervised learning aims to detect patterns and similarities between data
points, whereas reinforcement learning seeks to find an optimal action model that
maximises the agent's cumulative reward over time. Figure 2 depicts the action-reward
feedback loop of a generic reinforcement learning model.
The trade-off between the agent trying out novel action sequences and using an action
sequence that it has previously learnt is one of the most difficult aspects of implementing
reinforcement learning. The agent cannot discover novel and improved solutions if it does
not experiment with different sequences, but it also cannot achieve a steady outcome if it
tries something new every time. The conflict between using pre-existing action sequences
and creating new ones is a key feature that sets reinforcement learning apart from the other
two main machine learning paradigms.
Figure 2: Reinforcement process in games
The goal of policy is to define an agent's actions. It sort of maps actions to environmental
inputs. The main objective of the reinforcement learning problem is essentially a reward
signal. The environment sends the agent a single number for each step of training. The
agent seeks to maximise that figure as the reward. Because of this, the agent shouldn't be
able to influence how the environment generates the reward signal.
Since games acts as the agent's environment and provide a constant set of behaviours for it
to perform, reinforcement learning is a good fit for use in machine learning studies
including video games. The Arcade Learning Environment [9] and the Unity ML-Agents
toolkit [10] are two instances of how reinforcement learning is used to video games. Both
systems have the ability to use other auxiliary learning techniques, such as imitation
learning, in addition to reinforcement learning.
2.4 Unity
A well-liked gaming engine used by both professional and beginner developers is called
Unity, formerly known as Unity3D. Unity's licence is one factor contributing to its
popularity. As long as they have funding or revenue of at least $100,000 US, anyone can
use Unity under the licence. User must then upgrade from the personal licence only after
that. [11] Unity has become more and more well-known because to this, as well as the vast
gallery of assets available on the Asset Store, the abundance of tutorials, and other support
resources.
Although its primary application is as a gaming engine, Unity is widely used in other fields.
These fields include anything from architecture to film and animation to industry. In
addition to these fields, Unity is utilised as a research environment; it is particularly wellknown for its work in the fields of machine learning, augmented reality, and virtual reality.
Unity Labs is used for conducting the research work with reinforcement learning.
2.5 ML Agent ToolKit
Unity Machine Learning toolkit is open source functionalty which is developed by Unity
[10]. There is a Python package included with the ML-Agents toolkit, consisting of two
parts. An API at a low level is the first element. Direct interaction with the Unity
environment is possible using it. The agent training in the Unity environment is made
possible by the second component, which is an entrance point [12].
There are three main components in the ML agent toolkit which are as follows:
1. Academy: The main functionality of this component of ML agent toolkit is to
manage and coordinate different parts of learning game environment. It also
provides the interface between game environment and Python API for training the
agent in the given environment.
2. Agent: It is the main component of ML tookit which is trained in the given
environment of game by using the GameOject in Unity. The given agents are
trained in Unity build environment assign the responsibilty according to
environmnet scenario.
3. Sensor: It is component which help to incorporate the functionalities of the game.
The sensors contains the policy which is called behavior name in the game
environment.
The universal hyperparameters and the Proximal Policy Gradient method hyperparameters
are combined to create the hyperparameters used in this work. The main hyperparameters
used in this work are as follows:





Learning rate
Batch size
Buffer size
Beta
Number of epoches
When using either the SAC or PPO algorithms, the these hyperparameters in the MLAgents toolkit are present. The PPO algorithm-specific hyperparameters are Beta, Epsilon,
and number of epoch.
3. Literature Work
I conducted literature about the recent work done in the deep reinforcement learning for
training game environment. By using a Dynamic Difficulty Adjustment (DDA) methodology
and approaching the balance problem as a meta-game within the context of Reinforcement
Learning (RL), the research article presents a novel approach to automatic video game
balancing [13]. The study expands on standard reinforcement learning by using an agent as a
game master and a meta-environment in which state transitions depend on the development of
a base environment. In addition, the suggested Multi-Agent System training model has the
game master interacting with several agent opponents, each of whom has different playing
styles and skill levels for the base game. The experiments, which are conducted in adaptive
grid-world environments for single-player and multiplayer games, produce thorough findings
that assess how the game master makes decisions that are in line with the balanced objective
sets. A framework for autonomous game balance is presented in the research, with a focus on
modulating reward functions, action spaces, and balance space states. The conclusion expands
on the scope of DDA applications, especially in serious games, where the smooth incorporation
of dynamic difficulty adjustments may improve therapy session motivation and engagement,
tackling issues with patient recovery and opening up new directions for serious game design.
The majority of the research on visual navigation using natural language instructions has been
conducted in situations in which easily recognized items in the surrounding environment act
as points of reference. However, the work described in [14] tackles the difficult field of threedimensional maze-like environments, where a unique set of challenges arises from the lack of
such reference points. Specialized procedures are required because conventional approaches,
which have been successful in more traditional contexts, become unsuitable in this particular
environment. The research presents a unique architecture that distinguishes itself by
simultaneously learning instruction and visual navigation policies. The proposed approach
outperforms conventional techniques in simulated experiments conducted in various maze
environments, proving its capacity to comprehend instructions and navigate across complex,
deficient reference point terrains. This research contributes to the understanding of visual
navigation challenges and offers promising solutions to adjust maze-like scenarios.
This research study focuses on implicit interaction at the motor adaptation level and tackles
the crucial goal of creating intuitive collaboration between people and intelligent robots in
practical applications. Although there has been a lot of research on explicit communication,
co-learning via shared motor tasks is the main focus here. The authors introduce a collaborative
maze game in which a human player and a robotic agent must work together well because there
are only so many actions that may be made on orthogonal axes [14]. For robotic agent control,
deep reinforcement learning is used, and without prior training, it shows significant results in
a brief amount of real-world play. The systematic experiments conducted on the co-learning
process that occurs between humans and robots emphasize the formation of a cooperative
strategy that consistently wins games. Based on data from pilot research and a larger set of
experiments, the result emphasizes the viability of human-robot collaboration, especially in
physically demanding jobs like collaborative manufacturing robots and assistive robots for
users who are paralyzed. In the end, by working together with intelligent robots, the study
highlights the advantages of human-in-the-loop systems and shows how they may be used in
real-world applications to enhance human capabilities rather than replace them.
Research paper [15] explores the field of reinforcement learning with a particular focus on
image-output video games, an environment in which classical Q-Learning struggles to handle
huge and complex state spaces. With the introduction of Deep Q-Network (DQN), an artificial
neural network-based technique based on Q-Learning, the study investigates the use of deep
reinforcement learning to model video games based on visual input. The research suggests
neural network architectures and image processing optimization strategies to improve the
effectiveness of DQN in handling video game scenarios, addressing the shortcomings of
conventional methods. The experimental results show that human game data expedites and
enhances model training, confirming DQN's effectiveness in helping players get high scores
in video games. The study also presents and confirms the use of experience replay in training
DQN models, demonstrating how effectively it works to improve the agents' stability and rate
of learning. In the end, this establishes DQN as a powerful method for teaching agents to
perform well in video games.
With an emphasis on delayed reward systems, this work [16] investigates reinforcement
learning's performance-influencing features. The suggested multi-agent Deep Q-Network (NDQN) model uses ping-pong and maze-finding games as examples of scenarios with delayed
rewards that are challenging for traditional Q-learning. The N-DQN model exhibits notable
improvements, outperforming Q-Learning by 3.5 times and achieving goal attainment 1.1
times faster than DQN in reward-sparse scenarios. In order to solve issues with positive bias
and learning efficiency, the work integrates segmentation algorithms for incentive acquisition
periods and prioritizes experience replay. The article highlights improvements, but it also notes
that more research is necessary to lighten the system because of growing hardware demands.
The conclusion highlights the potential uses of smart factories and outlines future directions
for research, including the investigation of complex algorithms for dispersed agents and the
optimization of neural network topologies for real-world uses. By providing insightful
information about reinforcement learning in delayed reward systems, this study opens up new
avenues for future research in the area.
Advances in deep learning have made it possible to construct optimum policies for agents in
high-dimensional environments. Reinforcement learning (RL) has been widely used to solve
sequential decision-making problems. The function of reinforcement learning and deep
learning is the main topic of this research, which addresses difficulties in multi-agent settings
where cooperation and communication are crucial for resolving challenging problems. The
survey [17] covers a range of multi-agent deep reinforcement learning (MADRL) techniques,
including non-stationarity, partial observability, multi-agent training schemes, continuous state
and action spaces, and transfer learning. The paper highlights the effective implementation of
MADRL in real-world problems, including urban traffic control and swarm robotics, by
analyzing the advantages and disadvantages of these techniques. In the conclusion, potential
research directions are identified, including learning from demonstration in multiagent
environments, human-in-the-loop architectures, and model-based approaches for scalability
and efficiency. Challenges and solutions are also defined, and the integration of deep learning
into traditional multi-agent RL methods is discussed. It is highlighted how crucial empirical
research is to applying MADRL to complex real-world problems in an effective manner, as
well as how RL techniques are still evolving.
The critical function of path planning in autonomous mobile robots is discussed in this research
[18], which also offers a novel solution to the drawbacks of the conventional best-first search
(BFS) and rapidly exploring random trees (RRT) methods. The suggested approach creates a
path graph by incorporating reinforcement learning, especially Q-learning, into the path
planning procedure. This improves navigation efficiency by eliminating paths that collide with
barriers. Comparing the method to BFS and RRT, experimental results show that it provides
smoother and shorter pathways. The conclusion emphasizes how effective Q-Learning is in
generating better global paths and makes suggestions for future updates to the update equation
that center on directing the path away from places with a high concentration of obstacles. By
highlighting the potential of reinforcement learning to overcome the difficulties presented by
conventional algorithms, this research advances global path planning for autonomous robots
and has implications for improving robot navigation safety.
While modest automation has been observed in ame testing—a historically difficult task in the
gaming business that mostly relies on manual playtesting and scripted testing—recent attempts
have begun to investigate the possibilities of deep reinforcement learning (DRL). The research
[19] presented here tackles the niche of automated game testing by providing four oracles
drawn from an extensive analysis of 1349 real bugs in commercial games, while existing DRLs
have concentrated on winning games. Introduced in the paper, the Wuji framework increases
the probability of bug detection by dynamically balancing space exploration and game
progression through the use of evolutionary algorithms, DRL, and multi-objective
optimization. Wuji's efficacy in examining game states and identifying unidentified bugs is
demonstrated by the extensive assessment conducted on a basic game and two well-known
commercial games. The framework's contribution to automated game testing is highlighted in
the conclusion, which demonstrates how it combines evolutionary multi-object optimization
and DRL to successfully find bugs in real-world commercial games, classify bugs, and suggest
test oracles.
While reinforcement learning (RL) has demonstrated impressive performance in managing
agents in Markov decision processes, enduring issues such as algorithmic divergence and
suboptimal policies still exist. In order to improve exploration in settings with sparse feedback,
[20] presents the Dreaming Variational Autoencoder (DVAE) and its extensions, such as
DVAE-SWAGAN, a neural network-based generative modeling architecture. The work
assesses the performance of DVAE-SWAGAN in a variety of state-space situations, longhorizon tasks, and deterministic and stochastic challenges using deep maze as a novel and
versatile game engine. We compare the performance of various variants of the technique and
find gains while simulating environment models with continuous state spaces. Still, there
remain limitations, especially when it comes to correctly forecasting uncharted territory in the
state space, which calls for more model improvements to include reasoning and a more
thorough comprehension of the dynamics of the environment. The paper recognizes the open
issues in reinforcement learning and suggests avenues for further research, including
investigating non-parametric variations and adding an inverse reinforcement learning
component for deep maze and deep line battles. The current objective is to improve the sample
efficiency of RL algorithms by utilizing generative exploration, as demonstrated by DVAESWAGAN.
Inspired by the brain's segmented approach to tasks, Modular Reinforcement Learning (MRL)
breaks down complicated problems into smaller, concurrently learned sub-goals. The
effectiveness of dichotomy-based decomposition architectures, which prioritize safety and
learning efficiency above conventional Q-learning, has been proven by MaxPain and its deep
variation, Deep MaxPain. However, the modular potential of both techniques was limited
because mixing weights were established manually. To tackle this, the study investigates signal
scaling in relation to the discounting factor γ and presents a state-value-dependent weighting
method that incorporates softmax and hardmax, derived from a Boltzmann distribution case
study [21]. With an emphasis on maze-solving navigation challenges, the paper presents a lidar
and image-based sensor fusion network structure that improves simulation and real-world
Turtlebot3 Waffle Pi experimentation outcomes. The suggested techniques highlight
developments in weighting schemes and signal design, contributing to the field of Modular
Reinforcement Learning for navigation challenges involving labyrinth solutions.
First-person shooter (FPS) games present significant obstacles for reinforcement learning (RL)
in terms of learning logical actions because of the large action space and challenging
exploration. StarNet, a hierarchical reinforcement learning network optimized for first-person
shooter games, is the innovative answer that this [22] study offers. With a manager-worker
structure, StarNet functions on two hierarchical levels. High-level managers learn rules over
possibilities, and low-level workers carry out these options because they are motivated by
internal rewards. The model has the added benefit of action space division and environmental
signal use, and it effectively handles the problems of sparse rewards and reward delays. In
first-person shooter games, experimental assessments like the VDAIC 2018 Track show that
StarNet outperforms other RL-based models, demonstrating its potential to greatly improve
performance in tasks like combat skills and labyrinth solving.
In order to compare two interactive reinforcement learning techniques, this study looks at
regular reward assignment and a newly proposed method called "action assignment," which is
similar to giving demonstrations. Inspired by human teaching techniques, the study
investigates how well these strategies work when applied to robot learning. In the article [23],
the concept of action assignment is introduced. According to the agent's future behavior, the
suggested action is utilized to calculate a reward. Based on a user study in a two-dimensional
maze game, action assignment greatly improved users' capacity to indicate the right behavior,
according to the interaction logs. According to survey results, assigning actions and rewards
was seen as very natural and practical; in fact, action assignment was thought to be just as
natural as reward assignment. According to the study, users expressed frustration when the
agent ignored their orders or assigned prizes more than once, and awarding rewards required
more mental work. The results point to action assignment as a potentially less cognitively
taxing and more user-friendly approach to interactive reinforcement learning, offering
important new information for improving user satisfaction in robot teaching applications.
In Reinforcement Learning (RL), the study [24] presents a unique method called Potentialized
Experience Replay (PotER) to improve experience replay, especially for hard exploratory tasks
with scant rewards. Conventional approaches frequently ignore the potential learning value in
both favorable and negative circumstances by relying on basic heuristics for sampling events.
Inspired by physics, PotER imbues each state with a potential energy function by introducing
an artificial potential field into experience playback. This dynamic technique efficiently
addresses the difficulties of difficult exploration tasks by enabling the agent to learn from a
variety of experiences. PotER's inherent state supervision adds to the agent's flexibility,
enabling it to function in difficult situations. The research demonstrates PotER's compatibility
with multiple RL algorithms and self-imitation learning, highlighting its adaptability. Through
extensive experimental evaluations of intricate exploration settings, such as mazes and difficult
games, the research presents PotER as a competitive and effective baseline for handling
difficult exploration tasks in reinforcement learning.
In deep reinforcement learning (DRL), this study [25] tackles the issues of sluggish
convergence and sensitivity to local optimal solutions. The Magnify Saltatory Reward (MSR)
algorithm introduces dynamic modifications to experience rewards in the pool, focusing on
situations with reward saltation. Through optimizing the agent's use of reward saltation
experiences, MSR seeks to accelerate network convergence and enable the achievement of
globally optimal solutions. The paper compares the performance of DRL algorithms, including
deep Q-network (DQN), double DQN, and dueling DQN, before and after adding MSR. The
experiments are carried out in a simulated obstacle avoidance search environment for an
unmanned aerial vehicle. The experimental results show that the addition of MSR to these
algorithms greatly speeds up network convergence, providing a possible solution to the
enduring problems of local optimal solutions and slow convergence in DRL. The potential of
MSR to enhance training results across multiple DRL algorithms is further highlighted by its
changeable parameters, which may be adjusted to suit varied contexts with reward saltation.
Agent exploration presents difficulties for deep reinforcement learning, especially when there
are few rewards, which results in low sample efficiency. In order to overcome these drawbacks,
the suggested technique, known as E2S2RND [26], uses an auxiliary agent training paradigm
and a feature extraction module that has been precisely designed to harness predicted errors
and produce intrinsic incentives. This novel method considerably enhances the agent's ability
to conduct thorough exploration in situations with sparse reward distribution by mitigating the
exploration dilemma induced by separation derailment and white noise interference. Analyses
and comparative tests conducted in the Atari 2600 experimental environment confirm the
effectiveness of the refined feature extraction module and demonstrate notable improvements
in performance in six chosen scenarios. The advent of E2S2RND represents a significant
paradigm change, offering a fresh approach to improving exploration challenges. To ensure
robust performance across different reward structures, more study is necessary to extend the
applicability of the approach to diverse contexts, as indicated by the performance difference
observed in situations with dense rewards. Subsequent research endeavors to expand the
approach's range, augmenting its efficacy in scenarios featuring sparse as well as dense
incentives.
The study investigates the use of deep reinforcement learning (DRL) in humanoid robot soccer,
a game in which a robot uses images from its onboard camera to interact with its surroundings
and learn various skills [27]. The work shows effective learning in a robotic simulator for
activities including walking towards the ball, taking penalties, and goalkeeping by utilizing the
Dueling Double DQN algorithm. Notably, the knowledge acquired in the simulator is
effectively applied to an actual humanoid robot, demonstrating how flexible the DRL
methodology is in many settings. The outcomes demonstrate the robot's proficiency in soccerrelated activities and illustrate the potential of deep reinforcement learning (DRL) in teaching
sophisticated abilities to robot soccer players. The research highlights the significance of
simulation in the process of learning, enabling the robot to perform tasks thousands of times.
The effective transfer of learned models from simulation to an actual robot shows how DRL
may be used practically to build robotic capabilities for soccer-playing scenarios.
In deep reinforcement learning (DRL) and many other machine learning approaches, running
many learning agents in parallel is a common strategy for enhancing learning performance. In
the creation of these algorithms, the best strategy to organise the learning agents participating
in distributed search was missed. The authors leverage insights from the literature on
networked optimization to demonstrate that arranging learning agents in communication
networks other than entirely connected topologies (the implicit way agents are frequently
arranged) may improve learning [28]. They compare the performance of four well-known
graph families and discover that one of them (Erdos-Renyi random graphs) outperforms the de
facto fully-connected communication architecture in a variety of DRL test workloads. They
also discover that 1000 learning agents grouped in an Erdos-Renyi graph can outperform 3000
learning agents organised in the typical fully-connected topology, highlighting the large
learning advantages that may be realised by carefully structuring the topology over which
agents interact. Theoretical research on why their alternative topologies work better is
supplemented by actual data.
Overall, their results suggest that optimising the communication architecture between learning
agents might help distributed machine learning algorithms perform better.
Deep reinforcement learning has shown to be an effective method in a range of artificial
intelligence challenges during the last several years. Deep reinforcement learning in multiagent environments, rather than single-agent settings, has been the focus of recent research.
The main goal of this research is to provide a thorough and systematic overview of multi-agent
deep reinforcement learning methods, including issues and applications [29]. Basic
information is supplied first in order to have a better understanding of this subject. The relevant
structures and illustrative techniques are then described, along with a challenge taxonomy.
Finally, the future potential of multi-agent deep reinforcement learning applications is
highlighted.
Exploration is critical for effective deep reinforcement learning and has received a lot of
attention. Existing multi-agent deep reinforcement learning techniques, on the other hand,
continue to depend on noise-based approaches. Exploration strategies that take into
consideration the cooperation of several agents have just recently been developed. Existing
techniques, on the other hand, have a similar issue: agents struggle to identify states worth
investigating, and they seldom coordinate their exploration efforts in this direction. To address
this issue, the authors suggest cooperative multi-agent exploration (CMAE) [30], in which
agents collaborate while exploring to pursue a common goal. The target is chosen from many
projected state spaces using a normalised entropy-based technique. The agents are then taught
how to collaborate in order to attain the goal. They demonstrate that CMAE consistently
outperforms baselines on a range of problems, including a sparse-reward variation of the
multiple-particle environment (MPE) and the Starcraft multi-agent challenge (SMAC).
The outcomes of the literature review are as follows:



Many efficient agent collision avoidance mechanisms are discussed in the literature
which provides the cooperative relations between the agent and enemy.
Reinforcement learning used the reward exchange mechanism to navigate between the
agents which are much more effective in other domains too.
As there are many collision avoidance and reward mehanism methods are proposed but
they had not considered the environmental factors which are collision with enemy, and
the navigation of agent in the given environment.
The overall description, methods, tools used and limitations in the literature review are shown
in Table 3.1.
Ref.
[13]
Description
Proposed method for
collision avoidance in
complex environment
[14]
Proposed
an Deep Q-Network
algorithm in which
classical Q-Learning
struggles to handle
huge and complex
state spaces.
Unity 3D
[27]
Proposed
collective
motion strategy for
agents
Proposed
distributed
DRL approach for agent
formation control
DQN
Unity 3D
Centralized
control for agents
DDPG
Unity ML toolkit
The proposed approach
for the formation of
collision avoidance in
the multi player games.
The proposed approach
for the reward strategy
in games
PKURLOA
Unity ML-Agents
DQN
Unity 3D
Not considered
the enemy tackle
factor
during
navigation
They had not
considered
the
multi
player
winning factors
Not considered
the environmental
factor in AI based
games.
[28]
[29]
[30]
Methods
Dynamic
Adjustment
Difficulty
Tools used
Unity 3D
Limitations
Not considered
the environmental
factors
Failed to tackle
complex
environment
4. Methodology
The project aims to investigate how well curriculum learning can enhance several elements of
a simple-code agent that learns to find its way through a maze in the fewest steps possible. The
effectiveness of curricular learning could be enhanced in the areas of training and evaluation
outcomes.
During development, I encountered several issues with the Raycast Perception Sensor that
came with the ML-Agents toolbox. Consequently, I created an alternate version of the Agent,
which is also tested in this thesis. I used a tutorial by Joseph Hocking [31] to assist me design
a learning environment in order to accomplish this goal. With the aid of this lesson, I was able
to construct a setting in which the Agent finishes a predetermined amount of training episodes,
at which point a maze appears. The number of episodes for this investigation was fixed at five.
Five episodes was chosen because it is a manageable number that enables the Agent to retry
each episode while providing a variety of mazes for the Agent to explore during training. The
episode ends when the agent touches the goal at the upper right corner of the maze, or after
1,000 steps are accomplished. The maximum number of steps was selected because it gives
the Agent ample freedom to explore each episode and enable it to successfully navigate the
maze several times. There are eight Agent instances in the learning environment, and each one
has a different maze that has been created.
4.1 Agent
There are two Agent versions in the environment. The first one collects environmental
observations alone through the use of a Raycast Perception Sensor. This sensor is
configured to identify the goal object and the maze's walls in straight lines that are parallel
to the X and Z axes. Figure 3 provides an example which is configured as vector sensor in
agent.
Figure 3: Understanding of agent in 3D Maze game
After combining all of the findings, a single variable is created and divided by 255 to
normalise it. The use of the bitmask was motivated by a study conducted by Goulart, Paes,
and Clua [32], in which the authors experimented with several approaches to gather data
for an agent that was modelled after the video game Bomberman. This is the only
information that the Raycast Perception Sensor utilising Agent gathers. Additionally, the
bitmask utilising Agent records its present location in reference to the surroundings. The
geographic data is not standardised. Although the findings could have been better, I
concluded that the benefits would not outweigh the difficulties in normalising the location
data due to the maze's fluctuating size throughout curriculum learning.
5. Results and discussion
The aim of this chapter is to describe all the experimental results. In this chapter, we are going to
describe all the graphs, got from the training of ML-agents. Illustration of our policy performs
better than other policies and how it achieves high control on the agent in presence of enemies
scenarios are also listed in this chapter. How agent navigate the best path to reach the destination
by avoiding the collision with different obstacles and other agents are also described in this chapter.
5.1 Evaluated Parameters
This section provides a full explanation of the parameters that are investigated in order to
evaluate the effectiveness of the suggested model. In this research, we are going to be
assessing a number of different parameters, some of which are listed below.
5.1.1 Mean Reward
The first parameter that we take a look at is the average reward. The mean reward is the
value that is representative of the average of all of the prizes that were given to agents
during each episode. During the training phase, the performance of the proposed model is
evaluated using this parameter as a measure of how well it works. The reward values that
make up the reward structure are what ultimately decide how much the mean reward is
worth. If the reward value is large, it means that the agents have completed a significant
amount of training.
5.1.2 Policy Loss
The policy loss is used as a metric to evaluate how well the discriminator is doing. The
fact that our agents were able to replicate the actions of the expert agents in the
demonstrations is evidence of how successfully they did so. As seen by the minimum
policy loss value, agents carry out their responsibilities successfully and properly
imitate the behaviour. The efficiency with which agents function in their environments
is one of the primary factors that determine the value of policy loss.
5.1.3 Value Estimate
In addition to this, we make use of the value estimate to illustrate how well our
proposed model operates in settings involving several agents. During the course of their
training, agents go to several different states, and these value evaluations are used to
represent the value of each state. The short value in the value estimate is evidence that
the agents had a high level of training.
5.1.4 Policy Estimate
A base model is also implemented in the same conditions as the proposed model so that
the outcomes of both models may be verified. After training the agent using the basic
technique, the results are compared with those produced by the proposed model in order
to evaluate its effectiveness. The findings indicate that the proposed model performs
better than the base model in terms of the mean reward, the extrinsic reward, the policy
loss, and the value loss. In this section, the comparison between the base model and the
proposed model is shown.
5.1.5 Extrinsic Reward
This parameter is used to evaluate the agent's policy. This reward is offered to the agent
by the policy. The high number indicates that the agent performed well. This parameter
is a numerical value that is applied to each time step. The value of an extrinsic reward
is determined by the reward value assigned to each occurrence, such as colliding with
a barrier in room.
5.1.6 Episode Length
This metric is used to assess the agent's operational performance. The episode duration
indicates how many time steps an agent may function without clashing with anything.
The larger value of the episode may show the good performance in many scenarios.
5.1.7 Number of Collisions
This metric is used to evaluate the agent performance in terms of controled environment
of rooms. This metric assisted in determining how much control is achieved under
variou situations of obstacle.
5.2 Results
To validate the proposed model's outcomes, the base model is also applied in identical
conditions. The agents are trained using the base technique, and the results are compared
to the suggested model's performance. According to the results, the suggested model
outperforms the base model in terms of mean reward, extrinsic reward, policy loss, value
loss and the number of collisions. This section compares the proposed model to the baseline
model.
The value estimate is the first parameter that is used in order to measure the performance
of the proposed model. This graph depicts, for each state that an agent travels through
throughout the training process, the value that was predicted for that state. A better-trained
agent may be identified by their greater projected value. The suggested approach provides
more accurate estimates of the value of the states in the value estimate graph. The proposed
model has a more optimistic assessment of its value than the standard model does. In this
particular parameter, the suggested model performs better than the standard model.
The value estimate graph is shown in figure 3.
Figure 3: Value estimate
The second parameter that is used to evaluate our proposed model is the episode length shown in
figure 4. The duration of each episode is compared to the base model using the graph below. This
graph illustrates that the proposed model is superior than the base model in terms of performance.
Agents trained with the proposed model had longer episodes than agents trained with base model
graphs. The maximum episode length for the proposed model is 350, while the maximum episode
length for the basic model is 10 - 15. This parameter demonstrates that the value of the proposed
model is greater than the value of the base model, which indicates that the proposed model has
good performance. The longer duration indicates that the agents operate effectively for a longer
period of time, which demonstrates the agents' high level of performance.
Figure 4: Episode length
The third parameter utilized to evaluate our proposed model is the policy loss. The policy
variations that happened throughout the training phase are shown in figure 5. The value of the
graph should decrease with time since the agent should spend less time in the training process.
Our results demonstrate that until 1.1 million timesteps, the proposed model takes less time to
modify the policy, and then the graph decreases as the number of steps increases. After 1.1
million timesteps, the performance of both the proposed model and the base model is the same.
This parameter represents the time required to change the policy, and our model performs better
at lower timesteps, whereas both models almost took the same time to modify the policy and
discover the optimum policy at higher timesteps.
Figure 5: Policy loss
The value loss is another parameter used to assess the performance of the proposed
model, as shown in figure 6. This graph shows how correctly a policy value is evaluated
for each policy implemented by the agent throughout the training phase. The loss should
reduce as the timesteps increase. The graph illustrates that the loss was larger at the start,
but it dropped after 250k timesteps, and the whole training process proceeded smoothly.
The value loss of the proposed model is greater than that of the base model.
Figure 6: Value loss
According to this graph, the proposed model outperforms the base model. The reward for
agents trained with the proposed model was larger than for agents taught with base model
graphs. The greatest value of the mean reward obtained by the proposed model is -0.42,
whereas the maximum value obtained by the base model is -0.5. This parameter indicates
that the proposed model's value is greater than the base model's.
Figure 7: Cumulative reward
A comparison of cumulated reward between the proposed model and base model is
shown in figure 7.
The reward that is given to the agent that chooses the optimal policy is referred to as an
extrinsic reward, as shown in figure 8. This reward begins at -0.42 and gradually
increases until it reaches -0.32 as the number of time steps increases.
Figure 8: Extrinsic reward
But the reward that is obtained by the base model increases from -0.5 to -0.48 as the number
of timesteps rises. This parameter shows that the value of the proposed model is higher
than the value of the base model, which means that the proposed model performed better
than the base model.
6. Conclusion
Deep reinforcement learning has numerous applications in real world like, robotics
locomotion, healthcare system, self-driving car and video games etc. DRL is used in many
games to make it more user friendly and interesting with human base behavior. Deep
reinforcement learning is being utilized to bulid for multiple games. Previous research offered
strategies for agent navigation, collaboration, and cooperation in real life famous games.
Existing research does not take into consideration many environmental elements used in
games. Environmental factors like agent behavior and enemy have a negative impact on agent
control. This work provided a model to improve and enhance the behvior of agent in the game
environment. This report takes three environmental circumstances into account: agent
behavior, reward policy, and enemy and the proposed model is implemented in Unity 3D and
rendered all the mentioned environmental factors through Unity built in components and also
evaluated in these settings. The proposed model trained the 3D Maze game agent under the
presence of environmental factors using Generative Adversarial Imitation Learning and
Proximal Policy Optimization algorithms. All the demonstrations have been recorded and these
demonstrations are comprised of states and actions pair. And then these demonstrations are
passed to the Generative Adversarial Imitation Learning (GAIL) as hyper-parameter. The
agent employs expert demonstration to boost control in various environmental settings. A
convolutional neural network with three layers is used to implement the Proximal Policy
Optimization (PPO) as a trainer. Moreover, with slight different and additional hyperparameters, GAN is used for the implementations of GAIL. The performance matrices of mean
reward, extrinsic reward, episode duration, policy loss, value loss, value estimation, and a
number of collisions with obstacles is utilized to evaluate the performance of the proposed
model with the base model. In the mean reward, extrinsic reward, value estimate, episode
duration, and a number of collisions matrix, the suggested model performed well and beat the
standard model. This demonstrates how the suggested approach provides the agent with good
control. The results show that learning through expert behavior is very useful for increasing
the control in different environmental conditions.
7. References
[1] Szita, István. "Reinforcement learning in games." Reinforcement Learning: State-of-the-art. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2012. 539-577.
[2] Hu, Zhipeng, et al. "Promoting human-AI interaction makes a better adoption of deep reinforcement learning: a
real-world application in game industry." Multimedia Tools and Applications (2023): 1-22.
[3] Teja, Koyya Vishnu, and Mallikarjun M. Kodabagi. "Intelligent Path-Finding Agent in Video Games Using
Artificial Intelligence." International Journal of Research in Engineering, Science and Management 6.6 (2023): 6973.
[4] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects”, Science, vol. 349, no.
6245, pp. 255–260, 2015.
[5] Sarker, Iqbal. (2021). Machine Learning: Algorithms, Real-World Applications and Research Directions. SN
Computer Science. 2. 10.1007/s42979-021-00592-x.
[6] B. Liu, “Supervised learning”, in Web data mining, Springer, 2011, pp. 63–132
[7] I. C. Education. “Deep learning”. (2020), [Online]. Available: https://www. ibm.com/cloud/learn/deep-learning
(visited on 05/13/2022).
[8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[9] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation
platform for general agents”, Journal of Artifcial Intelligence Research, vol. 47, pp. 253–279, 2013.
[10] A. Juliani, V.-P. Berges, E. Teng, et al., “Unity: A general platform for intelligent agents”, arXiv preprint
arXiv:1809.02627, 2018.
[11] “Unity’s terms of service”. (2022), [Online]. Available: https://unity3d. com/legal/terms-of-service/software
(visited on 03/23/2022).
[12] Unity Technologies. “Unity ml-agents python
https://github.com/Unity- Technologies/ml- agents/blob/
05/09/2022).
low level api”. (2021), [Online]. Available:
release_18_docs/docs/Python-API.md (visited on
[13]. Reis, S., Reis, L. P., & Lau, N. (2021). Game adaptation by using reinforcement learning over meta
games. Group Decision and Negotiation, 30(2), 321-340.
[14]. Devo, A., Costante, G., & Valigi, P. (2020). Deep reinforcement learning for instruction following visual
navigation in 3D maze-like environments. IEEE Robotics and Automation Letters, 5(2), 1175-1182.
[15]. Shafti, A., Tjomsland, J., Dudley, W., & Faisal, A. A. (2020, October). Real-world human-robot collaborative
reinforcement learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp.
11161-11166). IEEE.
[16]. Tan, R., Zhou, J., Du, H., Shang, S., & Dai, L. (2019, May). An modeling processing method for video games
based on deep reinforcement learning. In 2019 IEEE 8th joint international information technology and artificial
intelligence conference (ITAIC) (pp. 939-942). IEEE.
[17]. Kim, K. (2022). Multi-agent deep Q network to enhance the reinforcement learning for delayed reward
system. Applied Sciences, 12(7), 3520.
[18]. Coppens, Y., Bargiacchi, E., & Nowé, A. (2019, August). Reinforcement learning 101 with a virtual reality
game. In Proceedings of the 1st international workshop on education in artificial intelligence K-12.
[19]. Gao, P., Liu, Z., Wu, Z., & Wang, D. (2019, December). A global path planning algorithm for robots using
reinforcement learning. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO) (pp. 16931698). IEEE.
[20]. Zheng, Y., Xie, X., Su, T., Ma, L., Hao, J., Meng, Z., ... & Fan, C. (2019, November). Wuji: Automatic online
combat game testing using evolutionary deep reinforcement learning. In 2019 34th IEEE/ACM International
Conference on Automated Software Engineering (ASE) (pp. 772-784). IEEE.
[21]. Andersen, P. A., Goodwin, M., & Granmo, O. C. (2021). Increasing sample efficiency in deep reinforcement
learning using generative environment modelling. Expert Systems, 38(7), e12537.
[22]. Wang, J., Elfwing, S., & Uchibe, E. (2021). Modular deep reinforcement learning from reward and punishment
for robot navigation. Neural Networks, 135, 115-126.
[23]. Song, S., Weng, J., Su, H., Yan, D., Zou, H., & Zhu, J. (2019, August). Playing FPS Games With EnvironmentAware Hierarchical Reinforcement Learning. In IJCAI (pp. 3475-3482).
[24]. Raza, S. A., & Williams, M. A. (2020). Human feedback as action assignment in interactive reinforcement
learning. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 14(4), 1-24.
[25]. Zhao, E., Deng, S., Zang, Y., Kang, Y., Li, K., & Xing, J. (2021, January). Potential driven reinforcement
learning for hard exploration tasks. In Proceedings of the Twenty-Ninth International Conference on International
Joint Conferences on Artificial Intelligence (pp. 2096-2102).
[26]. Hu, Z., Wan, K., Gao, X., & Zhai, Y. (2019). A dynamic adjusting reward function method for deep
reinforcement learning with adjustable parameters. Mathematical Problems in Engineering, 2019, 1-10.
[27]. Li, Y., Guo, T., Li, Q., & Liu, X. (2023). Optimized Feature Extraction for Sample Efficient Deep Reinforcement
Learning. Electronics, 12(16), 3508.
[28]. D. Adjodah, D. Calacci, A. Dubey, A. Goyal, P.
land,
“Leveraging
communication
topologies
between
reinforcement learning.” in AAMAS, 2020, pp. 1738–1740
[29]. W. Du and S. Ding, “A survey
from
the
perspective
of
challenges
Review, vol. 54, no. 5, pp. 3215–3238, 2021.
Krafft, E.
learning
on multi-agent deep
and
applications,”
Moro, and A.
agents
in
Pentdeep
reinforcement learning:
Artificial
Intelligence
[30].
I.-J.
Liu,
U.
Jain,
R.
A.
Yeh,
and
for
multi-agent
deep
reinforcement
learning,”
Machine Learning. PMLR, 2021, pp. 6826–6836
A.
in
Schwing,
“Cooperative
exploration
International
Conference
on
[31] “Training confguration fle”. (2021), [Online]. Available: https://github. com / Unity - Technologies / ml - agents
/ blob / release _ 18 _ docs / docs / Training-Configuration-File.md (visited on 05/08/2022).
[32] “Using tensorboard to observe training”. (2022), [Online]. Available: https://github.com/Unity- Technologies/mlagents/blob/release_ 18_docs/docs/Using-Tensorboard.md.
Download