Uploaded by Praise Nnadi

Reinforcement Learning in Conway's Game of Life

advertisement
Achieving Pattern Stabilisation in Conway’s Game of
Life Through Reinforcement Learning.
Nnadi Praise
20519328
Abstract—To explore the fascinating intersection between
artificial intelligence and cellular automata, this study applies
reinforcement learning to Conway’s Game of Life. The goal of
this study is to design an intelligent agent capable of achieving
deterministic pattern stabilisation within this complex system.
We developed a custom environment and employed a Deep
Q-Learning model to enable our agent to transition a starting
‘block’ pattern into a ‘blinker’. Despite the simplicity of its
operational rules, the Game of Life displays complex emergent
behaviours that present significant challenges in achieving
consistent pattern stabilisation. Our findings indicate that
while the agent is capable of achieving the target pattern, it
does so with very limited success, achieving a maximum
recorded accuracy of 1.2% in 500 episodes. Future efforts will
focus on refining model parameters and exploring alternative
algorithms to enhance the agent’s learning capabilities.
- A live cell with two or three live neighbours
survives to the next generation[2].
- A dead cell with exactly three live neighbours
becomes a live cell [2]. This rule simulates
reproduction.
●
Emergent Patterns: The simple rules outlined
above result in emergent behaviour and high
complexity. This allows various phenomena to
occur such as stable ‘blocks’, oscillating ‘blinkers’,
and moving ‘gliders’ which can simulate logical
operations and data transport within the grid[1][3].
In fact, the resulting game is Turing complete[4].
Keywords—Cellular Automata, Conway’s Game of Life,
Reinforcement
Learning,
Deep
Q-Learning,
Pattern
Stabilization, Neural Networks, Artificial Intelligence in Games.
I.
INTRODUCTION (AND RESEARCH QUESTIONS)
Conway’s Game of Life is a zero-player game developed by
British Mathematician John Conway in 1970. It is a
profound example of cellular automata that continues to
fascinate researchers across various fields like computer
science, physics, and artificial intelligence[1][2][3]. The
game operates on a set of simple rules that are applied to a
grid of cells. Each cell in the grid can be in one of two
states: alive or dead. Once applied, the rules simulate the
birth, survival, and death of cells and lead to complex
patterns and behaviours emerging from straightforward
initial conditions[2][3].
A. OVERVIEW OF CONWAY’S GAME OF LIFE
● Cellular Automata and Grid: The Game of Life
is played on an infinite two-dimensional grid of
square cells. Each cell interacts with its eight
neighbours (horizontal, vertical, and diagonal)
using a set of three rules based on the number of
live neighbours. Figure 1 shows the neighbourhood
radius of a single cell. The central black cell
represents the focal cell. The right surrounding
white cells are its neighbours- a.k.a the cells
considered to determine the focal cell’s next state.
●
Rules: The rules of the game are listed below:
- A live cell with fewer than two or more than three
live neighbours dies in the next generation[2].
These rules are a toy imitation of underpopulation
and overpopulation, respectively.
Fig 1: Illustration of the Neighbourhood Radius in
Conway’s Game of Life.
B. INITIAL PROJECT IDEA
The initial concept of the project was to perform a
comparative analysis of two types of agents within the game
of life environment: a rule-based agent and a reinforcement
learning agent. Both agents would be tasked with the goal of
stabilising a desired pattern in the game, specifically
transforming a ‘block’ pattern into a ‘blinker’. A block
pattern, as seen in figure 2, is called a still life pattern in
Conway’s Game of Life. This is because applying the
game’s rules results in no change in the block pattern on any
generation. On the other hand, the rules of the game cause
the blinker pattern to oscillate between horizontal and
vertical forms. The transition from block to blinker was to
be accomplished by allowing each agent to make a change
to a single cell per generation before the game’s rules
processed the resulting grid. And Unlike traditional
implementations of Conway’s Game of Life that often use
an infinite grid, our environment would use a finite grid
with dimensions specially tailored for this research.
Fig 2: Illustration of block pattern (left) and blinker pattern
(right).
C. FINAL PROJECT IDEA
As the project was developed, the core idea evolved to focus
on enhancing and analysing the reinforcement learning
agent within the environment. This adjustment was
influenced by insights gained during the initial development
phase that revealed the outcomes of the rule based agent
could be predetermined once effective rules were
established. Figure 3 below shows one potential solution for
transforming a block pattern into a blinker pattern across
four generations.
explore how the reinforcement learning agent adapted to the
environment and pursued its goal of pattern stabilisation.
The revised project objective then was to evaluate the
agent’s performance in achieving the blinker pattern,
comparing its effectiveness against the theoretical optimal
results which a rule-based agent could achieve under similar
conditions. In other words, coulda reinforcement learning
agent consistently achieve the desired pattern in four
generations or less?
II.
LITERATURE REVIEW
Conway’s Game of Life, which is more stimulation than
game, has been a fertile ground for exploring the
capabilities of computational models, particularly neural
networks, because of its simple rules that lead to complex
emergent behaviours.
A significant body of research has utilised neural networks
to explore pattern generation and evolution in the Game of
Life. For instance, an evolutionary approach combined with
neural networks has been employed to develop populations
of networks that evolve towards generating complex,
symmetrical patterns using a complexity function as a
measure of fitness[1]. Unlike the broader approach taken in
the reviewed literature, which primarily focuses on evolving
networks to discover or generate patterns, our project
uniquely targets the precise stabilisation of a specific
pattern—transforming a 'block' into a 'blinker' in Conway's
Game of Life using a reinforcement learning agent. While
we employ neural networks as part of our methodology, our
objective diverges from the general exploratory nature of
previous studies. We do not make use of evolutionary
algorithms to generate variability and novelty but instead
optimise our neural network to consistently reach a
predefined and desired state.
Despite their predictive power, however, research also
shows that neural networks face challenges in learning
deterministic rules in the Game of Life. Dickson et al.
(2020) document an experiment where researchers initially
crafted a small convolutional neural network and manually
tuned it to predict the sequence of cell state changes within
the game's grid. By achieving this, the researchers showed
that there is a minimal neural network that could indeed
represent the rules of the Game of Life effectively[2][5].
Fig 3: One approach a rule-based agent might take to
transform a block to a blinker pattern.
In figure 3, the grid state represents the initial state of the
grid at the start of each generation. The action represents the
agent’s decision for that generation. Green cells signify a
cell being revived, red cells signify a cell being killed, and
‘nothing’ means the agent neither kills or revives a cell in
that generation. Finally, the new state represents the
resulting grid after the game's rules have been applied to the
pattern leftover by the agent’s action.
This solution clearly demonstrates that a rule-based agent
can successfully achieve the desired blinker pattern in at
least four generations. Thus it seemed more meaningful to
However, when these networks were re-trained from scratch
with randomly initialised parameters, achieving optimal
settings proved impossible. The researchers embarked on
training with 1 million randomly generated game states.
Their goal was for the neural network to learn and
parameterize the game's rules independently. Theoretically,
perfect learning would mean the network's parameters
would converge to the manually tuned values. In practice,
this rarely occurred. More often, the network failed to find
the optimal parameter settings, and its performance
degraded as the complexity of the task increased with
additional steps[2][5]. Consequently, the initial weights and
the selection of training examples substantially impact the
learning outcomes when training a neural network for
deterministic systems like Game of Life. The "Lottery
Ticket Hypothesis" proposes that successful training of
neural networks often depends on fortuitous initial weight
configurations, which might explain the seemingly random
success in networks learning the Game of Life's rules[2][5]
The aforementioned challenge faced by neural networks
when it comes to learning deterministic rules in Conway’s
Game of Life directly impacts our goal of achieving a
desired pattern under a certain number of generations.
Research indicates that the success of such neural networks
heavily depends on the initial parameter settings. Optimal
configurations are similar to lottery tickets- desirable but
rare. Hence, achieving pattern stabilisation perfectly may
require meticulous tuning of network parameters and
possibly a more complex network architecture so the model
masters the precise sequences of actions needed for the
desired pattern transformation.
III.
METHODOLOGY
This section details the framework used in our project to
explore the stabilisation of the desired blinker pattern.
A. ENVIRONMENT SET UP
The environment for our project is a custom-built class
called Environment, designed to simulate Conway’s Game
of Life. It was developed from scratch, albeit built upon a
tutorial which is credited in the environment script.
Significant modifications were made to the basic tutorial
code to create an environment better suited for our project.
Here are the key features and functions of the Environment:
- Initialise Grid: The environment initialises a finite grid of
dimensions specified via customizable width and height
variables. This grid holds and is used to display the various
patterns resulting from agent actions and the effects of the
game’s rules. The grid is initialised with a block pattern at
its centre.
- Initialise Target Pattern: The environment stores the
desired pattern in a separate grid and contains functions to
initialise the desired pattern. In the case of a blinker which
can either be horizontal or vertical, there are two target
patterns. Either pattern is valid.
- Enable Agent: The environment contains a method that
effects the agent's desired change. It takes the agent’s
decision to kill a cell, revive a cell or do nothing and
performs it, allowing the grid to be influenced by the agent.
- Reward Agent: The environment contains a reward
system to guide the reinforcement learning process. The
environment assesses the agent's performance based on the
accuracy of achieving the desired patterns, the efficiency of
its actions, and proximity to the goal within a set number of
steps. It not only rewards the correct formation of patterns
but also penalises unnecessary or counterproductive actions
like killing a cell that is already dead.
- Update Grid: The environment is also responsible for
applying the rules of the game to calculate the next state of
each cell based on its neighbours. And this feature also aids
the visualisation of the environment.
B. DEEP Q-LEARNING MODEL
Our deep learning model was a class, DeepQNetwork,
inheriting from pytorch’s neural network module. We
selected Deep Q-Learning as the reinforcement learning
method for this project because we needed a learning
approach that enabled decision-making in a complex and
dynamic environment. Given the goal of stabilising a
specific pattern in Conway's Game of Life, the agent needs
to accurately predict the outcomes of a wide range of
possible actions from any given state. Deep Q-Learning
addresses this requirement by enabling the agent to estimate
the potential reward for each possible action, thus
facilitating informed decision-making that aims to maximise
long-term rewards. We discuss our model architecture
below:
- Input Layer: The input to the model is the grid from
Conway’s Game of Life but converted into a flattened list
where each element represents the state of a cell: 1 for alive
and 0 for dead. The size of the input layer, therefore, is the
grid width multiplied by the grid height.
- Output Layer: The output layer is defined as
(2 × π‘–π‘›π‘π‘’π‘‘ 𝑠𝑖𝑧𝑒) + 1. This structure allows the network
to decide between 2 types of actions for each cell and 1
neutral action, ‘do nothing’, that affects all cells. The first
segment of the output layer [0, 𝑖𝑛𝑝𝑒𝑑 𝑠𝑖𝑧𝑒 − 1]
corresponds to actions where the model suggests killing a
particular
cell.
The
second
segment
[𝑖𝑛𝑝𝑒𝑑 𝑠𝑖𝑧𝑒, (2 × π‘–π‘›π‘π‘’π‘‘ 𝑠𝑖𝑧𝑒) − 1 ] corresponds to
actions where the model suggests reviving a particular cell.
The last neuron represents the “do nothing” action. The
output layer uses no activation function, so the model’s
decision is derived from the neuron with the highest
(positive) value.
- Hidden Layers: The model has two hidden layers which
make use of a ReLu activation function and have neuron
size of 256 each.
- Training: The model has a function to train the agent by
making use of the Bellman Equation to iteratively update its
network weights. First it predicts Q-values for all possible
actions from the current state using its forward propagation
function. The agent then selects an action based on the
highest predicted Q-values. Next, target Q-values are
calculated using the Bellman Equation:
π‘„π‘‘π‘Žπ‘Ÿπ‘”π‘’π‘‘(𝑠, π‘Ž) = π‘Ÿ + γ π‘šπ‘Žπ‘₯ 𝑄 (𝑠', π‘Ž')
Where:
●
●
●
s is the current state,
a is the action taken,
r is the reward received,
●
●
●
s’ is the new state after taking action a,
a’ represents all possible actions from the new
state, s’, and
γ is the discount factor that weighs the importance
of future rewards.
- Loss Function and Optimization: The model uses a mean
squared error loss function and an adam optimiser. These
were chosen due to their ubiquitousness and not necessarily
because they are the best for our model. To update the
network weights, we minimise the loss between the
predicted Q-values and the target Q-values.
πΏπ‘œπ‘ π‘  = 𝑀𝑆𝐸(π‘„π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘’π‘‘(𝑠, π‘Ž), π‘„π‘‘π‘Žπ‘Ÿπ‘”π‘’π‘‘(𝑠, π‘Ž))
C. REINFORCEMENT LEARNING AGENT
The reinforcement learning agent in our project was
designed to interact with and learn from the environment of
Conway's Game of Life, with the ultimate goal of stabilising
specific patterns by manipulating grid cells. It was a custom
class, RL_Agent, which was built from scratch, heavily
extending another tutorial which is credited in the source
code. The agent made use of the deep q network to make
decisions and select actions. Below are the key features of
the agent:
- Exploration vs. Exploitation: The agent employs an
epsilon-greedy strategy to balance exploration with
exploitation.
- Long Term Memory: The agent maintains a memory of
previous state transitions, up to a specified maximum
capacity, to enable it to learn from past experiences. This
memory stores tuples of states, actions, rewards, and
subsequent states.
- Decision Making: The agent makes decisions based on the
state of the grid. By calling the forward propagation
function of its deep q network, the agent gets a vector of
q-values. The agent then uses the index of the maximum
q-value to determine whether to kill a cell, revive a cell or
do nothing.
- Training: After each action, the agent immediately learns
the outcome to adjust its strategy. Each time the goal is
achieved, however, the agent revisits all past experiences in
batches - up to a certain limit - to further refine its strategy.
D. SYSTEM INTERACTION
Figure 4 shows a high level overview of the interactions
within the reinforcement learning system. It shows the roles
and sequential operations of the agent, the environment, and
the deep q-learning model. The agent decides on actions
based on the current state of the grid, the model predicts the
outcomes of actions, and the environment updates the grid
and calculates rewards. These interactions facilitate the
training process so that the model continually adjusts its
parameters based on feedback from the environment to
improve the agent’s decision making.
Fig 4: Interaction Flowchart of the Reinforcement Learning
System..
IV.
EXPERIMENT DESIGN
The agent operates within a structured environment and its
aim is to achieve the desired pattern within four generations,
based on the theoretical possibility identified earlier. To
rigorously test the agent's learning and performance, we
have set up a systematic experiment where we vary the
number of episodes and measure the agent's accuracy in
achieving the target pattern.
A. EXPERIMENT CONFIGURATION
Below, we outline some key parts of the experiment
configuration:
- Environment Setup: The experiments are conducted on a
simplified 4 x 4 grid. This environment is a significant
reduction from the initial 40 x 40 grid. However, this
reduction was necessary due to the poor initial performance
on the larger grid where the sheer complexity of the action
space, we believe, overwhelmed the agent’s ability to learn
effective strategies. But the smaller grid size increases the
chance of random successful actions, which can then be
reinforced through learning.
- Experiment Structure: An Experiment is made up of a
series of episodes, each consisting of up to four generations.
Episodes do not exceed four generations because we know
the desired pattern can potentially be achieved within four
generations. This ‘game over’ condition means the agent is
constrained to not just achieving the pattern but doing so
efficiently within the acceptable number of generations.
- Variable Episode Sizes: The experiment loops over a
range of episode sizes to also examine how the number of
episodes affects the agent’s accuracy or learning curve.
Sizes tested were [10, 25, 50, 100, 250, 500, 1000, 2500,
5000, 10000, 25000, 50000, 100000] episodes. Larger
episode sizes could not be tested due to the intense
computation times involved. However, this range felt
extensive enough to identify the optimal number of episodes
for efficient learning.
- Accuracy Measurement: Accuracy is measured as a
percentage of the ratio of episodes where the agent
successfully stabilises the desired pattern to the total number
of episodes allocated for that experiment. For instance, the
agent has an accuracy of 10% if it achieved the desired
pattern once in ten episodes.
V.
RESULTS
From figure 6 we can see that while 50 and 100 episodes
seem to give good results, those values are outliers given
that the majority of the episode scores on those windows are
zero. However, larger episode sizes generally tend to be
more consistent and yield consistently ‘higher’ scores.
Although this improvement might not be linear as seen in
figure 5. Accuracy tended to plateau or even decrease
slightly beyond a certain point.
C. HYPERPARAMETER TUNING EXPERIMENTS
Focusing on an Episode size of 500, we ran further
experiments, tweaking some aspects of the model
configuration in the hope of improving its accuracy.
Adjustments to the neural network configuration, such as
reducing hidden layer sizes from 256 to 128, did not
improve performance; rather, it resulted in a decrease in
accuracy comparable to what was achieved at 250 episodes
with larger hidden layers. Increasing the hidden layer size to
512 did not yield any substantial improvement in accuracy.
Fig 5: Agent Training Accuracy by Episode Size.
Figure 5 is a plot of the accuracy score achieved by the
agent over the varying episode window sizes. From the
above plot we can determine two things. First, our model is
not very accurate. This is not the exciting answer we set out
to find, however, the truth is that the current model does not
consistently achieve the desired pattern in 4 generations or
less most of the time. More on what to do about this would
be covered in the discussion and conclusion sections. The
second observation has to do with a seemingly ‘optimal’
episode size.
A. OPTIMAL EPISODE SIZE
Among the tested episode sizes, 500 episodes consistently
yielded the highest accuracy of 1.2% in two consecutive
experiments.
B. EPISODE SIZE AND MODEL PERFORMANCE
In addition, altering the learning rate had mixed effects. A
significant increase in the learning rate to 0.1 mirrored the
effect of reducing the hidden layer size, while finer
adjustments (to 0.01 and 0.001) did not produce a noticeable
change in outcomes.
VI.
DISCUSSION
The results suggest that while increasing the number of
episodes generally provides the agent with more learning
opportunities, which can lead to higher accuracies, there
appears to be an optimal point (in this case, 500 episodes)
beyond which additional episodes do not necessarily equate
to better performance. This plateau might hint at the agent
having reached the limits of what it can learn from the
environment with its current configuration.
The results also reaffirmed the sensitivity of models to their
architectural and learning rate configurations. Reducing the
complexity of the model (by decreasing hidden layer sizes)
or dramatically increasing the learning rate resulted in
reduced performance. This shows that there is a delicate
balance in the network design and learning parameters that
optimises learning outcomes.
Interestingly, larger hidden layers did not improve
performance, which might mean that the model complexity
was sufficient at 256 units per layer for the task complexity..
Finally, the consistent lack of improvement with various
adjustments to learning rates and model sizes likely means
there remains potential inefficiencies in the learning
algorithm to be corrected like further tuning of the
exploration-exploitation balance or the reward structure.
Fig 6: Box Plot of Model Accuracy per Episode Size over
Five Experiment Iterations.
VII.
CONCLUSION AND FUTURE WORK
This study on the application of reinforcement learning to
Conway’s Game of Life demonstrates the potential and
limitations of neural networks in achieving specific pattern
stabilizations within a finite grid environment.
Despite the adaptation of a deep q-learning approach to
transform a 'block' pattern into a 'blinker', the results
indicate that achieving high accuracy within a desired
number of generations poses a few challenges.
Our experiments reveal that a reinforcement learning agent,
when trained over 500 episodes, reached an optimal
performance threshold. Beyond this, increases in episode
count did not significantly enhance the agent’s accuracy.
Therefore there is a saturation in learning potential under the
current model configurations.
Furthermore, modifications in the network architecture and
learning rates showed that there is more precise fine tuning
required to optimise learning, with neither increased
complexity (via larger hidden layers) nor higher learning
rates immediately leading to improved outcomes.
For future work, there is a clear need to perform more fine
tuning on the model, including adjustments to the reward
structure and exploration-exploitation balance. In addition,
genetic algorithms could be integrated and that might help
to overcome any inherent learning plateaus observed in this
study.
VIII.
REFERENCES
[1](No date) Watch ai evolve patterns in Conways Game of Life.
Available
at:
https://www.toolify.ai/ai-news/watch-ai-evolve-patterns-in-conway
s-game-of-life-27843 (Accessed: 01 May 2024).
[2]Ben Dickson et al. (2020) Why Neural Networks struggle with
the
game
of
life,
TechTalks.
Available
at:
https://bdtechtalks.com/2020/09/16/deep-learning-game-of-life/
(Accessed: 01 May 2024).
[3]Vayadande, K. et al. (2022) ‘SIMULATION OF CONWAY’S
GAME OF LIFE USING CELLULAR AUTOMATA’,
International Research Journal of Engineering and Technology
(IRJET), 9(1).
[4]Rendell, P. (2015) ‘Game of Life Universal Turing Machine’,
Turing Machine Universality of the Game of Life, pp. 71–89.
doi:10.1007/978-3-319-19842-2_5.
[5]Springer, J.M. and Kenyon, G.T. (2021) ‘It’s hard for neural
networks to learn the game of life’, 2021 International Joint
Conference
on
Neural Networks (IJCNN) [Preprint].
doi:10.1109/ijcnn52387.2021.9534060.
Download