Achieving Pattern Stabilisation in Conway’s Game of Life Through Reinforcement Learning. Nnadi Praise 20519328 Abstract—To explore the fascinating intersection between artificial intelligence and cellular automata, this study applies reinforcement learning to Conway’s Game of Life. The goal of this study is to design an intelligent agent capable of achieving deterministic pattern stabilisation within this complex system. We developed a custom environment and employed a Deep Q-Learning model to enable our agent to transition a starting ‘block’ pattern into a ‘blinker’. Despite the simplicity of its operational rules, the Game of Life displays complex emergent behaviours that present significant challenges in achieving consistent pattern stabilisation. Our findings indicate that while the agent is capable of achieving the target pattern, it does so with very limited success, achieving a maximum recorded accuracy of 1.2% in 500 episodes. Future efforts will focus on refining model parameters and exploring alternative algorithms to enhance the agent’s learning capabilities. - A live cell with two or three live neighbours survives to the next generation[2]. - A dead cell with exactly three live neighbours becomes a live cell [2]. This rule simulates reproduction. β Emergent Patterns: The simple rules outlined above result in emergent behaviour and high complexity. This allows various phenomena to occur such as stable ‘blocks’, oscillating ‘blinkers’, and moving ‘gliders’ which can simulate logical operations and data transport within the grid[1][3]. In fact, the resulting game is Turing complete[4]. Keywords—Cellular Automata, Conway’s Game of Life, Reinforcement Learning, Deep Q-Learning, Pattern Stabilization, Neural Networks, Artificial Intelligence in Games. I. INTRODUCTION (AND RESEARCH QUESTIONS) Conway’s Game of Life is a zero-player game developed by British Mathematician John Conway in 1970. It is a profound example of cellular automata that continues to fascinate researchers across various fields like computer science, physics, and artificial intelligence[1][2][3]. The game operates on a set of simple rules that are applied to a grid of cells. Each cell in the grid can be in one of two states: alive or dead. Once applied, the rules simulate the birth, survival, and death of cells and lead to complex patterns and behaviours emerging from straightforward initial conditions[2][3]. A. OVERVIEW OF CONWAY’S GAME OF LIFE β Cellular Automata and Grid: The Game of Life is played on an infinite two-dimensional grid of square cells. Each cell interacts with its eight neighbours (horizontal, vertical, and diagonal) using a set of three rules based on the number of live neighbours. Figure 1 shows the neighbourhood radius of a single cell. The central black cell represents the focal cell. The right surrounding white cells are its neighbours- a.k.a the cells considered to determine the focal cell’s next state. β Rules: The rules of the game are listed below: - A live cell with fewer than two or more than three live neighbours dies in the next generation[2]. These rules are a toy imitation of underpopulation and overpopulation, respectively. Fig 1: Illustration of the Neighbourhood Radius in Conway’s Game of Life. B. INITIAL PROJECT IDEA The initial concept of the project was to perform a comparative analysis of two types of agents within the game of life environment: a rule-based agent and a reinforcement learning agent. Both agents would be tasked with the goal of stabilising a desired pattern in the game, specifically transforming a ‘block’ pattern into a ‘blinker’. A block pattern, as seen in figure 2, is called a still life pattern in Conway’s Game of Life. This is because applying the game’s rules results in no change in the block pattern on any generation. On the other hand, the rules of the game cause the blinker pattern to oscillate between horizontal and vertical forms. The transition from block to blinker was to be accomplished by allowing each agent to make a change to a single cell per generation before the game’s rules processed the resulting grid. And Unlike traditional implementations of Conway’s Game of Life that often use an infinite grid, our environment would use a finite grid with dimensions specially tailored for this research. Fig 2: Illustration of block pattern (left) and blinker pattern (right). C. FINAL PROJECT IDEA As the project was developed, the core idea evolved to focus on enhancing and analysing the reinforcement learning agent within the environment. This adjustment was influenced by insights gained during the initial development phase that revealed the outcomes of the rule based agent could be predetermined once effective rules were established. Figure 3 below shows one potential solution for transforming a block pattern into a blinker pattern across four generations. explore how the reinforcement learning agent adapted to the environment and pursued its goal of pattern stabilisation. The revised project objective then was to evaluate the agent’s performance in achieving the blinker pattern, comparing its effectiveness against the theoretical optimal results which a rule-based agent could achieve under similar conditions. In other words, coulda reinforcement learning agent consistently achieve the desired pattern in four generations or less? II. LITERATURE REVIEW Conway’s Game of Life, which is more stimulation than game, has been a fertile ground for exploring the capabilities of computational models, particularly neural networks, because of its simple rules that lead to complex emergent behaviours. A significant body of research has utilised neural networks to explore pattern generation and evolution in the Game of Life. For instance, an evolutionary approach combined with neural networks has been employed to develop populations of networks that evolve towards generating complex, symmetrical patterns using a complexity function as a measure of fitness[1]. Unlike the broader approach taken in the reviewed literature, which primarily focuses on evolving networks to discover or generate patterns, our project uniquely targets the precise stabilisation of a specific pattern—transforming a 'block' into a 'blinker' in Conway's Game of Life using a reinforcement learning agent. While we employ neural networks as part of our methodology, our objective diverges from the general exploratory nature of previous studies. We do not make use of evolutionary algorithms to generate variability and novelty but instead optimise our neural network to consistently reach a predefined and desired state. Despite their predictive power, however, research also shows that neural networks face challenges in learning deterministic rules in the Game of Life. Dickson et al. (2020) document an experiment where researchers initially crafted a small convolutional neural network and manually tuned it to predict the sequence of cell state changes within the game's grid. By achieving this, the researchers showed that there is a minimal neural network that could indeed represent the rules of the Game of Life effectively[2][5]. Fig 3: One approach a rule-based agent might take to transform a block to a blinker pattern. In figure 3, the grid state represents the initial state of the grid at the start of each generation. The action represents the agent’s decision for that generation. Green cells signify a cell being revived, red cells signify a cell being killed, and ‘nothing’ means the agent neither kills or revives a cell in that generation. Finally, the new state represents the resulting grid after the game's rules have been applied to the pattern leftover by the agent’s action. This solution clearly demonstrates that a rule-based agent can successfully achieve the desired blinker pattern in at least four generations. Thus it seemed more meaningful to However, when these networks were re-trained from scratch with randomly initialised parameters, achieving optimal settings proved impossible. The researchers embarked on training with 1 million randomly generated game states. Their goal was for the neural network to learn and parameterize the game's rules independently. Theoretically, perfect learning would mean the network's parameters would converge to the manually tuned values. In practice, this rarely occurred. More often, the network failed to find the optimal parameter settings, and its performance degraded as the complexity of the task increased with additional steps[2][5]. Consequently, the initial weights and the selection of training examples substantially impact the learning outcomes when training a neural network for deterministic systems like Game of Life. The "Lottery Ticket Hypothesis" proposes that successful training of neural networks often depends on fortuitous initial weight configurations, which might explain the seemingly random success in networks learning the Game of Life's rules[2][5] The aforementioned challenge faced by neural networks when it comes to learning deterministic rules in Conway’s Game of Life directly impacts our goal of achieving a desired pattern under a certain number of generations. Research indicates that the success of such neural networks heavily depends on the initial parameter settings. Optimal configurations are similar to lottery tickets- desirable but rare. Hence, achieving pattern stabilisation perfectly may require meticulous tuning of network parameters and possibly a more complex network architecture so the model masters the precise sequences of actions needed for the desired pattern transformation. III. METHODOLOGY This section details the framework used in our project to explore the stabilisation of the desired blinker pattern. A. ENVIRONMENT SET UP The environment for our project is a custom-built class called Environment, designed to simulate Conway’s Game of Life. It was developed from scratch, albeit built upon a tutorial which is credited in the environment script. Significant modifications were made to the basic tutorial code to create an environment better suited for our project. Here are the key features and functions of the Environment: - Initialise Grid: The environment initialises a finite grid of dimensions specified via customizable width and height variables. This grid holds and is used to display the various patterns resulting from agent actions and the effects of the game’s rules. The grid is initialised with a block pattern at its centre. - Initialise Target Pattern: The environment stores the desired pattern in a separate grid and contains functions to initialise the desired pattern. In the case of a blinker which can either be horizontal or vertical, there are two target patterns. Either pattern is valid. - Enable Agent: The environment contains a method that effects the agent's desired change. It takes the agent’s decision to kill a cell, revive a cell or do nothing and performs it, allowing the grid to be influenced by the agent. - Reward Agent: The environment contains a reward system to guide the reinforcement learning process. The environment assesses the agent's performance based on the accuracy of achieving the desired patterns, the efficiency of its actions, and proximity to the goal within a set number of steps. It not only rewards the correct formation of patterns but also penalises unnecessary or counterproductive actions like killing a cell that is already dead. - Update Grid: The environment is also responsible for applying the rules of the game to calculate the next state of each cell based on its neighbours. And this feature also aids the visualisation of the environment. B. DEEP Q-LEARNING MODEL Our deep learning model was a class, DeepQNetwork, inheriting from pytorch’s neural network module. We selected Deep Q-Learning as the reinforcement learning method for this project because we needed a learning approach that enabled decision-making in a complex and dynamic environment. Given the goal of stabilising a specific pattern in Conway's Game of Life, the agent needs to accurately predict the outcomes of a wide range of possible actions from any given state. Deep Q-Learning addresses this requirement by enabling the agent to estimate the potential reward for each possible action, thus facilitating informed decision-making that aims to maximise long-term rewards. We discuss our model architecture below: - Input Layer: The input to the model is the grid from Conway’s Game of Life but converted into a flattened list where each element represents the state of a cell: 1 for alive and 0 for dead. The size of the input layer, therefore, is the grid width multiplied by the grid height. - Output Layer: The output layer is defined as (2 × ππππ’π‘ π ππ§π) + 1. This structure allows the network to decide between 2 types of actions for each cell and 1 neutral action, ‘do nothing’, that affects all cells. The first segment of the output layer [0, ππππ’π‘ π ππ§π − 1] corresponds to actions where the model suggests killing a particular cell. The second segment [ππππ’π‘ π ππ§π, (2 × ππππ’π‘ π ππ§π) − 1 ] corresponds to actions where the model suggests reviving a particular cell. The last neuron represents the “do nothing” action. The output layer uses no activation function, so the model’s decision is derived from the neuron with the highest (positive) value. - Hidden Layers: The model has two hidden layers which make use of a ReLu activation function and have neuron size of 256 each. - Training: The model has a function to train the agent by making use of the Bellman Equation to iteratively update its network weights. First it predicts Q-values for all possible actions from the current state using its forward propagation function. The agent then selects an action based on the highest predicted Q-values. Next, target Q-values are calculated using the Bellman Equation: ππ‘πππππ‘(π , π) = π + γ πππ₯ π (π ', π') Where: β β β s is the current state, a is the action taken, r is the reward received, β β β s’ is the new state after taking action a, a’ represents all possible actions from the new state, s’, and γ is the discount factor that weighs the importance of future rewards. - Loss Function and Optimization: The model uses a mean squared error loss function and an adam optimiser. These were chosen due to their ubiquitousness and not necessarily because they are the best for our model. To update the network weights, we minimise the loss between the predicted Q-values and the target Q-values. πΏππ π = πππΈ(ππππππππ‘ππ(π , π), ππ‘πππππ‘(π , π)) C. REINFORCEMENT LEARNING AGENT The reinforcement learning agent in our project was designed to interact with and learn from the environment of Conway's Game of Life, with the ultimate goal of stabilising specific patterns by manipulating grid cells. It was a custom class, RL_Agent, which was built from scratch, heavily extending another tutorial which is credited in the source code. The agent made use of the deep q network to make decisions and select actions. Below are the key features of the agent: - Exploration vs. Exploitation: The agent employs an epsilon-greedy strategy to balance exploration with exploitation. - Long Term Memory: The agent maintains a memory of previous state transitions, up to a specified maximum capacity, to enable it to learn from past experiences. This memory stores tuples of states, actions, rewards, and subsequent states. - Decision Making: The agent makes decisions based on the state of the grid. By calling the forward propagation function of its deep q network, the agent gets a vector of q-values. The agent then uses the index of the maximum q-value to determine whether to kill a cell, revive a cell or do nothing. - Training: After each action, the agent immediately learns the outcome to adjust its strategy. Each time the goal is achieved, however, the agent revisits all past experiences in batches - up to a certain limit - to further refine its strategy. D. SYSTEM INTERACTION Figure 4 shows a high level overview of the interactions within the reinforcement learning system. It shows the roles and sequential operations of the agent, the environment, and the deep q-learning model. The agent decides on actions based on the current state of the grid, the model predicts the outcomes of actions, and the environment updates the grid and calculates rewards. These interactions facilitate the training process so that the model continually adjusts its parameters based on feedback from the environment to improve the agent’s decision making. Fig 4: Interaction Flowchart of the Reinforcement Learning System.. IV. EXPERIMENT DESIGN The agent operates within a structured environment and its aim is to achieve the desired pattern within four generations, based on the theoretical possibility identified earlier. To rigorously test the agent's learning and performance, we have set up a systematic experiment where we vary the number of episodes and measure the agent's accuracy in achieving the target pattern. A. EXPERIMENT CONFIGURATION Below, we outline some key parts of the experiment configuration: - Environment Setup: The experiments are conducted on a simplified 4 x 4 grid. This environment is a significant reduction from the initial 40 x 40 grid. However, this reduction was necessary due to the poor initial performance on the larger grid where the sheer complexity of the action space, we believe, overwhelmed the agent’s ability to learn effective strategies. But the smaller grid size increases the chance of random successful actions, which can then be reinforced through learning. - Experiment Structure: An Experiment is made up of a series of episodes, each consisting of up to four generations. Episodes do not exceed four generations because we know the desired pattern can potentially be achieved within four generations. This ‘game over’ condition means the agent is constrained to not just achieving the pattern but doing so efficiently within the acceptable number of generations. - Variable Episode Sizes: The experiment loops over a range of episode sizes to also examine how the number of episodes affects the agent’s accuracy or learning curve. Sizes tested were [10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 25000, 50000, 100000] episodes. Larger episode sizes could not be tested due to the intense computation times involved. However, this range felt extensive enough to identify the optimal number of episodes for efficient learning. - Accuracy Measurement: Accuracy is measured as a percentage of the ratio of episodes where the agent successfully stabilises the desired pattern to the total number of episodes allocated for that experiment. For instance, the agent has an accuracy of 10% if it achieved the desired pattern once in ten episodes. V. RESULTS From figure 6 we can see that while 50 and 100 episodes seem to give good results, those values are outliers given that the majority of the episode scores on those windows are zero. However, larger episode sizes generally tend to be more consistent and yield consistently ‘higher’ scores. Although this improvement might not be linear as seen in figure 5. Accuracy tended to plateau or even decrease slightly beyond a certain point. C. HYPERPARAMETER TUNING EXPERIMENTS Focusing on an Episode size of 500, we ran further experiments, tweaking some aspects of the model configuration in the hope of improving its accuracy. Adjustments to the neural network configuration, such as reducing hidden layer sizes from 256 to 128, did not improve performance; rather, it resulted in a decrease in accuracy comparable to what was achieved at 250 episodes with larger hidden layers. Increasing the hidden layer size to 512 did not yield any substantial improvement in accuracy. Fig 5: Agent Training Accuracy by Episode Size. Figure 5 is a plot of the accuracy score achieved by the agent over the varying episode window sizes. From the above plot we can determine two things. First, our model is not very accurate. This is not the exciting answer we set out to find, however, the truth is that the current model does not consistently achieve the desired pattern in 4 generations or less most of the time. More on what to do about this would be covered in the discussion and conclusion sections. The second observation has to do with a seemingly ‘optimal’ episode size. A. OPTIMAL EPISODE SIZE Among the tested episode sizes, 500 episodes consistently yielded the highest accuracy of 1.2% in two consecutive experiments. B. EPISODE SIZE AND MODEL PERFORMANCE In addition, altering the learning rate had mixed effects. A significant increase in the learning rate to 0.1 mirrored the effect of reducing the hidden layer size, while finer adjustments (to 0.01 and 0.001) did not produce a noticeable change in outcomes. VI. DISCUSSION The results suggest that while increasing the number of episodes generally provides the agent with more learning opportunities, which can lead to higher accuracies, there appears to be an optimal point (in this case, 500 episodes) beyond which additional episodes do not necessarily equate to better performance. This plateau might hint at the agent having reached the limits of what it can learn from the environment with its current configuration. The results also reaffirmed the sensitivity of models to their architectural and learning rate configurations. Reducing the complexity of the model (by decreasing hidden layer sizes) or dramatically increasing the learning rate resulted in reduced performance. This shows that there is a delicate balance in the network design and learning parameters that optimises learning outcomes. Interestingly, larger hidden layers did not improve performance, which might mean that the model complexity was sufficient at 256 units per layer for the task complexity.. Finally, the consistent lack of improvement with various adjustments to learning rates and model sizes likely means there remains potential inefficiencies in the learning algorithm to be corrected like further tuning of the exploration-exploitation balance or the reward structure. Fig 6: Box Plot of Model Accuracy per Episode Size over Five Experiment Iterations. VII. CONCLUSION AND FUTURE WORK This study on the application of reinforcement learning to Conway’s Game of Life demonstrates the potential and limitations of neural networks in achieving specific pattern stabilizations within a finite grid environment. Despite the adaptation of a deep q-learning approach to transform a 'block' pattern into a 'blinker', the results indicate that achieving high accuracy within a desired number of generations poses a few challenges. Our experiments reveal that a reinforcement learning agent, when trained over 500 episodes, reached an optimal performance threshold. Beyond this, increases in episode count did not significantly enhance the agent’s accuracy. Therefore there is a saturation in learning potential under the current model configurations. Furthermore, modifications in the network architecture and learning rates showed that there is more precise fine tuning required to optimise learning, with neither increased complexity (via larger hidden layers) nor higher learning rates immediately leading to improved outcomes. For future work, there is a clear need to perform more fine tuning on the model, including adjustments to the reward structure and exploration-exploitation balance. In addition, genetic algorithms could be integrated and that might help to overcome any inherent learning plateaus observed in this study. VIII. REFERENCES [1](No date) Watch ai evolve patterns in Conways Game of Life. Available at: https://www.toolify.ai/ai-news/watch-ai-evolve-patterns-in-conway s-game-of-life-27843 (Accessed: 01 May 2024). [2]Ben Dickson et al. (2020) Why Neural Networks struggle with the game of life, TechTalks. Available at: https://bdtechtalks.com/2020/09/16/deep-learning-game-of-life/ (Accessed: 01 May 2024). [3]Vayadande, K. et al. (2022) ‘SIMULATION OF CONWAY’S GAME OF LIFE USING CELLULAR AUTOMATA’, International Research Journal of Engineering and Technology (IRJET), 9(1). [4]Rendell, P. (2015) ‘Game of Life Universal Turing Machine’, Turing Machine Universality of the Game of Life, pp. 71–89. doi:10.1007/978-3-319-19842-2_5. [5]Springer, J.M. and Kenyon, G.T. (2021) ‘It’s hard for neural networks to learn the game of life’, 2021 International Joint Conference on Neural Networks (IJCNN) [Preprint]. doi:10.1109/ijcnn52387.2021.9534060.