Temporal Difference Learning in Score-Four Matthew Hlavacek Benedict Lim John Ngo Student, Northwestern University 2420 Campus Drive Evanston, IL 60201 (847) 650-3876 Student, Northwestern University 615 Garrett Place W34 Evanston, IL 62021 (224) 730-2026 Student, Northwestern University 912 Noyes Street Apt. D Evanston, IL 60201 (712) 898-8148 Matthewhlavacek2014@u.north Benedict@u.northwestern.edu Johnngo2014@u.northwestern. western.edu edu ABSTRACT We intend to build a computer player for the board game scorefour. The player will use the temporal difference model of reinforcement learning, utilizing artificial neural networks for approximating the value function. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: Learning – knowledge acquisition, connectionism and neural nets. General Terms Algorithms, Design, Experimentation, Human Factors Keywords Temporal Difference, Reinforcement Learning, Abstract Strategy 1. INTRODUCTION Score-four is a popular abstract strategy board game that is essentially a three dimensional version of connect-four, played on a 4x4x4 grid. It is particularly interesting as a problem for temporal difference learning because the state space is extremely large, with 3^64 as a crude upper-bound. Neither a modeling of the underlying Markov decision process nor brute-force accumulation of the policies for all states is feasible for these reasons. Therefore, a delayed reinforcement learning model with an approximation of the value function was determined to be the most efficient approach to teach a computer to play score-four. 2. INTEREST AND UTILITY While there have been several applications of temporal difference learning to board games such as chess, backgammon and tic-tactoe, the use of the learning method for score-four has not been examined.[1] Furthermore, while games like connect-4 and checkers are “solved” games with published best strategies, scorefour has yet to have a proven winning strategy published. We are interested in the artificial neural net approach’s ability to respond to variations in game rules and feature representations. Particularly interesting is the ability for artificial neural nets to learn a value function approximation without an expert defined feature set. This was shown in the first version of TD-Gammon, a temporal difference-based backgammon player.[2] By simply Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission. EECS 349’12, November 10, 2012, Evanston, IL, USA. inputting the board state, the nets can be trained on a variety of rules for wins and moves. 3. LEARNING 3.1 Task to learn Our project will specifically learn to implement an effective strategy of playing one side of a score-four game. For each of its side’s turn in the game, it will take as input the state of the board and will output some particular move, attempting to eventually move into a winning state. 3.2 What will be optimized Our project is primarily concerned with optimizing the estimation of the value function for the possible moves. The challenge lies in determining which of the legal next-moves gives the greatest likelihood of reaching a winning state.[3] 3.3 Training Our learner will be trained through self-play against a snapshot of a previous model. Though play against an expert would be ideal, this would be a slow, time-consuming process. Self play has been shown in several temporal difference game-learning applications to be an effective means of training a model.[4] Furthermore, we will augment our learner with an ε-greedy policy while training to prevent the learner from being stuck in a local maxima and ensure it explores the state space. We will potentially experiment with various values of λ in the TD(λ) algorithm to analyze the impact on the learner’s performance.[5] The learner will play on alternating sides, and once a good-standing win/loss ratio is achieved against the snapshot, the learner will be copied onto a new snapshot with which the learner continues self play. This process will recycle until a sufficient number of games are completed, ranging between 1,000-10,000, time permitting, and the current snapshot reaches a performance level capable of playing competently against human players. Should this fail to produce a satisfactory player, the neural net’s input will be expanded beyond the raw encoding with a number of game heuristics such as a count of the 4-slot board vectors with 3 player pieces and one empty space. 3.4 Evaluating Performance The performance of our model can be evaluated by its win/loss ratio against a specified opponent. A random player would be our baseline performance indicator. It can also be evaluated in terms of ideal performance by its win/loss ratio against a rule-based player using the rules stated in the connect-four strategy guide’s generalized strategy Another indicator of performance is its ability to recognize guaranteed losses and wins. Given a set of boards with guaranteed 1-move wins it should always evaluate these moves as the best and a set of boards where all but 1 move result in a loss should always evaluate that move as best. 4. BUILDING Our project will be written in C# and will have two main components. The first will be the training module, which will be able to be configured and then left to its own devices to learn the set of weights for the MLP neural network used by our player. The other will be the human-computer game itself, in which one can play against the trained computer player. The graphic user interface has been previously developed by a third-party, and so, the learner’s environment is prepared for play. We will be using source and pseudo code for use with TD-Gammon provided by IBM researcher Gerald Tesauro and the computer science department at Rice University.[6,7] Also, other code analyzed during project development includes the C++ class EvalNet, used successfully in learning to play checkers and the machine learning classic, “Gridworld”.[8] 5. ACKNOWLEDGMENTS Our thanks to ACM SIGCHI for allowing us to modify templates they had developed. 6. MILESTONES The table below outlines our milestones and expected completion dates Milestone Group Members(s) Date Finalization of methodology and proposal Matthew Nov 14 Submission of revised proposal John Nov 16 Basic score programmed Matthew* Nov 22 four learner First round of training and testing of learner completed Benedict, John Nov 26 Submission of status report All 3 Nov 30 Refined learner model completed All 3 Dec 5 Poster printed Benedict Dec 7 Website uploaded John Dec 10 * All three group members will contribute to the coding of the learner with Matthew taking the lead 7. REFERENCES [1] Baxter, J., A. Tridgell, and L. Weaver. 1998. Knightcap: A chess program that learns by combining td with game-tree search. Proceedings of the 15th International Conference on Machine Learning. http://citeseerx.ist.psu.edu/viewdoc/similar;jsessionid=FA7F 9F6770E3AA1688D0B3F6B65231C4?doi=10.1.1.140.2003 &type=ab. [2] Tesauro, G. 1995. Temporal Difference Learning and TDGammon. Communications of the ACM, 38, 3, 58-68 (March. 1995). DOI= http://dx.doi.org/10.1145/203330.203343. [3] Scharudoph, N., P. Dayan, T. Sejnewski. 1994. Temporal Difference Learning of Position Evaluation in the Game of Go. Advances in Neuural Informational Processing, 6. http://www.variational-bayes.org/~dayan/papers/sds94.pdf. [4] Wiering, M.., J.P. Patist, and H. Mannen. 2007. Learning to Play Board Games using Temporal Difference Methods. Technical Report. Utrecht University, Institute of Information and Computer Sciences. DOI= http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.80. 6366. [5] Tesauro, G. 1992. Practical Issues in Temporal Difference Learning. Machine Learning, 8, 257-277 (1992). Klewer Academic Publishers. http://incompleteideas.net/sutton/tesauro-92.pdf. [6] Bonde Jr., A., and R. Sutton. 1993. Nonlinear TD/Backprop pseudo C-code. GTE Laboratories Incorporated. (April 1992). http://webdocs.cs.ualberta.ca/~sutton/td-backproppseudo-code.text. [7] TD-Gammon Revisited. Programming Assignment 4. Comp440 (Fall 2007). Rice University. http://www.clear.rice.edu/comp440/handouts/pa4.pdf. [8] Carter, D. 2007. GridWorld Case Study. CollegeBoard. Professional Development Workshop Materials. 2007-2008. http://apcentral.collegeboard.com/apc/public/repository/apsf-computer-science-gridworld.pdf. [9] Pollack, J., and A. Blair. 1998. Co-Evolution in the Successful Learning of Backgammon Strategy. Machine Learning, 32, 3, 225-240. http://www.springerlink.com/content/h1329772135j102l/?M UD=MP.