Revised Proposal - Google Project Hosting

advertisement
Temporal Difference Learning in Score-Four
Matthew Hlavacek
Benedict Lim
John Ngo
Student, Northwestern University
2420 Campus Drive
Evanston, IL 60201
(847) 650-3876
Student, Northwestern University
615 Garrett Place W34
Evanston, IL 62021
(224) 730-2026
Student, Northwestern University
912 Noyes Street Apt. D
Evanston, IL 60201
(712) 898-8148
Matthewhlavacek2014@u.north Benedict@u.northwestern.edu Johnngo2014@u.northwestern.
western.edu
edu
ABSTRACT
We intend to build a computer player for the board game scorefour. The player will use the temporal difference model of
reinforcement learning, utilizing artificial neural networks for
approximating the value function.
Categories and Subject Descriptors
I.2.6 [Artificial Intelligence]: Learning – knowledge acquisition,
connectionism and neural nets.
General Terms
Algorithms, Design, Experimentation, Human Factors
Keywords
Temporal Difference, Reinforcement Learning, Abstract Strategy
1. INTRODUCTION
Score-four is a popular abstract strategy board game that is
essentially a three dimensional version of connect-four, played on
a 4x4x4 grid. It is particularly interesting as a problem for
temporal difference learning because the state space is extremely
large, with 3^64 as a crude upper-bound. Neither a modeling of
the underlying Markov decision process nor brute-force
accumulation of the policies for all states is feasible for these
reasons. Therefore, a delayed reinforcement learning model with
an approximation of the value function was determined to be the
most efficient approach to teach a computer to play score-four.
2. INTEREST AND UTILITY
While there have been several applications of temporal difference
learning to board games such as chess, backgammon and tic-tactoe, the use of the learning method for score-four has not been
examined.[1] Furthermore, while games like connect-4 and
checkers are “solved” games with published best strategies, scorefour has yet to have a proven winning strategy published. We are
interested in the artificial neural net approach’s ability to respond
to variations in game rules and feature representations.
Particularly interesting is the ability for artificial neural nets to
learn a value function approximation without an expert defined
feature set. This was shown in the first version of TD-Gammon, a
temporal difference-based backgammon player.[2] By simply
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission.
EECS 349’12, November 10, 2012, Evanston, IL, USA.
inputting the board state, the nets can be trained on a variety of
rules for wins and moves.
3. LEARNING
3.1 Task to learn
Our project will specifically learn to implement an effective
strategy of playing one side of a score-four game. For each of its
side’s turn in the game, it will take as input the state of the board
and will output some particular move, attempting to eventually
move into a winning state.
3.2 What will be optimized
Our project is primarily concerned with optimizing the estimation
of the value function for the possible moves. The challenge lies in
determining which of the legal next-moves gives the greatest
likelihood of reaching a winning state.[3]
3.3 Training
Our learner will be trained through self-play against a snapshot of
a previous model. Though play against an expert would be ideal,
this would be a slow, time-consuming process. Self play has been
shown in several temporal difference game-learning applications
to be an effective means of training a model.[4] Furthermore, we
will augment our learner with an ε-greedy policy while training to
prevent the learner from being stuck in a local maxima and ensure
it explores the state space. We will potentially experiment with
various values of λ in the TD(λ) algorithm to analyze the impact
on the learner’s performance.[5] The learner will play on
alternating sides, and once a good-standing win/loss ratio is
achieved against the snapshot, the learner will be copied onto a
new snapshot with which the learner continues self play. This
process will recycle until a sufficient number of games are
completed, ranging between 1,000-10,000, time permitting, and
the current snapshot reaches a performance level capable of
playing competently against human players. Should this fail to
produce a satisfactory player, the neural net’s input will be
expanded beyond the raw encoding with a number of game
heuristics such as a count of the 4-slot board vectors with 3 player
pieces and one empty space.
3.4 Evaluating Performance
The performance of our model can be evaluated by its win/loss
ratio against a specified opponent. A random player would be our
baseline performance indicator. It can also be evaluated in terms
of ideal performance by its win/loss ratio against a rule-based
player using the rules stated in the connect-four strategy guide’s
generalized strategy
Another indicator of performance is its ability to recognize
guaranteed losses and wins. Given a set of boards with guaranteed
1-move wins it should always evaluate these moves as the best
and a set of boards where all but 1 move result in a loss should
always evaluate that move as best.
4. BUILDING
Our project will be written in C# and will have two main
components. The first will be the training module, which will be
able to be configured and then left to its own devices to learn the
set of weights for the MLP neural network used by our player.
The other will be the human-computer game itself, in which one
can play against the trained computer player. The graphic user
interface has been previously developed by a third-party, and so,
the learner’s environment is prepared for play. We will be using
source and pseudo code for use with TD-Gammon provided by
IBM researcher Gerald Tesauro and the computer science
department at Rice University.[6,7] Also, other code analyzed
during project development includes the C++ class EvalNet, used
successfully in learning to play checkers and the machine learning
classic, “Gridworld”.[8]
5. ACKNOWLEDGMENTS
Our thanks to ACM SIGCHI for allowing us to modify templates
they had developed.
6. MILESTONES
The table below outlines our milestones and expected completion
dates
Milestone
Group
Members(s)
Date
Finalization of methodology and
proposal
Matthew
Nov 14
Submission of revised proposal
John
Nov 16
Basic
score
programmed
Matthew*
Nov 22
four
learner
First round of training and testing
of learner completed
Benedict,
John
Nov 26
Submission of status report
All 3
Nov 30
Refined learner model completed
All 3
Dec 5
Poster printed
Benedict
Dec 7
Website uploaded
John
Dec 10
* All three group members will contribute to the coding of the
learner with Matthew taking the lead
7. REFERENCES
[1] Baxter, J., A. Tridgell, and L. Weaver. 1998. Knightcap: A
chess program that learns by combining td with game-tree
search. Proceedings of the 15th International Conference on
Machine Learning.
http://citeseerx.ist.psu.edu/viewdoc/similar;jsessionid=FA7F
9F6770E3AA1688D0B3F6B65231C4?doi=10.1.1.140.2003
&type=ab.
[2] Tesauro, G. 1995. Temporal Difference Learning and TDGammon. Communications of the ACM, 38, 3, 58-68 (March.
1995). DOI= http://dx.doi.org/10.1145/203330.203343.
[3] Scharudoph, N., P. Dayan, T. Sejnewski. 1994. Temporal
Difference Learning of Position Evaluation in the Game of
Go. Advances in Neuural Informational Processing, 6.
http://www.variational-bayes.org/~dayan/papers/sds94.pdf.
[4] Wiering, M.., J.P. Patist, and H. Mannen. 2007. Learning to
Play Board Games using Temporal Difference Methods.
Technical Report. Utrecht University, Institute of
Information and Computer Sciences. DOI=
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.80.
6366.
[5] Tesauro, G. 1992. Practical Issues in Temporal Difference
Learning. Machine Learning, 8, 257-277 (1992). Klewer
Academic Publishers.
http://incompleteideas.net/sutton/tesauro-92.pdf.
[6] Bonde Jr., A., and R. Sutton. 1993. Nonlinear TD/Backprop
pseudo C-code. GTE Laboratories Incorporated. (April
1992). http://webdocs.cs.ualberta.ca/~sutton/td-backproppseudo-code.text.
[7] TD-Gammon Revisited. Programming Assignment 4.
Comp440 (Fall 2007). Rice University.
http://www.clear.rice.edu/comp440/handouts/pa4.pdf.
[8] Carter, D. 2007. GridWorld Case Study. CollegeBoard.
Professional Development Workshop Materials. 2007-2008.
http://apcentral.collegeboard.com/apc/public/repository/apsf-computer-science-gridworld.pdf.
[9] Pollack, J., and A. Blair. 1998. Co-Evolution in the
Successful Learning of Backgammon Strategy. Machine
Learning, 32, 3, 225-240.
http://www.springerlink.com/content/h1329772135j102l/?M
UD=MP.
Download