Thomas LEMAIRE ID : 110237824 E-Mail : tlemaire59@yahoo.fr Exchange Student Project CS 526 Probabilistic Reasoning applied to Draughts Winter 2003 Contents I. Introduction .................................................................................................... 3 II. Interests ......................................................................................................... 3 III. Machine Learning in Games ...................................................................... 4 IV. History of the machine learning in Draughts............................................... 4 V. A view of the Probabilistic Algorithms applied to Draughts ............................ 5 A. Temporal Difference Learning .................................................................... 5 B. Artificial Neural Network ............................................................................. 5 C. Function applied to units ......................................................................... 5 D. Reinforcement Learning and TD Learning .............................................. 6 VI. NeuroDraughts ........................................................................................... 7 A. Architecture ................................................................................................ 7 B. Training with NeuroDraugts........................................................................ 7 C. Through self play .................................................................................... 8 VII. Performances from the paper..................................................................... 8 A. Human playing against network ................................................................. 8 B. Tournament ................................................................................................ 9 VIII. Further Topics .......................................................................................... 10 A. Developpement of NeuroDraughts ........................................................... 10 B. Application................................................................................................ 10 IX. Conclusion ............................................................................................... 11 X. Research Papers ordered by relevance ...................................................... 12 I. Introduction The temporal difference family of reinforcement learning procedures have been applied successfully to a number of domains including playing games. Thanks to TD learning a system can select actions according to how they contribute to the realization of some desired futures outcome. TD learning is the base of some algorithms which give the possibility of a program to learn by itself. It belongs to the family of probabilistic algorithms This project presents an overview of a probabilistic algorithm applied to the game of Draughts, NeuroDraughts. This algorithm uses NeuralNetwork and TD learning. This game is enough difficult for applying some probabilistic algorithms and having some interesting results. Indeed, programming a Draughts software has been always difficult because the simplicity of this game hides a very complex game. A good player will be able to change the pattern of a game as he wishes and by this fact disturbs the programm. The probabilistic reasoning occurs when we need to choose the best move while we don't know how the player will play by his turn. So, we should compute some probabilities of the next move of the concurrent and compute our best move knowing these probabilities. This is the basic reasoning of this problem. The projects shows some example of theses kind of algorithms thanks to the work of Mark Lynch. II. Interests I would like to explain why I took this project about probabilistic algorithms applied to the game of draughts (or checkers). In fact, I am an experimented player in draughts ( I took part in 4 Wolrd Championship for young people in Draughts) and I would like to see how we can compute a software so that it can win against confirmed players. Nervertheless, for the moment there exists no real good program which are superior to the best player (even if at a certain level, a program seems to be better) and I would like to overview one of the best algorithms which try to make a program better. Moreover, my level will enable me to analyse the performances of these algorithms. III.Machine Learning in Games Strategy games have always been a privileged domain for research in artificial intelligence. Since the first algorithms (beginning of 40's), a lot of progress has been made . The game "Othello" which is a very calculatory game, enables program to beat any player. In chess, the research has not ended even if in 1997, Deeper Blue of IBM, won against the worl champion, Garry Kasaparov. Indeed in 2002, the world champion, Vladimir Kramnik, played against the software, Deep Fritz, which was enable to analyse 3 millions of board per second againt 200 for Deep Blue, although Deep Fritz had less processor than Deep Blue (1000 processors). This confrontation ended with a draw. The game of Go, an Asian game, remain for the moment of an extrem difficulty for computer. The sorts of intelligence required for this game (pattern recognition, understanding of global objectifs, and priorisation of objectives) are difficult to program. So, best program are beaten by not so good player. Finally for Backgammon, Poker and Bridge which call a notion of probability, programs play like best players. IV. History of the machine learning in Draughts I will begin this report by presenting a brief overview of the research in Machine Learning applied to Draughts. The real beginning for both chess programs and draughts programs may be dated on March 9, 1949, when a research worker at Bell Telephone Laboratories, Claude Shannon, presented a seminar paper about how to program a Chess computer. Then, the computers appeared, and make possible to use Shannon's theories. In a few years, computers became better, and in 1992, Chinook, an American draughts program developed from 1988 in Alberta university, won against all the players in World's championship, but lost against Marion Tinsley (4 lost, 2 won, 33 draws). In 1995, Chinnok became world's champion, against Don Laferty (1 victory, 31 draws). In the same time, other programs were developed in other sort of draughts. Nowadays a project called, Buggy, developped by French and Dutch players, manage to win against the 11th ranked over world player. This program is the most valuable program (wolrd champion of draughts program) and it is continually developped to beat every player. This program used among other things full and lazy evaluation functions, pattern (shot) recognition. The match between human and machine point out the fact that it is difficult to program this game and to find efficient algorithms. V. A view of the Probabilistic Algorithms applied to Draughts A. Temporal Difference Learning At first, the TD(lamba) familiy of learning procedures is applied a lot to game playing. Indeed TD is well suited to game playing because it updates it's prediction at each time step to the prediction at the next step. So, that means it predicts the relatived best move by evaluating the state fo the game after it plays. Moreover, a record doesn't need to be kept, because only the current state and the previous evalution is needed to calculate the error and the new evaluation at any particular stage. And, the most important is that TD can improve the play of the program by playing againt itself. So, it can run for a thousand of games, learn and so become a high player. So, TD learning provides a computational mechanism that allows the positive or negative reward given at the end of a game to be effect passed back in time to previous moves. B. Artificial Neural Network Samuel was the pioneer of the idea of updating evaluations based on successive predictions in checkers program. He used a polynomial evaluation function (adjustments of the coefficients) as opposed to an Artificial Neural Network (ANN) used in Neurodraughts (adjustments of the weights). But the work of Samuel helped a lot to create the good feature set for NeuroDraughts, and showed a lot of results which are references in Machine Learning. A ANN is a multi layered perceptron (MLP) which consists of a layer of inputes, a layer of hidden units and a lyer of outputs. Each input is connected to each hidden unit and each hidden unit is connected to each output. It can be used to approximate functions. Training such a MLP is more complex because of the addition of the hidden layer. C. Function applied to units Some functions are created to process the units in the ANN. A squashing function is computed to control the range values coming from each hidden unit. This function enables us to keep vlaues always within a certain range. The hyperbolic tangent was choosent for the NeuroDraughts , that is to say the values will be kept between –1 and 1 because tanh is a bijective function from ]-infinity, + infinity[ to [-1,+1](-1 for loss, +1 for win). Moreover in order to make weight update we use backpropagation by calculating the error signal of the network. We use linear formula to change the weights between outputs and input. D. Reinforcement Learning and TD Learning In reinforcement learning the learner is rewarded for perfoming well (winning) and given negative reinforcement for performing badly (losing). In between the starting board and the final board when no specific reward is available the TD mechanism tries to update the prediction for the current state to that of the next state. For all non-terminal boards states the program forms a training pair between the current board state and the prediction of a win for the next state. We can notice that TD can be viewed as an extension of backpropation. An important advantage of TD is that we don't need to wait until the final outcome to train. This means that only one board state must be kept in memory. This is beneficial for multi-processor machines learning. One thread could be working out the next move and the other could be training for the current move. TD will consider a state good if the most likely state following it is good and vice versa. So, after playing several games one can expect the TD trained network to be very accurate prediction of winning/losing. VI. NeuroDraughts A. Architecture To compute, the evaluation network and the temporal difference procedures are incorporated into a same structure which represents the learning structure.. The mapping of the boards to the inputs is handle so that we can have a relation between the state board and the Neural Network. The network consists of a layer of inputs, hidden and a single output unit. The network is trained using back propagation. This is trained to evaluate boards after the player has moved. To represent the board, we need to notice that there exists five sorts of pieces in each square. So, we can compute the board by a set of squares where 0 means empty, 0.25 black man, 0.5 red man, 0.75 black king and 1 red king. There is others way to represent the board : square with 3 inputs or board represented y a given number of features to be described. B. Training with NeuroDraugts To train NeuroDraughts some methods could be applied. The first one is the Straight Play strategy which let teo oppponents play against each other, both learning for a set number of games. But it is impossible to know the improvement thanks to this method, because no benchmarks are available at any stage. The second one is cloning, a training network plays against a non-training network. If the training network win a certain percentage of games (80%-90%), then the non-training network copied its weights. Using this method is not reliable because if the training net wins the last 5 games after losing the first 45, it will be cloned yet. So, even if it becomes good, a much poorer network is in fact being cloned. An other one strategy is thanks to Expert play. But expert games always finish with six or more pieces left. So, it is a poor choice for NeuroDraughts. We can notice that self play is probably the best method of training as it is fully automated and should consistently improve from generation to generation. If we apply look ahead (see 2 moves after) then it can reveal a lot, even if for confimed players seeing 2 moves after is very poor (in average they can see 9-10 moves before). Finally, training with a human player could be beneficial if this player is of a hight standard. C. Through self play This part show some points during the self play training. The basis of self play is the tournament : A set of games during which training is occuring is followed by two games (the test). We can see whether or not the level of play has improved enough to beat its clone. If this is the case, the weights of the training nets are copied to the non-training oppponent. Then to replace the rules we use a neural net. Each possible board is generated from the current position and then evaluated. Then, the move with the highest evaluation is passed back to the player. At the end of the game depending on who has won, a reward is given to the network being trained. Then the behaviour is either positively or negatively reinforced. Nevertheless, self play has a big drawback which is the fact that if a network plays absurd moves and “thinks” that it is a good move, these bad moves will not be remove by the network. So, a human intervention in the network by a manual manner should be necessary. VII. Performances from the paper A. Human playing against network At the beginning, I thought it would be nice if I compute myself the experiments. Nevertheless, I wanted to use a programm in Turbo Pascal provided by a French programmer and this program turned out to be too difficult to be used (no commentary, one file). So, I decided to use the program of M. Lynch provided with his paper. Before playing against the program, I had to create the network by using Expert Trainer to create a network and so copy it in order to evaluate the performance of the program. I had to notice that the program use the rules of checkers and not international draughts. And even if only the number of pieces differs, I admit that it is not really the same about the strategy. So, I trained a lot a network by using many functions provided by the program. And I had to assume, that the program works very well about tactic (it can see easily all gain of pieces). I had intentionnaly made some mistakes and the program saw them quickly. But on the other hand, as all others programs there are some problem in strategy. In a certain kind of game, I tried to win always. But I had to assume that the fact that it can see and look ahead easily, the strategy of this program is very interesting, even if some moves were very absurd, but by playing a lot we could improve this point. Moreover we could manipulate the network to see where is the default. B. Tournament By an objective point of view, if we run a sequence of 2000 training games by the following method : Firstly, we do a tournament between networks, all the clone networks and the final nets produced within each training sequence were played against each other. The best net and the final network for each training sequence were chosen. So, each training sequence was represented by two nets. Then, these played in another tournament in which each network played 2 matches against every other network trained with the same representation (self play, dual, expert). Finally, the best 10 nets for each representation were chosen and played off in a cross representational tournament in which each net played 58 matches (29x2). We see as a result that the networks trained with expert games produced the strongest set of networks. These networks won or drew 88% of their games as opposed to 67% as dual and 66% as self play. So, it shows that the best training remains expert games even if it finishes when it remains 6-7 pieces on the board. Nevertheless, using self play after can improve the network to obtain a network able to beat very good players. Nevertheless, I don’t think that we could get an enough efficient network to beat any player with only the application of NeuroDraughts in a program. But by adding it in an evaluated program it could be maybe possible. It is a good point to look forward. VIII. Further Topics A. Developpement of NeuroDraughts The aim of a programm is to eat its creator. So, the program achieve easily. This program is achieved as well as a technical and theorical point of view. It will however remain in developpement, with the emphasis being put on reducing the domain specific knowledge while trying to keep the level of play constant. In future we could use Genetic Algorithm technology to attempt to automatically discover features. Moreover, we could add a research algorithm like MiniMax. search strategy which is simple model of the probable immediate outcomes of a move. B. Application This method which is quite new could be applied to some existing draughts program. Indeed, an international project (mainly French) named "Buggy" aims to beat any best player in the world. Of course, nowadays it is still impossible because of the complexity of this game, but Buggy tried in March to beat the 11 th player and doesn't use for the moment NeuroDraughts which could improve it. Buggy uses for the moment an algorithm to generate moves, a algorithm of research and an evaluation function. When the first two points are computed, the porgram can beat 99% of players. Indeed the tactic of the program is very strong thanks to a good capacity of calculation. On the other hand, to gain a high level, the program must understand the game. For that, it must get knowledge linked to the game, the theory of the game, so the strategy. So the programmer must build an evaluation function as accurate as possible. But, draughts is too rich and complex that it is extremly difficult to transmit to the program some notions even simple. So, it is always possible to find board that the software cannot play correctly. So, by incorporating NeuroDraughts in Buggy, we could improve the evaluation function of this software. Unfortunaltely, for the moment this program owns a kind of database with the differents system of game, and evaluate these thanks to this database and complete and incomplete evaluation functions. So, depending upon the kind of system of game, it can evaluate the reward of the board. The NeuroDraughts could occur when we evaluate the board, and so could improve the program. In a word, applying NeuroDraughts to an existing program could bring the aspect of a learning by itself program, which is not common in others program. Indeed, current program uses database of positions to evaluate a position and not really some probabilistic notions. IX. Conclusion For some reasons, even if I have predicted to implement the project, I was not able to do by lack of data of an original source of a draughts program. Moreover the program of Lynch did not provide the source. But I was able to test the work of NeuroDraughts which appears quite powerful in tact as the most of exisiting draughts software, but appears to be better than others program in strategy. This project enables us to see an overview of techniques using reinforcement learning to improve artificial intelligence of strategy games and especially draughts. Finally, this technique even if it seems to be quite complicate at the beginning, is enough simple to be understood. So, this technique could be proposed to the project "Buggy" so that we can improve this program. Moreover,we could use this technique in other strategy games. Again, draughts is the most advanced domain in research of learning in games, because draughts is the most complicated to be computed (after Go), and so research are more interesting in draughts. We add the point that more a program plays more it learns. X. Research Papers ordered by relevance "An application of Temporal Difference Learning to Draughts", Mark Lynch "NeuroDraughts : the role of representation, search, training regime and architecture in a TD draughts player", N.J.L Griffith "Inductive Inference of Chess and Draughts Player Strategy", Anthony R. Jansen, David L. Dowe, and Graham E. Farr "Experiments with a Bayesian game player", Warren D. Smith, Eric B. Baum "Exact probabilistic analysis of error detection for parity checkers", V.A. Vardanian