Learning BlackJack with Artificial Neural Network

Learning BlackJack with ANN (Aritificial Neural Network) ECE 539 Final Project Ip Kei Sam sam@cae.wisc.edu ID: 9012828100 1 Abstract Blackjack is a card game where the player attempts to beat the dealer by having the total points higher than the dealer’s hand but less than or equal to 21. The probabilistic nature of Blackjack makes it an illustrative application for learning algorithms. The learning system needs to explore different strategies with a higher probability of winning games. This project explores the method of Artificial Neural Network in Blackjack for learning strategies, and the reinforcement learning algorithm is used to learn strategies. Reinforcement learning is a process to map situations to actions when the learner is not told which actions to take but to discover which actions yield the highest reward by trial and error. The trained ANN will be used to play Blackjack without explicitly teaching the rules of the game. Furthermore, the efficiency of the ANN and the results from the learning algorithm will be investigated and interpreted as different strategies for playing Blackjack and other random game. Background Initially with two cards to each player, the object of Blackjack is to draw cards from a deck of 52 cards to a total value of 21. The player can choose from the following actions:  Stand to stay with the current hand and take no card.  Hit to add a card to the hand to make the total card value closer to 21  Double Down When the player is holding 2 cards, the player can double his bet by hitting with only one more card and stand after that.  Split Having the pair of cards with the same values, the player can split his hand into two hands. The player may split up to 3 times in a game. For simplicity of this project, Double Down and Split will not be considered in this project. The value of a hand is the sum of the values of each card in the hand, where each card from 2 to 10 is valued according, with J, Q, and K of the value of 10. Aces can be either 1 or 11. Each player plays against the dealer, and the goal is to obtain a hand with a greater value than the dealer’s hand but less than or equal to 21. A player may hit as many times as he wishes as long as it is not over 21. The player can also win by having 2 5 cards in hand with total points less than 21. The dealer hits when his hand is less than 17 and stand when it is greater than 17. When the player is dealt 21 points in the first 2 cards, he automatically wins his bet if the dealer is not dealt 21. If the dealer has blackjack (21 points), the game is over and the dealer wins all bets, or ties with any player with blackjack (21 points). Figure 1 and 2 shows a demo of the Matlab Blackjack (blackjack.m) program. The player needs to press the hit or stand button each turn, and the total dealer and player points are calculated as shown in the bottom. It also shows the total balance remains in the player’s account and the amount of bet the player has put in for this game. In this example, the player bet $0. The program exits when the player balance is less than zero. Figure 1: the initial state of the game, the first 2 cards are dealt to the player and to the dealer. To measure the efficiency of the rules in Blackjack, I have simulated the program to play against the dealer for 1000 games, where the dealer follows the 17 points value. The efficiency can be observed from the percentage of winning and the percentage of draw games. The comparison of the play’s random moves versus the dealer’s 17 point rule is shown in Figure 2a. 3 Figure 2: as the player chooses to stand, the dealer chooses to hit but get busted. The player won. Strategy Win % Tie % Player’s random moves 31% 8% Dealer’s 17 points rule 61% 8% Figure 2a: Efficiency of different strategies in Blackjack. Each Strategy is played1000 games. One of the goals of this project is to develop a better strategy with ANN that beats the Dealer’s 17 points rule, that is, the new strategy will have a higher wining percentage. Different configurations of MLP and preprocess of input training data sets will also be experimented later in this paper. Finally, it will explore some of the Blackjack strategies interpreted from the experiment results. 4 Applying Reinforcement Learning to Blackjack Reinforcement learning is a process to map situations to actions such that the reward value is maximized. The learning algorithm decides which actions to take by finding the actions that yields the highest reward through trial and error. The taken actions will affect the immediate rewards and the subsequent rewards as well. In Blackjack, given a set of dealer’s and player’s cards, if the probability of winning of each outcome is known, it can always make the best decision (hit or stand) by taking an action that yields the highest winning probability in the next state. For each final state, the probability of winning is either 1 (if the player wins/draws) or 0 (if the player loses). In this project, the initial winning probability of each intermediate state is set to 0.5 and the learning parameter α is also initialized to 0.5. The winning probability of each state is updated for the dealer and player after each game. The winning probability of the previous state will get closer to the current state base on this equation: P(s) = P(s) + α*[P(s’)-P(s)], where α is the learning parameter, s is the current state and s’ is the next state. For example, figure 3 shows the first 3 rows taken from the result table in the output when the Matlab program genlookuptable.m is simulated to play against the dealer based on random decision. 2.0000 5.0000 0 0 2.0000 5.0000 0 0 2.0000 5.0000 10.0000 0 0 6.0000 6.0000 0 0 4.0000 6.0000 6.0000 0 0 0 4.0000 6.0000 6.0000 7.0000 0 0.3700 1.0000 0 0.2500 1.0000 0 1.0000 0 0 0 1.0000 Figure 3: the result table (lpxx) from the Matlab program output after one game. The first 5 columns represent the dealer’s cards and the next 5 columns represent the player’s cards. The dealer and player can each have a maximum of 5 cards by the game rule. The card values in each hand are sorted in ascending order before they are inserted into the table. Column 11 is the winning probability of each state. Column 12 and 13 represented the action taken by the player, where [1 0] denotes a “hit” and [0 1] denotes a “stand” and [1 1] denotes an end state where no action is required. In the first row, the dealer has 2 and 5 (with a total score of 7) and the player has 6 and 6 (with a total score of 12). Based on random decision, the player decides to hit. The next round, the player gets a 4 with total points of 16 where the dealer gets a 10 with a total of 17. The player 5 decides to hit again. This time the player is busted with a total score of 23 in the final state (row 3). Since the player lost, this state is rewarded with P(3)=0. The learning parameter α = 0.5. The winning probability of taking a “hit” in previous state is: P(2) = P(2) + α*[P(3)-P(2)] = 0.5 + 0.5 * (0-0.5) = 0.2500 Similarly, P(1) = P(1) + α*[P(2)-P(2)] = 0.5 + 0.5 * (0-0.25) = 0.3750 After playing a sequence of games, the result table contains the winning probability of all intermediate states, the dealer’s cards and player’s cards and the player’s taken actions. When playing a new game, if the same card pattern is encountered, for example in row 2, 2.0000 5.0000 0 0 0 4.0000 6.0000 6.0000 0 0 0.2500 1.0000 0 The player will choose to “stand” this time as taking a “hit” will has a low winning probability. After the new game, the winning probability will again be updated based on the game result. If the player loses again this time by hitting an additional card, the wining probability will be: P(2) = P(2) + α*[P(3)-P(2)] = 0.25 + 0.5 * (0-0.25) = 0.125 Otherwise, if the player wins this time, the probability is: P(2) = P(2) + α*[P(3)-P(2)] = 0.25 + 0.5 * (1-0.25) = 0.625 The wining probability of each state will converge after a large number of game sets. Of course, the more the game tried, the more accurate the wining probability will become. Originally, the dealer is following the 17 point rule in Blackjack. To make the dealer more intelligent, reinforcement learning can be applied to the dealer as well. Similarly, for the dealer, a result table can be obtained after each game. Let’s take a look at the following dealer result table: 2.0000 5.0000 0 2.0000 5.0000 10.0000 0 0 6.0000 6.0000 0 0 4.0000 6.0000 6.0000 0 0 0 0.7500 0 0 1.0000 1.0000 1.0000 0 1.0000 In this game, the dealer was initially dealt 2 and 5 with a total of 7, where the player is dealt 6 and 6 with a total of 12. The dealer chose to hit this time, resulting a total of 17 points in the final state. The player decides to hit, however, resulting a total of 16 points. The dealer P(2) = 1, so P(1) = P(1) + α*[P(2)-P(1)] = 0.5 + 0.5 * (1-0.5) = 0.7500 6 When the same card pattern (in row 1) is encountered again in a new game, the dealer will choose to hit again since it will result in a higher probability of winning with a “hit”. Again, the probability value will be updated based on the results after each new game. The learning parameter α is initially set to 0.5 and it decreases after each game based on this equation: α = α * # of games remaining/ total # of game. Therefore, as it finishes all test games, α becomes 0. If the player or the dealer always takes the actions that will result in the states with the highest winning probability, there will be some intermediate states with the initial winning probability of 0.5 that is always less than some states and will never be encountered. Therefore, random moves will be necessary in between in order to explore these states. The winning probability of the random moves will not be updated at the end of the game. The total number of game simulated in this program is 1000. In each game, 20% of the moves are randomly generated. Since α decrease after each game, the winning probability value of each state converges as the number of game played increases. Let’s look at the previous example one more time: 2.0000 5.0000 0 2.0000 5.0000 10.0000 0 0 6.0000 6.0000 0 0 0 4.0000 6.0000 6.0000 0 0 0 0.7500 0 1.0000 1.0000 0 1.0000 1.0000 The P(1) value starts at the default value of 0.5, the first time when it encounters the above pattern and if it wins, P(1) is updated to 0.7259. After running 5000 games in the simulation, this pattern is encountered 9 times. The P (1) value converges to 0.76 as shown in the figure 4 below. It suggested that if the player has two 6’s, taking a “hit” has a better chance to win. 7 Figure 4: the convergence of winning probability in 5000 games. Let’s look another example. This time the player got 8 and 10 in hands: 7.0000 8.0000 0 3.0000 7.0000 8.0000 0 0 8.0000 10.0000 0 0 0 8.0000 10.0000 5.000 0 0 0 0.2500 0 0 1.0000 0 1.0000 1.0000 The P(1) value starts at the default value of 0.5, the first time when it encounters the above pattern, P(1) is updated to 0.25 when it loses. After running 5000 games in the simulation, this pattern is encountered 11 times. The P (1) value converges to 0.35 as shown in the figure 5. It suggested that if the player got 8 and 10, taking a “hit” will be very likely to lose. In this case, the player should only “hit” when the dealer’s hand is higher than the player’s hand. Figure 5: the convergence of winning probability in 5000 games. 8 Both experiments show the convergences of the winning probability values in each state as the learning parameter decreases after each game. Equation (1) The update method based on equation (1) performs quite well in this experiment. As α is reduced over time, the method converges to the true winning probability value in each state given optimal actions by the player. Except for random moves, the next taken action (hit or stand) is in fact the optimal moves against the dealer since the method converges to an optimal policy for playing Blackjack. If α does not decrease to 0 at the end of the training, then this player also plays well against the dealer when the dealer change its way of playing slowly over time. Applying reinforcement learning to the dealer will certainly beat the dealer’s original 17 point rule. Matlab Implementation As the Matlab program (blackjackN.m) is simulated to play against the dealer, each time when a new card pattern is encountered, it will have a default probability value of 0.5. Its winning probability will be updated after each game based on the final result and will be added to the result table at the end. For the other states that have not been encountered by the program, it will not be available in the result table. When the program encounters a card pattern that is not found in the result table, it will make the decision based on random moves. There are approximately 106 total different states for the dealer’s and player’s card patterns. Out of these possible states, many of them do not need to be considered as they may go way beyond 21 points. For example, the state of having five 10’s in hand is not possible, as it will be busted when it reaches three 10. Moreover, it is not possible to simulate all these different states because of the limitation of the slow iterative operations in Matlab. 9 Matlab Simulation All cards in one hand will be sorted before they are inserted into the table. The dealer and the player have different tables of their own. Each table is the input data to the ANN for training either from the dealer table or from the player table. Based on the card pattern, the corresponding winning probability will be returned. The Matlab program (genlookuptable.m) is simulated to play against the dealer, starting with the initial empty dealer table and player table to explore the different combinations of the dealer’s and the player’s cards. There are 15000 games simulated, among which 28273 entries were generated for the dealer’s table and 32372 entries were generated for the player’s table. After all, the program had only encountered less than 10% of all possible states. However, the states that were encountered by the program are the most common states. Actually, only less than 10% out of all different states are commonly possible and encountered in playing Blackjack. For example, the states of having 4 cards or 5 cards in hand are actually very less frequently occurred, and having 4 or 5 tens in hands is a state that is impossible to occur in Blackjack. Both the dealer and player learn how to play better as the number of games played increases. To test the accuracy of both the dealer table and the player table, I set up another sequence games to let the dealer play again the player with the generated lookup table. The following tables show the summary of the experiments: When applying Reinforcement learning to the player to play against a dealer with the 17 point rule, the player has a significant improvement on the winning percentage. Out of 1000 games, the player won 512 times. Strategy Win % Tie % Player (with learning) 51.2% 7% 10 When applying Reinforcement learning to the dealer (where the dealer does not follow the 17 points rule) to play against a player with random moves, the dealer has improved its winning percentage to 67%. Strategy Win % Tie % Dealer (with learning) 67% 5% When applying Reinforcement learning to both dealer and player to play against each other, the percentage of tie games increases, because they are using more likely the same set of strategies based on Reinforcement learning. Strategy Win % Tie % Player’s random moves 43% 12% Dealer’s 17 points rule 45% 12% Finally, I played against the dealer with the dealer lookup table. I lost 11 games out of 20 games and have 1 tie game. Strategy Win % Tie % Human Player 45% 5% Using the dealer and player tables has highly increased the number of games won for both dealer and the player. Therefore the tables generated for the dealer and player is good enough to use for the successive experiments. Just by lookup at the values in the lookup tables, the game strategy can be interpreted. For the player, if the total point is higher than 14, in most cases, it suggests not to “hit”. Similarly, it suggested not to “hit” if it is over 15 for the dealer. Without knowing the rules of Blackjack, it is able to determine the threshold of 14 for the player and 15 for the dealer by reinforcement learning and by the updates at the end of each game, even though it is a little bit conservative than the dealer’s 17 point rule. By looking at the player playing against the dealer when both are using lookup tables, the dealer has a lightly higher winning percentage. It suggested that the dealer table, which suggested not to “hit” at 15, players a little better than the player table, which suggested not to “hit” at 14. 11 Applying ANN to Blackjack The next step is to prepare the input data for the ANN network by using the result lookup tables. Once the dealer and the player lookup tables are ready, they can be converted to the training data for the ANN. The dealer’s cards and the player’s cards from the lookup tables are the inputs to the ANN. The output contains two neurons; each corresponds to one action (either hit or stand). The taken action depends on which output neuron has a higher value. This is a classification problem. To simplify the problem a little bit, I have set up several MLP structures in different stage to better classify the problem. The different MLP needed for the dealer and the player as well as the overall game flow chart are shown in figure 6 below. Note that the rules of Blackjack (and the 17-point dealer rule) are not implemented in the project. All decisions are based on MLP outputs. Figure 6: This figure shows the overall flow charts of using different MLP in this project 12 The first level MLP contains 4 inputs (2 cards from the dealer and 2 cards from the player in the initial state) and 2 outputs (either hit or stand). Then the second level MLP contains 5 inputs (2 from the dealer and 3 from player) and 2 outputs (hit/stand). The third level MLP contains 5 inputs (3 from the dealer and 2 from player) and 2 outputs (hit/stand). The fourth level MLP contains 6 inputs (3 from the dealer and 3 from player) and 2 outputs (hit/stand). Other levels MLP (4 or 5 cards for the player or dealer) will not considered, as the probability of getting to these stage of having 4 or 5 cards in hand becomes less likely and their structures become more complicated which leads to more difficulties in classification. When either the player or the dealer has 4 or 5 cards in one hand, the action will be determined by either random moves or by checking if the dealer’s total point is less than the player’s total points, and vise versa. Accordingly, the input lookup table will be separated into 5 sub-tables (findMLP.m implemented this function), namely the first sub-table picks up all rows that contains 2 dealer cards and 2 player cards (the initial state only), the second sub-table contains the state with 2 dealer cards and 3 player cards only, the third sub-table has the states with 3 dealer cards and 2 player cards, and fourth table contains states with 3 dealer and 3 player cards. The rest of the rows will be put into the fifth sub-table, which has the remaining states with 4 or 5 dealer or player cards. Each sub table will be saved as inputs to each of the 4 different MLP levels as defined above. During the simulation, depending on how many cards the dealer and the player have, different MLP structures will be called to decide which action (hit/stand) to take. The “End” state represents the end of the game, where either the player won, the dealer won or a draw game. In the next step, the analysis of the input data for each feature vector and preprocess will be required before training the different MLP. As each column of the lookup table contains a card value, the range of the value is from 1 to 10. First, let’s look at the MLP level 1 experiment. MLP level 1: 4 inputs (2 from the dealer and 2 from the player) and 2 outputs. The first and second feature vectors are the dealer’s and the third and fourth vectors are the player’s. Since the cards are sorted before they are inserted into the table, the mean of 13 the first (or third) feature vector is always less than the mean of the second (or fourth) feature vectors. The values are shown in the following table: Feature vector # 1 (dealer) 2 (dealer) 3 (player) 4 (player) Mean 4.5435 7.2961 3.8168 6.7322 Standard Deviation 1.2907 2.3353 1.7196 1.8596 Variance Minimum Maximum 1.194 1.282 1.674 1.374 1 1 1 1 10 10 10 10 After inspecting the feature vectors of the input data, the table shows that the mean of each feature is different from the others, since the mean of feature 1 and feature 3 is lower than feature 2 and feature 4. The minimum and maximum of the 4 feature vectors are the same. The first preprocess started with feature vector normalization and then scaling all the feature vectors into the same range (in the range of -5 to 5). Inspecting the standard deviation and variance of each feature could show that feature vector 2 and 4 are more important with respect to its class (since its standard deviation value are the highest). The system could pay more attention to these feature vectors by putting more weights on it. This data set is nonlinear. The classification rate is not satisfactory if there is no preprocessing the data set. With the pre-process on input data, it can identify exactly which feature vectors play more important roles in the classification phase. Therefore, these feature vectors would have more weights. After preprocessing, the next step is to train the MLP with this set of the training data. The Matlab program “bp2.m” is implemented using back propagation algorithm that trains the MLP. Several MLP configurations were experimented. Based on the training data, it produced the weight matrix. The activation function hyperbolic tangent was used in the hidden layers and the sigmoidal function was used in the output neurons. Different α, µ, epoch size, numbers of hidden layers and numbers of neurons in the hidden layers are also experimented. MLP (level 1) Configurations used in the experiment: With normalization in feature vectors The input feature vectors are scaled to range of -5 to 5, Learning rate Alpha = 0.1 (or 0.8) Momentum = 0 (or 0.8) 14 Excluding input layer, total # of layers = 2,3,4 MLP. Max. Training Epochs: 1000 Number of neurons in hidden layer = 5, 10, 50 MLP Configuraiton: 4-5-2, 4-5-5-2, 4-5-5-5-2, 4-10-10-10-2, 4-50-50-50-2 The activation function for the hidden layers: hyperbolic tangent The activation function for the output layer: sigmoidal Partition training set dynamically into training and tuning sets; Where 20% of training data reserved for tuning # of epochs between convergence check = 10 epoch size = 64 α µ 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 MLP Configuration 4-5-2 4-5-2 4-5-2 4-5-2 4-5-5-2 4-5-5-2 4-5-5-2 4-5-5-2 4-5-5-5-2 4-5-5-5-2 4-5-5-5-2 4-5-5-5-2 4-10-10-10-2 4-10-10-10-2 4-10-10-10-2 4-10-10-10-2 4-50-50-50-2 4-50-50-50-2 4-50-50-50-2 4-50-50-50-2 Classification Rate 80.2% 89.5% 79.1% 78.9% 85.8% 83.2% 84.1% 84.0% 87.2% 86.5% 85.3% 82.8% 89.5% 87.0% 86.8% 84.8% 88.0% 86.8% 86.2% 85.5% Table 1: This table shows the classification rates of different MLP (level 1) configurations with α and µ valued shown in column 1 and 2 The highest classification rate is 89.5%. If the MLP is too large, it may be trapped in a certain classification rate that would stop it from improving during the training process. The best configuration is α = 0.1, µ = 0, MLP Configuration 4-10-10-10-2. 15 MLP level 2: 5 inputs (2 from the dealer and 3 from player) and 2 outputs. This experiment setup is similar to the setup in MLP level 1. First the input feature vector is preprocessed and analyzed. The characteristics of the 5 feature vectors are shown in the table as follows: Feature vector # 1 (dealer) 2 (dealer) 3 (player) 4 (player) 5 (player) Mean 5.2532 7.7195 4.8351 5.2465 7.1328 Standard Deviation 1.1824 1.3353 1.2261 1.5152 1.9893 Variance Minimum Maximum 1.214 1.552 1.365 1.477 1.791 1 1 1 1 1 10 10 10 10 10 When looking at the feature vectors of the input data, the table shows that the mean of each feature is different from the others. The first preprocess started with feature vector normalization and then scaling all the feature vectors into the same range (in the range of -5 to 5). Inspecting the standard deviation and variance of each feature could show that feature vector 2 and 5 are more important with respect to its class (since its standard deviation value are the highest). The system could pay more attention to these feature vectors by putting more weights on it. MLP (level 2) Configurations used in the experiment: With normalization in feature vectors The input feature vectors are scaled to range of -5 to 5, Learning rate Alpha = 0.1 (or 0.8) Momentum = 0 (or 0.8) Excluding input layer, total # of layers = 2,3,4 MLP. Max. Training Epochs: 1000 16 Number of neurons in hidden layer = 5, 10, 50 MLP Configuraiton: 5-5-2, 5-5-5-2, 5-5-5-5-2, 5-10-10-10-2, 5-50-50-50-2 The activation function for the hidden layers: hyperbolic tangent The activation function for the output layers: sigmoidal Partition training set dynamically into training and tuning sets; Where 20% of training data reserved for tuning # of epochs between convergence check = 10 epoch size = 64 α µ 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 MLP Configuration 5-5-2 5-5-2 5-5-2 5-5-2 5-5-5-2 5-5-5-2 5-5-5-2 5-5-5-2 5-5-5-5-2 5-5-5-5-2 5-5-5-5-2 5-5-5-5-2 5-10-10-10-2 5-10-10-10-2 5-10-10-10-2 5-10-10-10-2 5-50-50-50-2 5-50-50-50-2 5-50-50-50-2 5-50-50-50-2 Classification Rate 78.3% 79.9% 79.5% 78.8% 84.2% 84.5% 83.8% 82.9% 86.6% 85.4% 84.9% 83.7% 90.1% 91.1% 90.7% 87.8% 90.7% 89.4% 89.6% 90.9% Table 2: This table shows the classification rates of different MLP (level 2) configurations with α and µ valued shown in column 1 and 2 The highest classification rate is 91.1%. If the MLP is too large, it may be trapped in a certain classification rate that would stop it from improving during the training process. The best configuration is α = 0.1, µ = 0.8, MLP Configuration 5-10-10-10-2. MLP level 3: 5 inputs (3 from the dealer and 2 from player) and 2 outputs. This experiment setup is similar to that in MLP level 1 and 2. First the input feature vector is preprocessed and analyzed. The characteristics of the 5 feature vectors are shown in the table as follows: 17 Feature vector # 1 (dealer) 2 (dealer) 3 (dealer) 4 (player) 5 (player) Mean 5.0133 6.2891 6.8344 6.7214 7.17221 Standard Deviation 1.2214 1.3174 1.5291 1.0127 1.4319 Variance Minimum Maximum 1.1003 1.3212 1.2984 1.0137 1.2101 1 1 1 1 1 10 10 10 10 10 When looking at the feature vectors of the input data, the table shows that the mean of each feature is different from the others. The first preprocess started with feature vector normalization and then scaling all the feature vectors into the same range (in the range of -5 to 5). Inspecting the standard deviation and variance of each feature could show that feature vector 3 and 5 are more important with respect to its class (since its standard deviation value are the highest). The system could pay more attention to these feature vectors by putting more weights on it. MLP (level 3) Configurations used in the experiment: With normalization in feature vectors The input feature vectors are scaled to range of -5 to 5, Learning rate Alpha = 0.1 (or 0.8) Momentum = 0 (or 0.8) Excluding input layer, total # of layers = 2,3,4 MLP. Max. Training Epochs: 1000 Number of neurons in hidden layer = 5, 10, 50 MLP Configuraiton: 5-5-2, 5-5-5-2, 5-5-5-5-2, 5-10-10-10-2, 5-50-50-50-2 The activation function for the hidden layers: hyperbolic tangent The activation function for the output layers: sigmoidal Partition training set dynamically into training and tuning sets; Where 20% of training data reserved for tuning # of epochs between convergence check = 10, epoch size = 64 α µ 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0 0.8 0 0.8 0 0.8 0 0.8 0 MLP Configuration 5-5-2 5-5-2 5-5-2 5-5-2 5-5-5-2 5-5-5-2 5-5-5-2 5-5-5-2 5-5-5-5-2 Classification Rate 78.2% 78.8% 80.1% 79.4% 82.0% 84.5% 84.1% 83.9% 85.2% 18 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 83.9% 85.7% 84.5% 89.9% 91.7% 92.5% 90.8% 90.7% 90.1% 91.1% 89.8% 5-5-5-5-2 5-5-5-5-2 5-5-5-5-2 5-10-10-10-2 5-10-10-10-2 5-10-10-10-2 5-10-10-10-2 5-50-50-50-2 5-50-50-50-2 5-50-50-50-2 5-50-50-50-2 Table 3: This table shows the classification rates of different MLP (level 3) configurations with α and µ valued shown in column 1 and 2 The highest classification rate is 92.5%. The best configuration is α = 0.8, µ = 0, MLP Configuration 5-10-10-10-2. MLP level 4: 6 inputs (3 from the dealer and 3 from player) and 2 outputs. This experiment setup is similar to previous experiments. First the input feature vector is preprocessed and analyzed. The characteristics of the 6 feature vectors are shown in the table as follows: Feature vector # 1 (dealer) 2 (dealer) 3 (dealer) 4 (player) 5 (player) 6 (player) Mean 5.2710 6.5097 7.1321 6.0211 6.6872 7.0728 Standard Deviation 1.0015 1.1232 1.2156 1.0471 1.2016 1.3549 Variance Minimum Maximum 1.0121 1.1324 1.2533 1.0877 1.2001 1.3531 1 1 1 1 1 1 10 10 10 10 10 10 When looking at the feature vectors of the input data, the table shows that the mean of each feature is different from the others. The first preprocess started with feature vector normalization and then scaling all the feature vectors into the same range (in the range of -5 to 5). Inspecting the standard deviation and variance of each feature could show that feature vector 3 and 6 are more important with respect to its class (since its standard deviation value are the highest). The system could pay more attention to these feature vectors by putting more weights on it. 19 MLP (level 4) Configurations used in the experiment: With normalization in feature vectors The input feature vectors are scaled to range of -5 to 5, Learning rate Alpha = 0.1 (or 0.8) Momentum = 0 (or 0.8) Excluding input layer, total # of layers = 2,3,4 MLP. Max. Training Epochs: 1000 Number of neurons in hidden layer = 8, 12, 50 MLP Configuraiton: 6-8-2, 6-8-8-2, 6-8-8-8-2, 6-12-12-12-2, 6-50-50-50-2 The activation function for the hidden layers: hyperbolic tangent The activation function for the output layers: sigmoidal Partition training set dynamically into training and tuning sets; Where 20% of training data reserved for tuning # of epochs between convergence check = 10, epoch size = 64 α µ 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0.1 0.1 0.8 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 0 0.8 MLP Configuration 6-8-2 6-8-2 6-8-2 6-8-2 6-8-8-2 6-8-8-2 6-8-8-2 6-8-8-2 6-8-8-8-2 6-8-8-8-2 6-8-8-8-2 6-8-8-8-2 6-12-12-12-2 6-12-12-12-2 6-12-12-12-2 6-12-12-12-2 6-50-50-50-2 6-50-50-50-2 6-50-50-50-2 6-50-50-50-2 Classification Rate 79.1% 80.9% 80.5% 80.1% 81.5% 82.2% 82.8% 81.4% 84.7% 85.2% 85.7% 84.6% 90.2% 89.3% 88.6% 86.9% 87.6% 88.3% 88.5% 87.8% Table 4: This table shows the classification rates of different MLP (level 4) configurations with α and µ valued shown in column 1 and 2 The highest classification rate is 90.2%. The best configuration is α = 0.1, µ = 0, MLP Configuration 6-12-12-12-2. The previous experiments have tested the 4 MLP structures that will be used in this project. 20 After the best configuration for each MLP is defined, in order to improve the classification performance, I have picked up those cases which are often easily misclassified by the MLP. I have set up another misclassification table to store these cases. The MLP structures will be implemented to the Blackjack game so that the program will call a MLP when it’s time to make a decision. Moreover, before the MLP continue to do its classification task, the program need to check if this specific dealer and player card pattern exists in the misclassification table. If it does, it will use the action from the misclassification table instead of using the MLP to make a decision. There are 610 entries in the misclassification table. For the MLP Configuration 4-10-10-10-2, it can improve the classification rate up to 92.9%. Here are summaries of the 4 different MLP Configurations:  Normalization in feature vectors, and scaled to range of -5 to 5.  Max. Training Epochs: 1000, epoch size = 64  Activation function (hidden layer)=hyperbolic tangent  Activation function (output layer) = sigmoidal α µ Configuration Classification Rate MLP1 0.1 0 4-10-10-10-2 89.5%. MLP2 0.1 0.8 5-10-10-10-2 91.1% MLP3 0.8 0 5-10-10-10-2 92.5%. MLP4 0.1 0 6-12-12-12-2 90.2%. Matlab Simulation using MLP The Matlab program BlackjackANN.m is implemented using the MLP that were defined in the previous section. There was 1000 games simulation in this experiment. Both the dealer and the player use MLP to make decisions in their turns. The winning percentages of the dealer and the player are as follows: When applying MLP to the player to play against a dealer with the 17 point rule, the player has a winning percentage of 56.5%. Out of 1000 games, the player won 565 times. 21 Strategy Win % Tie % Player with MLP 56.5% 9% When applying MLP to the dealer (where the dealer does not follow the 17 points rule) to play against a player with random moves, the dealer has an outstanding winning percentage of 68.2%. Strategy Win % Tie % Dealer with MLP 68.2% 3% When applying MLP to both dealer and player to play against each other, the percentage of tie games increases, as they are using more likely the same strategies based on the dealer’s and the player’s input lookup tables. It also depends on the number of entries in the misclassification tables. In this experiment, the player performed a little bit better than the dealer. It would be because the player’s misclassification table actually identifies more misclassified patterns than the dealer’s table does. Strategy Win % Tie % Player with MLP 54% 3% Dealer with MLP 43% 3% Finally, I had the opportunity to play against the system that I developed. I played against the dealer with the original dealer lookup table. I was able to win 12 games out of 20 games, as well as 2 tie games. Therefore, I decided to let the dealer use a better table (the player’s original table) as the input to its MLP networks. Then I played against the dealer again. This time, it improved its performance by 1 game (and 1 tie game this time). Strategy Win % Tie % Human Player 60% 10% Dealer with MLP 30% 10% Using the MLP networks for the dealer and the player has highly increased the number of games won for both the dealer and the player. Of course, the human still plays better than the MLP at the end, as the training set for the MLP is very small compared to all the 22 possible states a Blackjack game would have. If the entries in both dealer’s and player’s table increases (so that it covers most of the states in the game), I believe that the MLP can certainly play better than human being. Building such a large network would require additional numerical analysis techniques, (for example, how to make the lookup table smaller and contains more information), such that the MLP can have a reasonable response table in each turn. As human beings are so adapted to using their own experiences in playing games, the more games you play, the more you are familiar with the game by practicing. It draws an interesting idea on learning by letting both the program and the human learn together at the same time and the program observes who actually learns faster. Therefore, I decided to implement a side-version of Blackjack as an experiment. That is, instead of having 21 points as the default goal, you can actually define your target points in the game. The next section will show the details of this experiment. Applying ANN to play user-defined Blackjack In the normal Blackjack game, the goal of each game is to draw cards from a deck of 52 cards to a total value as close to 21 as possible. To make applying ANN to Blackjack more interesting, I have created a version of Blackjack where the user can define his only target points, for example, 30 points, so that the program and the human can learn how to play a new game at the same time. To make the implantation easier, I still decided to let both the dealer and the player have maximum of 5 cards in hand. If both the dealer and the player reach 5 cards without busted, the total points will determine who the winner is. Of course, a new set of input lookup tables will be needed to train the dealer to play this new game. Since I am playing as the only player, I did not train the player input lookup tables. When the target points changed, all the existing game rules and strategies changed as well. Without knowing the new rule and strategies of the game in advance, does MLP network learn better and faster than the human player? The Malab program BlackjackANNx.m is implemented for the user-defined Blackjack game. There are 20 games set up between the human and the dealer. Without cheating in the game, I did not calculate the probability of my play in advance. After the 20 games, the experiment results are as follows: 23 30-point target game Win % Tie % Human Player 40% 5% Dealer with MLP 55% 5% As this is my first time playing the 30-point Blackjack, I won only 8 games out of 20 games. The winning percentage for the MLP and the human is close. Though the MLP has a slightly larger input table, however, it only increases a small percentage of the possible states it covers in this new game. The table above shows the experiment result of the first time I play the new 30-point game. As I play more, I started to win the dealer more. As the number of games played increase, the learning rate of human being is definitely a lot higher than that of the program. On the other hand, the percentage of tie game decreases as the total target point increases. There is only 1 tie game out of 20 games because when the number of states increase, the probability of two players getting to the same state becomes less likely. As the total target points increase, the number of states also increases dramatically. Moreover, it is impossible for the program to capture all possible states, as the number of lookup table entries would increase dramatically. The winning probability of each state will only converge after millions of games and millions of entries in the lookup tables. It can be shown in an experiment similar as before. In this tremendous number of states in the game, it is very depending on how lucky the simulation plays the first 10000 games to collect the input data in the lookup table and also how lucky it is to capture most of the common states. After simulating more games, with the lookup table as the input of the dealer’s MLP, the MLP did its best at 65% classification rate in a new game. To simulate the program one more time, this time I let the simulated play to use the previous trained dealer’s lookup table as the input to the player’s MLP network. At the same time, the dealer is using a 25-point to play in this new game. There are 1000 games simulated and the experiment results are shown as follows: 24 30-point target game Win % Tie % Player with MLP 50.2% 2.0% Dealer with 25-point rule 47.6% 2.0% Even if the player is using a small lookup table as input to its MLP, it can actually beat the dealer, who is using a fixed game rule, by 2.4%. The MLP network works best for highly random and dynamic games, where the game rules and the strategies are hard to define and the game outputs are hard to predict exactly. Of course, it also depends how lucky the system is when it first collected its input data. Conclusion This project explored the methods of Artificial Neural Network in Blackjack for learning strategies without explicitly teaching the rules of the game, and the reinforcement learning algorithm is used to learn strategies. From the above experiment, it can be concluded that the Reinforcement learning algorithm does learn better strategies than the fixed rule strategies in playing Blackjack, as it improves the winning probability of each game dramatically. Reinforcement learning algorithms is able to suggest the best actions when the situation changes over time. It finds a mapping from different states to probability distributions over various actions. Without knowing any rules of Blackjack, the ANN is able to performance well in playing the game, given the training data set is larger and accuracy enough to cover most of the game states. The misclassification table is also useful in improving the classification rate by reducing the misclassification number. However, it requires some manually work to pick up entries from the lookup table and recognize the misclassification patterns and put them into the misclassification table. The highest classification rate it can get from the MLP is 81%. Of course, the input training table can also be repeated trained by the correctly classified test data to improve the classification rate. The performance of playing with ANN has an average of 51% wining, a minimum of 39% winning and a maximum of 65% of winning when simulated 25 to play in 1000 games. However, as the number of games increases, the number of unencountered states increases, which include the states that are not included in the lookup tables. Therefore, the average winning drops down to 42%. Continue learning is important to produce good results, that is, the input training data to the MLP should be updated after 10000 games played, or the testing train after each game can be feed back to the training input table to increase the training set, for example. As the number of game increases, the game strategies will change over time. The larger the input training data, the better the performance will be. Since the cards are dealt from a 52 card deck, the current hand is actually dependent on the last hands. Taking advantage of the memory of playing Blackjack makes the search space much smaller and hence increases the winning probability of each state. Including the card counting algorithm in the program is another desirable feature for future work. Others include: adding additional players to the game, training the ANN with a teacher to eliminate duplicate patterns (for example, 4 + 7 = 7 + 4 = 5 + 6 = …) and identify misclassified pattern, training the ANN to play against different experts so that it can pick up various game strategies, including game tricks and strategies in a different table for the ANN to look up when different states are encountered, exploring other learning methods such as Q learning, including a large training input data over time to make a better ANN, extending the ANN to play other similar games in Pokers, etc. The MLP network works best for highly random and dynamic games, where the game rules and the strategies are hard to define and the game outputs are hard to predict exactly. 26 Appendix 1 Matlab Files Blackjack.m This program allows the user to play blackjack with the dealer. BlackjackANN.m Based on the MLP and input data, the ANN can make decision to play against the dealer. Genlookuptable.m This program will simulate 10000 games between the dealer and the player, where the player makes random moves. The winning probability of each state will be updated after each game. The move pattern and winning probability value will be stored in the dealer and player lookup tables, which will be used as the input to the MLP. BlackjackN.m This program is called by genlookuptable.m to play against the dealer based on random move and update lookup tables. Bp2.m, These two programs take the lookup tables as training data. bpconfig2.m Using back propagation algorithm, it outputs the weight Bpdisplay.m matrix for the MLP and classification rate of each setting. Blackjackx.m This program is implemented with regular Blackjack with user defined target points, for example 30 points. BlackjackANNx.m The general form of Blackjack, which takes the user defined target value (by default, the target value is 21). The learning algorithm is also applied to this program. Cardhit.m Update screen when player or dealer hits a card Cardplot.m Plot dealt cards in the user screen Checkmoney.m Check user balance, if less than zero, then quit Dealerwin.m Performs updates to the lookup tables if dealer won. findMLP.m Separate input lookup table to different sub-tables for 4 MLP Scale.m Utility Matlab functions from the class website Rsample.m Randomize.m Cvgtest.m Actfun.m Actfunp.m Shuffle.m Shuffle cards before they are dealt to the player and dealer Youwin.m Performs updates to the lookup tables if player won 27 Appendix 2 References      De Wilde, Philippe (1995). Neural network models: an analysis, Springer-Verlag M.L littman “Markov games as a framework for multi-agent reinforcement learning”, Proceedings of the Eleven International Conference on Machine Learning, 1994, pp. 157 – 163. Morgan Kaufmann. R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. B. Widrow, N. Gupta and S. Maitra, “Punish/Reward: Learning with a Critic in Adaptive Threshold System”, IEEE Transactions on Systems, Man and Cybernetics, vol 3, no. 5, pp. 455 – 465, 1973. Haykin, Simon (1999). Neural networks: a comprehensive foundation,2nd edition, Prentice Hall, New Jersey Online Presentation and Matlab Files The project report, online presentation and Matlab files are available for download at: http://www.cae.wisc.edu/~sam/ece539 28

Learning BlackJack with Artificial Neural Network

Related documents

Products

Support

Learning BlackJack with Artificial Neural Network

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib