Learning BlackJack with Artificial Neural Network

advertisement
Learning BlackJack with ANN
(Aritificial Neural Network)
ECE 539
Final Project
Ip Kei Sam
sam@cae.wisc.edu
ID: 9012828100
1
Abstract
Blackjack is a card game where the player attempts to beat the dealer by having the total
points higher than the dealer’s hand but less than or equal to 21. The probabilistic nature
of Blackjack makes it an illustrative application for learning algorithms. The learning
system needs to explore different strategies with a higher probability of winning games.
This project explores the method of Artificial Neural Network in Blackjack for learning
strategies, and the reinforcement learning algorithm is used to learn strategies.
Reinforcement learning is a process to map situations to actions when the learner is not
told which actions to take but to discover which actions yield the highest reward by trial
and error. The trained ANN will be used to play Blackjack without explicitly teaching the
rules of the game. Furthermore, the efficiency of the ANN and the results from the
learning algorithm will be investigated and interpreted as different strategies for playing
Blackjack and other random game.
Background
Initially with two cards to each player, the object of Blackjack is to draw cards from a
deck of 52 cards to a total value of 21. The player can choose from the following actions:

Stand to stay with the current hand and take no card.

Hit to add a card to the hand to make the total card value closer to 21

Double Down When the player is holding 2 cards, the player can double
his bet by hitting with only one more card and stand after that.

Split Having the pair of cards with the same values, the player can split
his hand into two hands. The player may split up to 3 times in a game.
For simplicity of this project, Double Down and Split will not be considered in
this project. The value of a hand is the sum of the values of each card in the hand, where
each card from 2 to 10 is valued according, with J, Q, and K of the value of 10. Aces can
be either 1 or 11. Each player plays against the dealer, and the goal is to obtain a hand
with a greater value than the dealer’s hand but less than or equal to 21. A player may hit
as many times as he wishes as long as it is not over 21. The player can also win by having
2
5 cards in hand with total points less than 21. The dealer hits when his hand is less than
17 and stand when it is greater than 17. When the player is dealt 21 points in the first 2
cards, he automatically wins his bet if the dealer is not dealt 21. If the dealer has
blackjack (21 points), the game is over and the dealer wins all bets, or ties with any
player with blackjack (21 points). Figure 1 and 2 shows a demo of the Matlab Blackjack
(blackjack.m) program. The player needs to press the hit or stand button each turn, and
the total dealer and player points are calculated as shown in the bottom. It also shows the
total balance remains in the player’s account and the amount of bet the player has put in
for this game. In this example, the player bet $0. The program exits when the player
balance is less than zero.
Figure 1: the initial state of the game, the first 2 cards are dealt to the player and to the dealer.
To measure the efficiency of the rules in Blackjack, I have simulated the program to play
against the dealer for 1000 games, where the dealer follows the 17 points value. The
efficiency can be observed from the percentage of winning and the percentage of draw
games. The comparison of the play’s random moves versus the dealer’s 17 point rule is
shown in Figure 2a.
3
Figure 2: as the player chooses to stand, the dealer chooses to hit but get busted. The player won.
Strategy
Win %
Tie %
Player’s random moves
31%
8%
Dealer’s 17 points rule
61%
8%
Figure 2a: Efficiency of different strategies in Blackjack. Each Strategy is played1000 games.
One of the goals of this project is to develop a better strategy with ANN that beats the
Dealer’s 17 points rule, that is, the new strategy will have a higher wining percentage.
Different configurations of MLP and preprocess of input training data sets will also be
experimented later in this paper. Finally, it will explore some of the Blackjack strategies
interpreted from the experiment results.
4
Applying Reinforcement Learning to Blackjack
Reinforcement learning is a process to map situations to actions such that the reward
value is maximized. The learning algorithm decides which actions to take by finding the
actions that yields the highest reward through trial and error. The taken actions will affect
the immediate rewards and the subsequent rewards as well. In Blackjack, given a set of
dealer’s and player’s cards, if the probability of winning of each outcome is known, it can
always make the best decision (hit or stand) by taking an action that yields the highest
winning probability in the next state. For each final state, the probability of winning is
either 1 (if the player wins/draws) or 0 (if the player loses). In this project, the initial
winning probability of each intermediate state is set to 0.5 and the learning parameter α is
also initialized to 0.5. The winning probability of each state is updated for the dealer and
player after each game. The winning probability of the previous state will get closer to
the current state base on this equation: P(s) = P(s) + α*[P(s’)-P(s)], where α is the
learning parameter, s is the current state and s’ is the next state. For example, figure 3
shows the first 3 rows taken from the result table in the output when the Matlab program
genlookuptable.m is simulated to play against the dealer based on random decision.
2.0000 5.0000
0
0
2.0000 5.0000
0
0
2.0000 5.0000 10.0000 0
0 6.0000 6.0000
0
0 4.0000 6.0000 6.0000
0
0
0 4.0000 6.0000 6.0000 7.0000
0 0.3700
1.0000
0 0.2500
1.0000
0
1.0000
0
0
0
1.0000
Figure 3: the result table (lpxx) from the Matlab program output after one game.
The first 5 columns represent the dealer’s cards and the next 5 columns represent the
player’s cards. The dealer and player can each have a maximum of 5 cards by the game
rule. The card values in each hand are sorted in ascending order before they are inserted
into the table. Column 11 is the winning probability of each state. Column 12 and 13
represented the action taken by the player, where [1 0] denotes a “hit” and [0 1] denotes a
“stand” and [1 1] denotes an end state where no action is required. In the first row, the
dealer has 2 and 5 (with a total score of 7) and the player has 6 and 6 (with a total score
of 12). Based on random decision, the player decides to hit. The next round, the player
gets a 4 with total points of 16 where the dealer gets a 10 with a total of 17. The player
5
decides to hit again. This time the player is busted with a total score of 23 in the final
state (row 3). Since the player lost, this state is rewarded with P(3)=0. The learning
parameter α = 0.5. The winning probability of taking a “hit” in previous state is:
P(2) = P(2) + α*[P(3)-P(2)] = 0.5 + 0.5 * (0-0.5) = 0.2500
Similarly, P(1) = P(1) + α*[P(2)-P(2)] = 0.5 + 0.5 * (0-0.25) = 0.3750
After playing a sequence of games, the result table contains the winning probability of all
intermediate states, the dealer’s cards and player’s cards and the player’s taken actions.
When playing a new game, if the same card pattern is encountered, for example in row 2,
2.0000 5.0000
0
0
0 4.0000 6.0000 6.0000
0
0 0.2500
1.0000
0
The player will choose to “stand” this time as taking a “hit” will has a low winning
probability. After the new game, the winning probability will again be updated based on
the game result. If the player loses again this time by hitting an additional card, the
wining probability will be: P(2) = P(2) + α*[P(3)-P(2)] = 0.25 + 0.5 * (0-0.25) = 0.125
Otherwise, if the player wins this time, the probability is:
P(2) = P(2) + α*[P(3)-P(2)] = 0.25 + 0.5 * (1-0.25) = 0.625
The wining probability of each state will converge after a large number of game sets. Of
course, the more the game tried, the more accurate the wining probability will become.
Originally, the dealer is following the 17 point rule in Blackjack. To make the dealer
more intelligent, reinforcement learning can be applied to the dealer as well.
Similarly, for the dealer, a result table can be obtained after each game. Let’s take a look
at the following dealer result table:
2.0000 5.0000
0
2.0000 5.0000 10.0000
0
0 6.0000 6.0000
0
0 4.0000 6.0000 6.0000
0
0
0 0.7500
0
0 1.0000 1.0000
1.0000
0
1.0000
In this game, the dealer was initially dealt 2 and 5 with a total of 7, where the player is
dealt 6 and 6 with a total of 12. The dealer chose to hit this time, resulting a total of 17
points in the final state. The player decides to hit, however, resulting a total of 16 points.
The dealer P(2) = 1, so P(1) = P(1) + α*[P(2)-P(1)] = 0.5 + 0.5 * (1-0.5) = 0.7500
6
When the same card pattern (in row 1) is encountered again in a new game, the dealer
will choose to hit again since it will result in a higher probability of winning with a “hit”.
Again, the probability value will be updated based on the results after each new game.
The learning parameter α is initially set to 0.5 and it decreases after each game based on
this equation: α = α * # of games remaining/ total # of game.
Therefore, as it finishes all test games, α becomes 0. If the player or the dealer always
takes the actions that will result in the states with the highest winning probability, there
will be some intermediate states with the initial winning probability of 0.5 that is always
less than some states and will never be encountered. Therefore, random moves will be
necessary in between in order to explore these states. The winning probability of the
random moves will not be updated at the end of the game. The total number of game
simulated in this program is 1000. In each game, 20% of the moves are randomly
generated. Since α decrease after each game, the winning probability value of each state
converges as the number of game played increases.
Let’s look at the previous example one more time:
2.0000 5.0000
0
2.0000 5.0000 10.0000
0
0 6.0000 6.0000
0
0
0 4.0000 6.0000 6.0000
0
0
0 0.7500
0 1.0000
1.0000
0
1.0000
1.0000
The P(1) value starts at the default value of 0.5, the first time when it encounters the
above pattern and if it wins, P(1) is updated to 0.7259. After running 5000 games in the
simulation, this pattern is encountered 9 times. The P (1) value converges to 0.76 as
shown in the figure 4 below. It suggested that if the player has two 6’s, taking a “hit” has
a better chance to win.
7
Figure 4: the convergence of winning probability in 5000 games.
Let’s look another example. This time the player got 8 and 10 in hands:
7.0000 8.0000
0
3.0000 7.0000 8.0000
0
0 8.0000 10.0000 0
0
0
8.0000 10.0000
5.000
0
0
0
0.2500
0 0
1.0000
0
1.0000
1.0000
The P(1) value starts at the default value of 0.5, the first time when it encounters the
above pattern, P(1) is updated to 0.25 when it loses. After running 5000 games in the
simulation, this pattern is encountered 11 times. The P (1) value converges to 0.35 as
shown in the figure 5. It suggested that if the player got 8 and 10, taking a “hit” will be
very likely to lose. In this case, the player should only “hit” when the dealer’s hand is
higher than the player’s hand.
Figure 5: the convergence of winning probability in 5000 games.
8
Both experiments show the convergences of the winning probability values in each state
as the learning parameter decreases after each game.
Equation (1)
The update method based on equation (1) performs quite well in this experiment. As α is
reduced over time, the method converges to the true winning probability value in each
state given optimal actions by the player. Except for random moves, the next taken action
(hit or stand) is in fact the optimal moves against the dealer since the method converges
to an optimal policy for playing Blackjack. If α does not decrease to 0 at the end of the
training, then this player also plays well against the dealer when the dealer change its
way of playing slowly over time. Applying reinforcement learning to the dealer will
certainly beat the dealer’s original 17 point rule.
Matlab Implementation
As the Matlab program (blackjackN.m) is simulated to play against the dealer, each time
when a new card pattern is encountered, it will have a default probability value of 0.5. Its
winning probability will be updated after each game based on the final result and will be
added to the result table at the end. For the other states that have not been encountered by
the program, it will not be available in the result table. When the program encounters a
card pattern that is not found in the result table, it will make the decision based on
random moves. There are approximately 106 total different states for the dealer’s and
player’s card patterns. Out of these possible states, many of them do not need to be
considered as they may go way beyond 21 points. For example, the state of having five
10’s in hand is not possible, as it will be busted when it reaches three 10. Moreover, it is
not possible to simulate all these different states because of the limitation of the slow
iterative operations in Matlab.
9
Matlab Simulation
All cards in one hand will be sorted before they are inserted into the table. The dealer and
the player have different tables of their own. Each table is the input data to the ANN for
training either from the dealer table or from the player table. Based on the card pattern,
the corresponding winning probability will be returned.
The Matlab program (genlookuptable.m) is simulated to play against the dealer, starting
with the initial empty dealer table and player table to explore the different combinations
of the dealer’s and the player’s cards. There are 15000 games simulated, among which
28273 entries were generated for the dealer’s table and 32372 entries were generated for
the player’s table. After all, the program had only encountered less than 10% of all
possible states. However, the states that were encountered by the program are the most
common states. Actually, only less than 10% out of all different states are commonly
possible and encountered in playing Blackjack. For example, the states of having 4 cards
or 5 cards in hand are actually very less frequently occurred, and having 4 or 5 tens in
hands is a state that is impossible to occur in Blackjack.
Both the dealer and player learn how to play better as the number of games played
increases. To test the accuracy of both the dealer table and the player table, I set up
another sequence games to let the dealer play again the player with the generated lookup
table.
The following tables show the summary of the experiments:
When applying Reinforcement learning to the player to play against a dealer with the 17
point rule, the player has a significant improvement on the winning percentage. Out of
1000 games, the player won 512 times.
Strategy
Win %
Tie %
Player (with learning)
51.2%
7%
10
When applying Reinforcement learning to the dealer (where the dealer does not follow
the 17 points rule) to play against a player with random moves, the dealer has improved
its winning percentage to 67%.
Strategy
Win %
Tie %
Dealer (with learning)
67%
5%
When applying Reinforcement learning to both dealer and player to play against each
other, the percentage of tie games increases, because they are using more likely the same
set of strategies based on Reinforcement learning.
Strategy
Win %
Tie %
Player’s random moves
43%
12%
Dealer’s 17 points rule
45%
12%
Finally, I played against the dealer with the dealer lookup table. I lost 11 games out of 20
games and have 1 tie game.
Strategy
Win %
Tie %
Human Player
45%
5%
Using the dealer and player tables has highly increased the number of games won for
both dealer and the player. Therefore the tables generated for the dealer and player is
good enough to use for the successive experiments.
Just by lookup at the values in the lookup tables, the game strategy can be interpreted.
For the player, if the total point is higher than 14, in most cases, it suggests not to “hit”.
Similarly, it suggested not to “hit” if it is over 15 for the dealer. Without knowing the
rules of Blackjack, it is able to determine the threshold of 14 for the player and 15 for the
dealer by reinforcement learning and by the updates at the end of each game, even though
it is a little bit conservative than the dealer’s 17 point rule. By looking at the player
playing against the dealer when both are using lookup tables, the dealer has a lightly
higher winning percentage. It suggested that the dealer table, which suggested not to “hit”
at 15, players a little better than the player table, which suggested not to “hit” at 14.
11
Applying ANN to Blackjack
The next step is to prepare the input data for the ANN network by using the result lookup
tables. Once the dealer and the player lookup tables are ready, they can be converted to
the training data for the ANN. The dealer’s cards and the player’s cards from the lookup
tables are the inputs to the ANN. The output contains two neurons; each corresponds to
one action (either hit or stand). The taken action depends on which output neuron has a
higher value. This is a classification problem. To simplify the problem a little bit, I have
set up several MLP structures in different stage to better classify the problem. The
different MLP needed for the dealer and the player as well as the overall game flow chart
are shown in figure 6 below. Note that the rules of Blackjack (and the 17-point dealer
rule) are not implemented in the project. All decisions are based on MLP outputs.
Figure 6: This figure shows the overall flow charts of using different MLP in this project
12
The first level MLP contains 4 inputs (2 cards from the dealer and 2 cards from the player
in the initial state) and 2 outputs (either hit or stand). Then the second level MLP contains
5 inputs (2 from the dealer and 3 from player) and 2 outputs (hit/stand). The third level
MLP contains 5 inputs (3 from the dealer and 2 from player) and 2 outputs (hit/stand).
The fourth level MLP contains 6 inputs (3 from the dealer and 3 from player) and 2
outputs (hit/stand). Other levels MLP (4 or 5 cards for the player or dealer) will not
considered, as the probability of getting to these stage of having 4 or 5 cards in hand
becomes less likely and their structures become more complicated which leads to more
difficulties in classification. When either the player or the dealer has 4 or 5 cards in one
hand, the action will be determined by either random moves or by checking if the dealer’s
total point is less than the player’s total points, and vise versa. Accordingly, the input
lookup table will be separated into 5 sub-tables (findMLP.m implemented this function),
namely the first sub-table picks up all rows that contains 2 dealer cards and 2 player cards
(the initial state only), the second sub-table contains the state with 2 dealer cards and 3
player cards only, the third sub-table has the states with 3 dealer cards and 2 player cards,
and fourth table contains states with 3 dealer and 3 player cards. The rest of the rows will
be put into the fifth sub-table, which has the remaining states with 4 or 5 dealer or player
cards. Each sub table will be saved as inputs to each of the 4 different MLP levels as
defined above. During the simulation, depending on how many cards the dealer and the
player have, different MLP structures will be called to decide which action (hit/stand) to
take. The “End” state represents the end of the game, where either the player won, the
dealer won or a draw game.
In the next step, the analysis of the input data for each feature vector and preprocess will
be required before training the different MLP. As each column of the lookup table
contains a card value, the range of the value is from 1 to 10. First, let’s look at the MLP
level 1 experiment.
MLP level 1: 4 inputs (2 from the dealer and 2 from the player) and 2 outputs.
The first and second feature vectors are the dealer’s and the third and fourth vectors are
the player’s. Since the cards are sorted before they are inserted into the table, the mean of
13
the first (or third) feature vector is always less than the mean of the second (or fourth)
feature vectors. The values are shown in the following table:
Feature
vector #
1 (dealer)
2 (dealer)
3 (player)
4 (player)
Mean
4.5435
7.2961
3.8168
6.7322
Standard
Deviation
1.2907
2.3353
1.7196
1.8596
Variance
Minimum
Maximum
1.194
1.282
1.674
1.374
1
1
1
1
10
10
10
10
After inspecting the feature vectors of the input data, the table shows that the
mean of each feature is different from the others, since the mean of feature 1 and feature
3 is lower than feature 2 and feature 4. The minimum and maximum of the 4 feature
vectors are the same. The first preprocess started with feature vector normalization and
then scaling all the feature vectors into the same range (in the range of -5 to 5).
Inspecting the standard deviation and variance of each feature could show that feature
vector 2 and 4 are more important with respect to its class (since its standard deviation
value are the highest). The system could pay more attention to these feature vectors by
putting more weights on it. This data set is nonlinear. The classification rate is not
satisfactory if there is no preprocessing the data set. With the pre-process on input data, it
can identify exactly which feature vectors play more important roles in the classification
phase. Therefore, these feature vectors would have more weights.
After preprocessing, the next step is to train the MLP with this set of the training data.
The Matlab program “bp2.m” is implemented using back propagation algorithm that
trains the MLP. Several MLP configurations were experimented. Based on the training
data, it produced the weight matrix. The activation function hyperbolic tangent was used
in the hidden layers and the sigmoidal function was used in the output neurons.
Different α, µ, epoch size, numbers of hidden layers and numbers of neurons in the
hidden layers are also experimented.
MLP (level 1) Configurations used in the experiment:
With normalization in feature vectors
The input feature vectors are scaled to range of -5 to 5,
Learning rate Alpha = 0.1 (or 0.8)
Momentum = 0 (or 0.8)
14
Excluding input layer, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
Number of neurons in hidden layer = 5, 10, 50
MLP Configuraiton: 4-5-2, 4-5-5-2, 4-5-5-5-2, 4-10-10-10-2, 4-50-50-50-2
The activation function for the hidden layers: hyperbolic tangent
The activation function for the output layer: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of training data reserved for tuning
# of epochs between convergence check = 10
epoch size = 64
α
µ
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
MLP
Configuration
4-5-2
4-5-2
4-5-2
4-5-2
4-5-5-2
4-5-5-2
4-5-5-2
4-5-5-2
4-5-5-5-2
4-5-5-5-2
4-5-5-5-2
4-5-5-5-2
4-10-10-10-2
4-10-10-10-2
4-10-10-10-2
4-10-10-10-2
4-50-50-50-2
4-50-50-50-2
4-50-50-50-2
4-50-50-50-2
Classification Rate
80.2%
89.5%
79.1%
78.9%
85.8%
83.2%
84.1%
84.0%
87.2%
86.5%
85.3%
82.8%
89.5%
87.0%
86.8%
84.8%
88.0%
86.8%
86.2%
85.5%
Table 1: This table shows the classification rates of different MLP (level 1)
configurations with α and µ valued shown in column 1 and 2
The highest classification rate is 89.5%. If the MLP is too large, it may be trapped in a
certain classification rate that would stop it from improving during the training process.
The best configuration is α = 0.1, µ = 0, MLP Configuration 4-10-10-10-2.
15
MLP level 2: 5 inputs (2 from the dealer and 3 from player) and 2 outputs.
This experiment setup is similar to the setup in MLP level 1. First the input feature vector
is preprocessed and analyzed. The characteristics of the 5 feature vectors are shown in the
table as follows:
Feature
vector #
1 (dealer)
2 (dealer)
3 (player)
4 (player)
5 (player)
Mean
5.2532
7.7195
4.8351
5.2465
7.1328
Standard
Deviation
1.1824
1.3353
1.2261
1.5152
1.9893
Variance
Minimum
Maximum
1.214
1.552
1.365
1.477
1.791
1
1
1
1
1
10
10
10
10
10
When looking at the feature vectors of the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with feature vector
normalization and then scaling all the feature vectors into the same range (in the range of
-5 to 5). Inspecting the standard deviation and variance of each feature could show that
feature vector 2 and 5 are more important with respect to its class (since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putting more weights on it.
MLP (level 2) Configurations used in the experiment:
With normalization in feature vectors
The input feature vectors are scaled to range of -5 to 5,
Learning rate Alpha = 0.1 (or 0.8)
Momentum = 0 (or 0.8)
Excluding input layer, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
16
Number of neurons in hidden layer = 5, 10, 50
MLP Configuraiton: 5-5-2, 5-5-5-2, 5-5-5-5-2, 5-10-10-10-2, 5-50-50-50-2
The activation function for the hidden layers: hyperbolic tangent
The activation function for the output layers: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of training data reserved for tuning
# of epochs between convergence check = 10
epoch size = 64
α
µ
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
MLP
Configuration
5-5-2
5-5-2
5-5-2
5-5-2
5-5-5-2
5-5-5-2
5-5-5-2
5-5-5-2
5-5-5-5-2
5-5-5-5-2
5-5-5-5-2
5-5-5-5-2
5-10-10-10-2
5-10-10-10-2
5-10-10-10-2
5-10-10-10-2
5-50-50-50-2
5-50-50-50-2
5-50-50-50-2
5-50-50-50-2
Classification Rate
78.3%
79.9%
79.5%
78.8%
84.2%
84.5%
83.8%
82.9%
86.6%
85.4%
84.9%
83.7%
90.1%
91.1%
90.7%
87.8%
90.7%
89.4%
89.6%
90.9%
Table 2: This table shows the classification rates of different MLP (level 2)
configurations with α and µ valued shown in column 1 and 2
The highest classification rate is 91.1%. If the MLP is too large, it may be trapped in a
certain classification rate that would stop it from improving during the training process.
The best configuration is α = 0.1, µ = 0.8, MLP Configuration 5-10-10-10-2.
MLP level 3: 5 inputs (3 from the dealer and 2 from player) and 2 outputs.
This experiment setup is similar to that in MLP level 1 and 2. First the input feature
vector is preprocessed and analyzed. The characteristics of the 5 feature vectors are
shown in the table as follows:
17
Feature
vector #
1 (dealer)
2 (dealer)
3 (dealer)
4 (player)
5 (player)
Mean
5.0133
6.2891
6.8344
6.7214
7.17221
Standard
Deviation
1.2214
1.3174
1.5291
1.0127
1.4319
Variance
Minimum
Maximum
1.1003
1.3212
1.2984
1.0137
1.2101
1
1
1
1
1
10
10
10
10
10
When looking at the feature vectors of the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with feature vector
normalization and then scaling all the feature vectors into the same range (in the range of
-5 to 5). Inspecting the standard deviation and variance of each feature could show that
feature vector 3 and 5 are more important with respect to its class (since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putting more weights on it.
MLP (level 3) Configurations used in the experiment:
With normalization in feature vectors
The input feature vectors are scaled to range of -5 to 5,
Learning rate Alpha = 0.1 (or 0.8)
Momentum = 0 (or 0.8)
Excluding input layer, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
Number of neurons in hidden layer = 5, 10, 50
MLP Configuraiton: 5-5-2, 5-5-5-2, 5-5-5-5-2, 5-10-10-10-2, 5-50-50-50-2
The activation function for the hidden layers: hyperbolic tangent
The activation function for the output layers: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of training data reserved for tuning
# of epochs between convergence check = 10, epoch size = 64
α
µ
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0
0.8
0
0.8
0
0.8
0
0.8
0
MLP
Configuration
5-5-2
5-5-2
5-5-2
5-5-2
5-5-5-2
5-5-5-2
5-5-5-2
5-5-5-2
5-5-5-5-2
Classification Rate
78.2%
78.8%
80.1%
79.4%
82.0%
84.5%
84.1%
83.9%
85.2%
18
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
83.9%
85.7%
84.5%
89.9%
91.7%
92.5%
90.8%
90.7%
90.1%
91.1%
89.8%
5-5-5-5-2
5-5-5-5-2
5-5-5-5-2
5-10-10-10-2
5-10-10-10-2
5-10-10-10-2
5-10-10-10-2
5-50-50-50-2
5-50-50-50-2
5-50-50-50-2
5-50-50-50-2
Table 3: This table shows the classification rates of different MLP (level 3)
configurations with α and µ valued shown in column 1 and 2
The highest classification rate is 92.5%. The best configuration is α = 0.8, µ = 0, MLP
Configuration 5-10-10-10-2.
MLP level 4: 6 inputs (3 from the dealer and 3 from player) and 2 outputs.
This experiment setup is similar to previous experiments. First the input feature vector is
preprocessed and analyzed. The characteristics of the 6 feature vectors are shown in the
table as follows:
Feature
vector #
1 (dealer)
2 (dealer)
3 (dealer)
4 (player)
5 (player)
6 (player)
Mean
5.2710
6.5097
7.1321
6.0211
6.6872
7.0728
Standard
Deviation
1.0015
1.1232
1.2156
1.0471
1.2016
1.3549
Variance
Minimum
Maximum
1.0121
1.1324
1.2533
1.0877
1.2001
1.3531
1
1
1
1
1
1
10
10
10
10
10
10
When looking at the feature vectors of the input data, the table shows that the mean of
each feature is different from the others. The first preprocess started with feature vector
normalization and then scaling all the feature vectors into the same range (in the range of
-5 to 5). Inspecting the standard deviation and variance of each feature could show that
feature vector 3 and 6 are more important with respect to its class (since its standard
deviation value are the highest). The system could pay more attention to these feature
vectors by putting more weights on it.
19
MLP (level 4) Configurations used in the experiment:
With normalization in feature vectors
The input feature vectors are scaled to range of -5 to 5,
Learning rate Alpha = 0.1 (or 0.8)
Momentum = 0 (or 0.8)
Excluding input layer, total # of layers = 2,3,4
MLP. Max. Training Epochs: 1000
Number of neurons in hidden layer = 8, 12, 50
MLP Configuraiton: 6-8-2, 6-8-8-2, 6-8-8-8-2, 6-12-12-12-2, 6-50-50-50-2
The activation function for the hidden layers: hyperbolic tangent
The activation function for the output layers: sigmoidal
Partition training set dynamically into training and tuning sets;
Where 20% of training data reserved for tuning
# of epochs between convergence check = 10, epoch size = 64
α
µ
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0.1
0.1
0.8
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
0
0.8
MLP
Configuration
6-8-2
6-8-2
6-8-2
6-8-2
6-8-8-2
6-8-8-2
6-8-8-2
6-8-8-2
6-8-8-8-2
6-8-8-8-2
6-8-8-8-2
6-8-8-8-2
6-12-12-12-2
6-12-12-12-2
6-12-12-12-2
6-12-12-12-2
6-50-50-50-2
6-50-50-50-2
6-50-50-50-2
6-50-50-50-2
Classification Rate
79.1%
80.9%
80.5%
80.1%
81.5%
82.2%
82.8%
81.4%
84.7%
85.2%
85.7%
84.6%
90.2%
89.3%
88.6%
86.9%
87.6%
88.3%
88.5%
87.8%
Table 4: This table shows the classification rates of different MLP (level 4)
configurations with α and µ valued shown in column 1 and 2
The highest classification rate is 90.2%. The best configuration is α = 0.1, µ = 0, MLP
Configuration 6-12-12-12-2.
The previous experiments have tested the 4 MLP structures that will be used in this
project.
20
After the best configuration for each MLP is defined, in order to improve the
classification performance, I have picked up those cases which are often easily
misclassified by the MLP. I have set up another misclassification table to store these
cases. The MLP structures will be implemented to the Blackjack game so that the
program will call a MLP when it’s time to make a decision. Moreover, before the MLP
continue to do its classification task, the program need to check if this specific dealer and
player card pattern exists in the misclassification table. If it does, it will use the action
from the misclassification table instead of using the MLP to make a decision. There are
610 entries in the misclassification table. For the MLP Configuration 4-10-10-10-2, it can
improve the classification rate up to 92.9%.
Here are summaries of the 4 different MLP Configurations:
 Normalization in feature vectors, and scaled to range of -5 to 5.
 Max. Training Epochs: 1000, epoch size = 64
 Activation function (hidden layer)=hyperbolic tangent
 Activation function (output layer) = sigmoidal
α
µ
Configuration
Classification Rate
MLP1
0.1
0
4-10-10-10-2
89.5%.
MLP2
0.1
0.8
5-10-10-10-2
91.1%
MLP3
0.8
0
5-10-10-10-2
92.5%.
MLP4
0.1
0
6-12-12-12-2
90.2%.
Matlab Simulation using MLP
The Matlab program BlackjackANN.m is implemented using the MLP that were defined
in the previous section. There was 1000 games simulation in this experiment. Both the
dealer and the player use MLP to make decisions in their turns. The winning percentages
of the dealer and the player are as follows:
When applying MLP to the player to play against a dealer with the 17 point rule, the
player has a winning percentage of 56.5%. Out of 1000 games, the player won 565 times.
21
Strategy
Win %
Tie %
Player with MLP
56.5%
9%
When applying MLP to the dealer (where the dealer does not follow the 17 points rule) to
play against a player with random moves, the dealer has an outstanding winning
percentage of 68.2%.
Strategy
Win %
Tie %
Dealer with MLP
68.2%
3%
When applying MLP to both dealer and player to play against each other, the percentage
of tie games increases, as they are using more likely the same strategies based on the
dealer’s and the player’s input lookup tables. It also depends on the number of entries in
the misclassification tables. In this experiment, the player performed a little bit better
than the dealer. It would be because the player’s misclassification table actually identifies
more misclassified patterns than the dealer’s table does.
Strategy
Win %
Tie %
Player with MLP
54%
3%
Dealer with MLP
43%
3%
Finally, I had the opportunity to play against the system that I developed. I played against
the dealer with the original dealer lookup table. I was able to win 12 games out of 20
games, as well as 2 tie games. Therefore, I decided to let the dealer use a better table (the
player’s original table) as the input to its MLP networks. Then I played against the dealer
again. This time, it improved its performance by 1 game (and 1 tie game this time).
Strategy
Win %
Tie %
Human Player
60%
10%
Dealer with MLP
30%
10%
Using the MLP networks for the dealer and the player has highly increased the number of
games won for both the dealer and the player. Of course, the human still plays better than
the MLP at the end, as the training set for the MLP is very small compared to all the
22
possible states a Blackjack game would have. If the entries in both dealer’s and player’s
table increases (so that it covers most of the states in the game), I believe that the MLP
can certainly play better than human being. Building such a large network would require
additional numerical analysis techniques, (for example, how to make the lookup table
smaller and contains more information), such that the MLP can have a reasonable
response table in each turn.
As human beings are so adapted to using their own experiences in playing games, the
more games you play, the more you are familiar with the game by practicing. It draws an
interesting idea on learning by letting both the program and the human learn together at
the same time and the program observes who actually learns faster. Therefore, I decided
to implement a side-version of Blackjack as an experiment. That is, instead of having 21
points as the default goal, you can actually define your target points in the game. The
next section will show the details of this experiment.
Applying ANN to play user-defined Blackjack
In the normal Blackjack game, the goal of each game is to draw cards from a deck of 52
cards to a total value as close to 21 as possible. To make applying ANN to Blackjack
more interesting, I have created a version of Blackjack where the user can define his only
target points, for example, 30 points, so that the program and the human can learn how to
play a new game at the same time. To make the implantation easier, I still decided to let
both the dealer and the player have maximum of 5 cards in hand. If both the dealer and
the player reach 5 cards without busted, the total points will determine who the winner is.
Of course, a new set of input lookup tables will be needed to train the dealer to play this
new game. Since I am playing as the only player, I did not train the player input lookup
tables. When the target points changed, all the existing game rules and strategies changed
as well. Without knowing the new rule and strategies of the game in advance, does MLP
network learn better and faster than the human player?
The Malab program BlackjackANNx.m is implemented for the user-defined Blackjack
game. There are 20 games set up between the human and the dealer. Without cheating in
the game, I did not calculate the probability of my play in advance. After the 20 games,
the experiment results are as follows:
23
30-point target game
Win %
Tie %
Human Player
40%
5%
Dealer with MLP
55%
5%
As this is my first time playing the 30-point Blackjack, I won only 8 games out of 20
games. The winning percentage for the MLP and the human is close. Though the MLP
has a slightly larger input table, however, it only increases a small percentage of the
possible states it covers in this new game. The table above shows the experiment result of
the first time I play the new 30-point game. As I play more, I started to win the dealer
more. As the number of games played increase, the learning rate of human being is
definitely a lot higher than that of the program.
On the other hand, the percentage of tie game decreases as the total target point increases.
There is only 1 tie game out of 20 games because when the number of states increase, the
probability of two players getting to the same state becomes less likely. As the total target
points increase, the number of states also increases dramatically. Moreover, it is
impossible for the program to capture all possible states, as the number of lookup table
entries would increase dramatically. The winning probability of each state will only
converge after millions of games and millions of entries in the lookup tables. It can be
shown in an experiment similar as before. In this tremendous number of states in the
game, it is very depending on how lucky the simulation plays the first 10000 games to
collect the input data in the lookup table and also how lucky it is to capture most of the
common states. After simulating more games, with the lookup table as the input of the
dealer’s MLP, the MLP did its best at 65% classification rate in a new game.
To simulate the program one more time, this time I let the simulated play to use the
previous trained dealer’s lookup table as the input to the player’s MLP network. At the
same time, the dealer is using a 25-point to play in this new game. There are 1000 games
simulated and the experiment results are shown as follows:
24
30-point target game
Win %
Tie %
Player with MLP
50.2%
2.0%
Dealer with 25-point rule
47.6%
2.0%
Even if the player is using a small lookup table as input to its MLP, it can actually beat
the dealer, who is using a fixed game rule, by 2.4%. The MLP network works best for
highly random and dynamic games, where the game rules and the strategies are hard to
define and the game outputs are hard to predict exactly. Of course, it also depends how
lucky the system is when it first collected its input data.
Conclusion
This project explored the methods of Artificial Neural Network in Blackjack for learning
strategies without explicitly teaching the rules of the game, and the reinforcement
learning algorithm is used to learn strategies. From the above experiment, it can be
concluded that the Reinforcement learning algorithm does learn better strategies than the
fixed rule strategies in playing Blackjack, as it improves the winning probability of each
game dramatically. Reinforcement learning algorithms is able to suggest the best actions
when the situation changes over time. It finds a mapping from different states to
probability distributions over various actions. Without knowing any rules of Blackjack,
the ANN is able to performance well in playing the game, given the training data set is
larger and accuracy enough to cover most of the game states. The misclassification table
is also useful in improving the classification rate by reducing the misclassification
number. However, it requires some manually work to pick up entries from the lookup
table and recognize the misclassification patterns and put them into the misclassification
table.
The highest classification rate it can get from the MLP is 81%. Of course, the input
training table can also be repeated trained by the correctly classified test data to improve
the classification rate. The performance of playing with ANN has an average of 51%
wining, a minimum of 39% winning and a maximum of 65% of winning when simulated
25
to play in 1000 games. However, as the number of games increases, the number of unencountered states increases, which include the states that are not included in the lookup
tables. Therefore, the average winning drops down to 42%. Continue learning is
important to produce good results, that is, the input training data to the MLP should be
updated after 10000 games played, or the testing train after each game can be feed back
to the training input table to increase the training set, for example. As the number of
game increases, the game strategies will change over time. The larger the input training
data, the better the performance will be.
Since the cards are dealt from a 52 card deck, the current hand is actually dependent on
the last hands. Taking advantage of the memory of playing Blackjack makes the search
space much smaller and hence increases the winning probability of each state. Including
the card counting algorithm in the program is another desirable feature for future work.
Others include: adding additional players to the game, training the ANN with a teacher to
eliminate duplicate patterns (for example, 4 + 7 = 7 + 4 = 5 + 6 = …) and identify
misclassified pattern, training the ANN to play against different experts so that it can pick
up various game strategies, including game tricks and strategies in a different table for
the ANN to look up when different states are encountered, exploring other learning
methods such as Q learning, including a large training input data over time to make a
better ANN, extending the ANN to play other similar games in Pokers, etc. The MLP
network works best for highly random and dynamic games, where the game rules and the
strategies are hard to define and the game outputs are hard to predict exactly.
26
Appendix 1
Matlab Files
Blackjack.m
This program allows the user to play blackjack with the
dealer.
BlackjackANN.m
Based on the MLP and input data, the ANN can make
decision to play against the dealer.
Genlookuptable.m This program will simulate 10000 games between the dealer
and the player, where the player makes random moves. The
winning probability of each state will be updated after each
game. The move pattern and winning probability value will
be stored in the dealer and player lookup tables, which will
be used as the input to the MLP.
BlackjackN.m
This program is called by genlookuptable.m to play against
the dealer based on random move and update lookup tables.
Bp2.m,
These two programs take the lookup tables as training data.
bpconfig2.m
Using back propagation algorithm, it outputs the weight
Bpdisplay.m
matrix for the MLP and classification rate of each setting.
Blackjackx.m
This program is implemented with regular Blackjack with
user defined target points, for example 30 points.
BlackjackANNx.m The general form of Blackjack, which takes the user defined
target value (by default, the target value is 21). The learning
algorithm is also applied to this program.
Cardhit.m
Update screen when player or dealer hits a card
Cardplot.m
Plot dealt cards in the user screen
Checkmoney.m
Check user balance, if less than zero, then quit
Dealerwin.m
Performs updates to the lookup tables if dealer won.
findMLP.m
Separate input lookup table to different sub-tables for 4 MLP
Scale.m
Utility Matlab functions from the class website
Rsample.m
Randomize.m
Cvgtest.m
Actfun.m
Actfunp.m
Shuffle.m
Shuffle cards before they are dealt to the player and dealer
Youwin.m
Performs updates to the lookup tables if player won
27
Appendix 2
References





De Wilde, Philippe (1995). Neural network models: an analysis, Springer-Verlag
M.L littman “Markov games as a framework for multi-agent reinforcement
learning”, Proceedings of the Eleven International Conference on Machine
Learning, 1994, pp. 157 – 163. Morgan Kaufmann.
R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction, MIT
Press, 1998.
B. Widrow, N. Gupta and S. Maitra, “Punish/Reward: Learning with a Critic in
Adaptive Threshold System”, IEEE Transactions on Systems, Man and
Cybernetics, vol 3, no. 5, pp. 455 – 465, 1973.
Haykin, Simon (1999). Neural networks: a comprehensive foundation,2nd
edition, Prentice Hall, New Jersey
Online Presentation and Matlab Files
The project report, online presentation and Matlab files are available for download at:
http://www.cae.wisc.edu/~sam/ece539
28
Download