Britt and Silinski 1 Pac-Man Optimization Will Britt and Bryan Silinski Abstract Multiple decision making algorithms were implemented for trying to best complete a game of Pac-Man. These algorithms included: reflex agent, minimax, expectimax, and q-learning. This project explored the efficiency of each of these algorithms in addition to the time complexities involved in their utilization. Keywords: Pac-Man, Reflex Agent, Minimax, Expectimax, Q-Learning 1. Introduction Pac-Man is a game developed by Namco in 1980. The player is given a game board full of dots, known as pellets, which the player using the Pac-Man character attempts to consume. In order to consume all of the pellets, Pac-Man will have to avoid ghosts which could end the game causing a game over. There are also larger dots on the game board known as power pellets, which can grant Pac-Man a small amount of time of immunity during which he is able to consume the ghosts for extra points, who will then respawn in their non-consumable state. For the purposes of this project, the Pac-Man game developed for the Intro to Artificial Intelligence class (Berkeley CS 188) by Berkeley professors John DeNero and Dan Klein, is used for testing the algorithms. There exist a couple of differences between the original game and the Berkeley implementation we are using for our project. The game board being used to compare the algorithms is smaller and will only contain two ghosts that move around the board randomly. These changes are necessary due to computer limitations for running enough trials to make any statistically significant statements for comparing the algorithms. 2. Formal Problem Statement Given an agent with a set of N moves to choose from, an agent should choose the move which will best maximize utility. Formulaically, this is represented by max(U(N )) where the result is the move with maximum utility. N which represents a particular move from the set of moves, and U() represents the utility function. i i Utility for each move will be determined by our performance evaluation function, which provides objective quantifiable criterion for the level of success for an agent’s behavior. This utility function looks at things such as how far the the PacMan agent is away from ghosts, dots, or power pellets. Our utility function pseudocode is listed below: def utilityFunction (gamestate): moveScore=0 moveScore -= ghostDistance[0] * 5 #How much Pacman Wants to avoid #ghosts moveScore += 100 Eat Pellet #Eat Pellet Reward return 10000 if Win State #Win is the best result Britt and Silinski 2 moveScore += 100 if Powerpellet #Eat Power Pellet Reward moveScore += random.random() * 10 #Avoid a flip flop state when the move #scores are equal return moveScore These mathematical manipulations exist inside a function that will be called utilityFunction for the rest of this paper. utilityFunction requires a game state parameter. A game state contains all the information from the game at that current moment. Due to this project’s concern with having Pac-Man choose a move, this function is passed potential game states in order to choose the best one. The moveScore variable is the calculated score of a move and is manipulated based on the data from the game state such as the distance from Pac-Man to the ghosts, or if Pac-Man eats a pellet at that state, etc. The moveScore is returned is the utility for that particular move. 3. Methods 3. 1 Reflex Agent A reflex agent makes decisions based only on the current perception of a situation, not past or future perceptions. In order for a reflex agent to come to a decision, it must be given rules or criteria in order to evaluate the potential decisions to be made. In a general sense, a reflex agent works in three stages: 1. The potential states of the current environment are perceived. 2. The states are evaluated by the rules or criteria supplied for the decision process. 3. Based on the evaluation, a decision is made. When applying this general model to Pac-Man problem, the decision which is needed to be made is the move (North, South, East, West, or Stop) to be taken by Pac-Man. In order to make a decision between these moves, the utilityFunction mentioned previously was used. In general, the reflex agent runs in O(n) time where n is the amount of moves that are possible from a game state, which are run through the evaluation function before selecting a move. Given, the PacMan problem, the algorithm would run at a maximum of five times if Pac-Man was in the middle of an intersection. Pseudocode for reflex agent for the Pac-Man problem: max = -infinity for i in PacManLegalMoves: if utilityFunction(i) > max: max= utilityFunction(i) currentMoveChosen = i From the pseudocode above, one can see how the loop responsible for choosing the move would run for as many moves PacMan has available so O(n) is the time complexity. 3.2 Minimax Agent Minimax is an algorithm often used in two-player games where two opponents are working towards opposite goals. Minimax takes into account future moves by both the player and the opponent in order to best choose a move. Minimax is implemented in “full information games.” A full information game is one in which the player can calculate all possible moves for the adversary. For example, in a game of chess a player would able to list out all possible outcomes of the opponent’s next move based on the game pieces available. The minimax algorithm also operates under the assumption that the opponent will always make the optimal choice to make the Britt and Silinski 3 player lose the game, which is that it looks for the worst-case scenario for the player. Implementing a minimax algorithm involves looking at the game as a turn-by-turn basis. Generally, implementing a minimax into a given 2-player situation looks like this: 1. If the game is over, return the score from the player’s point of view. 2. Else, get game states for every possible move for whichever player’s turn it is. 3. Create a list of scores from those states using the utilityFunction 4. If the turn is the opponent’s then return the minimum score from the score list. 5. If the turn is the player’s then return the maximum score from the score list. For our problem, a maximum of five moves would be considered (North, South, East, West, and Stop) if Pac-Man was in the middle of a 4-way intersection. These moves utilities would be calculated and the maximum value move would be found through the calculations. For every ghost turn, a maximum of four turns would be considered( North, South, East, and West) if the ghost was in the middle of a 4-way intersection, since ghosts do not have the ability to stop. These moves would be evaluated by their utility and since we are looking at Pac-Man adversaries, the minimum utility would be calculated. This algorithm is often done recursively, as is the way we implemented it, and branches out as farther future game state depths are explored. The algorithm would choose the move which, according the game tree, would best benefit the player. Ideally, the minimax algorithm would run through an entire game’s worth of moves before choosing one for the player. However, exploring enough depths to completely explore a game would only work on small games like tic-tac-toe due to the space and time constraints of performing this algorithm. For other larger games, such as Pac-Man, it is more beneficial to limit the depth. The time complexity for the minimax algorithm is O(b^d) where b^d represents the amount of game states sent to the utility function. b, referred to as the branching factor, represents the amount of game states per depth. In Pac-Man this would be 3-5 Pac-Man successor states (depending on location). The d in this time complexity refers to the amount of depths explored. Minimax pseudocode for Pac-Man: def value(state): if the state is terminal state: return utility if the next agent is max(Pac-Man) return max-value(state) if the next agent is min(Pac-Man) return min-value(state) def max-value(state): initialize v = - infinity for each successor of state: v = max(v,value(successor)) return v def min-value(state): initialize v = infinity for each successor of state: v = min(v,vale(successor)) return v Looking at the pseudocode for minimax, one can see that the branching factor, b, from the big O notation comes from the successors of each game state, which the potential next moves for either Pac-Man or the ghosts depending on the agent. Depth, d, is handled by the recursion Britt and Silinski 4 which exits upon reaching a terminal state or reaching the necessary depth. One interesting thing to note is as previously mentioned the minimax algorithm assumes that the adversaries, the ghosts, will perform the optimal move, which would be to come after Pac-Man. However, for our tests the ghosts did not move optimally, but instead moved by selecting a random move from the legal moves available from a position. This provides for some interesting scenarios. When faced with a possibility where enough depth is explored so that Pac-Man believes that it’s death is inevitable due to a ghost proximity of the ghosts, Pac-Man (while operating under the minimax algorithm) will commit suicide as quickly as possible in order to best maximize the final score. This is obviously not the ideal decision to make, as there is a chance that the ghost will not make the optimal decisions and they will not converge upon Pac-Man. player’s moves and minimizing the ghost moves until the depth limit had been reached. With expectimax, chance nodes are interleaved with the minimum node, so that instead of taking the minimum of the utility values as was done in minimax, we are now getting the expected utilities due to the probabilities. This ghosts are no longer being thought of agents which are trying to minimize Pac-Man’s score, but rather they are now being thought to be a part of the environment. 3.3 Expectimax Agent Man: The time complexity of O(b^d) is the same as minimax, where b^d represents the amount of game states evaluated by the utility function. As it was in minimax, b (branching factor) represents the amount of game states per depth ( in Pac-Man this would be 3-5(Pac-man successor states) multiplied by 4-16(ghost successor game states). d represents depth. Expectimax pseudocode for Pac- In order to solve the problem of PacMan not taking into account the random nature of the ghosts, the minimax function had to be altered so that it was no longer looking at the worst case, and instead looking for the average case. When applying this algorithm to Pac-Man, the utilityFunction from the reflex agent will once again be used to evaluate to the game states. For the player’s turn, the expectimax functions similar to a minimax function would where the maximum utility is returned for the player, but for the nonplayer turn, we are no longer looking for the worst case for Pac-Man(best case for ghosts), and instead looking for the average case. To do this, the probability of moves occurring needed to be taken into account. In the minimax algorithm, the levels of the game tree alternated between maximizing a def value(state): if the state is a terminal state: return the state’s utility if the next agent is Max. return max-value(state) if the next agent is Exp, return exp-value(state) def max-value(state) v = -infinity for each successor of state: v = max(v,value(successor)) return v def exp-value(state): v=0 for each successor of state: p = probability(successor) # where probability is Britt and Silinski 5 #1/number of available moves v +=p * value(successor) return v Upon inspection of this pseudocode, one can see the similarities between minimax and expectimax. The only clear difference is that instead of having the minimizing function that was previously used for the adversary agents(ghosts), there is an expected value function which considers the probabilities of a ghost’s potential position. Like minimax, the branching factor within the time complexity comes from the for loops that explore the each agent’s possible moves. The depth comes into play through recursion with the exit of the recursion once again being the required depth or terminal state. 3.4 Q-Learning Q-learning is a model-free reinforcement learning technique. Reinforcement learning in computing is inspired by behavioral psychology and the idea of reinforcement where behavior is changed by rewards or punishment. In computing, the idea is to have an agent choose an action to best maximize its reward. Q-learning uses this idea for finding an optimal policy for making decisions for any given Markov decision process (MDP). A MDP is a mathematical framework for making decisions, where outcomes are determined by both randomness and the decision maker’s choices. MDP can be further explained as: during a specific time in a decision making process, the process is in some state, s, and the decision maker may choose any action, a, that exists in s. Once an action is chosen, the process moves to the next time step and randomly reaches a new state, s’, and the decision maker is given a corresponding reward, defined as the function R (s,s’). The probability of reaching a s’ is influenced by the action that was chosen, which is represented by the probability function P (s,s’). a With standard Q-learning each state is a snapshot of the game which takes into consideration the absolute position and feature of the game state(i.e. the ghosts are at position [3,4], not 1 move away from PacMan). This works for smaller problems sizes, because every state can be sufficiently visited enough as to properly determine what move from that state has a high utility. This decision takes into absolute positions of food, ghosts, and the position of Pac-Man. However, this is inefficient for larger problems sizes due the large potential states that Pac-Man has to visit in order to properly train for the scenario. Another consideration is that Pac-Man might train itself to avoid certain areas because it is inefficient, leaving Pac-Man without any training when it does end up in an area with less training. To combat this flaw, we modified the standard q-learning into an approximate q-learning algorithm. In the approximate qlearning, instead of learning from only each coordinate state, the algorithm generalizes the scenario. Some factors to consider for Pac-Man are the vicinity of dots, ghosts, and relative position. The relative position is used to create a more efficient route while pursuing dots and avoiding ghosts, in addition to the actual coordinates as to optimize the traversal of the game board. The approximate q-learning accounts for similar state factors as the reflex agent. The q-learning is more technical than the basic reflex agent because the algorithm will, with the process of trial and error, converge upon the optimal weight values for the different state factors. The q learning algorithm must be trained in order for it to figure out the most optimal way to play Pac-Man. After the training sessions are done, the values are Britt and Silinski 6 stored in a table. This allows for Pac-Man to look up the state and move, consider the features of the gamestate( i.e. there is ghost directly to Pac-Man’s East), then perform a move. There is very little on the fly calculations performed when compared to the other algorithms. The time complexity is slightly different than the other algorithms. For our evaluations we are only comparing the performance of the algorithms in regards to playing time and not time taken to train. With that in mind q learning has a time complexity of O(1) because q learning uses a lookup table in order to determine the optimal move. 4 Results (See the Appendix section for pertinent graphs and tables) In our tests we ran each algorithm for a total of one thousand tests trials. We ran the minimax and expectimax algorithm at depth two, three, and four in order to determine how as more depths were evaluated the performance changed. From this point forward minimax depth two, expectimax depth three, etc. will be referred to as minimax two and expectimax three respectfully. For q-learning we also ran for the same number of test runs; however, we also varied the training by running the algorithm four separate times with fifty, one hundred, five hundred, and one thousand training trial in order to see how the results differ. Shockingly, the reflex agent lost every game. We found that reflex agent failed as much as it did because the utility evaluation function was not as well suited for the reflex agent as it was the other algorithms. Ultimately, we decided to live with this because for the purposes of this class the utility evaluation function is not important. It is instead important to use the same evaluation function for all the algorithms so that the algorithms themselves can be compared and not the evaluation function. Within the set of minimax tests, the minimax two algorithm had the highest score, however it had a win percentage .01% less than that of minimax four. Due to minimax depending upon the adversary to take the optimal move, it is logical that minimax two had such a high score because if the ghost does not take an optimal move that disrupts the accuracy of the evaluation. As the depth increases the potential for ghosts to move not optimally increases. With minimax four having the highest win percentage that is a surprise. As the depth increases the win percentage does not have a pattern of increase, and when going form depth 2 to depth 3 it actually decreases, which is most likely due to randomness and with more trials we expect to find this trend to not continue. Interesting to note, is that as the depth increases the average number of moves also decreases. As the depth increases, the in game time stays the same because one unit of time in game consists of Pac-Man and all of the ghosts moving once; however, the time it take the algorithm to evaluate the possible moves increases. Depth three takes approximately 3 times longer than depth two and depth four takes approximately 5 times longer than depth 3 which suggests the exponential increase in time due to depth suggested by our theoretical time complexity (Appendix Item D). Within the set of expectimax tests, unlike minimax, expectimax four had both the highest score and the highest win percentage out of the group. Expectimax also had the lowest standard deviation Britt and Silinski 7 amongst the expectimax class, showing the algorithm performed more consistently than expectimax at other depths. When comparing the different depths of expectimax, the greatest change was from depth two to depth three. This stays the same for the standard deviation of the move count because expectimax one had the greatest standard deviation. The transition from expectimax depth three to four had smaller changes in the win percentage, average score, and move standard deviation. With this in mind, that as the depth increases the in game performance increase, it must be noted that the time it takes in reality (as in real time for a human: seconds, minutes, etc.) to calculate a move increases. When going from depth two to depth three it takes approximately three times as long on average per move. However, depth three to depth four takes approximately five times as long on average. This matches up with the time complexity stated previously where the increase in depth causes an exponential increase in the number of game states explored and time taken. With additional depths, more time will be required to complete the algorithm. (Appendix Item E). In the q learning tests, we varied the training games with 50, 100, 500, and 1000. Q learning with 500 training games performed the most optimally with the highest score, highest win percentage and lowest standard deviation. However, it did not have the lowest average number of moves with Qlearning 100 taking on average the fewest number of moves. Given the similarity between how the function calculates the utility and does not rely upon an depth, q learning requires the same time regardless of the number of training games and that is as expected. With the nature of q learning it is expected that as the number of training sessions increases the more optimal the game would play because the weights of the state evaluators converges upon the optimal values. However, when q learning was ran with 1000 training sessions the game did not perform better than it did with 500 q learning games and actually performed worse. When comparing the different algorithm upon each other and not amongst their sister algorithms(i.e expectimax depth 2 vs. expectimax depth 4 or q learning with 50 training games vs q learning with 100 training games) the strengths of each of the functions can be more efficiently compared. When comparing the most optimal algorithms (and depth pairs when appropriate) expectimax depth 4 performed the best when being judged by the win percentage and the average score. This illustrates how brute force can be leveraged in order to compute every combination to a certain depth and evaluate how the future might unfold. This depth evaluation comes at a cost because expectimax depth 4 takes the longest time per move. If the move time of each of the best algorithms in each subset is summed excluding the expectimax subset, expectimax depth 4 still takes over five times as long. Overall game performance comes at a cost; however, if willing to sacrifice .006 win percentage and lower score, q learning performs the most optimally out of each of the subsets. If you consider the in game trade off to time taken, q learning calculates almost 500 times faster than expectimax depth 4. Q learning even performs faster than the reflex agent because q learning looks up the utilities in a table and does not perform any calculations when playing in a test game. Another thing to consider is q learning with 500 training sessions had the smallest standard deviation amongst the move per game count. This illustrates how more consistent q learning is at reaching the desired goal. Britt and Silinski 8 5. Conclusions and Future Considerations Based on our results and the time complexities involved with acquiring such results, we can draw a few overall conclusions. First, one can see how much randomness mattered within the context of Pac-Man as there appears to be a large discrepancy between the results of the minimax and expectimax algorithms. If we were to do this experiment again, it would have been beneficial to run the same algorithms on a game board where the ghosts did, in fact, move optimally to see what differences could be found. Another conclusion drawn from this project is the noticeable power of a properly trained qlearning algorithm. When considering the time taken make the decisions and the results, q-learning was far and away the best algorithm. Q-learning was able to retrieve similar results to expectimax depth four in a fraction of the time. For future problems similar in nature to the one explored in this paper, it would be wise to apply the qlearning algorithm. There are a couple of concerns worth mentioning in regards to this project. First, reflex agent had the potential to do much better than it did with a different utility function, but we wanted to have consistency across algorithms for evaluation so that the algorithms, themselves could be compared. If we were to do this project again, we would have tried to find an evaluation function that would work well for all algorithms, even reflex agent. In addition, coming up with a better utility function in general would have been a good idea. A utility function is fairly subjective due to the programmer deciding what states are desirable. These subjectivities may not have been ideal for winning at PacMan. A lot of the decisions involving the utility function were made based on guessing and checking until the behavior evident in Pac-Man coincided with our preferences for how to play the game. Looking back on this project, there is probably a better way for how to come up with a utility function. If we were to do this project again it would also be worth exploring future depths of minimax and expectimax. Due to computer limitations and the amount of trials needed to draw conclusions, we could only explore up to depth four reasonably for both algorithms. It would be interesting to see how close to a perfect 100% rating we could obtain by investigating larger depths. The same goes for q-learning. By adding more games to the training count it would be interesting to see how close to perfection we could get. Britt and Silinski 9 Appendix (A) Algorithm Reflex Minimax 2 Minimax 3 Minimax 4 ExpectiMax 2 ExpectiMax 3 ExpectiMax 4 Qlearn 50 Qlearn 100 Qlearn 500 Qlearn 1000 Qlearn 50 W/ Train* Qlearn 100 w/Train* Qlearn 500 w/Train* Qlearn 1000 w/Train* Avg Move Time Avg # Moves AVG Score Win % Move STD 0.001919824 68.144 -445.024 0 52.22509669 0.004229125 189.043 975.877 0.672 81.1781562 0.012895651 322.443 801.657 0.648 151.2438741 0.061221232 411.292 818.098 0.682 193.9309773 0.016378297 168.175 1223.215 0.836 60.59668583 0.048551765 192.308 1283.672 0.915 54.26267201 0.234599232 200.735 1305.865 0.927 52.37050166 0.000527238 132.411 1191.929 0.9 29.7713354 0.000494471 129.223 1214.287 0.91 25.68413802 0.000504648 130.534 1230.316 0.921 24.23758041 0.000488254 130.6677632 1205.631579 0.90296053 25.28541812 0.001037794 1191.929 0.9 0 0.000993991 1214.287 0.91 0 0.000958897 1230.316 0.921 0 1205.631579 0.90296053 0 0.000942771 (B) Win % 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Britt and Silinski 10 (C) Avg Move Time 0.25 0.2 0.15 0.1 0.05 0 (D) Avg Move Time for MiniMax 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Minimax 2 Minimax 3 Minimax 4 Britt and Silinski 11 (E) Avg Move Time for Expectimax 0.25 0.2 0.15 0.1 0.05 0 ExpectiMax 2 ExpectiMax 3 ExpectiMax 4 Pseudocode Pseudocode for utility function: def utilityFunction (gamestate): moveScore=0 moveScore -= ghostDistance[0] * 5 #How much Pacman Wants to avoid #ghosts moveScore += 100 Eat Pellet #Eat Pellet Reward return 10000 if Win State #Win is the best result moveScore += 100 if Powerpellet #Eat Power Pellet Reward moveScore += random.random() * 10 #Avoid a flip flop state when the move #scores are equal return moveScore Pseudocode for reflex agent for the Pac-Man problem: max = -infinity for i in PacManLegalMoves: if utilityFunction(i) > max: max= utilityFunction(i) currentMoveChosen = i Minimax pseudocode for Pac-Man: Britt and Silinski 12 def value(state): if the state is terminal state: return utility if the next agent is max(Pac-Man) return max-value(state) if the next agent is min(Pac-Man) return min-value(state) def max-value(state): initialize v = - infinity for each successor of state: v = max(v,value(successor)) return v def min-value(state): initialize v = infinity for each successor of state: v = min(v,vale(successor)) return v Expectimax pseudocode for Pac-Man: def value(state): if the state is a terminal state: return the state’s utility if the next agent is Max. return max-value(state) if the next agent is Exp, return exp-value(state) def max-value(state) v = -infinity for each successor of state: v = max(v,value(successor)) return v def exp-value(state): v=0 for each successor of state: p = probability(successor) # where probability is #1/number of available moves v +=p * value(successor) return v Britt and Silinski 13 References Sutton, R., & Barto, A. Reinforcement Learning: An Introduction. Retrieved November 2, 2015, from https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html Tic Tac Toe: Understanding The Minimax Algorithm. Retrieved October 2, 2015, from http://neverstopbuilding.com/minimax UC Berkeley CS188 Intro to AI -- Course Materials. Retrieved September 1, 2015, from http://ai.berkeley.edu/home.html