Offensive Strategies in Baseball A Discrete Event Simulation Project Kirby Hunt DSES6620-Prof Gutierrez-Miravete 10/5/02 Table of Contents Introduction ...................................................................................................................................... 2 Objective and Scope ........................................................................................................................ 2 Collection and Analysis of Data ....................................................................................................... 3 Construction of the Model ................................................................................................................ 6 Model Verification and Validation ..................................................................................................... 7 Experimentation and Results ........................................................................................................... 8 Conclusions ................................................................................................................................... 10 Appendix ........................................................................................................................................ 11 Introduction The strategy of the game of Baseball is a matter of intense interest to fans and coaches at all levels of the game. This project investigates the dynamics of a baseball team's offensive strategy using discrete event simulation. This problem is ideally suited to discrete event simulation as it involves dynamic stochastic processes which influence each other in ways that are difficult, if not impossible, to predict with other modeling techniques. This project sheds some light on the traditional strategy of batting order (the order in which the players bat) by modeling the offense of a real major league baseball team and then exploring the relative merits of several batting order strategies. Objective and Scope The objective and scope of this project were laid out in the project proposal written before work began (See the file “proposal.doc” ). In brief, the main goal for this project was for the author to gain experience in setting up a simulation that works properly and represents the offensive side of the game of Baseball with acceptable fidelity. This simulation would then be exercised to produce some useful insight into the most productive strategies for playing the game. All of these objectives were accomplished. The scope of the project was limited to the investigation of the merits of various strategies of batting order. For this reason, only the batting, basic base running, and scoring aspects of the game were modeled. No defensive interaction (neither pitching nor fielding) was modeled. Base stealing, bunting, and hit and run plays were also omitted. Finally, pressure effects such as game winning or losing situations or batting with runners in scoring position were ignored. Limiting the scope of the project in this way permitted focus to be maintained on getting the basic offensive functionality of the model to work correctly in the time allotted for the project. Collection and Analysis of Data The data needed for this project was obtained from the Major League Baseball official website (www.mlb.com). It consists of batting statistics recorded for each player in the league for the entire 2002 baseball season. The result of each at-bat for the year is represented in this data so it is a good statistical sample (300-600 at-bats per player). The batting numbers for one team, the Seattle Mariners (not chosen at random), are shown in the following table. Seattle Mariners Offensive Statistics Summary (2002 Regular Season) I Suzuki OF Position 157 G - Games Played 647 AB - At Bats 111 R - Runs Scored 208 H - Hits 165 1B - Singles 27 2B - Doubles 8 3B - Triples 8 HR - Home Runs 51 RBI - Runs Batted In 275 TB - Total Bases 68 BB - Bases on Balls (Walks) 62 SO - Strikeouts 31 SB - Stolen Bases 15 CS - Caught Stealing 0.388 OBP - On-base Percentage 0.425 SLG - Slugging Percentage 0.321 AVG - Batting Average 5 SF - Sacrifice Flies 3 SH 5 HBP - Hit by Pitch 27 IBB - Intentional Walks 8 GIDP - Ground into Double Plays 728 TPA- Total Plate Appearances 4.6 Plate Appearances per Game 0.07 RBI Per Plate Appearance 2627 NP- Number of Pitches 43 XBH- Extra Base Hits 67.4 SB% - Stolen Base Percentage 238 GO - Ground Outs 140 AO- Fly Outs 1.76 GO/AO- Ground Outs/Fly Outs 0.813 OPS - On-base Plus Slugging Percentage B Boone 2B 155 608 88 169 108 34 3 24 107 281 53 102 12 5 0.339 0.462 0.278 6 2 6 4 11 675 4.4 0.16 2502 61 70.6 204 134 1.6 0.801 J Olerud 1B 154 553 85 166 105 39 0 22 102 271 98 66 0 0 0.403 0.49 0.300 12 0 5 6 19 668 4.3 0.15 2633 61 0 155 169 1.03 0.893 M Cameron OF 158 545 84 130 74 26 5 25 80 241 79 176 31 8 0.34 0.442 0.239 5 4 7 3 8 640 4.1 0.13 2612 56 79.5 100 141 0.77 0.782 J Cirillo 3B 146 485 51 121 95 20 0 6 54 159 31 67 8 4 0.301 0.328 0.249 9 13 9 0 12 547 3.7 0.10 1928 26 66.7 136 171 0.87 0.629 C Guillen SS 134 475 73 124 85 24 6 9 56 187 46 91 4 5 0.326 0.394 0.261 3 3 1 4 8 528 3.9 0.11 2063 39 44.4 120 138 0.93 0.719 R Sierra OF 122 419 47 113 77 23 0 13 60 175 31 66 4 0 0.319 0.418 0.27 2 0 0 5 17 452 3.7 0.13 1572 36 100 125 112 1.27 0.736 D Wilson C 115 359 35 106 83 16 1 6 44 142 18 81 1 0 0.326 0.396 0.295 8 7 2 1 8 394 3.4 0.11 1442 23 100 88 99 0.97 0.721 M McLemore OF 104 337 54 91 65 17 2 7 41 133 61 63 18 10 0.38 0.395 0.27 4 4 1 1 3 407 3.9 0.10 1642 26 64.3 92 96 0.99 0.774 D Relaford SS 112 329 55 88 67 13 2 6 43 123 33 51 10 3 0.339 0.374 0.267 7 1 6 2 6 376 3.4 0.11 1337 21 76.9 88 108 0.87 0.713 In order to use these numbers in the discrete event simulation it was necessary to categorize and tally the outcome for each at-bat in terms of a few possible outcomes. Accordingly, each at-bat was put into one of the following categories: 1. Single 2. Double 3. Triple 4. Home Run 5. Walk or Hit by Pitch 6. On base on error 7. Strike Out 8. Fieldable Grounder 9. Pop Fly The batting statistics above for each player then boiled down to the following probabilities for each of the possible outcomes: Seattle Mariners Hitting Probabilities (based on 2002 regular season statistics) Single Probability Double Probability Triple Probability Home Run Probability Walk&HBP Probability On Base on Error Probability Strikeout Probability Grounder Probability Pop Fly Probability I Suzuki 0.227 0.037 0.011 0.011 0.100 0.002 0.085 0.327 0.192 B Boone 0.160 0.050 0.004 0.036 0.087 0.001 0.151 0.302 0.199 J Olerud 0.157 0.058 0.000 0.033 0.154 0.000 0.099 0.232 0.253 M Cameron 0.116 0.041 0.008 0.039 0.134 0.003 0.275 0.156 0.220 J Cirillo 0.174 0.037 0.000 0.011 0.073 0.007 0.122 0.249 0.313 C Guillen 0.161 0.045 0.011 0.017 0.089 0.002 0.172 0.227 0.261 R Sierra 0.170 0.051 0.000 0.029 0.069 0.000 0.146 0.277 0.248 D Wilson M McLemore 0.211 0.160 0.041 0.042 0.003 0.005 0.015 0.017 0.051 0.152 0.006 0.004 0.206 0.155 0.223 0.226 0.251 0.236 D Relaford 0.178 0.035 0.005 0.016 0.104 0.001 0.136 0.234 0.287 Conceptual Model Once the data had been gathered and analyzed, the conceptual model that would form the framework or outline for the detailed model was generated. The conceptual model for batting itself is as simple as generating a random variate based on the set of probabilities just described that determines the outcome of each at-bat. These probabilities were put directly into Pro-Model as "User Defined Distributions" for each batter. They were then used to produce a random variate for each at-bat during the simulation. Although the batting order directly affects only the quality of the hits being produced, it would not be sufficient to model the hitting alone to be able to evaluate batting order strategies. This is because scoring runs is the final product of the batting order and this is crucially linked to the relationship between base running and batting. In other words, not only the number of hits, but also the timing of hits is important because if there aren't runners in scoring position when a hit is made, then it may be meaningless. Thus, a modeling scheme had to be generated for base running as well as hitting. The conceptual model for base runners proved to be significantly more complicated than for batters. The following rules include many simplifications but never the less capture the essence of how base-runners behave in the real game based on the outcome of the at-bat. Type of Hit A single base hit Runner response The batter gets to first and each base runner advances two bases A double base hit The batter gets to second and all other base runners score A triple base hit The batter gets to third and all other base runners score A home run The batter and all base runners score Walk or Hit by Pitch The batter gets to first and any runner in a force position advances one base (because only one runner can occupy a base, any runner who must advance to make room for another runner during a play is in a force position) On Base on Error The batter gets to first and any runner in a force position advances one base Strike Out The batter is out and no runners advance A fieldable grounder Any two lead runners in force situations, including the batter are out A pop fly The batter is out and no runners advance except any runner on third tags for home if there are less than 3 outs Construction of the Model The model for this project was created with the student version of a commercially available discrete event simulation software package called ProModel. The simulation space consists of the 4 bases of a baseball field, a batter’s box, and some other holding areas that have no physical parallel but are convenient for model functionality. The defensive team is not modeled but defensive effects are reflected in the offensive batting probabilities mentioned above. Because of the limitation on the number of different kinds of entities imposed by the software, generic batters and base-runners are used in the simulation. Logic in the processing of the batters keeps track of the batting order by counting the batters and then using the appropriate player’s hitting probability distribution in turn to determine the outcome of each at-bat. The batter becomes a generic runner upon leaving the batter’s box and all runners behave exactly the same. In this way, each player’s batting skill is represented in the desired order for batting but only two entity types are used. The outcome of the at-bat depends on a random variate generated from the player's batting distribution. The random variate is an integer between one and nine corresponding to the possible at-bat results described above. As each at-bat occurs, the random variate is generated and then assigned to a variable called “BatResult”. This variable is then used in various places in decision logic to control the flow of the players. A successful at-bat (1-6) will result in the batter ending up on base while a strike-out, grounder, or pop-fly (7,8,9) will result in an out(s). As each base-runner arrives at a base, the statement, “wait until BatResult > 0” is used near the top of the processing logic to delay processing of the runner until another batter has batted and the outcome of the at-bat is known. When the batter does hit, “if… then” statements in the base’s processing logic routes the base-runners according to the BatResult variable. These “if … then” statements enforce the base-running rules described in the conceptual model above. Timing is controlled via a downtime at the batter’s box location. This location experiences a downtime immediately after each at-bat. When the downtime is over, the location sets the BatResult back to zero and all base runners are stopped at the next base by the “wait until BatResult > 0” statements. The down time is scheduled according to the outcome of the at-bat to allow runners to advance the appropriate number of bases. If a double is hit, for example, the batter’s box is still down when the batter reaches first base so BatResult is still equal to two and the base processes the batter and sends him on to second. Once all the base-runners have settled, the next batter bats and his fate and the fate of the base runners are once again determined by the at-bat outcome. Any runs scored are tallied and, after three outs, the bases are cleared and a new inning starts. The model keeps track of innings played and terminates after nine. There may be a better way to construct the base running routing and logic but part of the complexity was driven by the desire to send each runner to each base in sequence for the sake of the animation. It would have been easier to route all runners directly to their final location after each at-bat but the animation would have looked wrong if runners were skipping bases. In any case, for all its complexity, the model worked correctly as you will see in the next section. Model Verification and Validation The model was verified and validated to insure good results. The first verification checks involved simply watching the animation of the simulation and checking the model parameters. Parameter displays were added to the animation to facilitate this process. This allowed the verification that hitters were arriving to bat in the right order, the outcome of their at-bat was varying randomly, and the hitters and base-runners were being routed correctly. This also facilitated the verification that innings were ending after three outs and the score was being tallied correctly. Because the model involved complex synchronization of events, the trace facility proved to be invaluable during some debugging involving the timing of code execution at multiple locations. The final validation checks of the simulation involved comparing 2002 season statistics from the real players and the team as a whole with results for the simulation. For example, during 157 games in real life, Ichiro Suzuki made 728 plate appearances. During 157 replicated games, the simulation recorded 743 plate appearances for Ichiro. With a 95% confidence band of 726-743 appearances, it is clear that the simulation is closely matching reality in this respect. This indicates that the simulation is correctly capturing the real life dynamics of the game in general (not just for this one batter) because the number of plate appearances depends on the number of outs vs. hits for all players. The more hits per inning, the more batters will bat, and the more times each player will bat during a game. If the ratio between outs and base hits was grossly wrong for any player, all other player's plate appearances would likely be affected. Overall team statistics from the simulation were also compared against real life data. The most important measure, runs per game did not match as well as might be desired with the initial model. The 2.9-3.6 runs per game (90% Confidence) from the simulation fell short of the 5 runs per game achieved by the real team. This prompted reexamination of the simplifications that were used in the model for base running (namely the omission of base stealing and the limitation of base runner advancement to one base for a single and two bases for a double) . This re-examination revealed that base stealing is an infrequent enough occurrence that including it would not materially effect the results of the model. The assumption that each base runner would advance only one base when a single was hit and two bases when a double was hit was found to be in error. In reality, runners more often advance two bases on a single and three on a double. The model was adjusted accordingly and the average runs per game for the simulation changed to 4.2-5.0 (90% Confidence) which matches the actual 5 runs per game of the real team well. Experimentation and Results The universally embraced strategy for arranging a line-up or batting order from little league to the major leagues, is to place the three batters with highest on base percentages (but not necessarily home run hitting ability) first in the batting order with the best power/home-run hitter in the fourth spot as the "clean-up" hitter. The rationale behind this traditional strategy is that you can get at least one of your high percentage batters on base and then your "clean-up" batter brings them in to score with a home run or a double or triple. The objective of the experimentation in this project was to challenge this notion. An alternative strategy was proposed. The cleanup hitter, Mike Cameron, was simply moved to the first spot and the lead off hitter, Ichiro Suzuki, was moved to the fourth spot. In the case of the team being analyzed here, the second, third, and fourth batters are all above average home run hitters while Suzuki is purely a base-hit batter. Thus, moving Suzuki to the fourth spot puts the home-run hitters up front in general. No other changes were made to the order. This scheme gives the power hitters more opportunities at-bat as well as changing the dynamics of the game in other ways that are difficult to predict. The base case and the alternative were run with 200 replications (each replication is a 9 inning game) and then compared against each other with a paired-t confidence interval analysis (see the included spreadsheet “compare strategies.xls” for calculations). The result of this comparison was that the alternative batting order produced .03-.76 (90% confidence interval) more runs per game than the traditional order. In other words, the team scored slightly more runs with the slugger batting first than with the traditional lead-off man batting first. It is important to note that the difference fell slightly short of being statistically significant at a confidence level of 95%. In order to investigate this question further, a trade was made for a player from another ball-club. It was theorized that because the clean-up hitter used in the study was sub-typical for a real power hitter, the difference in the two batting order strategies might be highlighted by using a more successful player. Thus, statistics were gathered for another player, Shawn Green of the Los Angeles Dodgers who hit 42 home runs last year compared to Mike Cameron’s 25. The simulation was run first with Shawn Green batting in the traditional clean-up spot. Not surprisingly, the team’s runs per game increased .03.85 runs (90% confidence interval) compared to the standard Seattle lineup. This is an interesting result in itself and is the kind of information that would be useful to baseball managers contemplating player trades. Next the simulation was run with the alternate batting order strategy (Shawn Green batting first and Suzuki batting fourth). The results of this run were somewhat surprising. The team averaged almost exactly the same number of runs with Green batting first as they did with him in the fourth position. Thus, the conclusion that switching the clean-up and lead-off hitters leads to more scoring was not strengthened, but weakened by making this comparison. It also indicates that generalizations about batting order strategies might be error prone because the merit of different strategies might depend on the players involved. One other batting strategy was investigated with the simulation. It explored the relative merits of the Mariner’s clean-up hitter, Mike Cameron and the Mariner’s lead-off batter Suzuki. To make the comparison, Cameron was removed from the order and Suzuki was put into the order in his normal spot and in Cameron’s spot. The performance of the team was then compared against the baseline performance. This comparison would show the relative merits of having two Suzuki like batters vs one Suzuki and one Cameron. The results were that the team scored approximately the same number of runs (the difference was not statistically significant at a 90% confidence interval). This indicates that Suzuki’s superior batting average is about equal in scoring value to Cameron’s combined low average but good power hitting ability. Conclusions With the results of these comparisons in mind there are a couple of conclusions that can be drawn. First, the limited combinations explored here indicate that there is no advantage to having the traditional clean-up batter bat in the clean-up position rather than the first position. In the case of the Mariners 2002 lineup, the simulation showed that there is actually a slight advantage to having the cleanup batter bat first and the lead-off batter bat fourth. However, making the substitution of one player, Shawn Green into the lineup changed the dynamics so that there was no advantage to having the cleanup batter bat first. This demonstrated that it is difficult to make general rules about batting order strategies as their effectiveness is largely dependent on the individual skills of the players involved and the dynamics of the team. It would be folly to take the results generated with these major league players and apply them universally to major league teams. It would be even more erroneous to apply these results to amateur teams with vastly different talent distribution. Rather, in order to get good results for this kind of analysis it is necessary to model the players involved individually. Appendix The following files are included with this report for reference baseball.mod baseball.TXT marinerstats.xls compare strategies.xls various results output files The Pro-Model model file The model text file The raw data and data analysis used for the model The results comparison analysis for the various configurations explored Output files generated for each batting configuration studied