Optimization of Batting Order Frank R. Zheng A Quick Introduction to Baseball Two teams alternate batting and fielding. Batting team tries to score runs. Runners must advance through first, second and third base in order to reach home Runners are advanced by players getting hits, drawing walks, stealing bases, or errors by the opposing team’s defense The team with the most runs at the end of the game wins Batting Order Before each game, the team’s coach must submit the batting order of the team The batting order dictates the order in which players step up to the plate Substitutions such as pitch hitters or pitch runners are allowed, but are relatively rare The optimal batting order maximizes the expected run production Batting Order Optimization as a Scheduling Problem Finding the optimal batting order for a team can be thought of as a single-machine scheduling problem Each batter is modeled as a job, and the batting order is a set of 9 such jobs The objective function is to maximize the run production of the lineup This is a complicated function that requires simulation to analyze Approach to Optimize Batting Order Each baseball team has a roster of ~15 batters, of which only 9 compose the batting order Brute forcing all the possible lineups is somewhat impractical – need to calculate 15!/6! combinations (over 1.8 billion unique lineups) Solution is to combine a qualitative “conventional wisdom” approach with a data-driven quantitative methodology Batting Order Conventional Wisdom Over the many decades baseball has been played, coaches have dedicated much thought to finding the best lineup Traditional lineups follow this general order 1-2 – batters who get on base on a lot 3-5 – batters who get a lot of extra base hits 6-8 – weak batters 9 – pitcher/weak batter/batter who gets on base a lot Key is to have players with a high realization value (lots of runs batted in) follow those with a high potential value (getting on base a lot) i.e., get runners on base so your power hitters can drive them home Underlying Causes of Run Production There is a limited set of events that have the potential to score runs We refer to these as “Run-Producing Events” or RPEs RPEs include Singles (1B) Doubles (2B) Triples (3B) Home Runs (HR) Bases on Balls/Batter Hitter by Pitch (BB+HBP) Errors (ERR) Batting Performance Does the model fully capture differences among player batting characteristics? Regression Value OUT 1B BB+HBP 2B 3B HR ERR -0.1040 0.4659 0.3255 0.7613 1.0456 1.4031 0.4340 How to distinguish between ‘table setters’ vs. ‘sluggers/cleanup hitters’? Realization Value vs. Potential Value Realization Value is the expected number of runs each RPE actually scores Potential Value is the effect each RPE has on the team’s chances to score additional runs in the same inning Differentiating between these two metrics allows us to quantitatively determine which players create the potential for scoring runs and which ones are good at bringing those players to home plate OUT 1B BB+HBP 2B 3B HR ERR Realization Value 0.0000 0.2314 0.0328 0.5120 0.7411 1.7387 0.1000 Potential Value -0.1040 0.2345 0.2927 0.2493 0.3045 -0.3356 0.3340 Total Value -0.1040 0.4659 0.3255 0.7613 1.0456 1.4031 0.4340 Differentiating Players By comparing each individual’s realization value and potential value to the team’s overall averages, we can group players into one of four categories (R+, P+) Strong Hitters – players who bat in a lot of runs but also create the potential for more runs (R+, P-) Run Producers – players who bat in a lot of runs (R-, P+) Table Setters – players who create a lot of potential for more runs (R-, P-) Weak Hitters – the team’s worst players This gives us the quantitative data we need to apply the conventional wisdom discussed earlier Overview of Heuristic Now we have the tools we need to combine the holistic conventional wisdom with quantitative data We adapted this heuristic from the work of Sokol After determining which players fall into which set, we attempt to follow the conventional wisdom of placing batters with high realization values after a group of batters with high potential values We want to build up potential value and then release it with realization value The optimal order of the four sets is (R-, P+) (R+, P+) (R+, P-) (R-, P-) Heuristic Steps Select the two batters with the highest P in the (R-, P+) set and assign them to the top two slots in the batting order, by order of increasing P Place all batters in the (R+, P+) group in the next slots, ordered by decreasing P Fill as many remaining slots as possible with batters from the (R+, P-) group, ordered by decreasing P If there are any remaining slots, fill them with batters in the (R-, P-) group, ordered by increasing P For each player left in the (R-, P+) group, replace a (R-, P-) player if possible, ordering the new (R-, P+) players by increasing P Application to 2011 New York Yankees In order to see the effects of our heuristic, we applied it to the 2011 New York Yankees First, we placed each player into the appropriate category NYY 2011 - Realization Value vs. Potential Value (Difference from Team Average) Potential Value (PV) Brett Gardner (R-, P+) Table Setters Derek Jeter (R+, P+) Strong Hitters Nick Swisher Eric Chavez Jesus Montero Alex Rodriguez Eduardo Nunez Realization Value (RV) Francisco Cervelli Andruw Jones Chris Dickerson (R-, P-) Weak Hitters Russel Martin Jorge Posada Curtis Granderson Mark Teixeria Robinson Cano (R+, P-) Run Producers Simulation In order to determine the value of our objective function (the expected number of runs scored per game) we need to simulate a game of baseball using the designated lineup Our simulation follows the structure of a normal game of baseball At each point in time, the next batter steps up to the plate and either generates a RPE or gets out, depending on that player’s distribution RPEs advance runners according to the rules of baseball or by probabilistic outcomes determined using data from the 2011 season The number of outs and runs is recorded for each of 16,200 games Results of Analysis Standard Lineup Batting Order 1 2 3 4 5 6 7 8 9 Player Derek Jeter Curtis Granderson Robinson Cano Alex Rodriguez Mark Teixeira Nick Swisher Jorge Posada Russel Martin Brett Gardner Heuristic Lineup Set R-, P+ R+, PR+, PR+, P+ R+, PR-, P+ R-, PR-, PR-, P+ This lineup generated an average of 5.68 runs, and is expected to have a 61.3% chance of winning a 5-game series against the Detroit Tigers Batting Order 1 2 3 4 5 6 7 8 9 Player Brett Gardner Derek Jeter Alex Rodriguez Robinson Cano Curtis Granderson Andruw Jones Mark Teixeira Russel Martin Nick Swisher Set R-, P+ R-, P+ R+, P+ R+, PR+, PR+, PR+, PR-, PR-, P+ This lineup generated an average of 5.84 runs, with a 64.7% chance of winning a 5-game series against the Detroit Tigers Conclusions and Other Applications The heuristic was only able to generate a lineup with a 3% increase in the amount of expected runs Since statistical analysis in baseball is a known quantity, it may be the case that the NYY have already studied this problem in great detail Even if the gains in expected run production were minimal, there are other applications for our methodology Potential trades or acquisitions of new players can be evaluated by what effect they would have on the team’s expected run production Can apply a game-theoretic approach to maximize your expected win rate by adjusting the distribution of your team’s run production to maximize the potential of winning a game against a specific team