Baseball Fundamentals: Pitching, Hitting, Running and Fielding Your Way to Success Statistics & Data Analysis Data Analysis Project Professor Jeffrey Simonoff Overview of Analysis Baseball is a “simple” game: you pitch, field, and hit and run. For our project, we evaluate how each of those components contributes to the success of a team. Observations are taken for all Major League Baseball teams. The 6 variables analyzed for each team are: Hitting On Base Percentage (OBP) – This is a measure of a batter’s contribution to his team’s offense by the rate at which he reaches base. Slugging Percentage (SLG) – This is a measure of a batter’s contribution to his team’s offense by considering the total number of base per at bat for a hitter. Running Stolen Bases (SB) – This is a measure of teams’ ability to utilize its speed and base-running skill for offensive gain. Fielding Fielding Percentage (FP) – The number of fielding chances handled without an error. High fielding percentages indicate quality fielding and throwing. Pitching K/9IP – The average number of strike outs per standard 9-inning game. This statistic is one of the ways to evaluate a team’s pitching and is commonly known as a measurement “power pitching”, as strikeouts are indicative of a pitcher’s ability to overpower a batter. Walks Plus Hits Per Inning Pitched (WHIP) – This statistic is useful when evaluating the effectiveness of a team’s pitching. It indicates how successful the pitchers are keeping the opposing batter off base. This analysis evaluates how these 6 variables effects a team’s winning percentage. These 6 variables have been chosen because they best represent the above mentioned fundamental aspect of the game. In theory, a team that can do the fundamentals of baseball the best will win the most games. We can see which aspect will have the most effect on a team’s ability to win. Motivations for Analysis This analysis is of interest because it considers the various aspects of the game and evaluates the impact each has on the overall performance of a team. Each team is effectively a company, competing in the industry of baseball. Ideally, each team determines and pursues a strategy that maximizes its resources (e.g. financial support) and capabilities (e.g. scouting and player development). By analyzing data for specific teams, our objective is to understand the aspects of the game that winning teams have emphasized. 1 of 21 Based upon this historical data, we will see which areas of the game at which winning teams have excelled. We expect this analysis to provide insight into winning strategies, which would enable us to forecast the winning percentage of future teams based upon basic measures of their hitting, running, fielding and pitching performance. Overview of Data The data analyzed in this project covers five years worth of season statistics for all 30 team in Major League Baseball. All of our data variables are numerical. Winning percentage, our response variable, and several predictors, including OBP, SLG, K/9IP, WHIP and fielding percentage are continuous variables. Stolen bases are a discrete numerical variable. All data was obtained from www.ESPN.com. In determining what data to use, we wanted to select statistical measures that covered the traditional “5 tools” of baseball: hitting for average, hitting for power, speed, fielding and throwing. These tools are the skills for which individual players have been traditionally evaluated. Although our analysis looks at the overall team rather than the individual player, we will use the traditional list of statistical measures to provide a view of team performance. For example, to gain insight into batting for average and batting for power we used OBP and SLG batting statistic. Additionally, we used fielding percentage as a measure of both fielding and throwing for non-pitchers. While these statistical values are useful in measuring teams’ performance in batting, running, fielding and pitching, there are some limiting factors to their ability to tell the whole story: Fielding performance is a factor of both ability to execute a play without making an error and a player’s ability to reach a ball put in play, commonly known as “range”. While fielding percentage does not account for a range factor, range is a very subjective statistic. Teams comprised of players with exceptional range may impact the game by taking away hits that otherwise may have occurred. The downstream impact of teams with exceptional range could therefore be measured in WHIP. Additionally, fielders with limited range make no attempt on a play on which fielders with exceptional range could attempt and make an error. Therefore, since the impact of range may vary, it has been deemed acceptable to exclude. A team’s proficiency at executing its running game is measured in many statistical and non-statistical ways. In addition to stolen bases, the percentage of successful stolen base attempts, and the ability to take an additional base on a play are a key components. The ability for a team to take the extra base or break up a double play are good indications of a team’s ability to use good base running to its advantage. However, these plays are not recorded in any statistical numbers. 2 of 21 First Look at the Data: Descriptive Statistic: SE Mean Median Mean StDev 0.33388 0.333 0.000983 0.01205 0.42446 0.423 0.0019 0.02329 89.42 85 2.49 30.53 0.98311 0.983 0.000206 0.00253 1.3938 1.39 0.00699 0.0856 6.5297 6.44 0.0481 0.5887 0.49633 0.5015 0.00573 0.07023 Variable OBP SLG SB Fielding % WHIP K/9 Winning % N Variable OBP SLG SB Fielding % WHIP K/9 Winning % Minimum Q1 Q3 Maximum 0.3 0.325 0.342 0.366 0.368 0.40775 0.44325 0.491 31 66 109 200 0.977 0.981 0.985 0.989 1.22 1.33 1.45 1.62 5.41 6.1175 6.8625 8.68 0.264 0.438 0.549 0.644 150 150 150 150 150 150 150 The initial analysis of our data highlights no unusually distributions. For the most part the mean and the median of each variable are substantially similar, which points to a normal distribution. We then attempted to verify our findings by plotting a histogram for each of our variables. Looking at the histogram for the K/9 variable, we saw a slight right tail. However, after we took the log of K/9, there was no significant improvement. Histogram of K/9 Histogram of Log K/9 25 25 20 15 Frequency Frequency 20 10 5 0 15 10 5 5.4 6.0 6.6 7.2 7.8 0 8.4 K/9 3 of 21 0.75 0.78 0.81 0.84 Log K/9 0.87 0.90 0.93 Next we graphed our response (winning %) in a box plot to highlight any outlying data points. The only outlier that was observed was the winning % of the 2003 Detroit Tigers, which was one of top 10 lowest winning % in baseball history. We will take this into consideration as we evaluate the quality of our model. Boxplot of Winning % 0.7 Winning % 0.6 0.5 0.4 0.3 We then looked at the fitted line plot of each variable against the winning % to get a better understanding of the relationship of each predictor and the response. This plots isolates each predictor and doesn’t take into account the combined effect of all variables on the response. Fitted Line Plot Winning % = - 0.6627 + 3.471 OBP 0.7 S R-Sq R-Sq(adj) LA Dodgers ‘03 Winning % 0.6 0.5 0.4 0.3 Detroit ‘03 0.30 0.31 0.32 0.33 0.34 OBP 4 of 21 0.35 0.36 0.37 0.0566129 35.4% 35.0% Fitted Line Plot Winning % = - 0.1656 + 1.559 SLG 0.7 LA Dodgers ‘03 S R-Sq R-Sq(adj) 0.0603062 26.8% 26.3% S R-Sq R-Sq(adj) 0.0703042 0.5% 0.0% Winning % 0.6 0.5 0.4 0.3 Detroit ‘03 0.350 0.375 0.400 0.425 SLG 0.450 0.475 0.500 Fitted Line Plot Winning % = 0.4825 + 0.000155 SB 0.7 0.5 0.4 0.3 NY Mets ‘03 50 100 150 200 SB Fitted Line Plot Winning % = - 11.90 + 12.61 Fielding % 0.7 S R-Sq R-Sq(adj) 0.6 Winning % Winning % 0.6 0.5 0.4 0.3 0.976 Detroit ‘03 0.978 0.980 0.982 0.984 Fielding % 5 of 21 0.986 0.988 0.990 0.0628067 20.6% 20.0% Fitted Line Plot Winning % = 1.297 - 0.5744 WHIP 0.7 S R-Sq R-Sq(adj) 0.0502820 49.1% 48.7% Winning % 0.6 0.5 0.4 0.3 Detroit ‘03 1.2 1.3 1.4 WHIP 1.5 1.6 Fitted Line Plot Winning % = 0.2704 + 0.03460 K/9 0.7 S R-Sq R-Sq(adj) 0.0674335 8.4% 7.8% Winning % 0.6 0.5 0.4 Arizona ‘04 0.3 Detroit ‘03 5 6 7 K/9 8 9 From these plots we do not see an overwhelming strong correlation between any individual variable and the team’s winning, as each R-Sq is below 50%. The variables with the highest R-Sq are WHIP and OBP and the variable with the lowest R-Sq is SB. Although each individual variable doesn’t show significant correlation to the team’s winning %, this is not surprising since a team’s success is dependent on execution of all the fundamentals of the game of baseball. There appear to be potential outliers and/or leverage points identified in the fitted line plots above; however the impact of these outliers will be further evaluated after analyzing the best subsets regression. 6 of 21 Preliminary Multiple Regression Model: Regression Analysis: Winning % versus OBP, SLG, ... The regression equation is Winning % = - 3.02 + 1.67 OBP + 0.891 SLG + 0.000098 SB + 3.35 Fielding % - 0.510 WHIP - 0.00089 K/9 Predictor Coef SE Coef T P Constant -3.020 1.107 -2.73 0.007 OBP 1.6664 0.3117 5.35 0.000 SLG 0.8914 0.1569 5.68 0.000 0.00009752 0.00008467 1.15 0.251 SB Fielding % 3.345 1.123 2.98 0.003 WHIP -0.50953 0.03450 -14.77 0.000 K/9 -0.00089 0.004759 -0.19 0.851 S = 0.0309459 R-Sq = 81.4% R-Sq(adj) = 80.6% Analysis of Variance Source Regression DF SS MS F P 104.06 0.000 6 0.597891 0.099649 Residual Error 143 0.136944 0.000958 Total 149 0.734835 The multi-variable model highlights the importance of considering several fundamentals as it now accounts for approximately 81% of the variability in team winning percentage. As expected, increases in OBP, SLG and Fielding % are associated with higher winning percentages. Holding all other variables constant, the model indicates that a team which gives up one additional hit or walk per game (an increase in WHIP of 0.1111) can be expected to have a winning percentage that is decreased by 0.057, or nearly one standard deviation from the mean. This result underscores the baseball adage that “pitching wins games.” On the contrary, our model reveals that the impact of stolen bases on team winning percentage is negligible. Even when comparing the range (169), or the difference between the team with the most stolen bases and the team with the fewest, the predicted difference in winning percentage is only .017 (169 x 0.000098). This is further verified by the high P value for stolen bases of 0.251, which is indicative of insufficient evidence to reject the null hypothesis that stolen bases are unrelated to team winning percentage. One point of interest in the model is that when comparing two teams with identical statistics other than K/9, the model predicts that the team with fewer K/9 will actually have a slightly higher winning percentage. However, 7 of 21 K/9 appears to have a minimal impact on team winning percentage. The difference between the teams with the highest and lowest K/9, results in a predicted difference in winning percentage of only .003 (3.27 x 0.00089). This is again further verified by the extremely high P value for K/9 of 0.851, which is indicative of insufficient evidence to reject the null hypothesis that k/9 are unrelated to team winning percentage. The P value results for stolen bases and K/9 indicate that the inclusion of these variables in our model does not add value to its predictive power. This will be further analyzed in the “Model Improvement” section below. The standard error of the estimate of approximately 0.031 implies the model can predict winning percentage to within ±0.062 (2 x 0.031) about 95% of the time. To put this further into perspective, over the course of a 162 game season, this translates into an error of the estimate of approximately ±10 wins (±0.062 x 162). Checking Assumptions In order to evaluate the validity of the model assumptions, we must analyze the model errors through the use of several residual plots. We will begin with the plot of residuals versus the fitted values as well as residuals versus each predictor. These plots will be evaluated to identify any structure which may indicate that the model assumptions are invalid. Residuals Versus the Fitted Values (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.30 0.35 0.40 0.45 0.50 Fitted Value 8 of 21 0.55 0.60 0.65 Residuals Versus OBP (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.30 0.31 0.32 0.33 OBP 0.34 0.35 0.36 0.37 Residuals Versus SLG (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.350 0.375 0.400 0.425 SLG 9 of 21 0.450 0.475 0.500 Residuals Versus SB (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 50 100 150 200 SB Residuals Versus Fielding % (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.976 0.978 0.980 0.982 0.984 Fielding % 10 of 21 0.986 0.988 0.990 Residuals Versus WHIP (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 1.2 1.3 1.4 WHIP 1.5 1.6 Residuals Versus K/9 (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 5 6 7 K/9 8 9 The above plots reveal no apparent structure of the residuals, indicating our assumptions regarding distributions of errors is correct. That is, there are no well-defined subgroups and the variance of the errors is distributed in a homoscedastic pattern. Next we evaluate the normal probability plot of the residuals to ensure errors are normally distributed. 11 of 21 Normal Probability Plot of the Residuals (response is Winning %) 99.9 99 Percent 95 90 80 70 60 50 40 30 20 10 5 1 0.1 -0.10 -0.05 0.00 Residual 0.05 0.10 This plot indicates that the residuals roughly follow a normal distribution. As a further step to ensure our assumptions hold, we will run a time-series plot of residuals, which will indicate any auto-correlation of results across seasons. Residuals Versus the Order of the Data (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Observation Order Our data was ordered from 2007 down to 2003, with 30 observations from each season. The time-series plot of residuals does not reveal any apparent patterns to indicate auto-correlation. Model Improvement 12 of 21 Our multiple-regression model provided significantly greater ability to determine the winning percentage of a baseball team, than any single regression model with an individual predictor. However, we will now evaluate opportunities to improve upon our model. Simplifying the Model As previously indicated, we believe that stolen bases and K/9 are the weakest predictors of winning percentage. We will evaluate the best subset regression to determine which combination of predictors provides the strongest ability to predict winning percentage. Best Subsets Regression: Winning % versus OBP, SLG, ... Response is Winning % Vars 1 1 2 2 3 3 4 4 5 5 6 R-Sq 49.1 35.4 76.3 74.8 80.1 77.3 81.2 80.2 81.4 81.2 81.4 R-Sq(adj) 48.7 35.0 76.0 74.5 79.7 76.9 80.7 79.6 80.7 80.5 80.6 Mallows C-p 244.7 349.3 37.7 49.1 10.8 31.8 4.3 12.0 5.0 6.3 7.0 S 0.050282 0.056613 0.034401 0.035467 0.031661 0.033765 0.030874 0.031683 0.030842 0.030981 0.030946 W O S H K B L S F I / P G B % P 9 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X These results seem to support our initial conclusion that stolen bases and K/9 have negligible impact on the model’s ability to predict team winning percentage. By eliminating these two variables from the model, we reduce the complexity while improving our ability to predict winning percentage, as noted by the slight increase in adjusted R2. While the model only eliminating K/9 provides slightly higher R2 and lower standard error of the estimate, the benefits (R2 increased by 0.2 and S decreased by 0.000032) are nearly inconsequential compared to the simplicity of modeling based upon fewer variables. The output of this simplified model is shown below. As expected the model has produced a slightly lower R2 of 81.2 with a standard error of estimate of 0.030874. We also see slight changes to the coefficients for our remaining variables. SLG and Fielding % each dropped slightly, while OBP and WHIP increased slightly. 13 of 21 Regression Analysis: Winning % versus OBP, SLG, Fielding , WHIP The regression equation is Winning % = - 2.93 + 1.69 OBP + 0.876 SLG + 3.26 Fielding % - 0.510 WHIP Predictor Coef SE Coef T P Constant -2.932 1.096 -2.68 0.008 OBP 1.6877 0.3101 5.44 0.000 SLG 0.8763 0.1552 5.64 0.000 Fielding % WHIP 3.26 1.116 2.92 0.004 -0.51041 0.03172 -16.09 0.000 S = 0.0308740 R-Sq = 81.2% R-Sq(adj) = 80.7% Analysis of Variance Source DF SS MS F P 4 0.59662 0.14916 156.48 0.000 Residual Error 145 0.13821 0.00095 Total 149 0.73483 Regression 14 of 21 Residuals Versus the Fitted Values (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.30 0.35 0.40 0.45 0.50 Fitted Value 0.55 0.60 0.65 Normal Probability Plot of the Residuals (response is Winning %) 99.9 99 Percent 95 90 80 70 60 50 40 30 20 10 5 1 0.1 -0.10 -0.05 0.00 Residual 0.05 0.10 We will now return to the outliers and leverage points we have previously identified. The outliers are the following: 2007 New York Mets and 2004 Arizona Diamondbacks – These two teams were identified as outliers in the Stolen Base and K/9 fitted line plots respectively. Since these two variables have been removed from our simplified model and they were not identified as outliers for any of the other variables, we can presume that they no longer act as unusual observations. 15 of 21 2003 Detroit Tigers – With their .264 winning percentage, the 2003 Detroit tigers were one of the worst teams of all time. While their predictive indicators were consistently below the mean, the team’s performance fell far short of their expected winning percentage of .319. As such an extreme data point, we will remove it from our analysis. 2003 Los Angeles Dodgers – The Dodgers’ actual winning percentage of .521 significantly exceeded their expected winning percentage of .472. With a WHIP approximately 2 standard deviations below the mean, the Dodgers compensated for their relatively pedestrian offensive and fielding statistics. As such we will exclude them from our analysis. We will next remove these outliers from our regression analysis to determine their impact on our regression. We will first revisit the best subset regression to determine if the model is still superior. Best Subsets Regression: Winning % versus OBP, SLG, ... Response is Winning % Vars 1 1 2 2 3 3 4 4 5 5 6 R-Sq 49.7 32.7 74.4 73.7 78.7 75.8 79.7 78.8 79.8 79.7 79.9 R-Sq(adj) 49.3 32.3 74.0 73.3 78.2 75.3 79.1 78.2 79.1 79.0 79.0 Mallows C-p 205.7 322.8 37.0 41.8 9.3 29.3 4.3 10.7 5.2 6.2 7.0 S 0.047430 0.054838 0.033960 0.034420 0.031093 0.033130 0.030445 0.031139 0.030440 0.030543 0.030523 W O S H K B L S F I / P G B % P 9 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Once again it appears that the simplified model based upon OBP, SLG, Fielding % and WHIP is the superior model. Running the full regression yields the following results: Regression Analysis: Winning % versus OBP, SLG, Fielding , WHIP The regression equation is Winning % = - 2.67 + 1.62 OBP + 0.878 SLG + 2.99 Fielding % - 0.496 WHIP Predictor Constant OBP SLG Fielding % WHIP Coef -2.670 1.6215 0.8783 2.994 -0.49588 S = 0.0304448 SE Coef 1.109 0.3112 0.1536 1.123 0.03204 R-Sq = 79.7% T -2.41 5.21 5.72 2.67 -15.47 P 0.017 0.000 0.000 0.009 0.000 R-Sq(adj) = 79.1% Analysis of Variance 16 of 21 Source Regression Residual Error Total DF 4 141 145 SS 0.51296 0.13069 0.64365 MS 0.12824 0.00093 F 138.36 P 0.000 The removal of these data points has resulted in a slight decrease of R2 to approximately 79.7. Additionally, the standard error of estimate has reduced slightly to approximately 0.0305, which implies the model can predict winning percentage to within ±0.0610 (2 x 0.030) about 95% of the time. To put this further into perspective, over the course of a 162 game season, this translates into an error of the estimate of approximately ±9.88 wins (±0.061 x 162), thus tightening our confidence interval by ± .15 games. Our predictor coefficients have decreased slightly too. Holding all other variables constant, the model indicates that a team which gives up one additional hit or walk per game (an increase in WHIP of 0.1111) can be expected to have a winning percentage that is decreased by 0.055, translates into nearly 9 fewer wins over the course of a 162 game season. To impact a teams winning percentage by the same amount, SLG, OBP and Fielding % would have to decrease by approximately -0.034, -0.064 and -0.019 respectively. The very low P values for each predictor as well as the overall regression allow us to reject the null hypothesis that the predictors are unrelated to the response. We must also recheck our assumptions to ensure they still hold. We can analyze the model errors through the use of several residual plots below. The plots of residuals versus the fitted values as well as residuals versus each predictor still do not appear to identify any structures to indicate that the model assumptions are invalid. Residuals Versus the Fitted Values (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.35 0.40 0.45 0.50 Fitted Value 17 of 21 0.55 0.60 0.65 Residuals Versus SLG (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 SLG Residuals Versus OBP (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.31 0.32 0.33 0.34 OBP 18 of 21 0.35 0.36 0.37 Residuals Versus WHIP (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 1.2 1.3 1.4 WHIP 1.5 1.6 Residuals Versus Fielding % (response is Winning %) 0.10 Residual 0.05 0.00 -0.05 0.976 0.978 0.980 0.982 0.984 Fielding % 0.986 0.988 0.990 The elimination of several outliers is evident in this new normal probability plot, as the data points now appear to follow a slightly more normal distribution. 19 of 21 Normal Probability Plot of the Residuals (response is Winning %) 99.9 99 Percent 95 90 80 70 60 50 40 30 20 10 5 1 0.1 -0.10 -0.05 0.00 Residual 0.05 0.10 Conclusion This analysis evaluated how the basic fundamental aspects of the game – pitching, fielding, and hitting and running – impact a team’s ability to win. Our theory hypothesized that teams that can successfully perform all these fundamentals will achieve a positive winning percentage. The results of our analysis revealed helpful insight into the relative significance of each of these aspects, and in some cases identified preferred statistics to measure performance in the fundamentals. Our analysis revealed that team performance in pitching, hitting and fielding is highly correlated to winning percentage. For pitching and hitting, we analyzed two performance measures each: WHIP and K/9IP, and OBP and SLG respectively. Per our analysis, WHIP, a measure of a team’s ability to prevent batters from reaching base, is a far more highly correlated to team winning percentage than K/9, a power-pitching measurement. For hitting, our analysis indicated that both OBP and SLG are positively correlated to team winning percentage. Overall, WHIP, OBP and SLG, and fielding percentage have relatively strong predictive ability for winning percentage. A team with a high OBP and SLG will usually score more runs which lead to wins. A team with high fielding percentage and low WHIP will usually give up fewer runs. From the regression, WHIP is the strongest predictor of winning percentage. Our model seems to prove the saying that “good pitching will always beat good hitting.” Running best subsets regression analysis revealed that our regression model could be simplified by removing SB and K/9, which each have very little predictive ability for winning percentage. The insignificant of SB can be 20 of 21 explained by two factors – the relative insignificance of stolen bases to the modern game of baseball and the imperfection of stolen bases as a measure of running proficiency, as discussed in the Overview of Data. As for K/9, a pitcher doesn’t have to strike out a lot of batters to be successful. A wild pitcher might have a high K/9 but can also surrender many walks and runs. The next question should be, “what should we do with this finding?” Over the past decade the science of analyzing baseball through objective evidence, called “sabermetrics” has evolved and reached significant prominence. General Managers (GMs) of baseball teams use data such as ours to understand the relevance of fundamentals as well as their key measurements when building their team. Likewise, fans can use performance metrics to evaluate the strength of management decisions in “upgrading” their favorite team for the upcoming season. Our regression analysis suggests that GMs and fans should emphasize quality pitching, though not necessarily “power pitching”, before focusing on a balanced hitting attack that delivers both consistent base-runners (as measured by OBP) and power (as measured by SLG). Additionally, solid defense is important to keep opposing runners off the base-paths. While a team should put less emphasis on stolen bases, the running game should not be ignored. The ability for a team to take the extra base or break up a double play are good indications of a team’s ability to use good base running to its advantage, but they have not been specifically accounted for in our analysis as these plays are not recorded in any statistical numbers. These intricacies make the game fun and hard to predict. 21 of 21