Baysian Statistics Project By: Nick Luerkens and Josh Gunderson Research Question: Which Statistic in major league baseball has more of an impact on a team’s success, batting average or era? Variables: We obtained data from all 30 major league baseball teams during the 2004 major league season. - Predictor variables: batting average (x) and era (y) - Response variable: season winning percentage (z) Baysian Model: We specified our model by assuming normal distributions for all three of our variables. We will analyze our data using multiple linear regression: mu (z) = alpha + beta (x) + gamma (y). Winbugs Code: model { for (i in 1:N) { wins[i] ~ dnorm( mu[i], tau ) ; mu[i] <- alpha + beta * batting[i] + gamma * era[i] ; } tau ~ dgamma(.01, .01) ; alpha ~ dnorm (0 , .01) ; beta ~ dnorm ( .954, .01) ; gamma ~ dnorm (-.057, .01) ; } list(N = 30) Prior parameters: * We must guess what E (x) and E (y) will be non-informatively. E (x) = 0.262 and E (y) = 4.39. - win%: mu [i] for win% will be 0.5. The reason is because we are assuming the average team to win half of their games. - alpha: alpha indicates the intercept for our multiple regression equation. We will set it at zero. - Beta and Gamma: 0 .5 = alpha + Beta (.262) + -Gamma (4.39). Assume that average and era have the same impact on wins. .25 = Beta (.262); Beta = .954 .25 = -Gamma (4.39); Gamma = -.057 (This needs to be negative because the slope of era vs. wins is negative in simple regression). Winbugs Output and Interpretation: The 1st is a scatter plot of era (x – axis) vs. winning % (y – axis). The 2nd is a plot of batting average (x – axis) vs. winning % (y – axis). scatterplot 0.7 0.6 0.5 0.4 0.3 0.24 0.25 0.26 0.27 0.28 0.29 4.0 4.5 5.0 5.5 6.0 scatterplot 0.7 0.6 0.5 0.4 0.3 3.5 Bivariate posterior scatter plots beta 10.0 5.0 0.0 -5.0 -0.3 -0.1 gamma Time series beta chains 1:3 10.0 5.0 0.0 -5.0 1001 5000 10000 15000 iteration Time series gamma chains 1:3 0.1 2.77556E-17 -0.1 -0.2 -0.3 1001 5000 10000 15000 iteration box plot: mu 0.7 [27] [5] [3] [1] [6] 0.6 [24] [4] [13] [12] [17] [21] [15] [25] [9] [19] [20] [11] [7] 0.5 [16] [10] [22] [23] [26] [29] [18] [28] [30] [14] [2] [8] 0.4 0.3 Top: batting average (boxes) vs. winning percentage (y - axis) Bottom: era (boxes) vs. winning % (y - axis) box plot: mu 0.7 [27] [5] [3] [1] [6] 0.6 [24] [4] [13] [12] [17] [21] [15] [25] [9] [19] [20] [11] [7] 0.5 [16] [10] [22] [23] [26] [29] [18] [28] [30] [14] [2] [8] 0.4 0.3 Node statistics Totals: node mu[1] mu[2] mu[3] mu[4] mu[5] mu[6] mu[7] mu[8] mu[9] mu[10] mu[11] mu[12] mu[13] mu[14] mu[15] mu[16] mu[17] mu[18] mu[19] mu[20] mu[21] mu[22] mu[23] mu[24] mu[25] mu[26] mu[27] mu[28] mu[29] mu[30] E[mu] mean 0.5844 0.3887 0.5901 0.5363 0.5949 0.5743 0.4594 0.3541 0.5037 0.4232 0.4742 0.5271 0.545 0.3963 0.5281 0.4449 0.5429 0.4397 0.4648 0.4824 0.5452 0.5032 0.4904 0.5725 0.5326 0.4835 0.6229 0.4276 0.4907 0.4256 0.4936 sd 0.0216 0.02172 0.0208 0.02211 0.02197 0.01924 0.01609 0.02675 0.01893 0.03184 0.01809 0.01447 0.01483 0.02115 0.01665 0.02493 0.01522 0.0233 0.02537 0.01281 0.01358 0.01111 0.01399 0.01661 0.01246 0.01443 0.0236 0.01632 0.0112 0.01666 MC error 2.508E-4 2.225E-4 2.129E-4 2.542E-4 2.534E-4 1.952E-4 1.679E-4 2.753E-4 2.121E-4 3.331E-4 1.948E-4 1.448E-4 1.512E-4 2.122E-4 1.659E-4 2.628E-4 1.538E-4 2.451E-4 2.653E-4 1.357E-4 1.449E-4 1.165E-4 1.397E-4 1.796E-4 1.35E-4 1.555E-4 2.551E-4 1.643E-4 1.16E-4 1.671E-4 median 0.5845 0.3888 0.5898 0.5364 0.595 0.574 0.4595 0.354 0.5038 0.4232 0.4743 0.5271 0.545 0.3965 0.5281 0.445 0.5429 0.4397 0.4648 0.4826 0.5451 0.5032 0.4904 0.5724 0.5326 0.4837 0.6226 0.4278 0.4907 0.4258 start 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 17001 sample 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 Conclusion: Proof that both predictor variables have a lot of impact on our response variable is clear through our output. There was convergence. Determining which variable has more of an impact is a not as clear. Here is what we are going to do. First, take individual node statistics for our slopes beta and gamma: Node statistics node beta mean 4.221 sd 1.163 MC error 0.005086 median 4.23 start 1001 sample 57000 node gamma mean -0.1043 sd 0.02434 MC error 9.471E-5 median -0.1043 start 1001 sample 57000 Node statistics • To determine which variable has more impact, we will take both means by the computed standard deviation of our actual dataset computed by SAS. • For any given value for ERA, for each standard deviation increase in Batting Average, Win % increases 4.221 (.01) = .04221; (.01 is standard dev. From SAS) • For any given value for batting average, for each standard deviation increase in ERA, Win % increases .1043 ( .466) = .0486; (.466 is standard dev. From SAS) • In conclusion, a one standard deviation increase in ERA will have more of an impact on Win % than a one standard deviation increase in batting average by .00639. This interpreted into a 162 games season is approx. 1.04 wins/season.