Baysian Statistics Project

advertisement
Baysian Statistics Project
By: Nick Luerkens and Josh Gunderson
Research Question: Which Statistic in major league baseball has more of an impact on a
team’s success, batting average or era?
Variables: We obtained data from all 30 major league baseball teams during the 2004
major league season.
-
Predictor variables: batting average (x) and era (y)
-
Response variable: season winning percentage (z)
Baysian Model: We specified our model by assuming normal distributions for all three
of our variables. We will analyze our data using multiple linear regression: mu (z) =
alpha + beta (x) + gamma (y).
Winbugs Code:
model
{
for (i in 1:N) {
wins[i] ~ dnorm( mu[i], tau ) ;
mu[i] <- alpha + beta * batting[i] + gamma * era[i] ;
}
tau ~ dgamma(.01, .01) ;
alpha ~ dnorm (0 , .01) ;
beta ~ dnorm ( .954, .01) ;
gamma ~ dnorm (-.057, .01) ;
}
list(N = 30)
Prior parameters:
* We must guess what E (x) and E (y) will be non-informatively. E (x) = 0.262
and E (y) = 4.39.
-
win%: mu [i] for win% will be 0.5. The reason is because we are assuming
the average team to win half of their games.
-
alpha: alpha indicates the intercept for our multiple regression equation. We
will set it at zero.
-
Beta and Gamma: 0 .5 = alpha + Beta (.262) + -Gamma (4.39). Assume that
average and era have the same impact on wins. .25 = Beta (.262); Beta = .954
.25 = -Gamma (4.39); Gamma = -.057 (This needs to be negative because the
slope of era vs. wins is negative in simple regression).
Winbugs Output and Interpretation:

The 1st is a scatter plot of era (x – axis) vs. winning % (y – axis).

The 2nd is a plot of batting average (x – axis) vs. winning % (y – axis).
scatterplot
0.7
0.6
0.5
0.4
0.3
0.24
0.25
0.26
0.27
0.28
0.29
4.0
4.5
5.0
5.5
6.0
scatterplot
0.7
0.6
0.5
0.4
0.3
3.5
Bivariate posterior scatter plots
beta
10.0
5.0
0.0
-5.0
-0.3
-0.1
gamma
Time series
beta chains 1:3
10.0
5.0
0.0
-5.0
1001
5000
10000
15000
iteration
Time series
gamma chains 1:3
0.1
2.77556E-17
-0.1
-0.2
-0.3
1001
5000
10000
15000
iteration
box plot: mu
0.7
[27]
[5]
[3]
[1]
[6]
0.6
[24]
[4]
[13]
[12]
[17]
[21]
[15]
[25]
[9]
[19]
[20]
[11]
[7]
0.5
[16]
[10]
[22]
[23]
[26]
[29]
[18]
[28]
[30]
[14]
[2]
[8]
0.4
0.3
Top: batting average (boxes) vs. winning percentage (y - axis) Bottom: era (boxes) vs.
winning % (y - axis)
box plot: mu
0.7
[27]
[5]
[3]
[1]
[6]
0.6
[24]
[4]
[13]
[12]
[17]
[21]
[15]
[25]
[9]
[19]
[20]
[11]
[7]
0.5
[16]
[10]
[22]
[23]
[26]
[29]
[18]
[28]
[30]
[14]
[2]
[8]
0.4
0.3
Node statistics
Totals:
node
mu[1]
mu[2]
mu[3]
mu[4]
mu[5]
mu[6]
mu[7]
mu[8]
mu[9]
mu[10]
mu[11]
mu[12]
mu[13]
mu[14]
mu[15]
mu[16]
mu[17]
mu[18]
mu[19]
mu[20]
mu[21]
mu[22]
mu[23]
mu[24]
mu[25]
mu[26]
mu[27]
mu[28]
mu[29]
mu[30]
E[mu]
mean
0.5844
0.3887
0.5901
0.5363
0.5949
0.5743
0.4594
0.3541
0.5037
0.4232
0.4742
0.5271
0.545
0.3963
0.5281
0.4449
0.5429
0.4397
0.4648
0.4824
0.5452
0.5032
0.4904
0.5725
0.5326
0.4835
0.6229
0.4276
0.4907
0.4256
0.4936
sd
0.0216
0.02172
0.0208
0.02211
0.02197
0.01924
0.01609
0.02675
0.01893
0.03184
0.01809
0.01447
0.01483
0.02115
0.01665
0.02493
0.01522
0.0233
0.02537
0.01281
0.01358
0.01111
0.01399
0.01661
0.01246
0.01443
0.0236
0.01632
0.0112
0.01666
MC error
2.508E-4
2.225E-4
2.129E-4
2.542E-4
2.534E-4
1.952E-4
1.679E-4
2.753E-4
2.121E-4
3.331E-4
1.948E-4
1.448E-4
1.512E-4
2.122E-4
1.659E-4
2.628E-4
1.538E-4
2.451E-4
2.653E-4
1.357E-4
1.449E-4
1.165E-4
1.397E-4
1.796E-4
1.35E-4
1.555E-4
2.551E-4
1.643E-4
1.16E-4
1.671E-4
median
0.5845
0.3888
0.5898
0.5364
0.595
0.574
0.4595
0.354
0.5038
0.4232
0.4743
0.5271
0.545
0.3965
0.5281
0.445
0.5429
0.4397
0.4648
0.4826
0.5451
0.5032
0.4904
0.5724
0.5326
0.4837
0.6226
0.4278
0.4907
0.4258
start
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
17001
sample
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
9000
Conclusion: Proof that both predictor variables have a lot of impact on our response
variable is clear through our output. There was convergence. Determining which
variable has more of an impact is a not as clear. Here is what we are going to do. First,
take individual node statistics for our slopes beta and gamma:
Node statistics
node
beta
mean
4.221
sd
1.163
MC error
0.005086
median
4.23
start
1001
sample
57000
node
gamma
mean
-0.1043
sd
0.02434
MC error
9.471E-5
median
-0.1043
start
1001
sample
57000
Node statistics
•
To determine which variable has more impact, we will take both means by the
computed standard deviation of our actual dataset computed by SAS.
•
For any given value for ERA, for each standard deviation increase in Batting
Average, Win % increases 4.221 (.01) = .04221; (.01 is standard dev. From SAS)
•
For any given value for batting average, for each standard deviation increase in
ERA, Win % increases .1043 ( .466) = .0486; (.466 is standard dev. From SAS)
•
In conclusion, a one standard deviation increase in ERA will have more of an
impact on Win % than a one standard deviation increase in batting average
by .00639. This interpreted into a 162 games season is approx. 1.04 wins/season.
Download