nb nascar

advertisement
Case Study – Negative Binomial Regression
NASCAR Lead Changes 1975-1979
Dependent Variable: Lead Changes
Independent Variables: #Drivers, Laps, Track Length
Model 1: Poisson Regression
Random Component: Poisson Distribution for the # of lead changes
Mass Function: P(Y  y | X 1 , X 2 , X 3 ) 
e   ( X )  ( X )
y!
y
y  0,1,2,...
Link Function: g() = log()
Systematic Component:
g (  )  log(  )     1 X 1   2 X 2   3 X 3
   e   1 X 1   2 X 2   3 X 3
Parameter
Intercept
Laps
Drivers
Trklength
Estimate
-0.4903
0.0021
0.0516
0.6104
Std Error
0.2178
0.0004
0.0057
0.0829
Z
-2.25
5.15
9.09
7.36
P-value
.0244
<.0001
<.0001
<.0001
Testing Goodness-of-Fit
Breaking races down into 10 groups of approximately equal size based on
their fitted values:
Range
0-9.4
9.4-10.5
10.5-11.6
11.6-20
20-21
21-23
23-26
26-32
32-36
36+
Total
#Races
15
14
14
17
19
15
16
16
11
13
151
obs
fit
113
138
178
321
485
191
353
491
349
574
131.3
150.6
157.1
274.4
390.3
328.7
397.1
452.9
374.2
536.4
Pearson
-1.60
-1.03
1.67
2.81
4.79
-7.60
-2.21
1.79
-1.30
1.62
2
X =107.4
Mean
7.53
9.20
12.71
18.88
25.53
12.73
22.06
30.69
31.73
44.15
Variance
23.41
34.46
41.30
56.36
89.93
48.21
74.33
183.70
201.82
229.47
The Pearson residuals are obtained by computing:
^
ei 
Yi   i
V (Yi )
^

Yi   i
^
i

observed - fitted
fitted
X 2   ei2
Under the hypothesis that the model is adequate, X2 is approximately chisquare with 10-4=6 degrees of freedom. The critical value for a =0.05 level
test is 12.59. The data clearly are not consistent with the model. Also note
that the variances within each group are orders of magnitude larger than the
mean.
Negative Binomial Model:
Random Component: Negative Binomial Distribution for the # of lead
changes
Mass Function:
( y  k )  k 


P(Y  y | X 1 , X 2 , X 3 , k ) 
(k )( y  1)  k   
k
  


k 
y
y  0,1,2,...
E(Y) =  V(Y) = 2/k)
Link Function: g() = log()
Systematic Component:
g (  )  log(  )    1 X1   2 X 2  3 X 3
   e  1 X 1   2 X 2   3 X 3  e x ' 
x'  1
X1
X2
X 3 
Note that SAS and STATA estimate (1/k).
Parameter
Intercept
Laps
Drivers
Trklength
1/k
Estimate
-0.5038
0.0017
0.0597
0.5153
0.1905
Std Error
0.4616
0.0009
0.0143
0.1636
0.0294
Z
-1.09
2.01
4.17
2.87
P-value
.2752
.0447
<.0001
.0041
Goodness-of-Fit based on Pearson Residuals for this model is obtained
below with:
^
ei 
Yi   i
V (Yi )
^

Yi   i
^
^ 2
^
i  i / k
X 2   ei2
Range
0-9.4
9.4-10.5
10.5-11.6
11.6-20
20-21
21-23
23-26
26-32
32-36
36+
Total
#Races
13
17
20
11
21
12
18
16
14
9
151
obs
fit
96
155
248
251
523
141
442
470
445
422
111.4
170.2
223.3
202.4
431.5
261.8
452.0
464.3
485.3
397.5
Pearson
-0.30952
-0.20153
0.250504
0.543148
0.482911
-1.04674
-0.0504
0.02797
-0.18924
0.140292
X2=1.88
Mean
7.38
9.12
12.40
22.82
24.90
11.75
24.56
29.38
31.79
46.89
S.D.
4.22
6.10
5.87
5.53
9.55
6.44
10.98
14.83
14.12
13.82
Clearly this model fits better than Poisson Regression Model. For the
negative binomial model, the standard deviation divided by the mean is
1 / k , which is estimated to be 0.436. For these 10 cells, the ratios range
from 0.24 to 0.67, which is consistent with that.
Sources:
Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley
Agresti, A. (2002). Categorical Data Analysis. Wiley
www.nascar.com
Computational Aspects:
k is restricted to be positive, so we estimate k*=log(k), which can take on any value. Note
that SAS and STATA are estimating 1/k in this parameterization, so that thet are
estimating –k* in the current notation.
Likelihood Function:
( y i  k )  k

li 
(k )( y i )  k   i



k
 i

 k  i
( y  k  1)  (k )  k

 i
( y i  1)!
 k  i



k
yi

( y  k  1)  (k )(k )  k
  i


(
k
)

(
y
)
i

 k  i
 i

 k  i
yi

 




k
 i

 k  i
( y i  e k *  1)  e k *  e k *
 k*
e 
( y i  1)!
i





ek *



yi
 i
 k*
e 
i





yi
Log-Likelihood Function:
yi 1
Li   ln( e k *  j )  ln ( yi  1)!  e k * ln( e k * )  yi ln(  i )  (e k *  yi ) ln(  i  e k * )
j 0
Derivatives wrt k* and :
 yi 1 1

Li
e k*  yi
 e k *  k *
 1  ln( e k * )  k *
 ln( e k *   i )
k *
e  i
 j 0 e  j

yi 1
 yi 1 1
  y
 2 Li
e k*  yi
1
k* 
k*
k*
k*
k* 
i
i

e

1

ln(
e
)


ln(
e


)

e

1

e



i
2
k*
k*
k*
2
k
*

 (k *)
e  i
 j)
j  0 (e
 j 0 e  j
 i  e

 y  
 2 Li
i
 xi e k *  i  i

2
k
*
k * 

  i  e


 y  i 
Li
 xi e k *  i
k* 

 i  e 
 e k*  y 
 2 Li
i
  xi xi ' e k *  i 

k* 2
 '
  i  e




2
k*


 e
  i  e k * 


Newton-Raphson Algorithm Steps:
L
g k   i
k *
L
g    i

g 
g k    
 gk 
 2 Li
Gk  
k *2
 2 Li
G   
 '
G k

G




 2 Li
  
k * 

 2 Li 

k *  

Gk *






'
Step 1: Set k*=0 (k=1/k=1) or the Poisson model:
~ (i )
Iterate to obtain estimate of : 
Step 2: Set  ’ = [ 1 0 0 0 ]
~ ( i 1)

 
 G
1
g
(No regression relation):
~ (i )
Iterate to obtain estimate of k* : k *
~ ( i 1)
k*
 Gk  g k
1
Step 3: Use results from steps 1 and 2 as starting values (software seems to use different
value for intercept), to obtain simultaneous estimates of  and k*
~ (i )
Iterate to obtain estimate of : 
~ ( i 1)

 
 G k
1
g k
 
  
k *
Step 4: Back-transform k* estimate to get estimate of k (k=ek*).
Download