Case Study – Negative Binomial Regression NASCAR Lead Changes 1975-1979 Dependent Variable: Lead Changes Independent Variables: #Drivers, Laps, Track Length Model 1: Poisson Regression Random Component: Poisson Distribution for the # of lead changes Mass Function: P(Y y | X 1 , X 2 , X 3 ) e ( X ) ( X ) y! y y 0,1,2,... Link Function: g() = log() Systematic Component: g ( ) log( ) 1 X 1 2 X 2 3 X 3 e 1 X 1 2 X 2 3 X 3 Parameter Intercept Laps Drivers Trklength Estimate -0.4903 0.0021 0.0516 0.6104 Std Error 0.2178 0.0004 0.0057 0.0829 Z -2.25 5.15 9.09 7.36 P-value .0244 <.0001 <.0001 <.0001 Testing Goodness-of-Fit Breaking races down into 10 groups of approximately equal size based on their fitted values: Range 0-9.4 9.4-10.5 10.5-11.6 11.6-20 20-21 21-23 23-26 26-32 32-36 36+ Total #Races 15 14 14 17 19 15 16 16 11 13 151 obs fit 113 138 178 321 485 191 353 491 349 574 131.3 150.6 157.1 274.4 390.3 328.7 397.1 452.9 374.2 536.4 Pearson -1.60 -1.03 1.67 2.81 4.79 -7.60 -2.21 1.79 -1.30 1.62 2 X =107.4 Mean 7.53 9.20 12.71 18.88 25.53 12.73 22.06 30.69 31.73 44.15 Variance 23.41 34.46 41.30 56.36 89.93 48.21 74.33 183.70 201.82 229.47 The Pearson residuals are obtained by computing: ^ ei Yi i V (Yi ) ^ Yi i ^ i observed - fitted fitted X 2 ei2 Under the hypothesis that the model is adequate, X2 is approximately chisquare with 10-4=6 degrees of freedom. The critical value for a =0.05 level test is 12.59. The data clearly are not consistent with the model. Also note that the variances within each group are orders of magnitude larger than the mean. Negative Binomial Model: Random Component: Negative Binomial Distribution for the # of lead changes Mass Function: ( y k ) k P(Y y | X 1 , X 2 , X 3 , k ) (k )( y 1) k k k y y 0,1,2,... E(Y) = V(Y) = 2/k) Link Function: g() = log() Systematic Component: g ( ) log( ) 1 X1 2 X 2 3 X 3 e 1 X 1 2 X 2 3 X 3 e x ' x' 1 X1 X2 X 3 Note that SAS and STATA estimate (1/k). Parameter Intercept Laps Drivers Trklength 1/k Estimate -0.5038 0.0017 0.0597 0.5153 0.1905 Std Error 0.4616 0.0009 0.0143 0.1636 0.0294 Z -1.09 2.01 4.17 2.87 P-value .2752 .0447 <.0001 .0041 Goodness-of-Fit based on Pearson Residuals for this model is obtained below with: ^ ei Yi i V (Yi ) ^ Yi i ^ ^ 2 ^ i i / k X 2 ei2 Range 0-9.4 9.4-10.5 10.5-11.6 11.6-20 20-21 21-23 23-26 26-32 32-36 36+ Total #Races 13 17 20 11 21 12 18 16 14 9 151 obs fit 96 155 248 251 523 141 442 470 445 422 111.4 170.2 223.3 202.4 431.5 261.8 452.0 464.3 485.3 397.5 Pearson -0.30952 -0.20153 0.250504 0.543148 0.482911 -1.04674 -0.0504 0.02797 -0.18924 0.140292 X2=1.88 Mean 7.38 9.12 12.40 22.82 24.90 11.75 24.56 29.38 31.79 46.89 S.D. 4.22 6.10 5.87 5.53 9.55 6.44 10.98 14.83 14.12 13.82 Clearly this model fits better than Poisson Regression Model. For the negative binomial model, the standard deviation divided by the mean is 1 / k , which is estimated to be 0.436. For these 10 cells, the ratios range from 0.24 to 0.67, which is consistent with that. Sources: Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley Agresti, A. (2002). Categorical Data Analysis. Wiley www.nascar.com Computational Aspects: k is restricted to be positive, so we estimate k*=log(k), which can take on any value. Note that SAS and STATA are estimating 1/k in this parameterization, so that thet are estimating –k* in the current notation. Likelihood Function: ( y i k ) k li (k )( y i ) k i k i k i ( y k 1) (k ) k i ( y i 1)! k i k yi ( y k 1) (k )(k ) k i ( k ) ( y ) i k i i k i yi k i k i ( y i e k * 1) e k * e k * k* e ( y i 1)! i ek * yi i k* e i yi Log-Likelihood Function: yi 1 Li ln( e k * j ) ln ( yi 1)! e k * ln( e k * ) yi ln( i ) (e k * yi ) ln( i e k * ) j 0 Derivatives wrt k* and : yi 1 1 Li e k* yi e k * k * 1 ln( e k * ) k * ln( e k * i ) k * e i j 0 e j yi 1 yi 1 1 y 2 Li e k* yi 1 k* k* k* k* k* i i e 1 ln( e ) ln( e ) e 1 e i 2 k* k* k* 2 k * (k *) e i j) j 0 (e j 0 e j i e y 2 Li i xi e k * i i 2 k * k * i e y i Li xi e k * i k* i e e k* y 2 Li i xi xi ' e k * i k* 2 ' i e 2 k* e i e k * Newton-Raphson Algorithm Steps: L g k i k * L g i g g k gk 2 Li Gk k *2 2 Li G ' G k G 2 Li k * 2 Li k * Gk * ' Step 1: Set k*=0 (k=1/k=1) or the Poisson model: ~ (i ) Iterate to obtain estimate of : Step 2: Set ’ = [ 1 0 0 0 ] ~ ( i 1) G 1 g (No regression relation): ~ (i ) Iterate to obtain estimate of k* : k * ~ ( i 1) k* Gk g k 1 Step 3: Use results from steps 1 and 2 as starting values (software seems to use different value for intercept), to obtain simultaneous estimates of and k* ~ (i ) Iterate to obtain estimate of : ~ ( i 1) G k 1 g k k * Step 4: Back-transform k* estimate to get estimate of k (k=ek*).