Poisson Regression - NASCAR Crash Data (1975-1979)

advertisement

Poisson Regression

Caution Flags (Crashes) in NASCAR

Winston Cup Races 1975-1979

L. Winner (2006). “NASCAR Winston Cup Race Results for 1975-2003,” Journal of Statistics

Education, Vol.14,#3, www.amstat.org/publications/jse/v14n3/datasets.winner.html

Data Description

• Units: NASCAR Winston Cup Races (1975-1979) n=151 Races

• Dependent Variable:

 Y=# of Caution Flags/Crashes (CAUTIONS)

• Independent Variables:

 X

1

=# of Drivers in race (DRIVERS)

 X

2

=Circumference of Track (TRKLENGTH)

 X

3

=# of Laps in Race (LAPS)

Generalized Linear Model

• Random Component:

 Poisson Distribution for # of Caution Flags

• Density Function:

P

Y

 y X

1

, X

2

, X

3

)

 e

 m 

X

1

, X

2

, X

3

)  m 

X

1

, X

2

, X

3

)  y y !

y

0 , 1 , 2 ,...

• Link Function: g( m) = log( m)

• Systematic Component: g ( m

)

 m 

 log(

X

1

, X

2

, m

)

X

3

)

0

 

1

X

1

 

2

X

2 e

0

 

1

X

1

 

2

X

2

 

3

X

3

 

3

X

3

Testing For Overall Model

• H

0

: 

1

 

2

 

3

• H

A

: Not all  j

 0

= 0

(# Cautions independent of all predictors)

(# Cautions associated with at least 1 predictor)

• Test Statistic: X obs

2

• P-Value: P( c 2

3

≥ X

= -2(L obs

2 )

0

-L

• Rejection Region: X obs

2 ≥ c 2 a ,3

1

)

• Where:

 L

0

 L

1 is maximized log likelihood under model H

0 is maximized log likelihood under model H

A

NASCAR Caution Flag Example

Model :

g

(

m

)

Criterion

Deviance

Scaled Deviance

Pearson Chi-Square

Scaled Pearson X2

Log Likelihood

DF

150

150

150

150

0

Value

215.4915

215.4915

201.6050

201.6050

410.8784

Value/DF

1.4366

1.4366

1.3440

1.3440

Model :

g

(

m

)

 

0

 

1

X

1

 

2

X

2

 

3

X

3

Criterion

Deviance

Scaled Deviance

Pearson Chi-Square

Scaled Pearson X2

Log Likelihood

DF

147

147

147

147

Value

171.2162

171.2162

158.8281

158.8281

433.0160

Value/DF

1.1647

1.1647

1.0805

1.0805

Test Statistic

Rejection

P

 value

2

: X obs

:

Region

P

 c

2

3

( a

 

2

L

0

L

1

)

 

2 ( 410 .

8784

433 .

0160 )

44 .

2752

0 .

05 )

44 .

2752

)

:

0

2

X obs

 c

2

.

05 , 3

7 .

815

Statistical output obtained from SAS PROC GENMOD

Testing for Individual (Partial)

Regression Coefficients

H

0

:

 j

0 H

A

:

 j

0

Z

( Z )

Test Statistic

P

 value : 2 P

: z obs

Z

SE

 z obs

)

^

^ j j c

2 c

Test Statistic

P

 value : P

 c

1

2

: X

2 obs

X

2 obs

)

SE

^

^ j j

2

1 Sided Tests : Confirm sign of

^ j is correct, then " cut" P value in half.

NASCAR Caution Flag Example

Parameter

Intercept

Drivers

TrkLength

Laps

DF

1

1

1

1

Estimate

-0.7963

0.0365

0.1145

0.0026

Std Error

0.4117

0.0125

0.1684

0.0008

Chi-Square

3.74

8.55

0.46

10.82

Pr>ChiSq

0.0531

0.0035

0.4966

0.0010

Conclude the following:

• Controlling for Track Length and Laps, as Drivers  Cautions 

• Controlling for Drivers and Laps, No association between Cautions and Track Length

• Controlling for Drivers and Track Length, as Laps  Cautions 

Reduced Model: log(Crashes) = -0.6876+0.0428*Drivers+0.0021*Laps

Testing Model Goodness-of-Fit

• Two Common Measures of Goodness of Fit:

– Pearson’s Chi-Square

– Deviance

• Both measures have approximate Chi-Square Distributions under the hypothesis that the current model is appropriate for fixed number of combinations of independent variables and large counts

Pearson' s Chi Square : X

2  i n 

1 y i

V

^

^ m i

2

where V

^

^ m i

^ m i

^ m i for Poisson Distributi on

Deviance : G 2

2 i n 

1 y i log y i

^ m i

NASCAR Caution Flags Example

Null Model

Criterion

Pearson X2

Deviance

DF

150

150

Value

201.6050

215.4915

Value/DF

1.3440

1.4366

P-Value

0.0032

0.0004

Full Model

Criterion

Pearson X2

Deviance

DF

147

147

Value

158.8281

171.2162

Value/DF

1.0805

1.1647

P-Value

0.2386

0.0838

Note that the null model clearly does not fit well, and the full model fails to reject the null hypothesis of the model being appropriate (however, we have many combinations of Laps, Track Length, and Drivers)

SAS Program

options ps=54 ls=76; data one; input serrace 6-8 year 13-16 searace 23-24 drivers 31-32 trklength 34-40 laps 46-48 road 56 cautions 63-64 leadchng 71-

72; cards;

1 1975 1 35 2.54 191 1 5 13

...

151 1979 31 37 2.5 200 0 6 35

; run;

/* Data set one contains the data for analysis. Variable names and column specs are given in INPUT statement. I have included ony first and last observations */

/* The following model fits a Generalized Linear model, with poisson random component, and a constant mean: g(mu)=alpha is systematic component, g(mu)=log(mu) is the link function: mu=e**alpha */ proc genmod; model Cautions = / dist=poi link=log; run;

/* The following model fits a Generalized Linear model, with poisson random component, g(mu)=alpha + beta1*drivers + beta2*trkength + beta3*laps is systematic component, g(mu)=log(mu) is the link function: mu=e**alpha + beta1*drivers + beta2*trkength + beta3*laps */ proc genmod; model Cautions = drivers trklength laps / dist=poi link=log; run; quit;

SPSS Output

Hosmer-Lemeshow Test

• Used when there are “many” distinct levels of explanatory variables

• Based on “lumping” together cases based on their predicted values into J (often 10 is used) groups

• Compares observed and expected counts by group based on Deviance and Pearson residuals. For Poisson model (where obs is observed, exp is expected):

Pearson: r i

Deviance: d

= (obs i i

-exp i

)/√exp i

= √(obs i

* log(obs i

X 2 =  r

/exp i i

2

)) G 2 =2  d i

2

 Degrees of Freedom: J- p-1 where p=#Predictor Variables

NASCAR Caution Flags Example

Group

1

2

3

4

5

6

7

8

9

10

^ m i

 e

0 .

6876

0 .

0428 D i

0 .

0021 L i

Fitted

<3.50

3.50-3.80

3.80-4.08

4.08-4.25

4.25-4.42

4.42-5.15

5.15-5.50

5.50-6.25

6.25-6.70

>6.70

#Races #Crashes Expected Pearson

15 37 46.05

-1.33

14 60 50.37

1.36

18

20

12

72

68

51

71.24

84.03

52.35

0.09

-1.75

-0.19

17

15

15

14

11

100

88

91

94

63

81.39

78.19

87.40

90.81

78.46

2.06

1.11

0.38

0.33

-1.75

Pearson X2 15.5119

P-value 0.0300

Note that there is evidence that the Poisson model does not provide a good fit

Computational Approach

Poisson Probabilit

Systematic Component

Link Function : g y Mass Function

( m

)

: g

 log(

( m

) m

)

0 m

: P ( Y

 

1

X

1

 e g ( m

) e

 m m y

 y )

 

2

X

2 y

 y !

3

X

3

 e

0

 

1

X

1

 

2

X

2

 

3

X

3

0 , 1 , 2 ,...

For Subject i x i

X

X

1

1 i

2 i

X

3 i

Likelihood

: g ( m i

)

β 

 

0

3

0

1

2

Function : L

β

 

1

X

1 i

 

2

X

2 i

 

3

X

3 i

 x

' i

β where :

X

 x x x

'

'

'

1

2 n

 y

1

,..., y n

)

 i n 

1

Y

 y

1 y

2

 y n

 e

 m i m i y i y i

!

 i n 

1 exp

μ  e

 x i

' m m m

β

1

2 n

 x

' i

β y i y i

!

 l

 l

 β

 ln( L )

 i n 

1

 e x

' i

β  i n 

1 y i x

' i

β  i n 

1 ln

 ) i

   x i e x

' i

β   y i x i

   y i

 m i

) x i

   y i

 m i

)

1

X

1 i

X

2 i

X

3 i

Computational Approach

 l

 β

   x i e x

' i

β   y i x i

   y i

 m i

) x i

   y i

 m i

)

1

X

1 i

X

2 i

X

3 i

Setting

 2 l

 β  β'

 l

 β

 β'

0

 x i e x

' i

β

 y i

 m i

)

 

X

X y i x i

1

X

1 i

2 i

3 i

)

0

X ' ( Y

  x i e x

' i

β x ' i

 μ

)

0

X ' WX where W

 diag

Setting

^

β

New

: G

^

β

Old

X ' WX and g

G



^

β

Old



1  g

^

β

Old

X ' ( Y

 with

 μ

) leads to the the estimate a reasonable staring vector of of

^

β

β

via

0

Newton ln

0

0

0

 )

Raphson algorithm with approximat e large sample estimated variance

1

V

^ ^

β 

G

1 

X'

^

W X

covariance matrix :

:

Download