Poisson Rate, Intro to Matched Pair Data Stat 557 Heike Hofmann

advertisement
Poisson Rate, Intro to
Matched Pair Data
Stat 557
Heike Hofmann
Final Project
• Choose topic for final project
• Send email with one paragraph write-up with
• explanation of the data
• suggestion for analysis
• sample of the data (or web-link to data)
• In case you are working in a team (1-3 people)
send me team information
Outline
• Poisson Rate
• Matched Pair Data
Poisson Regression
• Event occurrences proportional to observed time, or
space, or other index of size
• Want: model rate of occurrence
• e.g. study of homicides in a given year in a sample of
cities:
• model rate (= homicides/population size) for a city.
• Explanatory factors might be unemployment rate,
residents’ median income, percentage of high
school graduates, ...
Poisson Rates
• For size index t, model rate at which events
occur:
log µi/ti = α + β xi
• equivalent to
log µi = log ti + α + β xi
use offset() to make sure parameter is
not estimated, but set to 1
Example: Crimes across US
• FBI publishes data on number of crimes by
type and state for each year.
• For 2009:
State
Abbr
Population
Violent.crime
Alabama
: 1
AK
: 1
Min.
: 544270
Min.
:
817
Alaska
: 1
AL
: 1
1st Qu.: 1802408
1st Qu.: 5456
Arizona
: 1
AR
: 1
Median : 4403094
Median : 15968
Arkansas : 1
AZ
: 1
Mean
: 6128138
Mean
: 26207
California: 1
CA
: 1
3rd Qu.: 6647091
3rd Qu.: 30481
Colorado : 1
CO
: 1
Max.
:36961664
Max.
:174459
(Other)
:44
(Other):44
Murder.and.nonnegligent.manslaughter Forcible.rape
Robbery
Min.
:
7.00
Min.
: 124.0
Min.
:
77
1st Qu.: 37.75
1st Qu.: 562.8
1st Qu.: 1201
Median : 176.50
Median :1263.5
Median : 3810
Mean
: 301.94
Mean
:1758.9
Mean
: 8077
3rd Qu.: 424.25
3rd Qu.:2080.8
3rd Qu.: 9260
Max.
:1972.00
Max.
:8713.0
Max.
:64093
Aggravated.assault Property.crime
Min.
: 575
Min.
: 12502
1st Qu.: 3610
1st Qu.: 47968
Burglary
Min.
: 2230
1st Qu.: 9871
Larceny.theft
Min.
: 9296
1st Qu.: 34424
Example: Crimes across US
State
Abbr
Population
Violent.crime
Alabama
: 1
AK
: 1
Min.
: 544270
Min.
:
817
Alaska
: 1
AL
: 1
1st Qu.: 1802408
1st Qu.: 5456
Arizona
: 1
AR
: 1
Median : 4403094
Median : 15968
Arkansas : 1
AZ
: 1
Mean
: 6128138
Mean
: 26207
California: 1
CA
: 1
3rd Qu.: 6647091
3rd Qu.: 30481
Colorado : 1
CO
: 1
Max.
:36961664
Max.
:174459
(Other)
:44
(Other):44
Murder.and.nonnegligent.manslaughter Forcible.rape
Robbery
Min.
:
7.00
Min.
: 124.0
Min.
:
77
1st Qu.: 37.75
1st Qu.: 562.8
1st Qu.: 1201
Median : 176.50
Median :1263.5
Median : 3810
Mean
: 301.94
Mean
:1758.9
Mean
: 8077
3rd Qu.: 424.25
3rd Qu.:2080.8
3rd Qu.: 9260
Max.
:1972.00
Max.
:8713.0
Max.
:64093
Aggravated.assault
Min.
: 575
1st Qu.: 3610
Median :10297
Mean
:16069
3rd Qu.:20017
Max.
:99681
Motor.vehicle.theft
Min.
:
448
1st Qu.: 3583
Median : 10136
Mean
: 15782
3rd Qu.: 17736
Max.
:164021
region
Property.crime
Min.
: 12502
1st Qu.: 47968
Median : 132868
Mean
: 185850
3rd Qu.: 226611
Max.
:1009614
Region
Midwest :12
Northeast: 9
South
:16
West
:13
Burglary
Min.
: 2230
1st Qu.: 9871
Median : 29432
Mean
: 43909
3rd Qu.: 51821
Max.
:240233
Larceny.theft
Min.
: 9296
1st Qu.: 34424
Median : 89563
Mean
:126160
3rd Qu.:153502
Max.
:678353
Division
Mountain
: 8
South Atlantic
: 8
West North Central : 7
New England
: 6
East North Central : 5
Pacific
: 5
(Other)
:11
45
Violent.crime/Population * 1e+05
200
lat
40
300
400
35
500
600
30
700
-120
-110
-100
-90
-80
-70
long
45
Murder/Population * 1e+05
2
lat
40
4
6
35
8
10
30
-120
-110
-100
-90
-80
-70
glm(formula = Violent.crime ~ Region + offset(log(Population/1e+05)),
family = poisson(link = log), data = crime)
Deviance Residuals:
Min
1Q
Median
-118.04
-36.92
-22.78
3Q
23.71
Max
72.60
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
5.955204
0.001969 3023.85
<2e-16 ***
RegionNortheast -0.073860
0.002988 -24.72
<2e-16 ***
RegionSouth
0.238827
0.002385 100.12
<2e-16 ***
RegionWest
0.090777
0.002681
33.86
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 109966
Residual deviance: 90772
AIC: 91346
on 49
on 46
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
glm(formula = Murder ~ Region + offset(log(Population/1e+05)),
family = poisson(link = log), data = crime)
Deviance Residuals:
Min
1Q
Median
-12.599
-4.917
-2.109
3Q
1.142
Max
14.193
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.519367
0.018095 83.965 < 2e-16 ***
RegionNortheast -0.179513
0.028305 -6.342 2.27e-10 ***
RegionSouth
0.261676
0.021838 11.983 < 2e-16 ***
RegionWest
-0.008964
0.025219 -0.355
0.722
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1914.1
Residual deviance: 1504.5
AIC: 1850.8
on 49
on 46
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
Example: Heart Attacks
•
109 patients:
•
•
•
heart valve (aortic/mitral)
age (<55, ≥55)
survival (in months)
Time at risk (in months)
Deaths
aortic mitral
aortic mitral
<55
1259
2082
<55
4
1
≥55
1417
1647
≥55
7
9
glm(formula = Deaths ~ Valve + Age + offset(log(Exposure)), family = poisson,
data = heart)
Deviance Residuals:
1
2
3
1.025 -1.197 -0.602
4
0.613
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.3121
0.5066 -12.460
<2e-16 ***
ValveMitral -0.3299
0.4382 -0.753
0.4515
Age55+
1.2209
0.5138
2.376
0.0175 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 10.8405
Residual deviance: 3.2225
AIC: 22.349
on 3
on 1
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
Poisson Rates Identity Link
• For size index t, model rate at which events
occur:
µi/ti = α + β xi
• equivalent to
µi = αti + β xi ti
linear model without the intercept. Each
predictor is multiplied by the index.
Identity Link
glm(formula = Deaths ~ I(as.integer(Valve) * Exposure) + I(as.integer(Age) *
Exposure) + Exposure - 1, family = poisson(link = identity),
data = heart)
Deviance Residuals:
1
2
3
0.4550 -0.1812 -0.7494
4
0.5400
Coefficients:
Estimate Std. Error z value Pr(>|z|)
I(as.integer(Valve) * Exposure) -0.0019354 0.0013158 -1.471 0.14132
I(as.integer(Age) * Exposure)
0.0039663 0.0014399
2.755 0.00588 **
Exposure
0.0004772 0.0032507
0.147 0.88329
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance:
Inf
Residual deviance: 1.0931
AIC: 20.22
on 4
on 1
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 7
Matched Pair Data
Example: Approval Ratings
Same 1600 subjects asked to rate
British Prime Minister (Tony Blair)
2nd Rating
1st Rating
Approve
Disapprove
Approve
794
150
Disapprove
86
570
Matched Pair Data
• Observations are taken repeatedly from
the same subject, or
• Individuals with similar demographics are
paired
Matched Pair Data
2nd Rating
1st Rating
Assumptions
Approve
Disapprove
Approve
794
150
Disapprove
86
570
• Diagonal heavily loaded
• Association usually strongly positive (most
people don’t change their opinion)
• Distinguish between movers & stayers
Marginal Homogeneity
• Did as many people move from category a
as to category a?
• H :π
o
a+
= π+a
• For binary response:
McNemar: (n21-n12)2/(n12+n21) ~ χ21
> mcnemar.test(matrix(c(794,150,86,570),byrow=T,ncol=2),correct=F)
McNemar's Chi-squared test
data: matrix(c(794, 150, 86, 570), byrow = T, ncol = 2)
McNemar's chi-squared = 17.3559, df = 1, p-value = 3.099e-05
Subject-specific Tables
• For binary responses Y ,Y , we can think of
1
2
a record as one of the four instances:
stayers
794
yes
no
570
yes
no
1st
2nd
1
0
0
1
1
0
1st
2nd
0
1
150
yes
no
86
yes
no
1st
2nd
1
0
0
1
0
1
1st
2nd
1
0
movers
Subject-specific Tables
• Adding all 1600 of these tables
stayers
794
1st
2nd
yes
yes1st no 944
1
1
2nd
0
0
880
no
570
656
1st
720
2nd
yes
no
0
1
0
1
Marginal
moversHomogeneity then translates to whether
probability150
of approval
86 between
yes
no is the same
yes
no 1st
1and 2nd
0
1
rating 1st 0
1st
2nd
0
1
2nd
1
0
Marginal Homogeneity
Use model to test for conditional independence
• P(Y = 1| x ) = α + β x
• x is dummy variable for time points
t
t
t
t
x1 = 0, x2 = 1
• Marginal Homogeneity/Conditional
Independence: β=0
glm(formula = Approval ~ Rating, family = binomial(link = logit),
data = pm, weights = count)
Deviance Residuals:
1
2
3
-31.56
34.20 -32.44
4
33.91
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.52726
0.11340 -4.649 3.33e-06 ***
Rating
0.16329
0.07148
2.285
0.0223 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4373.2
Residual deviance: 4368.0
AIC: 4372
on 3
on 2
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
Marginal Homogeneity
Alternatively, use logit link:
• logit P(Y = 1| x ) = α + β x
• x is dummy variable for time points
t
t
t
t
x1 = 0, x2 = 1
Then β is log odds ratio based on overall population
Subject Specific Model
• link P(Y = 1) = α + β x
• x is dummy variable for time points
it
i
t
t
x1 = 0, x2 = 1
• then
αi = link P(Yi1 = 1)
β = link P(Yi2 = 1) - link P(Yi1 = 1)
Marginal vs SubjectSpecific Model
Estimates for β
• is identical for marginal model and subject
specific model in case of identity link
• are different for logit link
•
marginal model:
β = logit P(Y2 = 1| x2 ) - logit P(Y1 = 1| x1 )
•
subject specific, for all i:
β = logit P(Yi2 = 1| x2 ) - logit P(Yi1 = 1| x1 )
Subject-Specific Model
• logit P(Y = 1) = α + β x
• Assumptions generally:
• responses from different subjects
it
i
t
independent (for all i)
• responses for different time-points
independent
Subject-Specific Model
•
Violation of independence taken care of by model
structure:
•
•
Generally, |αi| >> |β|
•
When |αi| is small, we have the most variability
between responses of the same individual - i.e.
least dependence. That’s the records, on which
estimation of β is based on.
For large |αi|, probability of P(Yit = 1) is either
close to 0 or close to 1 (largest dependence in
the data)
Fitting the Subject
Specific Model
• link P(Y = 1) = α + β x
• for large i, fitting α becomes problematic:
it
i
i
condition out
t
Download