Poisson Rate, Intro to Matched Pair Data Stat 557 Heike Hofmann Final Project • Choose topic for final project • Send email with one paragraph write-up with • explanation of the data • suggestion for analysis • sample of the data (or web-link to data) • In case you are working in a team (1-3 people) send me team information Outline • Poisson Rate • Matched Pair Data Poisson Regression • Event occurrences proportional to observed time, or space, or other index of size • Want: model rate of occurrence • e.g. study of homicides in a given year in a sample of cities: • model rate (= homicides/population size) for a city. • Explanatory factors might be unemployment rate, residents’ median income, percentage of high school graduates, ... Poisson Rates • For size index t, model rate at which events occur: log µi/ti = α + β xi • equivalent to log µi = log ti + α + β xi use offset() to make sure parameter is not estimated, but set to 1 Example: Crimes across US • FBI publishes data on number of crimes by type and state for each year. • For 2009: State Abbr Population Violent.crime Alabama : 1 AK : 1 Min. : 544270 Min. : 817 Alaska : 1 AL : 1 1st Qu.: 1802408 1st Qu.: 5456 Arizona : 1 AR : 1 Median : 4403094 Median : 15968 Arkansas : 1 AZ : 1 Mean : 6128138 Mean : 26207 California: 1 CA : 1 3rd Qu.: 6647091 3rd Qu.: 30481 Colorado : 1 CO : 1 Max. :36961664 Max. :174459 (Other) :44 (Other):44 Murder.and.nonnegligent.manslaughter Forcible.rape Robbery Min. : 7.00 Min. : 124.0 Min. : 77 1st Qu.: 37.75 1st Qu.: 562.8 1st Qu.: 1201 Median : 176.50 Median :1263.5 Median : 3810 Mean : 301.94 Mean :1758.9 Mean : 8077 3rd Qu.: 424.25 3rd Qu.:2080.8 3rd Qu.: 9260 Max. :1972.00 Max. :8713.0 Max. :64093 Aggravated.assault Property.crime Min. : 575 Min. : 12502 1st Qu.: 3610 1st Qu.: 47968 Burglary Min. : 2230 1st Qu.: 9871 Larceny.theft Min. : 9296 1st Qu.: 34424 Example: Crimes across US State Abbr Population Violent.crime Alabama : 1 AK : 1 Min. : 544270 Min. : 817 Alaska : 1 AL : 1 1st Qu.: 1802408 1st Qu.: 5456 Arizona : 1 AR : 1 Median : 4403094 Median : 15968 Arkansas : 1 AZ : 1 Mean : 6128138 Mean : 26207 California: 1 CA : 1 3rd Qu.: 6647091 3rd Qu.: 30481 Colorado : 1 CO : 1 Max. :36961664 Max. :174459 (Other) :44 (Other):44 Murder.and.nonnegligent.manslaughter Forcible.rape Robbery Min. : 7.00 Min. : 124.0 Min. : 77 1st Qu.: 37.75 1st Qu.: 562.8 1st Qu.: 1201 Median : 176.50 Median :1263.5 Median : 3810 Mean : 301.94 Mean :1758.9 Mean : 8077 3rd Qu.: 424.25 3rd Qu.:2080.8 3rd Qu.: 9260 Max. :1972.00 Max. :8713.0 Max. :64093 Aggravated.assault Min. : 575 1st Qu.: 3610 Median :10297 Mean :16069 3rd Qu.:20017 Max. :99681 Motor.vehicle.theft Min. : 448 1st Qu.: 3583 Median : 10136 Mean : 15782 3rd Qu.: 17736 Max. :164021 region Property.crime Min. : 12502 1st Qu.: 47968 Median : 132868 Mean : 185850 3rd Qu.: 226611 Max. :1009614 Region Midwest :12 Northeast: 9 South :16 West :13 Burglary Min. : 2230 1st Qu.: 9871 Median : 29432 Mean : 43909 3rd Qu.: 51821 Max. :240233 Larceny.theft Min. : 9296 1st Qu.: 34424 Median : 89563 Mean :126160 3rd Qu.:153502 Max. :678353 Division Mountain : 8 South Atlantic : 8 West North Central : 7 New England : 6 East North Central : 5 Pacific : 5 (Other) :11 45 Violent.crime/Population * 1e+05 200 lat 40 300 400 35 500 600 30 700 -120 -110 -100 -90 -80 -70 long 45 Murder/Population * 1e+05 2 lat 40 4 6 35 8 10 30 -120 -110 -100 -90 -80 -70 glm(formula = Violent.crime ~ Region + offset(log(Population/1e+05)), family = poisson(link = log), data = crime) Deviance Residuals: Min 1Q Median -118.04 -36.92 -22.78 3Q 23.71 Max 72.60 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.955204 0.001969 3023.85 <2e-16 *** RegionNortheast -0.073860 0.002988 -24.72 <2e-16 *** RegionSouth 0.238827 0.002385 100.12 <2e-16 *** RegionWest 0.090777 0.002681 33.86 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 109966 Residual deviance: 90772 AIC: 91346 on 49 on 46 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 glm(formula = Murder ~ Region + offset(log(Population/1e+05)), family = poisson(link = log), data = crime) Deviance Residuals: Min 1Q Median -12.599 -4.917 -2.109 3Q 1.142 Max 14.193 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.519367 0.018095 83.965 < 2e-16 *** RegionNortheast -0.179513 0.028305 -6.342 2.27e-10 *** RegionSouth 0.261676 0.021838 11.983 < 2e-16 *** RegionWest -0.008964 0.025219 -0.355 0.722 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1914.1 Residual deviance: 1504.5 AIC: 1850.8 on 49 on 46 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 Example: Heart Attacks • 109 patients: • • • heart valve (aortic/mitral) age (<55, ≥55) survival (in months) Time at risk (in months) Deaths aortic mitral aortic mitral <55 1259 2082 <55 4 1 ≥55 1417 1647 ≥55 7 9 glm(formula = Deaths ~ Valve + Age + offset(log(Exposure)), family = poisson, data = heart) Deviance Residuals: 1 2 3 1.025 -1.197 -0.602 4 0.613 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.3121 0.5066 -12.460 <2e-16 *** ValveMitral -0.3299 0.4382 -0.753 0.4515 Age55+ 1.2209 0.5138 2.376 0.0175 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 10.8405 Residual deviance: 3.2225 AIC: 22.349 on 3 on 1 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 Poisson Rates Identity Link • For size index t, model rate at which events occur: µi/ti = α + β xi • equivalent to µi = αti + β xi ti linear model without the intercept. Each predictor is multiplied by the index. Identity Link glm(formula = Deaths ~ I(as.integer(Valve) * Exposure) + I(as.integer(Age) * Exposure) + Exposure - 1, family = poisson(link = identity), data = heart) Deviance Residuals: 1 2 3 0.4550 -0.1812 -0.7494 4 0.5400 Coefficients: Estimate Std. Error z value Pr(>|z|) I(as.integer(Valve) * Exposure) -0.0019354 0.0013158 -1.471 0.14132 I(as.integer(Age) * Exposure) 0.0039663 0.0014399 2.755 0.00588 ** Exposure 0.0004772 0.0032507 0.147 0.88329 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: Inf Residual deviance: 1.0931 AIC: 20.22 on 4 on 1 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 7 Matched Pair Data Example: Approval Ratings Same 1600 subjects asked to rate British Prime Minister (Tony Blair) 2nd Rating 1st Rating Approve Disapprove Approve 794 150 Disapprove 86 570 Matched Pair Data • Observations are taken repeatedly from the same subject, or • Individuals with similar demographics are paired Matched Pair Data 2nd Rating 1st Rating Assumptions Approve Disapprove Approve 794 150 Disapprove 86 570 • Diagonal heavily loaded • Association usually strongly positive (most people don’t change their opinion) • Distinguish between movers & stayers Marginal Homogeneity • Did as many people move from category a as to category a? • H :π o a+ = π+a • For binary response: McNemar: (n21-n12)2/(n12+n21) ~ χ21 > mcnemar.test(matrix(c(794,150,86,570),byrow=T,ncol=2),correct=F) McNemar's Chi-squared test data: matrix(c(794, 150, 86, 570), byrow = T, ncol = 2) McNemar's chi-squared = 17.3559, df = 1, p-value = 3.099e-05 Subject-specific Tables • For binary responses Y ,Y , we can think of 1 2 a record as one of the four instances: stayers 794 yes no 570 yes no 1st 2nd 1 0 0 1 1 0 1st 2nd 0 1 150 yes no 86 yes no 1st 2nd 1 0 0 1 0 1 1st 2nd 1 0 movers Subject-specific Tables • Adding all 1600 of these tables stayers 794 1st 2nd yes yes1st no 944 1 1 2nd 0 0 880 no 570 656 1st 720 2nd yes no 0 1 0 1 Marginal moversHomogeneity then translates to whether probability150 of approval 86 between yes no is the same yes no 1st 1and 2nd 0 1 rating 1st 0 1st 2nd 0 1 2nd 1 0 Marginal Homogeneity Use model to test for conditional independence • P(Y = 1| x ) = α + β x • x is dummy variable for time points t t t t x1 = 0, x2 = 1 • Marginal Homogeneity/Conditional Independence: β=0 glm(formula = Approval ~ Rating, family = binomial(link = logit), data = pm, weights = count) Deviance Residuals: 1 2 3 -31.56 34.20 -32.44 4 33.91 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.52726 0.11340 -4.649 3.33e-06 *** Rating 0.16329 0.07148 2.285 0.0223 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4373.2 Residual deviance: 4368.0 AIC: 4372 on 3 on 2 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 Marginal Homogeneity Alternatively, use logit link: • logit P(Y = 1| x ) = α + β x • x is dummy variable for time points t t t t x1 = 0, x2 = 1 Then β is log odds ratio based on overall population Subject Specific Model • link P(Y = 1) = α + β x • x is dummy variable for time points it i t t x1 = 0, x2 = 1 • then αi = link P(Yi1 = 1) β = link P(Yi2 = 1) - link P(Yi1 = 1) Marginal vs SubjectSpecific Model Estimates for β • is identical for marginal model and subject specific model in case of identity link • are different for logit link • marginal model: β = logit P(Y2 = 1| x2 ) - logit P(Y1 = 1| x1 ) • subject specific, for all i: β = logit P(Yi2 = 1| x2 ) - logit P(Yi1 = 1| x1 ) Subject-Specific Model • logit P(Y = 1) = α + β x • Assumptions generally: • responses from different subjects it i t independent (for all i) • responses for different time-points independent Subject-Specific Model • Violation of independence taken care of by model structure: • • Generally, |αi| >> |β| • When |αi| is small, we have the most variability between responses of the same individual - i.e. least dependence. That’s the records, on which estimation of β is based on. For large |αi|, probability of P(Yit = 1) is either close to 0 or close to 1 (largest dependence in the data) Fitting the Subject Specific Model • link P(Y = 1) = α + β x • for large i, fitting α becomes problematic: it i i condition out t