Yoon G Kim, ygk1@humboldt.edu Colloquium Talk New Ways of Looking at Binary Data Fitting in R Appetizer 0 50 100 150 200 250 Can we “stabilize” this? 5 10 15 20 x 2 After taking LOG … > > > > > y1 <- rep(c(100,200),times=10) y2 <- rep(c(10,20),times=10) x <- c(1:20) data <- cbind(x,y1,y2) data[1:3,] x y1 y2 [1,] 1 100 10 [2,] 2 200 20 [3,] 3 100 10 > par(mfrow=c(1,2)) > plot(y1~x,type="l",ylim=c(0,250),col="blue",ylab="") > lines(y2~x,type="l",col="red") > plot(log(y1)~x,type="l",ylim=c(0,6),col="blue",ylab="") > lines(log(y2)~x,type="l",col="red") Log transformed 3 5 10 x 15 20 5 10 15 20 x 4 0 0 1 50 2 100 3 150 4 5 200 6 250 Outline Exploring options available when assumptions of classical linear models are untenable. In this talk: What can we do when observations are not continuous and the residuals are not normally distributed nor identically distributed ? 5 Classical Linear Models Defined by three assumptions: (1) the response variable is continuous. (2) the residuals (ε) are normally distributed and ... (3) ... independently (3a) and identically distributed (3b). Today, we will consider a range of options available when assumptions (1) (2) and/or (3b) are not verified. 6 Non-continuous response variable Many situations exist: The response variable could be (1) a count (number of individuals in a population) (number of species in a community) (2) a proportion (proportion "cured" after treatment) (proportion of threatened species) (3) a categorical variable (breeding/non-breeding) (different phenotypes) (4) a strictly positive value (esp. time to success) (or time to failure) ( ... ) and so forth 7 Added difficulties These types of non-continuous variables also tend to deviate from the assumptions of Normality (assumption #2) and Homoscedasticity (assumption #3b) (1) A count variable often follows a Poisson distribution (where the variance increases linearly with the mean) (2) A proportion often follows a Binomial distribution (where the variance reaches a maximum for intermediate values and a minimum at either end: 0% or 100%) 8 Added difficulties These types of non-continuous variables also tend to deviate from the assumptions of Normality (assumption #2) and Homoscedasticity (assumption #3b). (3) A categorical variable tends to follow a Binomial distribution (when the variable has only two levels) or a Multinomial distribution (when the variable has more than two levels) (4) Time to success/failure can follow an exponential distribution or an inverse Gaussian distribution (the latter having a variance increasing much more quickly than the mean). 9 Fortunately Many of these situations can be unified under a central framework. Since all these distributions (and a few more) belong to the exponential family of distributions. Canonical form Canonical (location) parameter Dispersion parameter y b ( ) f y , e xp c( y, ) a ( ) Probability density function (if y is continuous) Probability mass function (if y is discrete) EY b var Y b a mean variance 10 The Normal distribution Probability density function f y , 1 2 2 y 2 exp 2 2 Canonical form y 2 / 2 1 y 2 2 e xp 2 lo g( 2 ) 2 2 Canonical (location) parameter Dispersion parameter EY 2 b v a r Y b a 2 11 The Poisson distribution Probability mass function ye f y , y! Canonical form b ( ) e x p ( ) e x p y l n l n y ! =1 Canonical (location) parameter Dispersion parameter ln 1 EY b v a r Y b a 12 The Binomial distribution Probability mass function n y f y , p 1 p n y y n e xp y ln p n y ln 1 p ln y Canonical form =1 n p e xp y ln n ln 1 p ln 1 p y Canonical (location) parameter Dispersion parameter b ( ) n ln (1 p ) n lo g (1 e x p ) ln 1 p 1 p E Y b n p v a r Y b a n p ( 1 p ) 13 Why is that remotely useful? 1) A single algorithm (maximum likelihood) will cope with all these situations. 2) Different types of Variance can be accommodated When Var is constant -> Normal (Gaussian) When Var increases linearly with the mean -> Poisson When Var has a humped back shape -> Binomial When Var increases as the square of the mean -> Gamma (means the coefficient of variation remains constant) When Var increases as the cube of the mean -> inverse Gaussian 3) Most types of data are thus effectively covered 14 15 Non-independent Observations Two ways to cope with non-independent observations When design is balanced ("equal sample size") We can use factors to partition our observations in different "groups" and analyze them as an ANOVA or ANCOVA. … when factors are "crossed" or when they are “nested" When design is unbalanced ("uneven sample size") Mixed effect models are then called for. 16 How does it work? 1) You need to specify the family of distribution to use 2) You need to specify the link function g y i link function 0 1x1 2x 2 p x p linear predictor For each type of variable the "natural" link function to use is indicated by the canonical parameter Link Normal Identity Poisson Log Binomial Logit Gamma Inverse Inv.Gaussian p ln 1 p Inverse square 17 Binary variable The response variable contains only 0’s and 1’s. The probability that a place is “occupied” is p, and we write P ( y ) p y ( 1 p ) 1 y The objective is to determine how Y influences p. The family to use is Binomial and the canonical link is logit. Example: The response is occupation of territories and the explanatory variable is the resource availability in each territory > occupy <- read.table("D:\\STAT999\\RBook\\occupation.txt",header=T) > dim(occupy) [1] 150 2 > occupy[1:3,] resources occupied 1 14.18154 0 2 18.68306 0 3 20.22156 0 > attach(occupy) Crawley, M.J. (2007) The R Book: 597-598 18 Binary variable > table(occupied) occupied by default the link for a Binomial is logistic 0 1 58 92 > modell <- glm(occupied~resources, family=binomial) > > plot(resources, occupied, type="n") > rug(jitter(resources[occupied==0])) > rug(jitter(resources[occupied==1]),side=3) > xv <- 0:1000 > yv <- predict(modell, list(resources=xv),type="response") 19 1.0 0.8 0.6 0.0 0.2 0.4 occupied 0 200 400 600 800 1000 resources 20 cutr <- cut(resources,5) tapply(occupied,cutr,sum) (13.2,209] (209,405] (405,600] (600,796] (796,992] 0 10 25 26 31 table(cutr) cutr (13.2,209] (209,405] (405,600] (600,796] (796,992] 31 29 30 29 31 probs <- tapply(occupied,cutr,sum)/table(cutr) probs (13.2,209] (209,405] (405,600] (600,796] (796,992] 0.0000000 0.3448276 0.8333333 0.8965517 1.0000000 attr(,"class") [1] "table" probs <- as.vector(probs) resmeans <- tapply(resources,cutr,mean) resmeans <- as.vector(resmeans) points(resmeans,probs,pch=16,cex=2) se <- sqrt(probs*(1-probs)/table(cutr)) up <- probs + as.vector(se) down <- probs - as.vector(se) for(i in 1:5) { lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))} 21 1.0 0.8 0.6 0.0 0.2 0.4 occupied 0 200 400 600 800 1000 resources 22 Various Link Functions > grid_x <- seq(10,990,by=0.5) > modell_p <predict(modell,new=data.frame(resources=grid_x),type="response") > modelp <- glm(occupied~resources, family=binomial(link=probit)) > modelp_p <predict(modelp,new=data.frame(resources=grid_x),type="response") > modelcl <- glm(occupied~resources, family=binomial(link=cloglog)) > modelcl_p <predict(modelcl,new=data.frame(resources=grid_x),type="response") > modelca <- glm(occupied~resources, family=binomial(link=cauchit)) > modelca_p <predict(modelca,new=data.frame(resources=grid_x),type="response") 23 To draw … > newdata <- data.frame(grid_x,modell_p,modelp_p,modelcl_p,modelca_p) > library(lattice) > print(xyplot(modell_p+modelp_p+modelcl_p+ modelca_p ~ grid_x, + data=newdata, type ="l", xlab="resources", + ylab="p",lwd=1.5, lty=c(1,2,3,4), col=c(1:4), + panel = function(x, y, ...) { + panel.xyplot(x, y, ...) + panel.text(occupy$resources,occupy$probs,"x", cex=1.5, type="p", ...) + })) > legend("topleft", legend=c("logit","probit","cloglog","cauchit"),lty=c(1:4), col=c(1:4), lwd=1.5) > > par(new=F) > points(resmeans,probs,pch=16,cex=2) > for (i in 1:5){ + lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))} 24 25 Binary variable > summary(modell) Only valid if the Response variable is indeed a binomial Call: glm(formula = occupied ~ resources, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.744592 0.669923 -5.590 2.28e-08 *** resources 0.009762 0.001568 6.227 4.77e-10 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 200.170 Residual deviance: 97.152 AIC: 101.15 on 149 on 148 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 D 2 y i ln( y i ˆ i ) ( y i ˆ i ) n i 1 also called G-statistic 26 Binary variable This dispersion parameter () must be calculated. ˆ n p 2 2 ˆ y i i i n p ˆ i Pearson's residuals Residual degrees of freedom > (dp <- sum(residuals(modell, type="pearson")^2)/modell$df.res) [1] 0.8472199) Suggests that the Variance is 0.85 times the Mean. In statistical terms there is no overdispersion. In biological terms, it suggests that the counts are independent from each other and are not Aggregated (i.e. Clumped). Typically Overdispersed count data follow a Negative Binomial distribution, which is not part of the Exponential families of distribution. It won't be covered here, but it can be approximated as a quasi-binomial (family="quasibinomial"). If you need it in your future work, you can also try glm.nb (in MASS package) 27 Binary variable The summary table can be adjusted with the dispersion parameter These Values can now be taken at face value > summary(modell, dispersion=dp) Call: glm(formula = occupied ~ resources, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.744592 0.616628 -6.073 1.26e-09 *** resources 0.009762 0.001443 6.765 1.33e-11 *** --(Dispersion parameter for binomial family taken to be 0.8472199) Null deviance: 200.170 Residual deviance: 97.152 AIC: 101.15 on 149 on 148 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 How good is the model? 1 – (Res. Dev. / Null Dev.) = 51.47 % 28 > summary(modell) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.744592 0.669923 -5.590 2.28e-08 *** resources 0.009762 0.001568 6.227 4.77e-10 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: 200.170 Residual deviance: 97.152 AIC: 101.15 on 149 on 148 degrees of freedom degrees of freedom > summary(modelp) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.1437759 0.3448511 -6.217 5.08e-10 *** resources 0.0055046 0.0007811 7.047 1.82e-12 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: 200.170 Residual deviance: 97.024 AIC: 101.02 on 149 on 148 degrees of freedom degrees of freedom 29 > summary(modelcl) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.5902574 0.4293153 -6.033 1.60e-09 *** resources 0.0053519 0.0008337 6.419 1.37e-10 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: 200.17 Residual deviance: 102.30 AIC: 106.30 on 149 on 148 degrees of freedom degrees of freedom > summary(modelca) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.540198 1.644250 -3.369 0.000753 *** resources 0.014612 0.004205 3.475 0.000510 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: 200.17 Residual deviance: 99.69 AIC: 103.69 on 149 on 148 degrees of freedom degrees of freedom 30 Bootstrapping > modell <- glm(occupied~resources,family=binomial) > bcoef <- matrix(0,1000,2) > for (i in 1:1000){ + indices <-sample(1:150,replace=T) + x <- resources[indices] + y <- occupied[indices] + modell <- glm(y~x, family=binomial) + bcoef[i,] <- modell$coef } > > > > > par(mfrow=c(1,2)) plot(density(bcoef[,2]),xlab="Coefficient of x",main="") abline(v=quantile(bcoef[,2],c(0.025,0.975)),lty=2, col=4) plot(density(bcoef[,1]),xlab="Intercept",main="") abline(v=quantile(bcoef[,1],c(0.025,0.975)),lty=2, col=4) 31 0.5 0.6 250 Density 0.2 0.3 0.4 200 150 0.1 0.0 Density 100 50 0 0.005 0.010 0.015 Coefficient of x 0.020 -8 -7 -6 -5 -4 -3 -2 Intercept 32 Jackknifing > jcoef <- matrix(0,150,2) > for (i in 1:150) { + modelj<-glm(occupied[-i]~resources[-i], family=binomial) + jcoef[i,] <- modelj$coef + } > > > > > par(mfrow=c(1,2)) plot(density(jcoef[,2]),xlab="Coefficient of x",main="") abline(v=quantile(jcoef[,2],c(0.025,0.975)),lty=2, col=4) plot(density(jcoef[,1]),xlab="Intercept",main="") abline(v=quantile(jcoef[,1],c(0.025,0.975)),lty=2, col=4) 33 30 20 10 15 Density 25 10000 5 0 Density 6000 2000 0 0.0098 0.0102 Coefficient of x 0.0106 -4.00 -3.90 -3.80 -3.70 Intercept 34 C.I.’s > > + + + + > > library(boot) reg.boot<-function(regdat, index){ x <- resources[index] y <- occupied[index] modell <- glm(y~x, family=binomial) coef(modell) } reg.model<-boot(occupy,reg.boot,R=10000) boot.ci(reg.model,index=2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 10000 bootstrap replicates Intervals : Level Normal 95% ( 0.0059, 0.0128 ) Basic ( 0.0051, 0.0120 ) Level Percentile BCa 95% ( 0.0075, 0.0144 ) ( 0.0070, 0.0132 ) Calculations and Intervals on Original Scale 35 0.002 0.000 -0.002 * * * * * ** * ** ** * * * ***************** **** **************************** * * * * * ** * ** ** * * * *************************************** ** * *** *** * * * * * * ** * ** * *** ********************************************** * * * * * ** * ** * ** * *********************************** * * * * * * * * * * * * * * * ** * ** * ** * ** ** ** * *** * *** * * * ** ********************************** *************************************** ************************************* 108 34 102 -0.004 5, 10, 16, 50, 84, 90, 95 %-iles of (T*-t) 0.004 > jack.after.boot(reg.model,index=2) 46 83 100 45 36 39 32 -5 -4 -3 -2 149129 136 137 99 111 105 119 133 112 69 118 813 3111 81 21 25 4822 440 18 47 75 71 41 67 -1 125 113 60 122 139 144 38 52 64128 124 58 91 77 92 85 28 23 5382 8104 9 77944 57 62 141 135 217 107 132 148 95 126 51 143 114 10 97 64 74 98 12 8988 101 2955 27 68 49 90 -6 70 134 130 150 117 147 146 138 131 120 56 65 72 123 514 466 50 16 78 86 7624 96 109 9431 35 127 59 115 515 6121 7103 145 1110 106 63 140 93 43 20 116 19 80 142 26 30 37 7333 42 0 1 standardized jackknife value 36 108th observation? > occupy[105:110,] resources occupied 105 703.1783 1 106 710.1274 1 107 716.7298 1 108 717.1994 0 109 733.3538 1 110 736.3060 1 > plot(resources, occupied) > text(resources[108],occupied[108],"Here",cex = 1.5,col="blue",pos=3) OR > fat.arrow <- function(size.x=0.5,size.y=0.5,ar.col="red"){ + size.x <- size.x*(par("usr")[2]-par("usr")[1])*0.1 + size.y <- size.y*(par("usr")[4]-par("usr")[3])*0.1 + pos <- locator(1) + xc <- c(0,1,0.5,0.5,-0.5,-0.5,-1,0) + yc <- c(0,1,1,6,6,1,1,0) + polygon(pos$x+size.x*xc,pos$y+size.y*yc,col=ar.col) } > fat.arrow() 37 1.0 0.8 0.6 0.0 0.2 0.4 occupied 0 200 400 600 800 1000 resources 38 Yoon G Kim, ygk1@humboldt.edu Thank You!