New Ways of Looking at Binary Data Fitting in R Colloquium Talk

advertisement
Yoon G Kim, ygk1@humboldt.edu
Colloquium Talk
New Ways of Looking at
Binary Data Fitting in R
Appetizer
0
50
100
150
200
250
Can we “stabilize” this?
5
10
15
20
x
2
After taking LOG …
>
>
>
>
>
y1 <- rep(c(100,200),times=10)
y2 <- rep(c(10,20),times=10)
x <- c(1:20)
data <- cbind(x,y1,y2)
data[1:3,]
x y1 y2
[1,] 1 100 10
[2,] 2 200 20
[3,] 3 100 10
> par(mfrow=c(1,2))
> plot(y1~x,type="l",ylim=c(0,250),col="blue",ylab="")
> lines(y2~x,type="l",col="red")
> plot(log(y1)~x,type="l",ylim=c(0,6),col="blue",ylab="")
> lines(log(y2)~x,type="l",col="red")
Log transformed
3
5
10
x
15
20
5
10
15
20
x
4
0
0
1
50
2
100
3
150
4
5
200
6
250
Outline
Exploring options available when
assumptions of classical linear models are untenable.
In this talk:
What can we do when observations are not continuous
and the residuals are not normally distributed nor
identically distributed ?
5
Classical Linear Models
Defined by three assumptions:
(1) the response variable is continuous.
(2) the residuals (ε) are normally distributed and ...
(3) ... independently (3a) and identically distributed (3b).
Today, we will consider a range of options available
when assumptions (1) (2) and/or (3b) are not verified.
6
Non-continuous response variable
Many situations exist:
The response variable could be
(1) a count (number of individuals in a population)
(number of species in a community)
(2) a proportion (proportion "cured" after treatment)
(proportion of threatened species)
(3) a categorical variable (breeding/non-breeding)
(different phenotypes)
(4) a strictly positive value (esp. time to success)
(or time to failure)
( ... ) and so forth
7
Added difficulties
These types of non-continuous variables
also tend to deviate from the assumptions of
Normality (assumption #2) and
Homoscedasticity (assumption #3b)
(1) A count variable often follows a Poisson distribution
(where the variance increases linearly with the mean)
(2) A proportion often follows a Binomial distribution
(where the variance reaches a maximum for intermediate values
and a minimum at either end: 0% or 100%)
8
Added difficulties
These types of non-continuous variables
also tend to deviate from the assumptions of
Normality (assumption #2) and
Homoscedasticity (assumption #3b).
(3) A categorical variable tends to follow a Binomial distribution
(when the variable has only two levels) or a Multinomial distribution
(when the variable has more than two levels)
(4) Time to success/failure can follow an exponential distribution or
an inverse Gaussian distribution (the latter having a variance
increasing much more quickly than the mean).
9
Fortunately
Many of these situations can be unified
under a central framework.
Since all these distributions (and a few more)
belong to the exponential family of distributions.
Canonical form
Canonical (location) parameter
Dispersion parameter
 y   b ( )

f  y  ,    e xp 
 c( y,  )
 a ( )

Probability density function
(if y is continuous)
Probability mass function
(if y is discrete)
EY

  b  
var Y

b     a  

mean

variance
10
The Normal distribution
Probability
density
function
f y  ,  
1
2 
2
  y   2
exp  
2 2





Canonical form
 y   2 / 2 1  y 2

2 
 e xp 
  2  lo g( 2   )  
2

2 



Canonical (location) parameter
  
Dispersion parameter
 
EY
2

b     
v a r Y  b     a     
2
11
The Poisson distribution
Probability
mass
function
 ye
f y  ,  
y!

Canonical form
b ( )  e x p ( )
 e x p  y l n     l n y !
=1
Canonical (location) parameter
Dispersion parameter
  ln 
 1
EY

b     
v a r Y  b     a     
12
The Binomial distribution
Probability
mass
function
n y
f  y  ,      p 1  p  n  y
 y

 n 
 e xp  y ln p   n  y  ln 1  p   ln   

 y  
Canonical form
=1

 n 
p
 e xp  y ln
 n ln 1  p   ln   
1 p

 y  

Canonical (location) parameter
Dispersion parameter
b ( )   n ln (1  p )  n lo g (1  e x p  )
  ln
 1
p
1 p
E Y  b     n p
v a r Y  b     a     n p ( 1  p )
13
Why is that remotely useful?
1) A single algorithm (maximum likelihood)
will cope with all these situations.
2) Different types of Variance
can be accommodated
When Var is constant -> Normal (Gaussian)
When Var increases linearly with the mean -> Poisson
When Var has a humped back shape -> Binomial
When Var increases as the square of the mean -> Gamma
(means the coefficient of variation remains constant)
When Var increases as the cube of the mean -> inverse Gaussian
3) Most types of data are thus effectively covered
14
15
Non-independent Observations
Two ways to cope with non-independent observations
When design is balanced ("equal sample size")
We can use factors to partition our observations in different
"groups" and analyze them as an ANOVA or ANCOVA.
… when factors are "crossed" or when they are “nested"
When design is unbalanced ("uneven sample size")
Mixed effect models are then called for.
16
How does it work?
1) You need to specify the family of distribution to use
2) You need to specify the link function
g y  
i

link function
   0   1x1   2x 2     p x
p
linear predictor
For each type of variable the "natural"
link function to use is indicated by the
canonical parameter
Link
Normal
Identity
Poisson
Log
Binomial
Logit
Gamma
Inverse
Inv.Gaussian
p
ln
1 p
Inverse square
17
Binary variable
The response variable contains only 0’s and 1’s. The
probability that a place is “occupied” is p, and we
write P ( y )  p y ( 1  p ) 1  y
The objective is to determine how Y influences p.
The family to use is Binomial and the canonical link is
logit.
Example: The response is occupation of territories and the
explanatory variable is the resource availability in each territory
> occupy <- read.table("D:\\STAT999\\RBook\\occupation.txt",header=T)
> dim(occupy)
[1] 150
2
> occupy[1:3,]
resources occupied
1 14.18154
0
2 18.68306
0
3 20.22156
0
> attach(occupy)
Crawley, M.J. (2007) The R Book: 597-598
18
Binary variable
> table(occupied)
occupied
by default the link for a Binomial is logistic
0 1
58 92
> modell <- glm(occupied~resources, family=binomial)
>
> plot(resources, occupied, type="n")
> rug(jitter(resources[occupied==0]))
> rug(jitter(resources[occupied==1]),side=3)
> xv <- 0:1000
> yv <- predict(modell, list(resources=xv),type="response")
19
1.0
0.8
0.6
0.0
0.2
0.4
occupied
0
200
400
600
800
1000
resources
20
cutr <- cut(resources,5)
tapply(occupied,cutr,sum)
(13.2,209] (209,405] (405,600] (600,796] (796,992]
0
10
25
26
31
table(cutr)
cutr
(13.2,209] (209,405] (405,600] (600,796] (796,992]
31
29
30
29
31
probs <- tapply(occupied,cutr,sum)/table(cutr)
probs
(13.2,209] (209,405] (405,600] (600,796] (796,992]
0.0000000 0.3448276 0.8333333 0.8965517 1.0000000
attr(,"class")
[1] "table"
probs <- as.vector(probs)
resmeans <- tapply(resources,cutr,mean)
resmeans <- as.vector(resmeans)
points(resmeans,probs,pch=16,cex=2)
se <- sqrt(probs*(1-probs)/table(cutr))
up <- probs + as.vector(se)
down <- probs - as.vector(se)
for(i in 1:5) {
lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))}
21
1.0
0.8
0.6
0.0
0.2
0.4
occupied
0
200
400
600
800
1000
resources
22
Various Link Functions
> grid_x <- seq(10,990,by=0.5)
> modell_p <predict(modell,new=data.frame(resources=grid_x),type="response")
> modelp <- glm(occupied~resources, family=binomial(link=probit))
> modelp_p <predict(modelp,new=data.frame(resources=grid_x),type="response")
> modelcl <- glm(occupied~resources, family=binomial(link=cloglog))
> modelcl_p <predict(modelcl,new=data.frame(resources=grid_x),type="response")
> modelca <- glm(occupied~resources, family=binomial(link=cauchit))
> modelca_p <predict(modelca,new=data.frame(resources=grid_x),type="response")
23
To draw …
> newdata <- data.frame(grid_x,modell_p,modelp_p,modelcl_p,modelca_p)
> library(lattice)
> print(xyplot(modell_p+modelp_p+modelcl_p+ modelca_p ~ grid_x,
+
data=newdata, type ="l", xlab="resources",
+
ylab="p",lwd=1.5, lty=c(1,2,3,4), col=c(1:4),
+
panel = function(x, y, ...) {
+
panel.xyplot(x, y, ...)
+
panel.text(occupy$resources,occupy$probs,"x", cex=1.5, type="p",
...)
+
}))
> legend("topleft",
legend=c("logit","probit","cloglog","cauchit"),lty=c(1:4),
col=c(1:4), lwd=1.5)
>
> par(new=F)
> points(resmeans,probs,pch=16,cex=2)
> for (i in 1:5){
+ lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))}
24
25
Binary variable
> summary(modell)
Only valid if the Response variable
is indeed a binomial
Call:
glm(formula = occupied ~ resources, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.744592
0.669923 -5.590 2.28e-08 ***
resources
0.009762
0.001568
6.227 4.77e-10 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.170
Residual deviance: 97.152
AIC: 101.15
on 149
on 148
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
D  2   y i ln( y i ˆ i )  ( y i  ˆ i ) 
n
i 1
also called G-statistic
26
Binary variable
This dispersion parameter () must be calculated.

ˆ
 

n p
2
2
ˆ


y


i i i
n p
ˆ i
Pearson's residuals
Residual degrees of freedom
> (dp <- sum(residuals(modell, type="pearson")^2)/modell$df.res)
[1] 0.8472199)
Suggests that the Variance is 0.85 times the Mean.
In statistical terms there is no overdispersion.
In biological terms, it suggests that the counts are
independent from each other and are not Aggregated
(i.e. Clumped).
Typically Overdispersed count data follow a Negative Binomial distribution,
which is not part of the Exponential families of distribution.
It won't be covered here, but it can be approximated as a quasi-binomial
(family="quasibinomial").
If you need it in your future work, you can also try glm.nb (in MASS package)
27
Binary variable
The summary table can be adjusted with the dispersion parameter
These Values can now
be taken at face value
> summary(modell, dispersion=dp)
Call:
glm(formula = occupied ~ resources, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.744592
0.616628 -6.073 1.26e-09 ***
resources
0.009762
0.001443
6.765 1.33e-11 ***
--(Dispersion parameter for binomial family taken to be 0.8472199)
Null deviance: 200.170
Residual deviance: 97.152
AIC: 101.15
on 149
on 148
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
How good is the model?
1 – (Res. Dev. / Null Dev.)
= 51.47 %
28
> summary(modell)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.744592
0.669923 -5.590 2.28e-08 ***
resources
0.009762
0.001568
6.227 4.77e-10 ***
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.170
Residual deviance: 97.152
AIC: 101.15
on 149
on 148
degrees of freedom
degrees of freedom
> summary(modelp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.1437759 0.3448511 -6.217 5.08e-10 ***
resources
0.0055046 0.0007811
7.047 1.82e-12 ***
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.170
Residual deviance: 97.024
AIC: 101.02
on 149
on 148
degrees of freedom
degrees of freedom
29
> summary(modelcl)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5902574 0.4293153 -6.033 1.60e-09 ***
resources
0.0053519 0.0008337
6.419 1.37e-10 ***
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.17
Residual deviance: 102.30
AIC: 106.30
on 149
on 148
degrees of freedom
degrees of freedom
> summary(modelca)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.540198
1.644250 -3.369 0.000753 ***
resources
0.014612
0.004205
3.475 0.000510 ***
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.17
Residual deviance: 99.69
AIC: 103.69
on 149
on 148
degrees of freedom
degrees of freedom
30
Bootstrapping
> modell <- glm(occupied~resources,family=binomial)
> bcoef <- matrix(0,1000,2)
> for (i in 1:1000){
+ indices <-sample(1:150,replace=T)
+ x <- resources[indices]
+ y <- occupied[indices]
+ modell <- glm(y~x, family=binomial)
+ bcoef[i,] <- modell$coef }
>
>
>
>
>
par(mfrow=c(1,2))
plot(density(bcoef[,2]),xlab="Coefficient of x",main="")
abline(v=quantile(bcoef[,2],c(0.025,0.975)),lty=2, col=4)
plot(density(bcoef[,1]),xlab="Intercept",main="")
abline(v=quantile(bcoef[,1],c(0.025,0.975)),lty=2, col=4)
31
0.5
0.6
250
Density
0.2
0.3
0.4
200
150
0.1
0.0
Density
100
50
0
0.005
0.010
0.015
Coefficient of x
0.020
-8
-7
-6
-5
-4
-3
-2
Intercept
32
Jackknifing
> jcoef <- matrix(0,150,2)
> for (i in 1:150) {
+ modelj<-glm(occupied[-i]~resources[-i], family=binomial)
+ jcoef[i,] <- modelj$coef
+ }
>
>
>
>
>
par(mfrow=c(1,2))
plot(density(jcoef[,2]),xlab="Coefficient of x",main="")
abline(v=quantile(jcoef[,2],c(0.025,0.975)),lty=2, col=4)
plot(density(jcoef[,1]),xlab="Intercept",main="")
abline(v=quantile(jcoef[,1],c(0.025,0.975)),lty=2, col=4)
33
30
20
10
15
Density
25
10000
5
0
Density
6000
2000
0
0.0098
0.0102
Coefficient of x
0.0106
-4.00
-3.90
-3.80
-3.70
Intercept
34
C.I.’s
>
>
+
+
+
+
>
>
library(boot)
reg.boot<-function(regdat, index){
x <- resources[index]
y <- occupied[index]
modell <- glm(y~x, family=binomial)
coef(modell) }
reg.model<-boot(occupy,reg.boot,R=10000)
boot.ci(reg.model,index=2)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 10000 bootstrap replicates
Intervals :
Level
Normal
95%
( 0.0059, 0.0128 )
Basic
( 0.0051,
0.0120 )
Level
Percentile
BCa
95%
( 0.0075, 0.0144 )
( 0.0070, 0.0132 )
Calculations and Intervals on Original Scale
35
0.002
0.000
-0.002
*
* *
*
*
** *
**
** * *
*
***************** ****
****************************
*
* *
*
*
** *
**
** * * *
*************************************** **
* ***
*** *
*
* *
*
*
** *
**
* ***
**********************************************
*
* *
*
*
** *
**
* ** *
***********************************
*
*
*
* *
* *
* *
*
*
*
*
*
*
** *
** *
** *
**
**
**
* ***
* ***
* * * **
**********************************
***************************************
*************************************
108
34
102
-0.004
5, 10, 16, 50, 84, 90, 95 %-iles of (T*-t)
0.004
> jack.after.boot(reg.model,index=2)
46
83
100
45
36
39
32
-5
-4
-3
-2
149129
136
137
99
111
105
119
133
112
69
118
813
3111
81
21
25
4822
440
18
47 75
71
41
67
-1
125
113
60
122
139
144
38
52
64128
124
58
91
77
92
85
28
23
5382
8104
9
77944
57 62
141
135
217
107
132
148
95
126
51
143
114
10
97
64
74
98
12
8988
101
2955
27 68
49
90
-6
70
134
130
150
117
147
146
138
131
120
56
65
72
123
514
466
50
16
78
86
7624
96
109
9431 35
127
59
115
515
6121
7103
145
1110
106
63
140
93
43
20
116
19
80
142
26
30
37
7333 42
0
1
standardized jackknife value
36
108th observation?
> occupy[105:110,]
resources occupied
105 703.1783
1
106 710.1274
1
107 716.7298
1
108 717.1994
0
109 733.3538
1
110 736.3060
1
> plot(resources, occupied)
> text(resources[108],occupied[108],"Here",cex = 1.5,col="blue",pos=3)
OR
> fat.arrow <- function(size.x=0.5,size.y=0.5,ar.col="red"){
+ size.x <- size.x*(par("usr")[2]-par("usr")[1])*0.1
+ size.y <- size.y*(par("usr")[4]-par("usr")[3])*0.1
+ pos <- locator(1)
+ xc <- c(0,1,0.5,0.5,-0.5,-0.5,-1,0)
+ yc <- c(0,1,1,6,6,1,1,0)
+ polygon(pos$x+size.x*xc,pos$y+size.y*yc,col=ar.col) }
> fat.arrow()
37
1.0
0.8
0.6
0.0
0.2
0.4
occupied
0
200
400
600
800
1000
resources
38
Yoon G Kim, ygk1@humboldt.edu
Thank You!
Download