L4_GLM

advertisement
Genetic Association and Generalised Linear Models
Gil McVean, WTCHG
Weds 2nd November 2011
IMSGC, WTCCC2 (2011)
Questions to ask…
•
What is a linear model?
•
What is a generalised linear model?
•
How do you estimate parameters and test hypotheses with GLMs?
•
How are GLMs used in the study of genetic association?
3
What is a covariate?
•
A covariate is a quantity that may influence the outcome of interest
– Genotype at a SNP
– Age of mice when measurement was taken
– Batch of chips from which gene expression was measured
•
Previously, you have looked at using likelihood to estimate parameters of
underlying distributions
•
We want to generalise this idea to ask how covariates might influence the
underlying parameters
•
Much statistical modelling is concerned with considering linear effects of
covariates on underlying parameters
4
What is a linear model?
•
In a linear model, the expectation of the response variable is defined as a linear
combination of explanatory variables
Yi  0  1 X i  2 Zi  3 X i Zi  ... 
Response
variable
•
Intercept
Linear relationships
with explanatory
variables
Gaussian
error
Explanatory variables can include any function of the original data
Yi  0  1 X   2e
2
i
•
Interaction
term
 Zi
 3 X i  ...  
Zi
But the link between E(Y) and X (or some function of X) is ALWAYS linear and the
error is ALWAYS Gaussian
5
Quick test: which of these is NOT a linear model
yi   0  1 xi   2 zi
yi  1 exp( xi )   2 / zi
yi   0   x   2 xi zi
2
1 i
yi   0   x
zi
1 i
y i   0   1 xi
2
6
What is a GLM?
•
There are many settings where the error is non-Gaussian and/or the link between
E(Y) and X is not necessarily linear
– Discrete data (e.g. counts in multinomial or Poisson experiments)
– Categorical data (e.g. Disease status)
– Highly-skewed data (e.g. Income, ratios)
•
Generalised linear models keep the notion of linearity, but enable the use of nonGaussian error models
E(Yi )  i  g 1 0  1 X i  2 Zi  3 X i Zi  ...
•
g is called the link function
– In linear models, the link function is the identity
•
The response variable can be drawn from any distribution of interest (the
distribution function)
– In linear models this is Gaussian
7
Poisson regression
•
In Poisson regression the expected value of the response variable is given by the
exponent of the linear term
E(Yi )  i  exp0  1 X i  2 Zi  3 X i Zi  ...
•
The link function is the log
•
Note that several distribution functions are possible (normal, Poisson, binomial
counts), though in practice Poisson regression is typically used to model count
data (particularly when counts are low)
8
Example: Caesarean sections in public and private hospitals
9
Boxplots of rates of C sections
10
Fitting a model without covariates
> analysis<-glm(d$Caes ~ d$Births, family = "poisson")
> summary(analysis)
Call:
glm(formula = d$Caes ~ d$Births, family = "poisson")
Deviance Residuals:
Min
1Q
Median
-2.81481 -0.73305 -0.08718
3Q
0.74444
Max
2.19103
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Implies an average of
(Intercept) 2.132e+00 1.018e-01 20.949 < 2e-16 ***
12.7 per 1000 births
d$Births
4.406e-04 5.395e-05
8.165 3.21e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 99.990
Residual deviance: 36.415
AIC: 127.18
on 19
on 18
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
11
Fitting a model with covariates
glm(formula = d$Caes ~ d$Births + d$Hospital, family = "poisson")
Deviance Residuals:
Min
1Q
Median
-2.3270 -0.6121 -0.0899
3Q
0.5398
Coefficients:
Estimate Std. Error z
(Intercept) 1.351e+00 2.501e-01
d$Births
3.261e-04 6.032e-05
d$Hospital 1.045e+00 2.729e-01
--Signif. codes: 0 ‘***’ 0.001 ‘**’
•
Max
1.6626
value
5.402
5.406
3.830
Pr(>|z|)
6.58e-08 ***
6.45e-08 ***
0.000128 ***
0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Unexpectedly, this indicates that public hospitals actually have a higher rate of
Caesarean sections than private ones
Implies an average of 15.2 per 1000 births in public hospitals
and 5.4 per 1000 births in private ones
12
Checking model fit
•
Look a distribution of residuals and how well observed values are predicted
13
What’s going on?
•
Initial summary suggested opposite result to GLM analysis. Why?
Normal
Caesarean
Private
412
16
Public
21121
285
Relative risk Private (compared to public) = 2.8
•
Relationship between no. Births and no. C sections does not appear to be linear
•
Accounting for this removes most (but not all) of the apparent differences
between hospital types
•
There is also one quite influential outlier
14
Hospitals with fewer births tend to have more Caesarean sections
15
Simpson’s paradox
•
Be careful about adding together observations – this can be misleading
•
E.g. Berkeley sex-bias case
Applicants
% admitted
Men
8442
44%
Women
4321
35%
Major
Men
Appears that women have lower success
Women
Applicants
% admitted
Applicants
% admitted
A
825
62%
108
82%
B
560
63%
25
68%
C
325
37%
593
34%
D
417
33%
375
35%
E
191
28%
393
24%
F
272
6%
341
7%
But actually women are
typically more successful
at the Departmental
level, just apply to more
competitive subjects
16
Finding MLEs in GLM
•
In linear modelling we can use the beautiful compactness of linear algebra to find
MLEs and estimates of the variance for parameters
•
Consider an n by k+1 data matrix, X, where n is the number of observations and k
is the number of explanatory variables, and a response vector Y
– the first column is ‘1’ for the intercept term
•
The MLEs for the coefficients () can be estimated using


ˆβ  XT X 1 XT Y
•
βˆ ~ N (β, (XT X)1 2 )
In GLMs, there is usually no such compact analytical expression for the MLEs
– Use numerical methods to maximise the likelihood
17
Testing hypotheses in GLMs
•
For the parameters we are interested in we typically want to ask how much
evidence there is that these are different from zero
•
For this we need to construct confidence intervals for the parameter estimates
•
We could estimate the confidence interval by finding all parameter values with
log-likelihood no greater than 1.94 units worse than the MLE
•
Alternatively, we might use bootstrap resampling techniques to estimate the
distribution of parameter estimates
•
However, we can also appeal to theoretical considerations of likelihood (based on
the CLT) that show that parameter estimates are asymptotically normal with
variance described by the Fisher information matrix
•
Informally, the information matrix describes the sharpness of the likelihood curve
around the MLE and the extent to which parameter estimates are correlated
18
Logistic regression
•
When only two types of outcome are possible (e.g. disease/not-disease) we can
model counts by the binomial
•
If we want to perform inference about the factors that influence the probability of
‘success’ it is usual to use the logistic model
exp 0  1 X i   2 Z i  ...
E (Yi ) 
1  exp 0  1 X i   2 Z i  ...
•
The link function here is the logit
  

g ( )  log
1  
19
Example: testing for genotype association
•
In a cohort study, we observe the number of individuals in a population that get a
particular disease
•
We want to ask whether a particular genotype is associated with increased risk
•
The simplest test is one in which we consider a single coefficient for the genotypic
value
Genotype
AA
Aa
AA
Genotypic
value
0
1
2
Frequency in
population
p2
2p1p
1p2
Probability of
disease
p0
2
0 = -4
1 = 2
1
pi 
p1
p2
1
1 e
[  0  1Gi ]
0
20
A note on the model
•
Note that each copy of the risk allele contribute in an additive way to the
exponent
•
This does not mean that each allele ‘adds’ a fixed amount to the probability of
disease
•
Rather, each allele contributes a fixed amount to the log-odds
•
This has the effect of maintaining Hardy-Weinberg equilibrium within both the
cases and controls
 P(disease) 
   0  1Gi
log odds  log
 P( Not disease) 
21
Concepts in disease genetics
•
Relative risk describes the risk to a person in an exposed group compared to the
unexposed group
RR 
•
P(disease | exposed)
P(disease | not exposed)
The odds ratio compares the odds of disease occurring in one group relative to
another
 P(disease | exposed) 


P
(
not
disease
|
exposed
)

OR  
 P(disease | not exposed) 


 P(not disease | not exposed) 
•
If the absolute risk of disease is low the two will be very similar
22
Cont.
•
Suppose in a given study we observe the following counts
Genotype
0
1
2
Counts with
disease
26
39
21
1298
567
49
Counts without
disease
•
We can fit a GLM using the logit link function and binomial probabilities
•
We have genotype data stored in the vector gt and disease status in the vector
status
•
Using R, this is specified by the command
– glm(formula = status ~ gt, family = binomial)
23
An important note
•
In case-control designs, it is actually the genotype that is the random variable, not
the outcome
•
There is some theory that says that estimates of coefficients are equivalent under
prospective or retrospective approaches in CC designs
– Prentice & Pyke (1979) Biometrika 66:403.
•
However, the two approaches are not fully equivalent as the CC design creates
artificial association between causal factors that are independent in the
population at large
24
Interpreting results
Measure of contribution of
individual observations to overall
goodness of fit (for MLE model)
MLE for coefficients
Call:
glm(formula = status ~ gt, family = binomial)
Standard error for estimates
Deviance Residuals:
Min
1Q
Median
-0.8554 -0.4806 -0.2583
Estimate/std. error
3Q
-0.2583
Max
2.6141
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.6667
0.2652 -17.598
<2e-16 ***
gt
1.2833
0.1407
9.123
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
P-value from normal distribution
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 967.36
Residual deviance: 886.28
AIC: 890.28
on 1999
on 1998
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
Measure of goodness of fit of null
(compared to saturated model)
Measure of goodness of fit of fitted
model
Penalised likelihood used in model
choice
Number of iterations used to find
MLE
25
Adding in extra covariates
•
Adding in additional explanatory variables in GLM is essentially the same as in
linear model analysis
•
Likewise, we can look at interactions
•
In the disease study we might want to consider age as a potentially important
covariate
glm(formula = status ~ gt + age, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.00510
0.79209 -8.844
<2e-16 ***
gt
1.46729
0.17257
8.503
<2e-16 ***
age
0.04157
0.01903
2.185
0.0289 *
26
Adding model complexity
•
In the disease status analysis we might want to generalise the fitted model to one
in which each genotype is allowed its own risk
pi 
1
1 e
[  0  1I Gi 1  1I Gi 2 ]
glm(formula = status ~ g1 + g2 + age, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.46870
0.73155 -7.476 7.69e-14 ***
g1TRUE
1.25694
0.26278
4.783 1.72e-06 ***
g2TRUE
3.00816
0.33871
8.881 < 2e-16 ***
age
0.04224
0.01915
2.206
0.0274 *
--Null deviance: 684.44 on 1999 degrees of freedom
Residual deviance: 609.09 on 1996 degrees of freedom
AIC: 617.09
27
Model choice
•
It is worth remembering that we cannot simply identify the ‘significant’
parameters and put them in our chosen model
•
Significance for a parameter tests the marginal null hypothesis that the coefficient
associated with that parameter is zero
•
If two explanatory variables are highly correlated then marginally neither may be
significant, yet a linear model contains only one would be highly significant
•
There are several ways to choose appropriate models from data. These typically
involve adding in parameters one at a time and adding some penalty to avoid
over-fitting.
28
The WTCCC
•
One of the first large-scale genome-wide association studies, carried out on 7
diseases using c. 2,000 cases from a variety of diseases and shared controls
–
–
–
–
–
–
–
Type 1 diabetes
Type 2 diabetes
Rheumatoid arthritis
Hypertension
Bipolar disorder
Crohn’s disease
Coronary artery disease
•
Results published in Nature (2007)
•
This was an important study because it established a series of common statistical
principles used in the analysis of GWAS studies
– As well as identifying many novel regions associated with common complex diseases
29
Checking for differentiation between control groups
Manhattan plots
Sampling
distributions for
QQplots
30
A rationale for setting a threshold for ‘genome-wide
significance’
P < 5 x 10-7
31
A method for ‘imputing’ data from additional references
resources
32
Coping with population structure
•
Genomic control
•
PCA
•
Mixed models
33
Many open questions
•
Given the data collected, where do we think the causal mutation is?
•
To what extent can we ever ‘fine-map’ the causal variant in
– The population we start with?
– Other populations?
•
Can we use resequencing or imputation from (e.g. 1000 Genomes Project) to finemap without additional worK?
34
Download