A Statistical Method for Adjusting Covariates in Linkage Analysis With Sib Pairs

advertisement
Presenter’s Name: Colin O. Wu
National Heart, Lung, and Blood Institute
National Institutes of Health
Department of Health and Human Services
A Statistical Method for Adjusting
Covariates in Linkage Analysis
With Sib Pairs
Colin O. Wu, Gang Zheng, JingPing Lin,
Eric Leifer and Dean Follmann
Office of Biostatistics Research, DECA, NHLBI
1. Framingham Heart Study (GAW13)
1.1 First Generation:
 5209 subjects (2336 men & 2873 women);
 29 to 62 years old when recruited;
 1644 spouse pairs;
 Continuously examined every 2 years since
1948
– medical history
– physical exams
– laboratory tests.
1.2 Second Generation (full dataset):
 5124 of the original participants’ adult children
& spouses of these adult children;
 2616 subjects are offspring of original spouse
pairs;
 34 are stepchildren;
 898 offspring are children with only one
parent in the study;
 1576 are spouses of the offspring;
 Offspring cohort followed every 4 years;
 Interval between Exams 1&2 is 8 years.
1.3 Second Generation (sib-pair subset)
 482 multi-sib families
– from 330 pedigrees;
 Observed trait:
systolic blood pressure;
 Covariates:
1. age (in years),
2. gender (0=female, 1=male),
3. drinking
(average daily alcohol consumption in ml).
 Genotype data:
398 random markers with an average of
10cM apart.
2. Methods for Linkage Analysis
2.1 Methods based on identity by descent (IBD)
 Association in pedigrees between phenotype
and IBD sharing at loci linked to trait loci;
 Linkage for qualitative traits
– IBD sharing conditional on phenotypes:
e.g. affected sib-pair methods
(Hauser & Boehnke, 1998).
 Linkage for quantitative trait loci (QTL)
– phenotypes conditional on IBD sharing,
e.g. Haseman & Elston (1972), Amos (1994);
– extremely discordant sib-pairs,
e.g. Risch & Zhang (1995, 1996).
2.2 The Haseman-Elston method
X 1 j , X 2 j : observed traits for 1st and 2nd sibs;
Yj   X1 j  X 2 j 
2
 squared trait difference in jth pair;
X ij    gij  eij ,
 : the overall mean trait value,
gij : the genetic effect on the  i, j  -th sib,
eij : the environmental effect on the  i, j  -th sib.
HE Model without Covariate Adjustment:
Assumptions: (i) one locus determines gij , (ii) two
alleles, B and b, (iii) gene frequencies p and q.
Genotypic values:
a for a BB individual;
gij  d for a Bb individual;
a for a bb individual.
E Y j |  j      j ,
 j  proportion of genes IBD for the j -th pair,
 ,  : unknown parameters.
Linkage:
Negative   potential linkage between
QTL & marker locus.
Hypothesis Testing Problem:
H 0 :   0 (no linkage)
H1 :   0 (linkage)
Limitations:
 Covariate effects are not included.
 Genetic and environmental effects are
additive.
 Method may not have sufficient power.
2.3 HE Method with Linear Covariate
Adjustment (SAGE SIBPAL)
 Involve families with more than 2 sibs.
 Can use other measures of trait difference
– e.g. the mean-corrected cross-product.
 Include covariate effects in linear regression
– e.g. Elston, Buxbaum, Jacobs and Olson
(2000)
“Haseman and Elston Revisted”.
Linear generalization:
Z ij ,..., Z ij : covariates for  i, j  -th sib,
1
 p

1
 p
Z ij   j , Z ij ,..., Z ij

T
: covariate vector,
l 
Z j : covariate for the j -th sib pair,

l 
l 
l 
e.g. Z j  Z1 j  Z 2 j

 0
 p
Z j  Z j ,..., Z j

1

T
l 
Z1 j  Z 2 j ,
 0
, Zj   j.
 p
E Y j |  j , Z j ,..., Z j
 or
l 
       Z .
p
l 0
l 
l
j
0  0  linkage.
Linkage:
Covariate effects:
l  0, l  1,..., p  effect of the l -th covariate.
Limitation:
Only the information in Z j is used

Z
l 
1j
l 
, Z2 j
 is reduced to Z
l 
j
.
2
For example, use Z j  age1 j  age 2 j .
3. The Proposed Method
3.1. Modeling the covariates
Goal: To generalize the HE regression model


that includes the covariates Z1 lj , Z 2 lj .
Assumptions:
(i) The covariates are not affected by the gene
and the environment.
(ii) The effects of gene and environment are
additive.
 Cross-sectional data:

1
 p
X ij   Z ij ,..., Z ij

1
 p
 Z ij ,..., Z ij
 g
ij
 eij ,
  mean of X
Equivalent form:

1

 p
given Z ij ,..., Z ij
ij
 p
X  X ij   Z ij ,..., Z ij
*
ij
1

 covariate adjusted trait
 gij  eij .
.
Regression models for covariates:
Linear model:

1
 p
 Zij ,..., Zij
      Z .
p
0
l 
l
l 1
ij
Equivalently,
p

l  
*
X ij  X ij   0    l Z ij 
l 1


 gij  eij ;


   0 ,..., p  : linear coefficients.
T
General parametric models
 e.g. nonlinear models :

1
 p
 Z ij ,..., Z ij
    Z
1
ij
 p
,..., Z ij
 ;  .
Nonparametric models  Hardle, 1991 :

1
 p
 Z ij ,..., Z ij
  smooth function of Z
l 
ij
.
Semiparametric models (Bickel et al., 2000).
 Longitudinal data:
(Repeated measurements over time)
For j -th sib pair:
n1 j  n2 j  n j  number of repeated measurements,
Tijk  time of the k -th measurement,
k  1,..., nij .
X
*
ijk

1
 p
 X ijk   Tijk , Z ijk ,..., Z ijk
= gij  eij .

Notation:
X ijk  observed trait,
1

 p
Z ijk ,..., Z ijk  covariates at time Tijk ,
1
 p
 Tijk , Z ijk ,..., Z ijk
  conditional mean of X
ijk
X ijk*  covariate adjusted trait.
Linear model (Verbeke & Molenberghs, 2000):
X
*
ijk
 X ijk

l  
  0   00Tijk    l Zijk  ,
l 1


p


   0 , 00 ,1 ,..., p  : linear coefficients.
,
3.2 Covariate adjusted linkage detection
 General Procedure:
 Select a regression model for the covariates.
 Estimate the covariate adjusted trait based on
the above regression model.
 Apply the linkage procedures, such as the HE
model or the variance-components model,
using the estimated adjusted trait values and
genotypic values.
3.3 Cross-sectional data
Covariate adjusted HE model:
Y  X  X
*
j
*
1j

* 2
2j
: adjusted squared trait difference.
Same derivation in HE (1972) 
E Y j* |  j      j .
 ,  : unknown parameters.
Testing problem:
H 0 :   0 (no linkage),
H1 :   0 (linkage).
Estimation of adjusted trait values:
Data from sib pairs are correlated
 Existing estimation methods for
independent data can not be directly
applied.
Two approaches:
1) Use methods for correlated data, such as
GEE – treat each family as a subject, each
member as a single observation.
2) Resample independent observations:
i.
Randomly sample one member from each
family.
ii. Estimate the parameters and adjusted trait
values using the re-sampled data and
procedures for independent data, such as
LSE, MLE, etc.
iii. Repeat the previous steps many times and
compute the estimates using the average of
the estimates from the re-sampled data.
This leads to consistent estimates when the
sample size (number of families) is large
(Hoffman et al., 2001).
Procedure for linear adjustment model:
Step 1: Estimate  by ˆ, a consistent estimator.
*
ij
*
j
Step 2: Estimate X and Y by

 
1
 p
*
ˆ
X ij  X ij   Z ij ,..., Z ij ;ˆ ,

Yˆ  Xˆ  Xˆ
*
j
*
1j
*
2j
.
2
Step 3: Fit the HE model using Yˆj* and test
 =0 vs.  <0.
3.5 Longitudinal data
Y  X
*
jk
*
1 jk
X
*
2 jk

2
 adjusted squared difference in j -th pair
at k -th measurement;
nj
Y j*   Y jk* / n j  : mean adjusted difference;
k 1
E Y j* |  j      j .
Linkage detection:
  0 (no linkage);   0 (linkage).
Two sources of potential correlations in the
estimation of adjusted trait values:
i. Correlation within a sib
 intra-subject correlation.
ii. Correlation between sibs within a family
 intra-family correlation.
 Nested longitudinal data.
 Methods for longitudinal estimation can not
be directly applied
(Morris, Vannucci, Brown and Carroll, 2003,
JASA).
Resampling approach:
i. Randomly select one sib from each family
 Resampled data contain repeated
measurements of independent sibs.
ii. Estimate the covariate adjusted trait values
from the above resampled data based on
longitudinal estimation methods (GEE,
MLE, REMLE, etc.).
iii. Repeat the above steps many times and
estimate the parameters using the
averages of the estimates from the
resampled data.
iv. Fit the HE model using existing procedure.
4. Framingham Heart Study
Features of the data:
 Clustered data from families;
 Repeated measurements;
 Multi-sib families;
 Continuous and categorical covariates.
Variables:
 Quantitative trait: SBP;
 Covariates: age, gender (0=female, 1=male),
drinking (average daily consumption).
SAGE HE Model:
nj
Use Y j    X 1 jk  X 2 jk  / n j in place of Y j ;
2
k 1
1
2
Z j  age1 j  age 2 j : age difference between sibs;
Z j 2  gender1 j  gender2 j : 0 if same sex, 1 if different sex;
 3
2
Z j  drinking1 j  drinking 2 j ;
Fit regression:
E Y j |  j , Z j      j  1 Z j   2 Z j   3 Z j .
1
No linkage:   0; Linkage:   0.
 2
 3
New HE Model:
Use linear model with 1000 resampling replications;
  age, gender, drinking  ;    0  1  age
  2  gender + 3  drinking;


*
X ijk
 X ijk    age, gender, drinking  ;ˆ ;
nj
2
Y j*    X 1*jk  X 2* jk  / n j : average squared trait difference.
k 1
Fit regression:
E Y j* |  j , Z j      j .
No linkage:  =0; Linkage:  <0.
5. Discussion
 Advantages for covariate adjustment:
 small variation for the estimates;
 better interpretation for the model.
 Directions of further research:
 Non-additive models, e.g. covariate-gene and
covariate-environment interactions;
 Covariate adjustment with other measures of
the trait difference;
 Methods of model selection;
 Models with general pedigrees.
Download