Associating Genomic Variations with Phenotypes

Associating Genomic
Variations with
Phenotypes
Model comparison, rare variants,
and analysis pipeline
Qunyuan Zhang
Division of Statistical Genomics & Genome Institute
Washington University School of Medicine
1
Data & Question
i
1
Y
y1
X
x11
x12
... x1m
2 y2
x21
... ... ... ...
x22
...
... x2 m
... ...
n
xn 2
..
yn
Phenotypes
(quantitative,
categorical)
xn1
xnm
Genotypes:
SNP
Insertion
Deletion
Duplication
Inversion
Translocation
…
Relationship
between X and Y ?
2
Linkage & Association
i
1
x11
X
q12
x12
...
2 y2
x21
... ... ... ...
q22
...
x22
...
...
...
n
Y
y1
yn
xn1
qn 2
Genotypes
xn 2 ...
Association: (Y,X)
Linkage: (Y,Q)
Phenotype
Q is unobservable
r1 Q r2
Putative QTL
3
A Fixed-effect Mixture Model For Linkage
Commonly used in plant genetics
P1
X
P2
SNP A
SNP B
F1
r1 Q r2
F2
3
f ( yi )   P(Q j | X i , r )
j 1
1

j
 1 yi   j 2 
exp  (
) 
 2 

2
j


n
L(Y )   f ( yi )
i 1
4
A Variance-component Model For Linkage
Commonly used in human genetics
SNP A
SNP B
ΔQ
r 1 Q r2
 1

T
1
L(Y ) 
exp  (Y   ) V (Y   ) 
n/2
1/ 2
(2 ) | V |
 2

1
V  Cov(Y )  ΔQ  Δg  I
2
Q
QTL IBD
matrix
2
g
Background
IBD matrix
2
e
Diagonal unit
matrix
5
Variance-component Model
= Random-effect Linear Model
Y  μ  ZQ γ Q  Z g γ g  e
MVN (0, Δg )
MVN (0, ΔQ )
2
Q
2
g
V  ΔQ  Δg  I
2
Q
2
g
Random
effects
N (0,  e2 )
2
e
 1

T
1
L(Y ) 
exp  (Y   ) V (Y   ) 
n/2
1/ 2
(2 ) | V |
 2

1
6
From Linkage to Association
QTL effect(s)
Y  μ  ZQ γ Q  Z g γ g  e
Y  μ  Xβ  Zg γ g  e
marker
effect(s)
Linkage model
Family-based
association model
V  Δg  I
2
g
2
e
fixed
effect(s)
 1

T
1
L(Y ) 
exp  (Y    X ) V (Y    X ) 
n/2
1/ 2
(2 ) | V |
 2

1
7
A Simple Association Model
For Unrelated Subjects
Y  μ  Xβ  e
L(Y ) 
1
(2 ) n / 2 | V |1/ 2
n

i 1
1

e
V  I
2
e
 1

T
1
exp  (Y    X ) V (Y    X ) 
 2

 1 yi    X 2 
exp   (
) 
e
2
 2

8
Covariate(s):
Adjusting For Confounder(s)
Y  μ  Xβ  XC βC  e
Observed confounders: age, sex etc.
Hidden confounders: population structure
Population structure can be estimated by:
-PCA
-Clustering
-Admixture/ancestry
9
Modeling Hidden Genetic Correlation
Between Subjects
marker
fixed
effect(s)
covariate
fixed
effect(s)
Genetic
background
random effects
Y  μ  Xβ  XCβC  Z g γ g  e
V  Δg  I
2
g
2
e
Family data, pedigree => IBD matrix
Population data, hidden, marker data => IBS matrix
10
Modeling Rare Variants
Y  μ  Xβ  XCβC  Z g γ g  e
Common variants, tested individually, H0: β1=0. One p-value per variant
Y  μ  1 X1  ...
Rare variants, tested as an entire group (burden test), usually by gene
H0: β1= β2=…=βk=0 . One p-value per group of variants
Y  μ  1 X1  2 X 2  ... k X k  ...
Incorporated with variable selection, with loose criteria
β can be treated as random effects, variance components
test, can be weighted by prior information
11
Collapsing Model
Y  μ  1 X1  2 X 2  ... k X k  ...
Y  μ   X   ...
Collapsing multiple variables into one
subject X 1
X2
X3
X
1
2
0
0
0
1
0
0

1
1
3
1
0
0
1
12
Weighted Sum Model
Y  μ  1 X1  2 X 2  ... k X k  ...
k
Y  μ    ( w j X j )  ...
j 1
Y  μ   S  ...
subject X 1
1
2
3
X3
X2
S
w1  0.2 w1  0.5 w1  0.3
0.0
0
0
0
0.8
0
1
1
1
0
0
0.2
13
Weighting Variants
 Base
on allele frequency, continuous or binary(0,1) weight,
variable threshold;
 Based on function annotation/prediction;
 Based on sequencing quality (coverage, mapping quality,
genotyping quality, validated or not etc.);
 Data-driven, using both genotype and phenotype data,
learning weights (including effect directions) from data,
requiring permutation test;
 Any combination …
Grouping Variants
By gene
By transcript
By gene set / pathway
……
By exon
By protein domain
14
Modeling More Data Types
Generalized Linear (Mixed) Model
g (Y )  μ  Xβ  ... e
Link function
For binary Y, logistic model
 P(Y  1) 

g (Y )  logit (Y )  log
 1  P(Y  0) 
exp(μ  Xβ  ...  e)
P(Y  1) 
exp(μ  Xβ  ...  e)  1
15
Longitudinal Data (quantitative)
Time
Fixed effect, time as covariate
Repeated measures, random effect, correlation within subjects
16
Longitudinal Data (binary)
Time
Linear model, time as covariate
Survival analysis, CoxPH model etc.
17
Tools
SAS Procedures
REG, LOGISTIC, GENMOD, MIXED,
HPMIXED, GLIMMIX, PHREG/LIFETEST
R Functions/Packages
lm (), glm()
gee, nlme, kinship2/coxme, lme4, survival
Other Programs
SOLAR, MMAP, EMMA, EMMAX, SKAT
18
Pipeline
Input (data + options)
Job generating/submitting module
Job number controlling module
LSF bsub
job1
job2
…..
Job
N
Options.jobi => self-programmed modules (SAS, R,…)
Options.jobi => external program modules (MMAP, SKAT,..)
Result
1
…..
Result
2
Result
N
Job status monitoring module (all done ?)
Yes
no
Result summarizing module
Wait …
19
gwas.sh options.gwa
#!/bin/sh
OPFILE=$1
...
…
Pheno
Bmi
YES
Obes
YES
HD
Age
Sex
…
type
qt
covar
age,sex
program analysis
SASGLM mixed
ql
NA
SASGLM gee
ql
…
…
age
SASGLM gee
Program
SASGLM
GSTAT
MMAP
language
SAS
R
C
location
Maintainer
/dsg1/code/sas/glm.sas Q.Zhang
/dsg1/code/R/gstat.R
Q.Zhang
/dsg1/code/sas/mmap.sh J. Czajkowski
…
run
NO
[DATA]
database=SAS
genotype_dir=/dsg1/gwas/fhsgeno
genotype_file=
phenotype_file=fhs100
markerinfo_file=mapall
marker_selection=MAF>0.01
pedigree_file=pediall
subjectID=subject
pedgreeID=famid
markername=snp
…
[ANALYSIS]
phenolist_file=
pheno_list=bmi/qt
covariates=
program=SASGLM
analysis=mixed
[OUTPUT]
output_dir=/dsguser/qunyuan/fhs/bmi
output_file=
output_replace=no
[RUN]
clusterjobname=bmimixed
memsize=1000M
maxjobn=300
…
20
Thanks !
21