Case-control association techniques
in genetic studies
March 10, 2011
Karen Curtin, Ph.D.
Division of Genetic Epidemiology and
HCI Pedigree & Population Resource (PPR)
‹#›
Presentation outline
Background (genetics concepts)
Basic case-control association
Complex case-control association
Genome-wide association
‹#›
The Human Genome: 6 billion DNA bases
(Adenine, Cytosine, Guanine, or Thymine)
‹#›
License: Creative Commons Attribution 2.0
Genotype and Haplotype
C
A
…
A
G
C
C
A
A
A
C
T
G
A
A
T
T
C
…
…
A
G
C
C
A
A
A
T
T
G
G
A
T
T
C
…
At any locus (position on a
chromosome):
Read across both chromosomes
Genotype CT
T
Read along a chromosome
G
Haplotypes: C-A and T-G
If allele T can predict allele G,
two alleles are in
Linkage Disequilibrium (LD)
‹#›
Single Nucleotide Polymorphsim
Two alternate forms (alleles) that differ
in sequence at one point in a DNA
segment
90% of genomic
variants are SNPs
Source: David Hall, Creative Commons Attribution 2.5 license
Genetic variants: Germline v Somatic
• Germline variant/mutations
– Inherited/In-born mutation
– In all cells
– In particular, in germline haploid cells
• Heritable
– Cell division - meiosis
• Somatic variants/mutations
– Acquired mutation
– Only in an isolated number of cells (tumor site)
• Generally not heritable
– Cell division - mitosis
‹#›
Hereditary mutation - meiosis
Parent germ cells
Daughter cells
HAPLOID
X
New zygotes
DIPLOID
Presentation outline
Background (genetics concepts)
Basic case-control association
Complex case-control association
Genome-wide association
‹#›
Genetic variants in association studies
Association:
two characteristics (disease & genetic variant) occur more
often together than expected by chance
• Direct Association / Causal
Functional variant
Disease
– Functional variant is involved in disease
– Functional variant is associated with the disease
• Indirect Association
Genetic variant
Functional variant
Disease
– Genetic variant (SNP) is associated/correlated with underlying functional
variant
– Functional variant is involved in disease
– Genetic variant (marker) is associated with disease (initial step.. Ultimate
goal is to discover causal variant)
‹#›
Genetic association study Designs
• Observational
– Exposure variables
• Genetic variants
• Environmental factors
• Classical association study designs
– Unit of interest is an individual
– Cohort study (cross-sectional or longitudinal)
– Case-control study
• Family-based association study
– Unit of interest is a family unit
‹#›
Case-Control Study
• Sample individuals based on to disease status and without
knowledge of exposure status (e.g. genotype)
– CASES (with disease)
– CONTROLS (no disease)
• Usually balanced design (#cases = #controls)
• Retrospective
• Neither prevalence nor incidence can be estimated
‹#›
Types of Case-Control Study
• Population-based
– Risk estimates can be extrapolated to the
source population
– Could be nested in a cohort study
• Selected sampling
– Increases power to detect associations
• Antoniou & Easton (2003)
– Tests of independence are valid
– True positive risks are exaggerated
• Can not be extrapolated
‹#›
Case-Control: Population-based
• Source population
– All individuals satisfying predefined criteria
• Source cohort
– A group that is ‘representative’ of the source population
– CASES and CONTROLS occur in relation to population prevalence
• CASES
– Cases selected are ‘representative’ of cases in the source cohort
– In particular, in terms of the exposure variables
•
CONTROLS
– Controls selected are ‘representative’ of controls in the source cohort
– In particular, in terms of the exposure variables
• Odds Ratio (estimate of the relative risk) can be extrapolated back
to the source population
– Population Attributable Risk (PAR)
‹#›
Case-Control: Selected Sampling
•
Source population
– All individuals satisfying predefined criteria
•
Source cohort
– A group that is ‘representative’ of the source population
– CASES and CONTROLS occur in relation to population prevalence
•
CASES
– Cases selected are in effect selectively sampled from cases in source cohort
– Family history of disease, severe disease, early onset,…
•
CONTROLS
– Cases selected are in effect selectively sampled from controls in source cohort
– Screened negative, no family history,…
•
•
Association analyses are still valid and power may be increased
BUT…
– Odds Ratio (estimate of the relative risk) can not be extrapolated back to the
source population
‹#›
Case-Control Study: Odds Ratio
Disease
Exposure
Yes
No
Cases (Yes)
a
b
Controls(No) c
d
Odds Ratio (OR) = a / b = a × d
c/d
b×c
H0: OR = 1 same risk (no association)
OR > 1 indicates increased risk
OR < 1 indicates decreased risk (protective)
‹#›
95% confidence intervals for the Odds Ratio
Lower and Upper bounds for the risk estimates.
Two common methods:
1) eln(OR) – 1.96se(ln(OR)), eln(OR) + 1.96se(ln(OR))
where se(ln(OR)) = 1/a+1/b+1/c+1/d
2) OR1-1.96/,
OR1+1.96/
‹#›
chi-square test
Compares observed values (O) with those expected
under independence between rows and columns
Expected (E) = row total  column total
N
chi-square statistic, with (rows-1)  (columns-1)
degrees of freedom
2 =  (O – E)2 ~ 2(rows-1) (columns-1)
E
‹#›
Test for Non-independence
H0: Disease and exposure (genotype)
are independent
chi-square tests: contingency tables
2×3 genotype table (2 df)
2×2 grouped genotype table (1 df)
• Dominant or recessive
2×3 ‘dose-dependent’ table
• Armitage test for trend (1 df)
2×2 allele table (1 df)
‹#›
Modeling genetic exposures
• Exposure = genotype
• Single variant with 2 alleles (SNP)
• Three genotypes: CC, CT, TT
• 23 contingency table
– Chi-sq 2df
– Chi-sq 1df (impose a linear dependency between columns)
CC
CT
TT
Controls
a
b
c
Cases
d
e
f
‹#›
Mode of Expression / Inheritance
•
•
Let allele C be disease causing
Examples of modes of expression are:
1. Dominant
•
Individuals heterozygous or homozygous
for the C allele gives rise to the disease
2. Recessive
•
TT TC CC
TT TC CC
Only homozygous individuals for the C allele
results in disease
3. Codominant TT TC CC
•
•
All three genotypes can be distinguished phenotypically
‘Additive’ model – TC has r-fold risk, CC has 2r effect
‹#›
chi-square test
CC
CT
TT
Totals
Controls
120
40
20
200
Cases
120
Totals
120
60
120
240
50
50
100
40
30
30
200
400
60
Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2
120
50
30
120
50
30
Chi-statistic = 10.67
p-value=0.0048 (for a chi-square distribution with 2 df)
‹#›
Genotypic relative risk
• Assess risk (OR) for each genotype relative to the
homozygous common genotype
ORhet =
CT vs. CC
a×e
b×d
ORhzv = a × f
TT vs. CC
c×d
Genotype (exposure)
CC
CT
TT
Controls
a
b
c
Cases
d
e
f
‹#›
chi-square test / genotypic relative risk
CC
CT
TT
Totals
Controls
120
40
20
200
Cases
120
Totals
120
60
120
240
50
50
100
40
30
30
200
400
60
Chi-stat= (120-120)2 + (40-50)2 + (20-30)2 +(120-120)2 +(60-50)2 + (40-30)2
120
50
30
120
50
30
Chi-statistic = 10.67
p-value=0.0048 (for a chi-square distribution with 2 df)
OR
het
CT vs. CC = 1.5
OR
hzv
TT vs. CC = 2.0
‹#›
Test for Non-independence
H0: Disease and exposure (genotype)
are independent
chi-square tests: contingency tables
2×3 genotype table (2 df)
2×2 grouped genotype table (1 df)
• Dominant or recessive
2×3 ‘dose-dependent’ table
• Armitage test for trend (1 df)
2×2 allele table (1 df)
‹#›
Dominant model for exposure
Exposure = CT&TT genotypes - 22 test with 1 df
ORdom =
a × (e+f) = 1.67
d × (b+c)
CC
Genotype
CT
TT
Controls
a=120
(b+c)=60
Cases
d=120
(e+f)=100
‹#›
Recessive model for exposure
Exposure = TT genotype (vs. CC&CT) - 22 test w/1 df
ORrec = (a+b) × f
(d+e) × c
CC
= 1.78
Genotype
CT
TT
Controls
(a+b)=160
c=20
Cases
(d+e)=180
f=40
‹#›
Test for Non-independence
H0: Disease and exposure (genotype)
are independent
chi-square tests: contingency tables
2×3 genotype table (2 df)
2×2 grouped genotype table (1 df)
• Dominant or recessive
2×3 ‘dose-dependent’ table
• Armitage’s trend test (1 df)
2×2 allele table (1 df)
‹#›
Armitage Trend Test (23 with 1df)
Assess departures from a fitted trend
A   (r x ) 
i
i
i
R
 (n x )
N
i
i
i
B
R( N  R )
[ N  ( n x )  (  ( n x )) ]
( N ( N  1)
2
2
i
i
i
2
i
i
i
2
A
 trend 
B
2
CC (x1=0) CT (x2=1)
TT (x3=2)
Controls
a (r1)
b (r2)
c (r3)
Cases
d
e
f
n2
n3
n1
R
N
‹#›
Example – genotypic relative risk and trend test
Shephard et al. Cancer Res 2009
Test for Non-independence
H0: Disease and exposure (genotype)
are independent
chi-square tests: contingency tables
2×3 genotype table (2 df)
2×2 grouped genotype table (1 df)
• Dominant or recessive
2×3 ‘dose-dependent’ table
• Armitage’s trend test (1 df)
2×2 allelic table (1 df)
‹#›
Allelic Test
• Exposure = Allele (T vs. C)
• 2 x 2 table (1 df) for a single SNP
• Count every allele (2 per person)
– Doubles the sample size
ORallele = (2a+b)×(2f+e)
(2c+b)×(2d+e)
C
Allele
T
Controls
2a+b=280
2c+b=80
Cases
2d+e=300
2f+e=140
OR = 1.633 T
vs. C allele
‹#›
Example – allelic association
11
12
22
11
12
22
Xue et al. Arch Oral Bio 2009
‹#›
More flexible techniques
• If other factors may have an effect on disease status
(affected/unaffected, case/control)
– We want to account for these as covariates
– We want to adjust for matching variables (age, sex, etc.)
• Logistic regression
– Logistic transformation (logit)
– ln(p/(1-p)) =  + 1x1 + 2x2 + ….
– Coefficients  and ’s are estimated using maximum likelihood
estimation (MLE)
– Test H0: =0 against H1:  = 
using a likelihood ratio test (LRT)
~~
• Must decide on how to model the genetic exposure
^
– genotype categories (i.e. CC, CT, TT), dominant,
recessive,
additive (allele dose)..
‹#›
Example of logistic regression model with genetic exposure
and covariates
Slattery et al. IJC 2010
‹#›
Assumptions for Validity
• Independence of all individuals
– Independent and identically distributed (iid)
• Reasonable sample sizes
– Contingency tables
• Expected values all > 1 and 80% > 5
– Logistic regression
• Minimum of 15-20 individuals per group
• If violated
– Simulate the null distribution for testing
• Permutation test
– e.g. Fishers exact test is an exhaustive permutation test
• Monte Carlo simulation
‹#›
Presentation outline
Background (genetics concepts)
Basic case-control association
Complex case-control association
Genome-wide association
‹#›
Performing haplotype analyses
• Single locus
– We observe genotypes, so testing is
straight-forward counting into a
contingency table
CC
CT
TT
Controls
a
b
c
Cases
d
e
f
‹#›
Performing haplotype analyses
• Multi-locus
– Haplotypes are not directly observed
– But can be estimated (EM/Bayesian…)
– For some individuals, their haplotype pair can
be inferred unambiguously
– For many individuals they can not
• “Phase uncertainty”
– All analyses of haplotypes must take into
account the phase uncertainty in the data
• Otherwise, increase in type 1 errors
‹#›
Haplotypes / Genotypes
C
G
…
A
G
C
T
A
A
A
C
T
G
G
A
T
T
…
…
A
G
C
C
A
A
A
C
T
G
G
A
T
T
…
Two-locus Haplotypes:
The haplotype pair must
be:
C-G and C-G
UNAMBIGUOUS
C
G
‹#›
Estimating haplotypes
Genotypes
Locus 1
Locus 2 Haplotypes
CC
GG
C-G
& C-G
CC
GA
C-G
& C-A
CC
AA
C-A
& C-A
CT
GG
C-G
& T-G
CT
GA
?(C-G & T-A)
or (C-A & T-G)?
CT
AA
C-A
& T-A
TT
GG
T-G
& T-G
TT
GA
T-G
& G-A
TT
AA
T-A
& T-A
‹#›
Estimating haplotypes
• Expectation-maximization (EM) algorithm
– SNPHAP (Johnson et al 2001)
– GCHap (Thomas 2003)
• Bayesian MCMC approach
– PHASE (Stephens et al 2001)
• Both approaches assume independent individuals
• Use to estimate
– Population haplotype frequencies estimated from a set
of individuals
– Most likely haplotype pair for each individual
‹#›
Traditional methods for phase uncertainty
• Likelihood based approach
– Each individual can have multiple
different haplotype pairs that are
consistent with the genotype data
• Some pairs of haplotypes are more or less
likely than others
• Each pair is given a weight
• All possible haplotype pairs are considered
in the case-control analysis
– weighted by their probabilities
‹#›
Slattery et al. CEBP 2007
Simulation methods for phase uncertainty
• Sample over the observed data
• Instead of weighting all the possible
haplotype pairs for every individual and
incorporating all at once into the analysis
– Sample one pair of each individual
• Randomly and in proportion to the weights, select a
haplotype pair for each individual
• Perform the analysis as if those were observed
• Repeat 1,000 times…
• Average
• SIMHAP (McCaskie et al.)
‹#›
Simulation methods for phase uncertainty
• Monte Carlo testing
– Simulate the null –matched to the real data
• Instead of weighting all the possible haplotype
pairs for every individual and incorporating all at
once into the analysis
– Assign each individual their most likely haplotype pair
• Cases and controls separately
– Simulate null haplotype data
• Null: Convert haplotypes to genotypes
• Null: Estimate haplotypes
• Null: Assign each individual their most likely haplotype pair
– Real and null are matched
– Test real data (with most likely haplotype pairs
assigned) against the simulated null
• hapMC (Thomas et al.)
‹#›
Exponential explosion…
high dimensional data
• 1 SNP
– 2 alleles
– 3 genotypes
1 test
1+ tests
• 2 SNP loci
– 4 haplotypes
• 3 SNP loci
– 8 haplotypes
• 10 SNP loci
– 1024 haplotypes
many tests..
‹#›
Multi-locus… but how many, and which loci to test?
• For example…20 tSNPs
– Only perform single SNP analyses?
– Perform tests on all 20-locus haplotypes?
• Group all ‘rare’ haplotypes together
• Cluster to reduce dimension
– Multi-locus tests with subsets of 20 SNPs?
• Subsets of which SNPs?
‹#›
Data mining approach to
haplotype construction – hapConstructor
• Automatically builds haplotypes
(Abo et al.)
(or composite genotypes)
– Non-contiguous SNPs
– In a case-control framework
– All SNP haplotypes are phased during 1st stage and used in all
subset analyses
– Starts with each single SNP locus
• Forward-backward process driven by significance thresholds
• Significance and false discovery rates (p-values and q-values)
reported for the building process
• Computationally challenging, potentially time intensive
‹#›
Multilocus model building example using
hapConstructor
16 SNPs
Curtin et al. BMC Med Genet 2010
Multilocus haplotype association
using hapConstructor
Curtin et al. BMC Med Genet 2010
‹#›
Meta-association in case-control studies
• Association: two characteristics occur more often together than
expected by chance
– Disease
– Genetic variants
• Meta-Association: study of association across case-control data
collected by multiple study sites (collaborative effort)
– NARAC: North American Rheumatoid Arthritis Consortium
– BCAC: Breast Cancer Association Consortium
VS. “Meta-analysis of individual level data from
participants in a systematically ascertained
group of studies” (Petitti definition)
‹#›
Meta-analysis of multi-study case-control data:
general concepts
• simple pooling – combine individual level data
from multiple studies and compute association
statistics
• fixed effects models – inference is conditional
on the studies actually done
– in genetic association, assumes same genetic effect size
across studies
• random effects models – inference is based
on assuming studies in the analysis are a
‘random sample’ of hypothetical population of
studies
‹#›
Fixed effects models
• Methods and effect measures
– Mantel-Haenszel: Odds ratio; also rate, risk ratio
• well-known method for calculating summary estimate of
effect across strata (i.e. multiple studies)
– Peto: Ratio (can approximate odds ratio)
• modification of M-H method
– General variance-based: Ratio (all types) and rate differences
‹#›
Mantel-Haenszel method (fixed effects)
Not
Exposed Exposed
Total
Diseased
ai
bi
ai + bi
Not
Diseased
ci
di
ci + di
ai + ci
bi + d i
ni
Total
where i is the ith strata (study)
‹#›
Mantel-Haenszel method (fixed effects)
summary odds ratio
( weighti ORi )

ORmh 
 weighti
where:
ORi 
( ai  d i )
( bi  ci )
weighti = 1/variancei
variance component of effect size
within studies only
variance i 
ni
( bi  ci )
‹#›
Mantel-Haenszel method (fixed effects)
summary odds ratio
• Strengths
– Optimal statistical properties (uniformly most powerful test)
– M-H estimate OR=1, M-H Chi-square=0
(mathematical connection of effect with summary statistic)
– Widely available in statistical software
• Limitations
– Requires data to complete 2x2 table for all studies (potential
exclusion bias)
– ignores confounding not taken into account by study design
(i.e. age, sex-matched controls)
• could use logsitc regression estimate of OR to simultaneously
model confounding variables and to adjust for study site
‹#›
CMH chi-square general association test of
independence (fixed-effect method)
• Extension of Cochran-Mantel-Haenszel (CMH) test to
sets of (X by Y) contingency tables (i.e. studies)
• Formulas for the CMH statistics are more easily
defined in terms of matrices (Landis and Koch 1978)
• Assumes study strata are independent, and that the
marginal totals of each stratum are fixed
– H0 : there is no association between X (disease status) and
Y (genotype) in any of the strata
– corresponding model is the multiple hypergeometric
‹#›
Heterogeneity
• If Ho: homogeneity is rejected, studies are not
measuring effect of the same size
• Tests of Heterogeneity
– Q test ~Chisq. with d.f.= #studies – 1
• Mantel-Haenszel method:
Q  [weighti  (lnORmh  ln ORi )2 ]
– Logistic regression: add a term for interaction between study and
genotypes in model (test using Wald or Likelihood Ratio)
• When heterogeneity is not extreme, fixed- and
random- effects models yield similar results
‹#›
Random effects models
• Methods and effect measures
– DerSimionian-Laird (1986): Ratio (all types)
and difference
– Bagos and Nikolopoulos (2007): Odds ratio
• study-specific coefficient in logistic regression model
representing deviation of study i’s true genotypic
effect to overall mean effect
• incorporates between-study component of
variance, CI’s at least as wide (wider) than fixed
effects
‹#›
Fixed- vs. Random- Assumptions
• analysis under fixed model addresses the
question:
Was there a genotype-phenotype
association in the consortium of casecontrol studies used in the meta analysis?
• under the random model, question:
Will there be a genotype-phenotype
association “on average?”
‹#›
Independent individuals
• If study cases and controls are
independent (unrelated) individuals,
meta-association is
straightforward...
‹#›
Straightforward...
• Adjust for ‘study site’ in a logistic
regression
• Use Cochran Mantel Haenszel (CMH)
techniques, controlling for study
– CMH test of association
– CMH test of trend
– meta odds ratio estimate
‹#›
Cox et al, Nature Genetics (2007)
• Test of Ho: no association included terms for genotype
and BCAC study
• Trend test included 1 parameter for allele dose and a
term for BCAC study
• Genotype-specific risks estimated as ORs using logistic
regression with BCAC study as a covariate (fixed-effects)
• Tested heterogeneity between studies by comparing
logistic regression models with and without a genotype x
study interaction term
• Data also analyzed using a random-effects model, test for
heterogeneity
‹#›
Cox et al, Nature Genetics (2007)
Table 1. Summary odds ratios and 95% confidence intervals for nine polymorphisms and breast cancer risk
SNP
No.
of
studies
ADH1B 3'
UTRA G
rs1042026
9
CASP8
D302H
rs1045485
14
CDKN1A
S31R
rs1801270
15
ICAM5
V301I
rs1056538
15
IGFBP3 202C A
rs2854744
10
No. of
ctrls
No. of
cases
15,570
11,391
17,109
22,670
22,229
17,926
16,423
18,290
17,687
13,101
MAF
0.29
0.13
0.07
0.39
0.45
Betweenstudy
heterogeneity
0.35
0.97
0.009
0.58
0.72
Test for
assoc.
0.044
5.7 10-7
0.55
0.58
0.051
Trend
test
0.54
1.1 10-7
0.28
0.78
0.046
Analysis model
Heterozygote OR
(95% c.i.)
Rare homozygote
OR (95% c.i.)
Fixed effects
0.94 (0.89, 1.00)
1.04 (0.95, 1.14)
Random effects
0.99 (0.90, 1.10)
1.04 (0.95, 1.14)
Fixed effects
0.89 (0.85, 0.94)
0.74 (0.62, 0.87)
Random effects
0.89 (0.85, 0.94)
0.73 (0.60, 0.90)
Fixed effects
1.03 (0.97, 1.10)
1.07 (0.86, 1.33)c
Random effects
1.04 (0.93, 1.09)
1.20 (0.82, 1.76)c
Fixed effects
1.02 (0.98, 1.07)
1.00 (0.94, 1.06)
Random effects
1.02 (0.97, 1.08)
0.99 (0.93, 1.06)
Fixed effects
1.00 (0.94, 1.05)
0.93 (0.87, 0.99)
Random effects
1.00 (0.94, 1.05)
0.92 (0.86, 0.99)
‹#›
Meta Association – Related individuals
• But what if some study individuals (cases
or controls) are related in multi-study
collaborations? ..sibships, trios, pedigreesor mixed, in families
meta analysis of data from
multiple sites is more difficult..
Genie to the rescue..
‹#›
Genie overview
• Allen-Brady et al. (2006), Curtin et al. (2007)
• Simulation-based technique
– Monte Carlo approach
– Null distribution is simulated for the statistic of interest matching
the pedigree structure
• Equivalent to an empirical version of the variance correction
method with prior probabilities
• Flexible in type of statistic that can be analyzed
– Classical association statistics and effect measure (OR)
– Meta association statistics (fixed-effects approach)
• Dichotomous and quantitative traits
http://www-genepi.med.utah.edu/Genie/index.html
‹#›
Genie: Empirical null
• Generate the empirical null
• Using appropriate allele frequencies perform a gene-drop
through the pedigree
– Null genotypic configuration
• Calculate the statistic of interest using the null data
ignoring relatedness
– Null statistic
• Repeat thousands of times
– Empirical estimate of the null distribution
• Assess the significance of the observed statistic by
assessing where it lies in the null distribution
‹#›
Creating the Simulated Null Distribution
Population allele frequencies
Assign alleles randomly to pedigree founders
Gene drop: simulated Mendelian inheritance
Repeat
Null Genotype Configuration
Calculate NULL statistic
Empirical Null Distribution
‹#›
Meta Statistics PedGenie Can Perform
Dichotomous Data
Chi-square test
(genotype- or allele-based)
Meta Odds ratio test
(genotype or allele data)
CMH Chi-Square
Meta Odds Ratio
(95% CI determined from the empirical distribution)
Chi-Square trend test
(genotype- or allele-based)
CMH Chi-SqTrend
‹#›
Genie Meta-association
• Fixed effects approach – assumes same genetic
effect size across studies
• Generalized CMH approach – chi-square general
association test of independence
extension to >2x2 tables across multiple studies
• CMH chi-square test of trend – mean score
statistic where ordered genotypes (i.e. genotypes
aa, aA, and AA) lie on an ordinal scale
• Meta ORs – M-H common odds ratio estimate for
2x2 tables (CT vs CC, TT vs CC)
– 95% CI estimated empirically
‹#›
Empirical 95% Confidence Interval
Distribution of OR estimates from 1,000 configurations in PedGenie null
Why Genie Meta-association?
• Ability to combine family-based and
independent case-control resources and use
all available data
– Genie software corrects for relationships in
family-based resources; all family members with
phenotype and genotype data can be included
– increases the utility of pedigrees previously
ascertained for linkage and can provide increased
power to detect associations..
..particularly in stratified and subset analyses that
may lead to small sample sizes in individual
studies
– needs a logistic regression framework (underway)
‹#›
XRCC2
SNP
rs3218536
(R188H)
rs3218501
rs3218499
rs3218472
rs3218454
rs3218418
rs3218402
rs3218400
rs3218395
rs3218385
rs3218374
rs3218373
Genotype Controls Cases
GG
1,167 1,014
GA
204
185
AA
9
10
CC
CG/GG
1,296
100
GG
823
GC
504
CC
65
CC vs. GC/GG
Meta
OR
1
1.0
1.3
95% CI
Lower Upper p
-Ref (0.8, 1.3)
(0.4, 3.8)
0.87
1,117
84
1
1.0
-Ref (0.7, 1.4)
0.93
712
414
84
1
1.0
1.6
1.6
-Ref (0.8, 1.1)
(1.1, 2.2)
(1.1, 2.2)
0.23
0.009
CC
CT/TT
1,355
6
1,190
2
1
0.4
-Ref (0.0, 2.7)
0.22
AA
AT
TT
1,140
206
20
999
167
23
1
0.9
1.3
-Ref (0.7, 1.2)
(0.5, 3.4)
0.89
GG
GA/AA
1,245
139
1,103
107
1
0.8
-Ref (0.6, 1.1)
0.23
AA
AG/GG
1,302
70
1,130
75
1
1.2
-Ref (0.8, 1.7)
0.35
CC
CA
AA
1,068
275
11
974
210
16
1
0.8
1.6
-Ref (0.7, 1.0)
(0.7, 3.7)
0.36
CC
CT/TT
1,255
137
1,081
129
1
1.1
-Ref (0.8, 1.4)
0.51
TT
TG/GG
1,226
144
1,052
138
1
1.1
-Ref (0.8, 1.4)
0.57
CC
CG
GG
411
677
299
370
559
269
1
0.9
1.0
-Ref (0.8, 1.1)
(0.8, 1.3)
0.94
GG
GT
TT
1,119
232
15
983
193
16
1
0.9
1.2
-Ref (0.8, 1.2)
(0.5, 2.9)
0.79
Association of XRCC2
tag-SNPs with CRC
in 4-study meta analysis
(Curtin et al. CEBP 2009)
*Empirical Cochran-Mantel-Haenszel χ2 test for trend or recessive model based on 10,000 simulations.
Association of XRCC2 rs3218499G>C with CRC in
4-study meta analysis
Overall
Tumor site
Proximal colon
Distal colon
Rectal
Female Rectal
Meta
OR
1.6
CC vs. GC/GG
95% CI
Lower Upper
(1.1, 2.2)
p*
0.009
1.2
1.1
2.1
3.1
(0.8,
(0.7,
(1.3,
(1.6,
Men
Women
1.3
2.0
(0.8, 2.1)
(1.2, 3.4)
0.24
0.01
Age onset <60
Age onset ≥60
1.4
1.6
(0.6, 3.1)
(1.1, 2.3)
0.38
0.03
1.7
1.8
(1.1, 2.6)
(0.6, 5.7)
0.02
0.17
Family history
No
Yes
2.0)
1.8)
3.2)
6.1)
0.44
0.63
0.0006
0.0006
*Empirical Cochran-Mantel-Haenszel χ2 test for recessive model based on 10,000 simulations.
‹#›
Genomewide (case-control) Association GWA:
an approach to the study of common diseases
• Complex architecture
– Multiple genes likely involved
– Multiple environmental factors
– Individually low risks
• Argument that the underlying variants may be
common and of modest effect..
– Common variants (>0.05, >0.01)
– Not under intense negative selection
• Agnostic.. no hypothesis
– Hypothesis generating vs. hypothesis driven
(candidate gene or pathway)
‹#›
GWA: What is required?
• Large set of SNPs
• Stringent significance thresholds
– ~5 x 10-8
• Large case-control sample size
– Example
•
•
•
•
Allele frequency 0.15
OR=1.25
80% power
6,000 cases and 6,000 controls
‹#›
Large set of SNPs
• Linkage-disequilibrium (LD)-based
– Genomewide tag-SNP set
– Made possible by HAPMAP
– 500,000-1,000,000 SNPs
– High-density arrays with 2 million SNPs
• Not optimal for rare variants…
– tag-SNP methods ignore them
‹#›
Stringent significance thresholds
• Very few ‘hits’ per study
– 1,3,4,5 significant hits per genome using GWA
– If don’t correct and use nominal 0.05
• In 500,000 markers
• Can expect 25,000 false positives
– Need to use a correction for multiple testing
• significance threshold of ~510-8 (Dudbridge & Koeleman ASHG 2004)
• Good… but not great
– But we’re expecting many more genes to be found…
right?
• Less stringency and instead use replication?
‹#›
Multistage strategies in GWA
Hirschhorn & Daly Nature Reviews 2005
‹#›
Interactions
• An increase (or decrease) effect of one
exposure given another.
• Gene-environment interaction
– Risk (genotype AA / no smoke) = 4
– Risk (genotype AA / smoke)= 6
• Gene-gene interaction
– Epistasis
– Risk (genotype AA / genotype bb) = 4
– Risk (genotype AA / genotype Bb,BB) = 6
‹#›
Statistical Interactions
• Multiplicative model
– Most commonly used
– Natural to a risk framework
• Logistic regression
– Independent loci
• multiply risk OR11=OR10×OR01
– Interaction
• OR11≠OR10×OR01
‹#›
Multiplicative model
• Multiplicative risk for alleles at each locus
• First locus
– aa 1.00
– aA 2.20
– AA 4.84
1

2
• Second locus
– Bb 1.00
– bB 1.50
– BB 2.25
1

2
‹#›
Risk: Two Independent Loci (multiplicative)
OR11= OR10  OR01
aa
aA
AA
bb
1.00
2.20
4.84
bB
1.50
3.30
7.26
BB
2.25
4.95
10.89
‹#›
Statistical Interactions
• Additive model
– Less popular
– Independent loci
• Add risks
• OR11= 1 + (OR10-1) + (OR01-1)
– Interaction
• OR11≠ 1+ (OR10-1) + (OR01-1)
‹#›
Additive model
• Additive risk for alleles for each single locus
• First locus
– aa
– aA
– AA
1.00
2.20
3.40
1

2-1
• Second locus
– bb
– bB
– BB
1.00
1.50
2.00
1

2-1
‹#›
Risk: Two Independent Loci (additive)
OR11=1+(OR10-1)+(OR01-1)
aa
aA
AA
bb
1.00
2.20
3.40
bB
1.50
2.70
3.90
BB
2.00
3.20
4.40
‹#›
No main effects???
• No main effects
• Only interaction effects
• Problem:
– In a stepwise procedure, if aren’t able to
identify the main effects, then how do you
know to test the interaction??
• HOWEVER… Thus far, no biological model
has been put forth that support the lack of
main effects
‹#›
Case-control design: ORs
• Testing in the Odds Ratio framework
• H0: OR11=OR10×OR01
• H0: IOR11=1.0
‹#›
Risk: Multiplicative
aa
aA
AA
bb
1.00
OR01
OR02
bB
OR10
OR11
OR12
BB
OR20
OR21
OR22
‹#›
Case-control design: ORs
• IOR11=
OR11
OR10  OR01
• Under the null, IOR11 = 1
• Can do several IORs
– 11, 12, 21 and 22
• Can construct confidence intervals to test
for a significant interaction
‹#›
Case-control design: logistic regression
logit P(Y=1/G1,G2) = + 1G1 + 2G2 + 3G1×G2
• Parameter is an estimator for ln(IOR) under a
multiplicative model
• G1 and G2 can be modeled several ways
–
–
–
–
Dominant
Recessive
Additive
3 levels
‹#›
Methods: MDR
• Multifactor-Dimensionality Reduction
(MDR)
– Ritchie et al (2001) Am J Hum Genet
– Combinatorial partitioning
– Data mining
– http://www.epistasis.org/software.html
‹#›
MDR
• Divide sample into 10 equal partitions
– Model on 9/10 (1…9)
– Test on 1/10 of data (10)
– Repeat 10 times and average the
misclassification
• Pick n loci from the total N SNPs
– Exhaustively assess all combinations
– All cells cases>controls (high-risk)
– All cells cases<controls (low-risk)
– Group
• Repeat for all possible n of N
• May be too many… doesn’t scale well
‹#›
Machine learning
• Machine learning
– Classification trees (e.g. CART)
• Greedy algorithms
• Not optimal
• Cook et al (2004) Stat Med
– Artificial Neural Networks (ANNs)
• GPNN software
• Motsinger et al (2006) BMC Bioinformatics
– Support Vector Machine Approach
• Combinatorial optimization techniques
– Local search
– Genetic algorithms
• Weng et al (2007) Genet Epidemiol
‹#›
Other approaches
• Logistic regression framework
– tagSNPs and powerful models for
epistasis
– Chapman and Clayton (2007) Genet
Epidemiol
• Case-control
– Haplotype interactions
– FAMHAP
– Becker et al (2005) Genet Epidemiol
‹#›
Thanks for your time..
..questions ?
‹#›
Additional slides (if needed)
‹#›
Quantitative traits
• Simple comparisons
– 2 groups (e.g. alleles, dominant)
• Normal test
large sample sizes
• T-test
small sample sizes
• Mann-Whitney non-parametric
– >2 groups (e.g. genotypes)
• ANOVA (F-test)
• Kruskall-Wallis
non-parametric
• Including Covariates
– Linear regression
y =  + x
– Again, need to model genetic exposure
~~
‹#›
Family-Based Methods
• Parent-Offspring Trios
– Haplotype Relative Risk (HRR)
– Transmission/Disequilibrium Test (TDT)
– Quantitative TDT (QTDT)
– Generalized Estimating Equations (GEE)
• Nuclear Families
– Sibling TDT (STDT)
– FBAT
– QTDT
– GEEs
‹#›
Family-Based Methods
• General Pedigrees (small to moderate size)
– PDT
– FBAT
– QTDT
– Variance correction (posterior probability)
– CCREL
• Extended Pedigrees
– Variance correction (prior probability)
– Quasi-Likelihood Score (QLS)
– PedGenie
‹#›
Transmission/Disequilibrium Test (TDT)
• Transmission method
• Spielman et al (2003)
• Trio method
– Requires genotype data on all three individuals
• The statistic considers only {parent, affectedoffspring} pairs from the trio for which the parent
is heterozygous
– Compare the number of times each of the different
alleles is transmitted to the affected offspring
– Is there evidence for preferential transmission of one
allele over the other?
‹#›
TDT: Validity
• H0: (1-2) = 0
• A test for both association and
linkage
• Robust to stratification
‹#›
TDT
CT
CC
One heterozygous parent
Transmits T to offspring
CT
‹#›
TDT
CT
CC
One heterozygous parent
Transmits C to offspring
CC
‹#›
TDT
CT
CT
Two heterozygous parents
One parent:
Transmits C to offspring
Other parent:
Transmits T to offspring
CT
‹#›
TDT
CC
CC
No heterozygous parents
No data to record
CT
‹#›
TDT: Tabulation
Allele NOT
transmitted
2
 tdt
Allele transmitted
C
T
C
a
b
T
c
d

b  c

2
(b  c )
~
2
1
‹#›