Single Nucleotide Polymorphism
And Association Studies
Stat 115/215
International HapMap Project
• The International HapMap project is a
recent, large-scale effort to facilitate
GWAS studies:
– Phase 1: 269 samples, 1.1 M SNPs
– Phase 2: 270 samples, 3.9 M SNPs
– Phase 3: 1115 samples, 1.6 M SNPs
• Phase 3 platforms:
– Illumina Human1M (by Wellcome Trust
Sanger Institute)
– Affymetrix SNP 6.0 (by Broad Institute)
2
Phase 1 & 2
• 90 Yoruba individuals (30 parent-parentoffspring trios) from Ibadan, Nigeria (YRI)
• 90 individuals (30 trios) of European
descent from Utah (CEU)
• 45 Han Chinese individuals from Beijing
(CHB)
• 45 Japanese individuals from Tokyo (JPT)
3
Phase III samples
• Population descriptors:
4
– ASW (A): African ancestry in Southwest USA
– CEU (C): Utah residents with Northern and Western
European ancestry from the CEPH collection
– CHB (H): Han Chinese in Beijing, China
– CHD (D): Chinese in Metropolitan Denver, Colorado
– GIH (G): Gujarati Indians in Houston, Texas
– JPT (J): Japanese in Tokyo, Japan
– LWK (L): Luhya in Webuye, Kenya
– MEX (M): Mexican ancestry in Los Angeles, California
– MKK (K): Maasai in Kinyawa, Kenya
– TSI (T): Toscans in Italy
– YRI (Y): Yoruba in Ibadan, Nigeria
Using 4242 independent SNPs
and applying STRUCTURE
ASW
5
CEU
CHB
CHD
ç
JPT
LWK MEX
MKK
TSI
YRI
PCA plot
6
Non-African Populations
GIH
7
MEX
Haplotype Maps
• Although there are around 10,000,000
SNPs, they group into a small number of
groups of SNPs that are correlated with
each other.
• So, there are around around 300,000
unique arrangements of the SNPS
• This is not that big of a number!
• CS people can imagine an exhaustive
search
SNP Characteristics:
Linkage Disequilibrium
• Hardy-Weinberg equilibrium
– In a population with genotypes AA, aa, and Aa, if p =
freq(A), q =freq(a), the frequency of AA, aa and Aa
will be p2, q2, and 2pq, respectively at equilibrium.
– Similarly with two loci, each two alleles Aa, Bb
9
•
SNP Characteristics:
Linkage Disequilibrium
Equilibrium
Disequilibrium
• LD: If Alleles occur together more often than can be
accounted for by chance, then indicate two alleles are
physically close on the DNA
• LD expected to decay monotonically on either
side of each SNP
– In mammals, LD is often lost at ~100 KB
– In fly, LD often decays within a few hundred bases
10
Basic Concepts
Parent 2
Parent 1
A
B
X
a
A B
A B
a b
A B
b
a b
a b
High LD -> No Recombination
(r2 = 1) SNP1 “tags” SNP2
B
a
OR
A B
a b
A
b
A b
a B
a B
A B
A B
A B
A b
A b
etc…
Low LD -> Recombination
Many possibilities
SNP Characteristics:
Linkage Disequilibrium
• Three ways to calculate LD
D  p11  p1q1
max(p1q2 , p2 q1 ) if D  0
Dmax  
max(p1q1 , p2 q2 ) if D  0
2
D
r2 
p1 p2 q1q2
12
Observed
Expected
SNP Characteristics:
Linkage Disequilibrium
• D’ = D / Dmax (Lewontin 1964)
• D = 0.1, Dmax = 0.24, D’ = 0.1/0.24 = 0.427
• p1 = 0.6, q1 = 0.6
13
SNP Characteristics:
Linkage Disequilibrium
• Statistical Significance of LD
– Chi-square test with 1 df
X 
2
i, j
( nij  eij )
eij
– General chi-square tests
X2 
(Oij  Eij )2
i, j
– Permutation tests
14
B1
B2
Total
A1
n11
n12
n1.
A2
n21
n22
n2.
Total n.1
n.2
nT
2
Eij
~  2 (r  1)  (c  1)
SNP Characteristics:
Linkage Disequilibrium
• Can see haplotype block: a cluster of linked
SNPs
15
Haplotype: an illustration
A1A1, A2B2, A3A3
A1B1, B2B2, B3B3
A1A1, B2B2, A3B3
A1B1, B2B2, A3B3
16
B1B1, B2B2, A3B3
A1B1, B2B2, A3B3
A1
B1
A1
B1
B2
B2
B2
B2
A3
B3
A3
B3
or
A1
B1
B2
B2
B3
A3
SNP Characteristics:
Linkage Disequilibrium
• Haplotype boundary: blocks of sequence
with strong LD within blocks and no LD
between blocks
• Haplotype size distribution
18
SNP Characteristics:
Linkage Disequilibrium
• [C/T] A T X C [A/C] [T/A]
– Possible haplotype: 23
– In reality, a few common haplotypes explain 90%
variations
• Tagging SNPs:
– SNPs that capture
most variations
in haplotypes
– removes
redundancy
19
Redundant
SNP Characteristics:
Population Stratification
• Population Stratification: individuals
selected from two genetically different
populations in different proportions
• Stratification may be environmental,
cultural, or genetic
• Could give spurious results in case control
association studies (later this lecture)
20
SNP Discovery Methods
• Where are the SNPs in human genome?
• Sequence many individuals, find mismatches in
alignments, too costly to sequence all
• Computational:
– Align genome assembly to EST (mRNA) for SNPs in
the coding regions
– Need to differentiate between SNP and sequencing
error
• Resequence to verify
• dbSNP: 6 M SNPs
21
SNP Discovery Methods
• Sequence-free SNP detection
• First check whether big regions
have SNPs
– Basic idea: denature and re-anneal
two samples, detect heterduplex
– Can pool samples (e.g. 10 African
with 10 Caucasians) to speed
screening
• Then sequence smaller regions to
verify
22
SNP Genotyping
• For a known locus TT C/A AG, does this individual
have CC, AA or AC?
• Use PCR to amply enough of the bigger region
• Primer before SNP, then ddCTP and ddATP
• Sequence a few bp: add A,C,G,T in turn, right nt
incorporated to give light proportional # of
incorporated nt
CC
AA
CA
• Use florescent probes (CTGAA): give out light if
hybridized
3’- GACTT -5’
• SNP chip (simultaneously genotype thousands of
SNPs)
23
Association Studies
• Association between genetic markers and
phenotype
• Especially, find disease genes, SNP /
haplotype markers, for susceptibility
prediction and diagnosis
• Two strategies:
– Population-based case-control association
studies
– Family-based association studies
24
Case-Control Association Studies
• SNP/haplotype marker frequency in sample of
affected cases compared to that in age /sex
/population-matched sample of unaffected controls
• Expected:
– (24 + 278) * (24 + 86) / (24 + 278 + 86 + 296) = 49
– (278+296) * (86+296) / (24 + 278 + 86 + 296) = 321
• 2  
i, j
25
(eij  oij )2
eij
2 = 27.5, 1df, p < 0.001
Pitfalls of Association Studies
• Association  causal
• Difficult when several genes affecting a
quantitative trait
• Penetrance (fraction of people with the marker
who show the trait) and expressivity (severity of
the effect)
• Population stratification
– e.g. some SNP unique to ethnic group
– Need to make sure sample groups match
– Hidden environmental structure
• Not very reproducible
26
Family-based Association Studies (TDT)
• Look at allele transmission in unrelated families
and one affected child in each
Aa
92

 2.11
A a
92
~  2 , 1 df
ZTDT 
2
ZTDT
• Could also compare
allele frequency
between affected vs
unaffected children
in the same family
27
Like coin toss
Reproducibility of Association Studies
• Most reported associations have not been
consistently reproduced
• Hirschhorn et al, Genetics in Medicine, 2002,
review of association studies
– 603 associations of polymorphisms and disease
– 166 studied in at least three populations
– Only 6 seen in > 75% studies
28
Cause for Inconsistency
• What explains the lack of reproducibility?
• False positives
– Multiple hypothesis testing
– Ethnic admixture/Stratification
• False negatives
– Lack of power for weak effects
• Population differences
– Variable LD with causal SNP
– Population-specific modifiers
29
Causes for Inconsistency
• A sizable fraction (but less than half) of
reported associations are likely correct
• Genetic effects are generally modest
– Beware the winner’s curse (auction theory)
– In association studies, first positive report is
equivalent to the winning bid
• Large study sizes are
needed to detect these
reliably
30
Should we Believe
Association Study Results?
• Initial skepticism is warranted
• Replication, especially with low p values, is
encouraging
• Large sample sizes are crucial
• E.g. PPARg
Pro12Ala &
Diabetes
31
Acknowledgement
• Tim Niu
• Kenneth Kidd, Judith Kidd and Glenys
Thomson
• Joel Hirschhorn
• Greg Gibson & Spencer Muse
32