Haplotype Blocks and Tagging SNPs

advertisement
Association Studies, Haplotype
Blocks and Tagging SNPs
Prof. Sorin Istrail
Association studies
Disease
Responder
Allele 0
Control
Non-responder
Allele 1
Marker A:
Allele 0 =
Allele 1 =
Marker A is
associated with
Phenotype
Association studies
• Evaluate whether nucleotide
polymorphisms associate with
phenotype
A
C
G
A
G
A
C
G
A
T
A
T
A
A
G
C
T
A
G
T
A
T
G
G
T
A
T
G
G
G
Association studies
A
C
G
A
G
A
C
G
A
T
A
T
A
A
G
C
T
A
G
T
A
T
G
G
T
A
T
G
G
G
Hypothesis – Haplotype Blocks?
The
genome consists largely of blocks of
common SNPs with relatively little recombination
within the blocks
Patil et al., Science, 2001;
Jeffreys et al., Nature Genetics, 2001;
 Daly et al., Nature Genetics, 2001
Haplotype Block Structure
LD-Blocks, and 4-Gamete Test Blocks
200 kb
Sense genes
DNA
Antisense genes
SNPs
Haplotype
blocks
1
2
3
4
One definition of block
• Based on the Four Gamete test.
• Intuition: when between two SNPs there
are all four gametes, there is a
recombination point somewhere
inbetween the two sites
Four Gamete Block Test
• Hudson and Kaplan 1985
A segment of SNPs is a block if between every pair of SNPs at
most 3 out of the 4 gametes (00, 01,10,11) are observed.
0
0
1
1
0
1
1
1
BLOCK
1
1
0
1
0
0
1
1
0
1
1
0
1
1
0
1
VIOLATES THE BLOCK DEFINITION
Finding Recombination Hotspots:
Many Possible Partitions into Blocks
A
G
A
G
A
A
C
T
C
T
C
C
T
T
T
T
T
T
A
C
C
A
C
A
G
G
T
T
T
G
A
A
A
A
A
C
T
C
T
C
T
T
All four gametes are present:
A
A
G
G
A
G
G
A
A
A
G
G
C
C
T
C
T
C
C
A
C
A
A
A
T
T
G
T
T
T
The final result is a minimum-size set
of sites crossing all constraints.
A C T A G A T A G C C T
GFind
T the
T left-most
C G A right
C Aendpoint
A C of
A T
AEliminate
C
T
C
T
A
T
G
A
T
C
G
any
constraints
crossing
any constraint
and mark the
site
Repeat
until
all
constraints
are
gone.
G Tbefore
T Ait aTrecombination
A
C
G
A
C
A
T
that site.
site.
A C T C T A T A G T A T
A C T A G C T G G C A T
Tagging SNPs
ACGATCGATCATGAT
GGTGATTGCATCGAT
ACGATCGGGCTTCCG
ACGATCGGCATCCCG
GGTGATTATCATGAT
An example of real data set
and its haplotype block
structure. Colors refer to the
founding population, one
color for each founding
haplotype
Only 4 SNPs are needed to tag
all the different haplotypes
A------A---TG-G------G---CG-A------G---TC-A------G---CC-G------A---TG--
Informativeness
A measure for the “information” a SNP contains
about about another SNP. Useful for designing SNPs Arrays
and Tagging SNPs selection.
s
h
1
0
0
1
1
0
h
2
0
0
1
0
1
Informativeness
0
0
1
1
0
0
0
1
0
1
0
1
0
0
0
1
1
0
1
1
s1
s2
s3
s4
s5
I(s1,s2) = 2/4 = 1/2
Informativeness
0
0
1
1
0
0
0
1
0
1
0
1
0
0
0
1
1
0
1
1
s1
s2
s3
s4
s5
I({s1,s2}, s4) = 3/4
Informativeness
0
0
1
1
0
0
0
1
0
1
I({s3,s4},{s1,s2,s5})
=3
0
1
0
0
0
S={s3,s4} is a
1
s1
1
s2
0
s3
1
s4
1
s5
Minimal Informative Subset
e
6
Informativeness
Graph theory insight
s
e
5
Minimum Set Cover
s
4
=
Minimum Informative Subset
s
s
1
s
2
s
3
s
4
s
5
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
1
1
1
s
s
e
5
4
3
e
3
2
e
1
e
SNPs
Edges
2
1
e
6
Informativeness
Graph theory insight
s
e
5
Minimum Set Cover {s3, s4}
s
4
=
Minimum Informative Subset
s
s
1
s
2
s
3
s
4
s
5
0
0
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
1
s
s
e
5
4
3
e
3
2
e
1
e
SNPs
Edges
2
1
Download