The Questions

advertisement
The Questions
•
•
•
•
Why study haplotypes?
How can haplotypes be inferred?
What are haplotype blocks?
How can haplotype information be used to
test associations with disease phenotypes?
• How shall we select a subset of informative
SNPs for large-scale typing?
• How can haplotype information be visualized
Methods for inferring haplotype blocks and
informative SNP selection
Detecting haplotype blocks on Chromosomes 6,21,22
Hypothesis – Haplotype Blocks?
• The genome consists largely of blocks of
common SNPs with relatively little
recombination shuffling in the blocks
–
Patil et. al, Science, 2001; Jeffreys et al. Nature Genetics; Daly
et al. Nature Genetics, 2001
• Compare block detection methods.
– How well we can detect haplotype blocks?
– Are the detection methods consistent?
Block detection methods
• Four gamete test, Hudson and Kaplan,Genetics,
1985, 111, 147-164.
–
A segment of SNPs is a block if between every pair (aA and
bB) of SNPs at most 3 gametes (ab, aB, Ab, AB) are
observed.
• P-Value test
–
•
A segment of SNPs is a block if for 95% of the pairs of
SNPs we can reject the hypothesis (with P-value 0.05
or 0.001) that they are in linkage equilibrium.
LD-based, Gabriel et al. Science,2002,296:2225-9
–
Next slide
Gabriel et al. method
• For every pair of SNPs we calculate an
upper and lower confidence bound on
D’ (Call these D’u, D’l)
• We then split the pairs of SNPs into 3
classes:
– Class I: Two SNPs are in ‘Strong LD’ if D’u
> .98 and D’l > .7.
– Class II: Two SNPs show ‘Strong evidence
for recombination’ if D’u < .9.
– Class III: The remaining SNP pairs, these
Block View
Block comparison
Conclusions
Clear evidence of “blocky”
structure in Chromosomes
• Different block detection
methods are highly concordant.
• However, boundaries defined by
these methods are not sharp and
we believe there is no single “true”
block partition.
•
Block free SNP selection
What does it mean to tag SNPs?
• SNP = Single Nucleotide Polymorphism
– Caused by a mutation at a single position in human
genome, passed along through heredity
– Characterizes much of the genetic differences
between humans
– Most SNPs are bi-allelic
– Estimated several million common SNPs (minor allele
frequency >10%
• To tag = select a subset of SNPs to work with
Why do we tag SNPs?
• Disease Association Studies
–
–
–
–
Goal: Find genetic factors correlated with disease
Look for discrepancies in haplotype structure
Statistical Power: Determined by sample size
Cost: Determined by overall number of SNPs typed
• This means, to keep cost down, reduce the number of SNPs
typed
• Choose a subset of SNPs, [tag SNPs] that can
predict other SNPs in the region with small
probability of error
– Remove redundant information
What do we know?
• SNPs physically close to one another tend to be
inherited together
– This means that long stretches of the genome (sans mutational
events) should be perfectly correlated if not for…
• Recombination breaks apart haplotypes and slowly
erodes correlation between neighboring alleles
– Tends to blur the boundaries of LD blocks
• Since SNPs are bi-allelic, each SNP defines a partition
on the population sample.
– If you are able to reconstruct this partition by using other SNPs,
there would be no need to type this SNP
– For any single SNP, this reconstruction is not difficult…
Complications:
• But the Global solution to the minimum
number of tag SNPs necessary is NP-hard
• The predictions made will not be perfect
– Correlation between neighboring tag SNPs
not as strong as correlation between
neighboring (not necessarily tagged) SNPs
• Haplotype information is usually not
available for technical reasons
– Need for Phasing
• Tagging SNPs can be partitioned into the
following three steps:
– Determining neighborhoods of LD: which
SNPs can infer each other
– Tagging quality assessment: Defining a
quality measure that specifies how well a set
of tag SNPs captures the variance observed
– Optimization: Minimizing the number of tag
SNPs
Optimal Haplotype Block-Free
Selection of Tagging SNPs for
Genome-Wide Association
Studies
Halldorsson et al (2004)
The Definition of
Perfect
Prediction of
a SNP from a set
of SNPs
“Predict a SNP” (cont)
Hap1
Hap2
Site #
or
SNP #
AGTA
ACAC
1 2 3 4
Predicts
SNP 3
Nothing to
Predict
Predicts
SNP 4
Predicts
Each of SNPs
2 and 4
Predicts each
of SNPs
2 and 3
P
r
e
d
i
c
t
s
A graphical notation
AGTA
ACAC
“ The Blue box Predicts the Green SNP”
Three SNPs Predicting Each
Other
Only one of the three needs to be typed
G T A
C A C
Either one will do
A Pair of SNPs Predicting
Another SNP
SNPs 1 and 3 together Predict SNP 4
G T A G
C T A T
G G T
T
1
4
2
3
No single SNP (different than SNP 4) can predict SNP 4
• Tagging SNPs can be partitioned into the
following three steps:
– Determining neighborhoods of LD: which
SNPs can infer each other
– Tagging quality assessment: Defining a
quality measure that specifies how well a set
of tag SNPs captures the variance observed
– Optimization: Minimizing the number of tag
SNPs
Finding Neighborhoods:
• Goal is to select SNPs in the sample that characterize
regions of common recent ancestry that will contain
conserved haplotypes
• Recent common ancestry means that there has been
little time for recombination to break apart haplotypes
• Constructing fixed size neighborhoods in which to look
for SNPs is not desirable because of the variability of
recombination rates and historical LD across the
genome
• In fact, the size of informative neighborhoods is highly
variable precisely because of variable recombination
rates and SNP density
• Authors avoid block-building by recursively creating
neighborhood with help of ‘informativeness’ measure
Defning Informativeness:
•
•
•
•
A measure of tagging quality assessment
Assume all SNPs are bi-allelic
Notation:
I(s,t) = Informativeness of a SNP s with respect to a SNP t
– i, j are two haplotypes drawn at random from the uniform distribution on the set
of distinct haplotype pairs.
– Note: I(s,t) =1 implies complete predictability, I(s,t)=0 when t is monomorphic in
the population.
•
I(s,t) easily estimated through the use of bipartite clique that defines each
SNP
– We can write I(s,t) in terms of an edge set
•
•
•
Definition of I easily extended to a set of SNPs S by taking the union of
edge sets
Assumes the availability of haplotype phases
New measure avoids some of the difficulties traditional LD measures have
experienced when applied to tagging SNP selection
– The concept of pairwise LD fails to reliably capture the higher-order
dependencies implied by haplotype structure
Bounded-Width Algorithm: k Most
Informative SNPs (k-MIS)
• Input: A set of n SNPs S
• Output: subset of SNPs S’ such that I(S’,S) is
maximal
• In its most general form, k-MIS is NP-hard by
reduction of the set cover problem to MIS
• Algorithm optimizes informativeness, although
easily adapted for other measures
• Define distance between two SNPs as the
number of SNPs in between them
• k-MIS can be solved as long as distance
between adjacent tag SNPs not too large
• Define
– Assignment As[i]
– S(As)
– Recursion function Iw(s,l, S(A)) = score of the
most informative subset of l SNPs chosen
from SNPs 1 through s such that As described
the assignment for SNP s.
• Pseudocode
• Complexity: O(nk2w) in time and O(k2w) in
space, assuming maximal window w
Evaluation
• Algorithm evaluated by Leave-One-Out Cross-Validation
– accumulated accuracy over all haplotypes gives a global measure of the
accuracy for the given data set.
• SNPs not typed were predicted by a majority vote among all
haplotypes in the training set that were identical to the one being
inferred
– If no such haplotypes existed, the majority vote is taken among all
training haplotypes that have the same allele call on all but one of the
typed SNPs
– etc.
• When compared to block-based method of Zhang:
– Presumably, the advantage is due to the cost imposed by artificially
restricting the range of influence of the few SNPs chosen by block
boundaries
• ‘Informativeness’ was shown to be a “good” measure
– aligned well with the leave-one-out cross validation results
– extremely close to the results of optimizing for haplotype r2
Premise:
Informative SNP selection
• Select SNPs to use in an association study
– Would like to associate single nucleotide
polymorphisms (SNPs) with disease.
• Very large number of SNPs
– Chromosome wide studies, whole genome-scans.
– For cost effectiveness, select only a subset.
• Closely spaced SNPs are highly correlated
– It is less likely that there has been a recombination
between two SNPs if they are close to each other.
SNP selection within blocks
•
Zhang et al. PNAS, 2002.
• Partition chromosome into haplotype blocks.
•
•
•
•
Zhang et al. RECOMB, 2003
H. I. Avi-Itzhak,X. Su, F. M. De La Vega, PSB, 2003
Sebastiani et al. PNAS 2003
Patil et al., PNAS 2002.
• Within blocks one can select the SNPs that
maximize entropy or diversity.
•
Zhang et al. AJHG 2003.
• Select a minimal number of SNPs with limited
resources.
Block free SNP selection
• For each SNP define a neighborhood of
predictive SNPs.
• Define a measure of informativeness, how
well a set of SNPs predicts a target SNP.
• Maximize informativeness over all SNPs.
LD Graph Theory
The Definition of
Perfect Prediction of
a SNP from a set of SNPs
Combinatorial interpretations of
intermediate values of D’ and r2
G
G
A
A
G
G
A
A
Distinguishing SNPs
SNPs
T A Adistinguishing G T A
pair of
G
T
A
T A Cevery
haplotypes
A C G
C G G
A C A
C A T
A
A
G
A
G
G
A
A
A
C
G
T
A
C
G
T
Perfect Distinguishibility
G T T C G A C A A C A T
A C G T A T C T A T T A
G T T C G A C T A T T A
A C G C G A C A A T T A
G
G
A
A
T
T
C
C
A
T
G
A
Predictive SNPs
A Set of SNPs
G T
C Predicts
G T
G SNP s
A C
T
A C
A
T
G
A
s
s
G
G
A
A
T
T
C
C
A
C
G
T
G
G
A
A
A
T
G
A
A
C
G
T
Perfect Prediction
G T T C G A C A A C A T
A C G T A T C T A T T A
G T T C G A C T A T T A
A C G C G A C A A T T A
The Informativeness Duality
Lemma
Let M be the SNPs/Haps matrix.
S be the set of SNPs (columns).
H be the set of Haplotypes (rows)
T a subset of S.
The following are equivalent:
(1) T perfectly predicts every SNP in S
(2) T perfectly distinguishes every pair of distinct
haplotypes in H
“Predict a SNP” (cont)
Hap1
Hap2
Site #
or
SNP #
AGTA
ACAC
1 2 3 4
Predicts
SNP 3
Nothing to
Predict
Predicts
SNP 4
Predicts
Each of SNPs
2 and 4
Predicts each
of SNPs
2 and 3
P
r
e
d
i
c
t
s
Informativeness
• Each SNP defines a partition on the set of chromosomes
– Infer the value each SNP in the population.
• Our goal is to infer partitions defined by each one of the
SNPs.
• Inferring the partition of every SNP allows us to infer any
possible haplotype.
1
2
3
4
5
GGGAT
GCTGA
ACGAT
ACGAT
ACTGA
s
0
2
0
1
1 3
1
4
1
Informativeness
– For a SNPs, and haplotypes I, J
Ds(I,J) is the event that SNP s has
different alleles for haplotypes I, J
– Define I(s,t) = Pr(Ds(I,J) | Dt(I,J))
– I(s,t) can be estimated from a
population sample
• For each SNP s, define a bipartite
graph on the haplotypes
• Let E(s) denote the edge set
I(s,t) = |E(s)  E(t)| / |E(t)|
I(S,t) = |s  SE(s)  E(t)| / |E(t)|
I(S,T) = tT I(S,t)
t
0
0
1
1
1
s
0
1
1
1
1
I(s,t)
The Minimum Informative
SNPs problem
• Given a set S of SNPs, compute
arg max S’  S, |S’|  k I(S’,S\S’)
• The problem is NP-complete in general
– Reduction from set cover
• Tractable in practice
– When only nearby SNPs are used as
candidates
Bounded Width MIS
• Only neighboring SNPs inform meaningfully
– SNP i can only be used to infer SNP j if there is
little evidence of recombination between i and j
• I(w,S,t) = Informativeness of S w.r.t t when
restricted to SNPs in S that are within w/2neighborhood of t.
I ( w, S , T )   I ( w, S , t )
tT
• (k,w)-MIS problem:
– Given a set T, compute the k most informative
SNPs S that minimize I(w,S,T)
• (k,w)-MIS can be computed in time O(nk2w),
and space O(k2w)
Correct imputation
Block vs. block free
# correct
imputations
Block Free
Zhang et al.
#SNPs typed
Perlegen dataset
Correlation of informativeness with
imputation in leave one out studies
Informativeness
Leave one out
Block free
#SNPs
Perlegen dataset
Haplotype blocks
Haplotype Blocks
Union of possible haplotype
blocks
Block free – SNPs selected
Haplotype block tagging SNPs
Haplotype block tagging SNPs
The Definition of
Perfect
Prediction of
a SNP from a set
of SNPs
“Predict a SNP” (cont)
Hap1
Hap2
Site #
or
SNP #
AGTA
ACAC
1 2 3 4
Predicts
SNP 3
Nothing to
Predict
Predicts
SNP 4
Predicts
Each of SNPs
2 and 4
Predicts each
of SNPs
2 and 3
P
r
e
d
i
c
t
s
A graphical notation
AGTA
ACAC
“ The Blue box Predicts the Green SNP”
Three SNPs Predicting Each
Other
Only one of the three needs to be typed
G T A
C A C
Either one will do
A Pair of SNPs Predicting
Another SNP
SNPs 1 and 3 together Predict SNP 4
G T A G
C T A T
G G T
T
1
4
2
3
No single SNP (different than SNP 4) can predict SNP 4
Homework
Find the minimum subset of SNPs that needs
to be typed; I.e., from which the rest of the SNPs
can be Predicted.
G T A G
C T A T
G G T
T
Answer: Solution 1 = Type
SNPs 1 and 3
From SNPs 1 and 3 we can predict SNP 4
From SNP 3 we can predict SNP 2
G T A G
C T A T
G G T
T
Another solution (maybe better for Mercury SNPs : )
Solution 2 = Type SNPs 1 and 2.
Informativeness of a SNP
Informativeness of a SNP s with respect with SNP t
Quantifies the confidence with which we can predict t from s.
Le s be a SNP and i,j be haplotypes.
Let D(s, i, j) be the event that at s, i and j haps have different alleles
The informativeness of s w.r.t. t is given by
I(s,t) = Prob [ D(s,i,j) | D(t,i,j) ]
i and j are haplotypes drawn uniformly at random
from the set of all distinct haplotype pairs.
The Min Informative Subset
Problems
Observe that:
I(s,t) = 1 implies perfect prediction
I(s,t) = 0 implies no predictability
The Minimum Perfectly Informative Subset of SNPs Problem
Input: A set of n SNPs S, a subset T of S, and 0<k<=n
Ouput: Does there exist a subset S’ of S-T such that
I(S’,T) = 1 and size of S <= k ?
The k-Most Informative Subset of SNPs Problem
Input: A set of n SNPs S, with a subset T of S, and 0<k<=n
Ouput: Find a subset S’ of S-T such that
I(S’,T) = MAX {I(S”, T)} and size of S” <= k ?
Basic Insight: The Set Cover
Problem
The Minimum Perfectly Informative Subset of
SNPs Problem is NP-colpmete
The k-Most Informative Subset of SNPs Problem
is NP-complete
Graph Theory – Min Set Cover
Set
Set
elements
Set
Set
Want: Min number of Sets that cover all elements
Or Min number of GIRLS that know all the BOYS
BOYS
GIRLS
Our Boys and Girls …
The elements:
For a SNP t, the elements are the set of pairs of haplotypes
that are distinguished by t.
The sets:
Each SNP s defines a set consisting of all pairs of haplotypes
that is distinguished by both s and t.
The Minimum Set Cover is
Minimum subset of SNPs that
Perfectly Predicts the entire sample.
Algorithms
n number of SNPs
m number of Haplotypes
ALGORITM 1
When S is a set of SNPs in perfect LD with each other
(I.e., all in a no 4-gamete block) the k-Most Informative
Subset of SNPs can be solved exactly in O(nm) time.
ALGORITM 2
When the distance in SNPs between the predicting SNP(s) and
the target SNP is at most w , the (k,w)-Most Informative
Subset of SNPs problem can be solved exactly in time
O(nk2^w) and space O(k2^w).
Block free SNP selection
Download