The Questions • • • • Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations with disease phenotypes? • How shall we select a subset of informative SNPs for large-scale typing? • How can haplotype information be visualized Methods for inferring haplotype blocks and informative SNP selection Detecting haplotype blocks on Chromosomes 6,21,22 Hypothesis – Haplotype Blocks? • The genome consists largely of blocks of common SNPs with relatively little recombination shuffling in the blocks – Patil et. al, Science, 2001; Jeffreys et al. Nature Genetics; Daly et al. Nature Genetics, 2001 • Compare block detection methods. – How well we can detect haplotype blocks? – Are the detection methods consistent? Block detection methods • Four gamete test, Hudson and Kaplan,Genetics, 1985, 111, 147-164. – A segment of SNPs is a block if between every pair (aA and bB) of SNPs at most 3 gametes (ab, aB, Ab, AB) are observed. • P-Value test – • A segment of SNPs is a block if for 95% of the pairs of SNPs we can reject the hypothesis (with P-value 0.05 or 0.001) that they are in linkage equilibrium. LD-based, Gabriel et al. Science,2002,296:2225-9 – Next slide Gabriel et al. method • For every pair of SNPs we calculate an upper and lower confidence bound on D’ (Call these D’u, D’l) • We then split the pairs of SNPs into 3 classes: – Class I: Two SNPs are in ‘Strong LD’ if D’u > .98 and D’l > .7. – Class II: Two SNPs show ‘Strong evidence for recombination’ if D’u < .9. – Class III: The remaining SNP pairs, these Block View Block comparison Conclusions Clear evidence of “blocky” structure in Chromosomes • Different block detection methods are highly concordant. • However, boundaries defined by these methods are not sharp and we believe there is no single “true” block partition. • Block free SNP selection What does it mean to tag SNPs? • SNP = Single Nucleotide Polymorphism – Caused by a mutation at a single position in human genome, passed along through heredity – Characterizes much of the genetic differences between humans – Most SNPs are bi-allelic – Estimated several million common SNPs (minor allele frequency >10% • To tag = select a subset of SNPs to work with Why do we tag SNPs? • Disease Association Studies – – – – Goal: Find genetic factors correlated with disease Look for discrepancies in haplotype structure Statistical Power: Determined by sample size Cost: Determined by overall number of SNPs typed • This means, to keep cost down, reduce the number of SNPs typed • Choose a subset of SNPs, [tag SNPs] that can predict other SNPs in the region with small probability of error – Remove redundant information What do we know? • SNPs physically close to one another tend to be inherited together – This means that long stretches of the genome (sans mutational events) should be perfectly correlated if not for… • Recombination breaks apart haplotypes and slowly erodes correlation between neighboring alleles – Tends to blur the boundaries of LD blocks • Since SNPs are bi-allelic, each SNP defines a partition on the population sample. – If you are able to reconstruct this partition by using other SNPs, there would be no need to type this SNP – For any single SNP, this reconstruction is not difficult… Complications: • But the Global solution to the minimum number of tag SNPs necessary is NP-hard • The predictions made will not be perfect – Correlation between neighboring tag SNPs not as strong as correlation between neighboring (not necessarily tagged) SNPs • Haplotype information is usually not available for technical reasons – Need for Phasing • Tagging SNPs can be partitioned into the following three steps: – Determining neighborhoods of LD: which SNPs can infer each other – Tagging quality assessment: Defining a quality measure that specifies how well a set of tag SNPs captures the variance observed – Optimization: Minimizing the number of tag SNPs Optimal Haplotype Block-Free Selection of Tagging SNPs for Genome-Wide Association Studies Halldorsson et al (2004) The Definition of Perfect Prediction of a SNP from a set of SNPs “Predict a SNP” (cont) Hap1 Hap2 Site # or SNP # AGTA ACAC 1 2 3 4 Predicts SNP 3 Nothing to Predict Predicts SNP 4 Predicts Each of SNPs 2 and 4 Predicts each of SNPs 2 and 3 P r e d i c t s A graphical notation AGTA ACAC “ The Blue box Predicts the Green SNP” Three SNPs Predicting Each Other Only one of the three needs to be typed G T A C A C Either one will do A Pair of SNPs Predicting Another SNP SNPs 1 and 3 together Predict SNP 4 G T A G C T A T G G T T 1 4 2 3 No single SNP (different than SNP 4) can predict SNP 4 • Tagging SNPs can be partitioned into the following three steps: – Determining neighborhoods of LD: which SNPs can infer each other – Tagging quality assessment: Defining a quality measure that specifies how well a set of tag SNPs captures the variance observed – Optimization: Minimizing the number of tag SNPs Finding Neighborhoods: • Goal is to select SNPs in the sample that characterize regions of common recent ancestry that will contain conserved haplotypes • Recent common ancestry means that there has been little time for recombination to break apart haplotypes • Constructing fixed size neighborhoods in which to look for SNPs is not desirable because of the variability of recombination rates and historical LD across the genome • In fact, the size of informative neighborhoods is highly variable precisely because of variable recombination rates and SNP density • Authors avoid block-building by recursively creating neighborhood with help of ‘informativeness’ measure Defning Informativeness: • • • • A measure of tagging quality assessment Assume all SNPs are bi-allelic Notation: I(s,t) = Informativeness of a SNP s with respect to a SNP t – i, j are two haplotypes drawn at random from the uniform distribution on the set of distinct haplotype pairs. – Note: I(s,t) =1 implies complete predictability, I(s,t)=0 when t is monomorphic in the population. • I(s,t) easily estimated through the use of bipartite clique that defines each SNP – We can write I(s,t) in terms of an edge set • • • Definition of I easily extended to a set of SNPs S by taking the union of edge sets Assumes the availability of haplotype phases New measure avoids some of the difficulties traditional LD measures have experienced when applied to tagging SNP selection – The concept of pairwise LD fails to reliably capture the higher-order dependencies implied by haplotype structure Bounded-Width Algorithm: k Most Informative SNPs (k-MIS) • Input: A set of n SNPs S • Output: subset of SNPs S’ such that I(S’,S) is maximal • In its most general form, k-MIS is NP-hard by reduction of the set cover problem to MIS • Algorithm optimizes informativeness, although easily adapted for other measures • Define distance between two SNPs as the number of SNPs in between them • k-MIS can be solved as long as distance between adjacent tag SNPs not too large • Define – Assignment As[i] – S(As) – Recursion function Iw(s,l, S(A)) = score of the most informative subset of l SNPs chosen from SNPs 1 through s such that As described the assignment for SNP s. • Pseudocode • Complexity: O(nk2w) in time and O(k2w) in space, assuming maximal window w Evaluation • Algorithm evaluated by Leave-One-Out Cross-Validation – accumulated accuracy over all haplotypes gives a global measure of the accuracy for the given data set. • SNPs not typed were predicted by a majority vote among all haplotypes in the training set that were identical to the one being inferred – If no such haplotypes existed, the majority vote is taken among all training haplotypes that have the same allele call on all but one of the typed SNPs – etc. • When compared to block-based method of Zhang: – Presumably, the advantage is due to the cost imposed by artificially restricting the range of influence of the few SNPs chosen by block boundaries • ‘Informativeness’ was shown to be a “good” measure – aligned well with the leave-one-out cross validation results – extremely close to the results of optimizing for haplotype r2 Premise: Informative SNP selection • Select SNPs to use in an association study – Would like to associate single nucleotide polymorphisms (SNPs) with disease. • Very large number of SNPs – Chromosome wide studies, whole genome-scans. – For cost effectiveness, select only a subset. • Closely spaced SNPs are highly correlated – It is less likely that there has been a recombination between two SNPs if they are close to each other. SNP selection within blocks • Zhang et al. PNAS, 2002. • Partition chromosome into haplotype blocks. • • • • Zhang et al. RECOMB, 2003 H. I. Avi-Itzhak,X. Su, F. M. De La Vega, PSB, 2003 Sebastiani et al. PNAS 2003 Patil et al., PNAS 2002. • Within blocks one can select the SNPs that maximize entropy or diversity. • Zhang et al. AJHG 2003. • Select a minimal number of SNPs with limited resources. Block free SNP selection • For each SNP define a neighborhood of predictive SNPs. • Define a measure of informativeness, how well a set of SNPs predicts a target SNP. • Maximize informativeness over all SNPs. LD Graph Theory The Definition of Perfect Prediction of a SNP from a set of SNPs Combinatorial interpretations of intermediate values of D’ and r2 G G A A G G A A Distinguishing SNPs SNPs T A Adistinguishing G T A pair of G T A T A Cevery haplotypes A C G C G G A C A C A T A A G A G G A A A C G T A C G T Perfect Distinguishibility G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A C T A T T A A C G C G A C A A T T A G G A A T T C C A T G A Predictive SNPs A Set of SNPs G T C Predicts G T G SNP s A C T A C A T G A s s G G A A T T C C A C G T G G A A A T G A A C G T Perfect Prediction G T T C G A C A A C A T A C G T A T C T A T T A G T T C G A C T A T T A A C G C G A C A A T T A The Informativeness Duality Lemma Let M be the SNPs/Haps matrix. S be the set of SNPs (columns). H be the set of Haplotypes (rows) T a subset of S. The following are equivalent: (1) T perfectly predicts every SNP in S (2) T perfectly distinguishes every pair of distinct haplotypes in H “Predict a SNP” (cont) Hap1 Hap2 Site # or SNP # AGTA ACAC 1 2 3 4 Predicts SNP 3 Nothing to Predict Predicts SNP 4 Predicts Each of SNPs 2 and 4 Predicts each of SNPs 2 and 3 P r e d i c t s Informativeness • Each SNP defines a partition on the set of chromosomes – Infer the value each SNP in the population. • Our goal is to infer partitions defined by each one of the SNPs. • Inferring the partition of every SNP allows us to infer any possible haplotype. 1 2 3 4 5 GGGAT GCTGA ACGAT ACGAT ACTGA s 0 2 0 1 1 3 1 4 1 Informativeness – For a SNPs, and haplotypes I, J Ds(I,J) is the event that SNP s has different alleles for haplotypes I, J – Define I(s,t) = Pr(Ds(I,J) | Dt(I,J)) – I(s,t) can be estimated from a population sample • For each SNP s, define a bipartite graph on the haplotypes • Let E(s) denote the edge set I(s,t) = |E(s) E(t)| / |E(t)| I(S,t) = |s SE(s) E(t)| / |E(t)| I(S,T) = tT I(S,t) t 0 0 1 1 1 s 0 1 1 1 1 I(s,t) The Minimum Informative SNPs problem • Given a set S of SNPs, compute arg max S’ S, |S’| k I(S’,S\S’) • The problem is NP-complete in general – Reduction from set cover • Tractable in practice – When only nearby SNPs are used as candidates Bounded Width MIS • Only neighboring SNPs inform meaningfully – SNP i can only be used to infer SNP j if there is little evidence of recombination between i and j • I(w,S,t) = Informativeness of S w.r.t t when restricted to SNPs in S that are within w/2neighborhood of t. I ( w, S , T ) I ( w, S , t ) tT • (k,w)-MIS problem: – Given a set T, compute the k most informative SNPs S that minimize I(w,S,T) • (k,w)-MIS can be computed in time O(nk2w), and space O(k2w) Correct imputation Block vs. block free # correct imputations Block Free Zhang et al. #SNPs typed Perlegen dataset Correlation of informativeness with imputation in leave one out studies Informativeness Leave one out Block free #SNPs Perlegen dataset Haplotype blocks Haplotype Blocks Union of possible haplotype blocks Block free – SNPs selected Haplotype block tagging SNPs Haplotype block tagging SNPs The Definition of Perfect Prediction of a SNP from a set of SNPs “Predict a SNP” (cont) Hap1 Hap2 Site # or SNP # AGTA ACAC 1 2 3 4 Predicts SNP 3 Nothing to Predict Predicts SNP 4 Predicts Each of SNPs 2 and 4 Predicts each of SNPs 2 and 3 P r e d i c t s A graphical notation AGTA ACAC “ The Blue box Predicts the Green SNP” Three SNPs Predicting Each Other Only one of the three needs to be typed G T A C A C Either one will do A Pair of SNPs Predicting Another SNP SNPs 1 and 3 together Predict SNP 4 G T A G C T A T G G T T 1 4 2 3 No single SNP (different than SNP 4) can predict SNP 4 Homework Find the minimum subset of SNPs that needs to be typed; I.e., from which the rest of the SNPs can be Predicted. G T A G C T A T G G T T Answer: Solution 1 = Type SNPs 1 and 3 From SNPs 1 and 3 we can predict SNP 4 From SNP 3 we can predict SNP 2 G T A G C T A T G G T T Another solution (maybe better for Mercury SNPs : ) Solution 2 = Type SNPs 1 and 2. Informativeness of a SNP Informativeness of a SNP s with respect with SNP t Quantifies the confidence with which we can predict t from s. Le s be a SNP and i,j be haplotypes. Let D(s, i, j) be the event that at s, i and j haps have different alleles The informativeness of s w.r.t. t is given by I(s,t) = Prob [ D(s,i,j) | D(t,i,j) ] i and j are haplotypes drawn uniformly at random from the set of all distinct haplotype pairs. The Min Informative Subset Problems Observe that: I(s,t) = 1 implies perfect prediction I(s,t) = 0 implies no predictability The Minimum Perfectly Informative Subset of SNPs Problem Input: A set of n SNPs S, a subset T of S, and 0<k<=n Ouput: Does there exist a subset S’ of S-T such that I(S’,T) = 1 and size of S <= k ? The k-Most Informative Subset of SNPs Problem Input: A set of n SNPs S, with a subset T of S, and 0<k<=n Ouput: Find a subset S’ of S-T such that I(S’,T) = MAX {I(S”, T)} and size of S” <= k ? Basic Insight: The Set Cover Problem The Minimum Perfectly Informative Subset of SNPs Problem is NP-colpmete The k-Most Informative Subset of SNPs Problem is NP-complete Graph Theory – Min Set Cover Set Set elements Set Set Want: Min number of Sets that cover all elements Or Min number of GIRLS that know all the BOYS BOYS GIRLS Our Boys and Girls … The elements: For a SNP t, the elements are the set of pairs of haplotypes that are distinguished by t. The sets: Each SNP s defines a set consisting of all pairs of haplotypes that is distinguished by both s and t. The Minimum Set Cover is Minimum subset of SNPs that Perfectly Predicts the entire sample. Algorithms n number of SNPs m number of Haplotypes ALGORITM 1 When S is a set of SNPs in perfect LD with each other (I.e., all in a no 4-gamete block) the k-Most Informative Subset of SNPs can be solved exactly in O(nm) time. ALGORITM 2 When the distance in SNPs between the predicting SNP(s) and the target SNP is at most w , the (k,w)-Most Informative Subset of SNPs problem can be solved exactly in time O(nk2^w) and space O(k2^w). Block free SNP selection