A Data Compression Problem The Minimum Informative Subset

advertisement
A Data Compression Problem
The Minimum Informative Subset
Informativeness-based Tagging
SNPs Algorithm
Outline:

Brief background to SNP selection

A block-free tag SNP selection algorithm that
maximizes ‘informativeness’

Halldorsson et al 2004
What does it mean to tag SNPs?

SNP = Single Nucleotide Polymorphism





Caused by a mutation at a single position in human
genome, passed along through heredity
Characterizes much of the genetic differences
between humans
Most SNPs are bi-allelic
Estimated several million common SNPs (minor allele
frequency >10%
To tag = select a subset of SNPs to work with
Why do we tag SNPs?

Disease Association Studies




Goal: Find genetic factors correlated with disease
Look for discrepancies in haplotype structure
Statistical Power: Determined by sample size
Cost: Determined by overall number of SNPs typed


This means, to keep cost down, reduce the number of SNPs
typed
Choose a subset of SNPs, [tag SNPs] that can
predict other SNPs in the region with small
probability of error

Remove redundant information
What do we know?

SNPs physically close to one another tend to be
inherited together


Recombination breaks apart haplotypes and slowly
erodes correlation between neighboring alleles


This means that long stretches of the genome (sans mutational
events) should be perfectly correlated if not for…
Tends to blur the boundaries of LD blocks
Since SNPs are bi-allelic, each SNP defines a partition
on the population sample.


If you are able to reconstruct this partition by using other SNPs,
there would be no need to type this SNP
For any single SNP, this reconstruction is not difficult…
Complications:


But the Global solution to the minimum
number of tag SNPs necessary is NP-hard
The predictions made will not be perfect


Correlation between neighboring tag SNPs not as
strong as correlation between neighboring (not
necessarily tagged) SNPs
Haplotype information is usually not available
for technical reasons

Need for Phasing

Tagging SNPs can be partitioned into the
following three steps:



Determining neighborhoods of LD: which SNPs
can infer each other
Tagging quality assessment: Defining a quality
measure that specifies how well a set of tag SNPs
captures the variance observed
Optimization: Minimizing the number of tag SNPs
Haplotype-based tagging SNPs: htSNPs

Block-Based:


Define blocks as as set of SNPs that are in strong LD
with each other, but not with neighboring blocks
Requires inference on exact location of haplotype
blocks



Recombination between the blocks but not within the blocks
Within each block, choose a subset of SNPs
sufficiently rich to be able to reconstruct diversity of
the block
Many algorithms exist for creating blocks… few select
the same boundaries!
How do we create Haplotype Blocks?
1.
Recombination-based block building algorithm:
1.
2.
3.
2.
Diversity-based test: A region is a block if at least 80% of the
sequences occur in more than one chromosome.
1.
2.
3.
Test does not scale well to large sample sizes. (See Patil et al (2001))
To generalize this notion, one could look for sequences within a region
accounting for 80% of the sampled population that each occur in at least
10% of the sample.
LD-based test:
1.
4.
Infinite sites assumption [each site mutates at most once]
Assume no recombination within a block
Implies each block should follow the four-gamete condition for any pair
of sites (See Hudson and Kaplan)
D’ value of every pair of SNPs within the block shows significant LD
given the individual SNP frequencies with a P-value of 0.001
Two SNPs are considered to have a useful level of correlation if they
occur in the same haplotype block [i.e. they are physically close with
little evidence of recombination]. The set of SNPs that can be used
to predict SNP s can be found by taking the union of all putative
haplotype blocks that contain SNP s.
1.
It is possible that many overlapping block decompositions will meet the
rules defined by a rule-based algorithm for finding haplotype blocks
Methods for inferring haplotype blocks
Hypothesis – Haplotype Blocks?

The genome consists largely of blocks of
common SNPs with relatively little
recombination shuffling in the blocks


Patil et. al, Science, 2001; Jeffreys et al. Nature Genetics; Daly
et al. Nature Genetics, 2001
Compare block detection methods.


How well we can detect haplotype blocks?
Are the detection methods consistent?
Block detection methods

Four gamete test, Hudson and Kaplan,Genetics,
1985, 111, 147-164.


P-Value test


A segment of SNPs is a block if between every pair (aA and
bB) of SNPs at most 3 gametes (ab, aB, Ab, AB) are
observed.
A segment of SNPs is a block if for 95% of the pairs
of SNPs we can reject the hypothesis (with P-value
0.05 or 0.001) that they are in linkage equilibrium.
LD-based, Gabriel et al. Science,2002,296:2225-9

Next slide
Gabriel et al.
Gabriel
etmethod
al. method


For every pair of SNPs we calculate an
upper and lower confidence bound on D’
(Call these D’u, D’l)
We then split the pairs of SNPs into 3
classes:


Class I: Two SNPs are in ‘Strong LD’ if D’u
> .98 and D’l > .7.
Class II: Two SNPs show ‘Strong evidence for
recombination’ if D’u < .9.
Gabriel
etmethod
al. method
Gabriel et al.


A contiguous set of SNPs is a block if



Class III: The remaining SNP pairs, these are
“uninformative”.
(Class II)/(Class I + ClassII) < 5%.
Special rules to determine if 2, 3 or 4
SNPs are a block.
Furthermore there are distance
requirements on the chromosome to
determine if the SNPs are a block.
One definition of block


Based on the Four Gamete test.
Intuition: when between two SNPs there
are all four gametes, there is a
recombination point somewhere
inbetween the two sites
Four Gamete Block Test

Hudson and Kaplan 1985
A segment of SNPs is a block if between every pair of SNPs at
most 3 out of the 4 gametes (00, 01,10,11) are observed.
0
0
1
1
0
1
1
1
BLOCK
1
1
0
1
0
0
1
1
0
1
1
0
1
1
0
1
VIOLATES THE BLOCK DEFINITION
Finding Recombination Hotspots:
Many Possible Partitions into Blocks
A
G
A
G
A
A
C
T
C
T
C
C
T
T
T
T
T
T
A
C
C
A
C
A
G
G
T
T
T
G
A
A
A
A
A
C
T
C
T
C
T
T
All four gametes are present:
A
A
G
G
A
G
G
A
A
A
G
G
C
C
T
C
T
C
C
A
C
A
A
A
T
T
G
T
T
T
The final result is a minimum-size set
of sites crossing all constraints.
A C T A G A T A G C C T
GFind
T the
T left-most
C G A right
C Aendpoint
A C of
A T
AEliminate
C
T
C
T
A
T
G
A
T
C
G
any
constraints
crossing
any constraint
and mark the
site
Repeat
until
all
constraints
are
gone.
G Tbefore
T Ait aTrecombination
A
C
G
A
C
A
T
that site.
site.
A C T C T A T A G T A T
A C T A G C T G G C A T
Tagging SNPs
ACGATCGATCATGAT
GGTGATTGCATCGAT
ACGATCGGGCTTCCG
ACGATCGGCATCCCG
GGTGATTATCATGAT
An example of real data set
and its haplotype block
structure. Colors refer to the
founding population, one
color for each founding
haplotype
Only 4 SNPs are needed to tag
all the different haplotypes
A------A---TG-G------G---CG-A------G---TC-A------G---CC-G------A---TG--
Optimal Haplotype Block-Free
Selection of Tagging SNPs for
Genome-Wide Association Studies
Halldorsson, Bafna, Lippert, Schwartz,
Clark, Istrail (2004)

Tagging SNPs can be partitioned into the
following three steps:



Determining neighborhoods of LD: which SNPs
can infer each other
Tagging quality assessment: Defining a quality
measure that specifies how well a set of tag SNPs
captures the variance observed
Optimization: Minimizing the number of tag SNPs
Finding Neighborhoods:





Goal is to select SNPs in the sample that characterize
regions of common recent ancestry that will contain
conserved haplotypes
Recent common ancestry means that there has been
little time for recombination to break apart haplotypes
Constructing fixed size neighborhoods in which to look
for SNPs is not desirable because of the variability of
recombination rates and historical LD across the
genome
In fact, the size of informative neighborhoods is highly
variable precisely because of variable recombination
rates and SNP density
Authors avoid block-building by recursively creating
neighborhood with help of ‘informativeness’ measure
Defning Informativeness:




A measure of tagging quality assessment
Assume all SNPs are bi-allelic
Notation:
I(s,t) = Informativeness of a SNP s with respect to a SNP t



I(s,t) easily estimated through the use of bipartite clique that defines each
SNP




i, j are two haplotypes drawn at random from the uniform distribution on the set of
distinct haplotype pairs.
Note: I(s,t) =1 implies complete predictability, I(s,t)=0 when t is monomorphic in
the population.
We can write I(s,t) in terms of an edge set
Definition of I easily extended to a set of SNPs S by taking the union of
edge sets
Assumes the availability of haplotype phases
New measure avoids some of the difficulties traditional LD measures have
experienced when applied to tagging SNP selection

The concept of pairwise LD fails to reliably capture the higher-order dependencies
implied by haplotype structure
Bounded-Width Algorithm: k Most
Informative SNPs (k-MIS)






Input: A set of n SNPs S
Output: subset of SNPs S’ such that I(S’,S) is
maximal
In its most general form, k-MIS is NP-hard by
reduction of the set cover problem to MIS
Algorithm optimizes informativeness, although
easily adapted for other measures
Define distance between two SNPs as the
number of SNPs in between them
k-MIS can be solved as long as distance
between adjacent tag SNPs not too large

Define





Assignment As[i]
S(As)
Recursion function Iw(s,l, S(A)) = score of the
most informative subset of l SNPs chosen from
SNPs 1 through s such that As described the
assignment for SNP s.
Pseudocode
Complexity: O(nk2w) in time and O(k2w) in
space, assuming maximal window w
Evaluation

Algorithm evaluated by Leave-One-Out Cross-Validation


SNPs not typed were predicted by a majority vote among all
haplotypes in the training set that were identical to the one being
inferred



If no such haplotypes existed, the majority vote is taken among all
training haplotypes that have the same allele call on all but one of the
typed SNPs
etc.
When compared to block-based method of Zhang:


accumulated accuracy over all haplotypes gives a global measure of the
accuracy for the given data set.
Presumably, the advantage is due to the cost imposed by artificially
restricting the range of influence of the few SNPs chosen by block
boundaries
‘Informativeness’ was shown to be a “good” measure


aligned well with the leave-one-out cross validation results
extremely close to the results of optimizing for haplotype r2
A Data Compression Problem

Select SNPs to use in an association study


Very large number of candidate SNPs



Would like to associate single nucleotide polymorphisms (SNPs)
with disease.
Chromosome wide studies, whole genome-scans
For cost effectiveness, select only a subset.
Closely spaced SNPs are highly correlated

It is less likely that there has been a recombination between two
SNPs if they are close to each other.
Association studies
Disease
Responder
Allele 0
Control
Non-responder
Allele 1
Marker A:
Allele 0 =
Allele 1 =
Marker A is
associated with
Phenotype
Association studies

Evaluate whether nucleotide
polymorphisms associate with
phenotype
A
C
G
A
G
A
C
G
A
T
A
T
A
A
G
C
T
A
G
T
A
T
G
G
T
A
T
G
G
G
Association studies
A
C
G
A
G
A
C
G
A
T
A
T
A
A
G
C
T
A
G
T
A
T
G
G
T
A
T
G
G
G
SNP-Selection Axiom:
Hypothesis-free associations

Due to the many unknowns regarding the nature of
common or complex disease, we should aim at SNP
selection that confers maximal resolution power, i.e.,
genome-wide SNP scans with the hope of performing
hypothesis-free disease associations studies, as
opposed to hypothesis-driven candidate gene or region
studies.
A New Measure
Informativeness
SNP-Selection Axiom:
Multi-allelic measure

The tagging quality of the selected SNPs should
by described by multi-allelic measure; sets of
SNPs have combined information about
predicting other SNPs
SNP-Selection Axioms:
LD consistency and Block-freeness
The highly concordant results of the block detection
methods make the interior of LD blocks adequate for
sparse SNP selection. However, block boundaries defined
by these methods are not sharp, with no single “true”
block partition. SNP selection should avoid dependence of
particular definitions of “ haplotype block.”
A New SNP Selection Measure:
Informativeness
It satisfies the following six Axioms:
1.
Multi-allelic measure
2.
LD consistency: compares well with measures of LD
3.
Block-freeness: independence on any particular block definition
4.
Hypothesis-free associations: optimization achieves maximum
haplotype resolution
5.
Algorithmically sound: practical for genome-wide computations
6.
Statistically sound: passes overfitting and imputation tests
Informativeness
s
h
1
0
0
1
1
0
h
2
0
0
1
0
1
Informativeness
0
0
1
1
0
0
0
1
0
1
0
1
0
0
0
1
1
0
1
1
s1
s2
s3
s4
s5
I(s1,s2) = 2/4 = 1/2
Informativeness
0
0
1
1
0
0
0
1
0
1
0
1
0
0
0
1
1
0
1
1
s1
s2
s3
s4
s5
I({s1,s2}, s4) = 3/4
Informativeness
0
0
1
1
0
0
0
1
0
1
I({s3,s4},{s1,s2,s5})
=3
0
1
0
0
0
S={s3,s4} is a
1
1
s1
s2
0
s3
1
1
s4
s5
Minimal Informative Subset
Informativeness
e
6
Graph theory insight
s
e
5
Minimum Set Cover
s
4
=
Minimum Informative Subset
s
s
1
s
2
s
3
s
4
s
5
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
1
1
1
s
s
e
5
4
3
e
3
2
e
1
e
SNPs
Edges
2
1
Informativeness
e
6
Graph theory insight
s
e
5
Minimum Set Cover {s3, s4}
s
4
=
Minimum Informative Subset
s
s
1
s
2
s
3
s
4
s
5
0
0
1
1
0
0
0
0
1
1
0
0
1
1
0
1
1
0
0
1
s
s
e
5
4
3
e
3
2
e
1
e
SNPs
Edges
2
1
Connecting Informativeness with
Measures of LD
The Minimum Informative SNPs
in a Block of Complete LD
(k,w)-MIS Problem
(k,w)-MIS:
1
0
1
0
w
O(nk2 )
?
?
?
?
?
?
?
0
1
0
1
1
0
0
1
1
0
1
1
0
0
1
0
1
1
0
0
solution
?
1
Opt
As0
As1
As
Validation
Tests on Publicly-Accessible Data

We performed tests using two publicly available
datasets:
LPL dataset of Nickerson et al. (2000):
142 chromosomes typed at 88 SNPs
Chromosome 21 dataset of Patil et al. (2001):
20 chromosomes typed at 24,047 SNPs

We also performed tests on an AB dataset
Most of Chromosome 22
45 chromosomes typed at 4102 SNPs
A region of Chr. 22
45 Caucasian samples
Two different runs of the Gabriel el al Block Detection method +
Zhang et al SNP selection algorithm
Our block-free algorithm
Block free tagging
Minimum informative SNPs
Informativeness
Block Free method
Block Method
Number of SNPs
Perlegen Data Set Chromosome 21:
20 individuals, 24047 SNPs
Block free tagging
Minimum informative SNPs
Block-free 21
Block-free 15
Information
fraction
With blocks
#SNPS
Lipoprotein Lipase Gene, 71 individuals, 88 SNPs
Correct imputation
block vs. block free
# correct
imputations
Block Free
Zhang et al.
#SNPs typed
Perlegen dataset
Correlations of informativeness with
imputation in leave one out studies
Leave one out
Informativeness
Block free
#SNPs
Perlege dataset
Conclusions
Conclusions

Existing LD based measures are not adequate for SNP
subset selection, and do not extend easily to multiple
SNPs

The Informativeness measure for SNPs is Block-free,
and extends easily to multiple SNPs.

Practically feasible algorithms for genome-wide studies
to compute minimum informative SNP subsets

We are able to show that by typing only 20-30% of the
SNPs, we are able to retain 90% of the
informativeness.
Download