tagSNP selection

advertisement
Fast Tag SNP Selection
Wang Yue
Joint work with Postdoc Guimei Liu and Prof Limsoon Wong
NUS Presentation Title 2006
Outline
•
•
•
•
•
•
Some preliminary definitions
Previous Work
Our work- Fast Tagger
Experiment Result
Tag SNP Application
Future Work
NUS Presentation Title 2006
SNP(Single-nucleotide polymorphism)
NUS Presentation Title 2006
Why research on SNP
• Variation among human beings can affect how
human develop certain diseases and respond to
pathogens, chemicals, drugs, vaccines, and other
agents
eg:Researchers found that persons with the specific alterations (SNPs)
have a 50% higher relative risk of developing glioblastoma, a type of
Brain Cancer.
• A promising area to realize the "Personalized
medicine"
• Important in crop and livestock breeding programs
NUS Presentation Title 2006
Tag SNP
• A tag SNP is a SNP in a region of the genome with
high linkage disequilibrium
• Possible to identify genetic variation without
genotyping every SNP in a chromosomal region.
• Tag SNPs are useful in whole-genome SNP
association studies in which hundreds of thousands
of SNPs across the entire genome are genotyped.
NUS Presentation Title 2006
Tag SNP-linkage disequilibrium
• In population genetics, linkage disequilibrium is the
non-random association of alleles at two or more
loci, not necessarily on the same chromosome.
• Usually use r2 to measure
• where P(XY), P(Xy), P(xY), P(xy) are freq of
possible alleles; P(X) =P(XY)+P(Xy),
P(x)=P(xY)+P(xy),
NUS Presentation Title 2006
Tag SNP selection
• Given dataset, we can find a huge number of tag snp
relation among SNPs as long as we can enumerate
the possible r2 value between SNPs
• The reality is
• We desire to select a smallest set of high quality
SNPs which can tag the rest SNPs, in other words, if
we understand this smallest set of SNPs, we can
refer the rest based on the r2 value.
NUS Presentation Title 2006
Tag SNP Selection-- More formal description
• Given a set S of SNPs, find the smallest set of tag
SNPs Stag such that for every SNPj ∈ S − Stag, there is
at least one SNP set Sj⊆ Stag such that
• – r2 (Sj, SNPj) ≥ min_r2
• – |Sj| ≤ max_size
• – Distance between every pair of SNPs in Sj ∪ {SNPj}
is no larger than max_dist
NUS Presentation Title 2006
Previous Work
• Step 1: Correlations between SNPs within certain distance are
calculated
• Step 2: Find smallest set of tag SNPs using
correlations calculated in Step 1
• Most algo use greedy approach to find a near optimal set of tag SNPs
in Step 2
Earlier tag SNP selection methods rely on pairwise correlations
• MultiTag & MMTagger find multimarker rules – {SNP1, SNP2, SNP3} ->SNPx
• Cannot handle >100k SNP
• MultiTag takes hundreds of hours for 30k SNP
• MMTagger takes hours & 1GB memory for 30k SNP
NUS Presentation Title 2006
Fast Tagger
• Similar two major steps
• first step: borrow the typical data mining techniques
to mine tagging rules based on r2 value
• Second step:Use a greedy algorithm to select the
small set of tag SNPs from the tagging rules
generated in first step
NUS Presentation Title 2006
Why beat the previous work?
• Previous work like MMtagger will generate a lot of
redundant tagging relations
• Ours can avoid this by
1. Merge nearby equivalent SNPs
2. Prune redundant correlation rules
3. Skip the rules if RHS has been covered many times
4. If total size of rules exceeds memory, divide
chromosome into blocks, and then find tag SNPs
within each block
NUS Presentation Title 2006
Experiment Setting
Japanese and Han in HapMap release 21
– 45 unrelated individuals
– 6 chromosomes
NUS Presentation Title 2006
Experiment Result—running time and # tag SNPs
Comparison with state-of-the-art work: MMTagger
NUS Presentation Title 2006
Experiment Result-memory consumption
MMTagger consumes much more memory
Failed on large chromosomes when max_size = 3
Step 2 of FastTagger consumes much more memory than
Step 1 because this step needs to store rules generated in
the memory
NUS Presentation Title 2006
Effectiveness of Merging Nearby Equiv SNPs
# of rules, tag SNPs, and runtime are significantly
reduced
NUS Presentation Title 2006
Effectiveness of Skipping Rules
Memory usage and runtime are significantly
reduced, while # of tag SNPs is marginally
increased
NUS Presentation Title 2006
Effectiveness of Pruning Redundant Rules
Memory usage and # rules are significantly
reduced
NUS Presentation Title 2006
Conclusions
• Compared to existing genome-wide tag SNP
selection algorithm using multi-marker correlations,
• FastTagger is
– Many times faster
– Consumes much less memory
– Can work on chromosomes with > 100k SNPs
• Merging equiv SNPs together is most effective
technique in reducing running time and memory
consumption
NUS Presentation Title 2006
Tag SNP Application
• Using the tagging rules generated by our data
mining technique to infer extra SNPs from existing
SNP list
• We obtained two SNP list from two major SNP chip
company:
• IIiuminia ,1145784 SNPs
• Affimetric,927654 SNPs
• How many extra SNPs we can infer?
NUS Presentation Title 2006
Experiment Setting
• Our rules are generated from Data set Japanese and
Han in HapMap release 21,contrary to previous
experiment, we use 22 chromosomes
• In this experiment, two factors will determine how
many extra SNPs we can infer
1. r2 threshold: empirical set 0.8, we set 0.80, 0.85, 0.90,
0.95
2. Rule size: we set 1,2
NUS Presentation Title 2006
r2 : 0.80
r2
r2
: 0.85
: 0.90
length 2
length 1
Affimetric
1006866
382962
lluminia
993321
417107
length 2
length 1
Affimetric
927026
310671
lluminia
923971
340070
length 2
length 1
Affimetric
821042
226306
lluminia
827858
248512
length 2
length 1
Affimetric
118994
112896
lluminia
131263
125200
r2 : 0.95
NUS Presentation Title 2006
Future Work
• Test the accuracy of our selected SNPs with state-ofthe-art work
• Support adaptive user requirement to select the
SNPs, such as I have only 1 million, just give me
1000 most informative SNPs
• How the division of the chromosomes influence the
# of tag SNPs
• More to explore
NUS Presentation Title 2006
Many thanks to
• My supervisor : Prof Limsoon Wong
• My senior: Guimei Liu
• Some slides are adapted from Prof Wong's notes
and Wikipedia
• Thank you for listening
• Q&A
Download