Approximate Similarity Search in Sequence Databases

advertisement
Approximate Similarity Search in
Genomic Sequence Databases using
Landmark-Guided Embedding
Ahmet Sacan and I. Hakki Toroslu
email: [ahmet,toroslu]@ceng.metu.edu.tr
Computer Engineering Department,
Middle East Technical University
Ankara, TURKEY
SISAP’08 – 20080411
Outline
• Background
– Sequence Alignment
– Blast
• Embedding Subsequences
– Fastmap, LMDS
– Analysis of parameters to achieve stable and
accurate mapping
• Indexing Subsequences
2
SISAP’08 – 20080411
Sequence Similarity Search
• Sequence similarity search is at the heart of
bioinformatics research
– Similarity information allows: structural,
functional, and evolutionary inferences
3
SISAP’08 – 20080411
Sequence Alignment
• Goal: maximize “alignment score”
• Score of aligning two residues:
– Substitution matrix
• Optimal solution: Dynamic Programming
– Global: Needleman-Wunsch (1970)
– Local: Smith-Waterman (1981)
4
SISAP’08 – 20080411
Blast (Basic Local Alignment Search Tool)
• Popular tool for similarity search in sequence
databases
1) Generate “k-tuples” (“k-mers”, “words”) from
query
•
•
CDEFG  CDE, DEF, EFG
CDE  ADE,CDC,CCE, CDE, …
2) Find (exact) matching k-tuples in the database
3) For each candidate sequence, extend the ktuple match in both directions.
5
SISAP’08 – 20080411
Time-accuracy trade-off
Proteins (203 tuples)
DNA (411 tuples)
1 2 3 4 …
11
k:
Too many k-tuple hits to process
Slows down the extension phase
Few/none k-tuple hits
Fast execution
Exact k-tuple matching not sensitive
Too many false negatives
• Challenge:
– Allow flexible matching for larger words at
reasonable time
6
SISAP’08 – 20080411
Raising the bar for k
1. Map k-tuples to a vector space
•
Mapping cannot be perfect, thus “approximate
results”
2. Use Spatial Access Methods (e.g. R-tree, Xtree) to index and retrieve k-tuples
7
SISAP’08 – 20080411
Mapping k-tuples
• Requirements:
– Need to support out of sample extension
– Speed
• Candidate methods:
– Fastmap (Faloutsos, 1995)
– Landmark MDS (de Silva, 2003)
8
SISAP’08 – 20080411
Fastmap
1. Select two pivots
•
Distant pivots heuristic
2. Obtain projection using
cosine law
3. Project objects to
new hyperplane
4. Repeat
9
SISAP’08 – 20080411
Fastmap
• Fast! O(Nd)
– N: number of data points
– d is the target dimensionality
• For query, need only to calculate distances to
set of pivots
• Unstable (esp. if original space is nonEuclidean)
10
SISAP’08 – 20080411
Landmark MDS
1. Select n landmarks (pivots)
2. Embed landmarks using classical
MDS
3. For the remaining objects, apply
distance-based triangulation based on
distances to landmarks
11
SISAP’08 – 20080411
Landmark MDS
• Provides stable results
• Good selection of landmarks is critical.
– LMDSrandom
– LMDSmaxmin
• Add new landmarks that maximizes the minimum
distance to already selected landmarks
– LMDSfastmap
• Use the same landmarks as found by Fastmap
12
SISAP’08 – 20080411
Evaluation
• Synthetic datasets
– Randomly generate k-tuples for a given k and
alphabet size σ
• Real dataset
– Yeast proteins benchmark (σ=20)
– 6,341 proteins, 2.9 million residues
– 103 query proteins, 38-884 residues
• Weighted Hamming distance
• CB-EUC substitution matrix (Sacan, 2007)
13
SISAP’08 – 20080411
Target dimensionality (d)
k=5, synthetic dataset, identity matrix
• Sammon’s metric stress:
• Breaking point dimensionality
14
SISAP’08 – 20080411
Subsequence length (k)
and alphabet size (σ)
15
SISAP’08 – 20080411
Number of landmarks
k=5, d=7, synthetic dataset, identity matrix
SISAP’08 – 20080411
16
Approximate k-tuple search performance
• Find all k-tuples within a specified radius
from a query k-tuple
k=6, d=8, real dataset, CB-EUC matrix
17
SISAP’08 – 20080411
Homology search
k=6, d=8, real dataset, CB-EUC matrix
18
SISAP’08 – 20080411
Search time
search radius=7
Database size=100,000
19
SISAP’08 – 20080411
Conclusion
• Applied an embedding-based approach to
approximate sequence similarity search for
the first time
• Significant time improvements with
negligible degradation in accuracy
• Achieved more stable embedding with
combined pivot selection strategy
• Defined intrinsic Euclidean dimensionality of
the dataset
20
SISAP’08 – 20080411
Download