Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email: [ahmet,toroslu]@ceng.metu.edu.tr Computer Engineering Department, Middle East Technical University Ankara, TURKEY SISAP’08 – 20080411 Outline • Background – Sequence Alignment – Blast • Embedding Subsequences – Fastmap, LMDS – Analysis of parameters to achieve stable and accurate mapping • Indexing Subsequences 2 SISAP’08 – 20080411 Sequence Similarity Search • Sequence similarity search is at the heart of bioinformatics research – Similarity information allows: structural, functional, and evolutionary inferences 3 SISAP’08 – 20080411 Sequence Alignment • Goal: maximize “alignment score” • Score of aligning two residues: – Substitution matrix • Optimal solution: Dynamic Programming – Global: Needleman-Wunsch (1970) – Local: Smith-Waterman (1981) 4 SISAP’08 – 20080411 Blast (Basic Local Alignment Search Tool) • Popular tool for similarity search in sequence databases 1) Generate “k-tuples” (“k-mers”, “words”) from query • • CDEFG CDE, DEF, EFG CDE ADE,CDC,CCE, CDE, … 2) Find (exact) matching k-tuples in the database 3) For each candidate sequence, extend the ktuple match in both directions. 5 SISAP’08 – 20080411 Time-accuracy trade-off Proteins (203 tuples) DNA (411 tuples) 1 2 3 4 … 11 k: Too many k-tuple hits to process Slows down the extension phase Few/none k-tuple hits Fast execution Exact k-tuple matching not sensitive Too many false negatives • Challenge: – Allow flexible matching for larger words at reasonable time 6 SISAP’08 – 20080411 Raising the bar for k 1. Map k-tuples to a vector space • Mapping cannot be perfect, thus “approximate results” 2. Use Spatial Access Methods (e.g. R-tree, Xtree) to index and retrieve k-tuples 7 SISAP’08 – 20080411 Mapping k-tuples • Requirements: – Need to support out of sample extension – Speed • Candidate methods: – Fastmap (Faloutsos, 1995) – Landmark MDS (de Silva, 2003) 8 SISAP’08 – 20080411 Fastmap 1. Select two pivots • Distant pivots heuristic 2. Obtain projection using cosine law 3. Project objects to new hyperplane 4. Repeat 9 SISAP’08 – 20080411 Fastmap • Fast! O(Nd) – N: number of data points – d is the target dimensionality • For query, need only to calculate distances to set of pivots • Unstable (esp. if original space is nonEuclidean) 10 SISAP’08 – 20080411 Landmark MDS 1. Select n landmarks (pivots) 2. Embed landmarks using classical MDS 3. For the remaining objects, apply distance-based triangulation based on distances to landmarks 11 SISAP’08 – 20080411 Landmark MDS • Provides stable results • Good selection of landmarks is critical. – LMDSrandom – LMDSmaxmin • Add new landmarks that maximizes the minimum distance to already selected landmarks – LMDSfastmap • Use the same landmarks as found by Fastmap 12 SISAP’08 – 20080411 Evaluation • Synthetic datasets – Randomly generate k-tuples for a given k and alphabet size σ • Real dataset – Yeast proteins benchmark (σ=20) – 6,341 proteins, 2.9 million residues – 103 query proteins, 38-884 residues • Weighted Hamming distance • CB-EUC substitution matrix (Sacan, 2007) 13 SISAP’08 – 20080411 Target dimensionality (d) k=5, synthetic dataset, identity matrix • Sammon’s metric stress: • Breaking point dimensionality 14 SISAP’08 – 20080411 Subsequence length (k) and alphabet size (σ) 15 SISAP’08 – 20080411 Number of landmarks k=5, d=7, synthetic dataset, identity matrix SISAP’08 – 20080411 16 Approximate k-tuple search performance • Find all k-tuples within a specified radius from a query k-tuple k=6, d=8, real dataset, CB-EUC matrix 17 SISAP’08 – 20080411 Homology search k=6, d=8, real dataset, CB-EUC matrix 18 SISAP’08 – 20080411 Search time search radius=7 Database size=100,000 19 SISAP’08 – 20080411 Conclusion • Applied an embedding-based approach to approximate sequence similarity search for the first time • Significant time improvements with negligible degradation in accuracy • Achieved more stable embedding with combined pivot selection strategy • Defined intrinsic Euclidean dimensionality of the dataset 20 SISAP’08 – 20080411