Approximate Similarity Search in Sequence Databases

Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu email: [ahmet,toroslu]@ceng.metu.edu.tr Computer Engineering Department, Middle East Technical University Ankara, TURKEY SISAP’08 – 20080411 Outline • Background – Sequence Alignment – Blast • Embedding Subsequences – Fastmap, LMDS – Analysis of parameters to achieve stable and accurate mapping • Indexing Subsequences 2 SISAP’08 – 20080411 Sequence Similarity Search • Sequence similarity search is at the heart of bioinformatics research – Similarity information allows: structural, functional, and evolutionary inferences 3 SISAP’08 – 20080411 Sequence Alignment • Goal: maximize “alignment score” • Score of aligning two residues: – Substitution matrix • Optimal solution: Dynamic Programming – Global: Needleman-Wunsch (1970) – Local: Smith-Waterman (1981) 4 SISAP’08 – 20080411 Blast (Basic Local Alignment Search Tool) • Popular tool for similarity search in sequence databases 1) Generate “k-tuples” (“k-mers”, “words”) from query • • CDEFG  CDE, DEF, EFG CDE  ADE,CDC,CCE, CDE, … 2) Find (exact) matching k-tuples in the database 3) For each candidate sequence, extend the ktuple match in both directions. 5 SISAP’08 – 20080411 Time-accuracy trade-off Proteins (203 tuples) DNA (411 tuples) 1 2 3 4 … 11 k: Too many k-tuple hits to process Slows down the extension phase Few/none k-tuple hits Fast execution Exact k-tuple matching not sensitive Too many false negatives • Challenge: – Allow flexible matching for larger words at reasonable time 6 SISAP’08 – 20080411 Raising the bar for k 1. Map k-tuples to a vector space • Mapping cannot be perfect, thus “approximate results” 2. Use Spatial Access Methods (e.g. R-tree, Xtree) to index and retrieve k-tuples 7 SISAP’08 – 20080411 Mapping k-tuples • Requirements: – Need to support out of sample extension – Speed • Candidate methods: – Fastmap (Faloutsos, 1995) – Landmark MDS (de Silva, 2003) 8 SISAP’08 – 20080411 Fastmap 1. Select two pivots • Distant pivots heuristic 2. Obtain projection using cosine law 3. Project objects to new hyperplane 4. Repeat 9 SISAP’08 – 20080411 Fastmap • Fast! O(Nd) – N: number of data points – d is the target dimensionality • For query, need only to calculate distances to set of pivots • Unstable (esp. if original space is nonEuclidean) 10 SISAP’08 – 20080411 Landmark MDS 1. Select n landmarks (pivots) 2. Embed landmarks using classical MDS 3. For the remaining objects, apply distance-based triangulation based on distances to landmarks 11 SISAP’08 – 20080411 Landmark MDS • Provides stable results • Good selection of landmarks is critical. – LMDSrandom – LMDSmaxmin • Add new landmarks that maximizes the minimum distance to already selected landmarks – LMDSfastmap • Use the same landmarks as found by Fastmap 12 SISAP’08 – 20080411 Evaluation • Synthetic datasets – Randomly generate k-tuples for a given k and alphabet size σ • Real dataset – Yeast proteins benchmark (σ=20) – 6,341 proteins, 2.9 million residues – 103 query proteins, 38-884 residues • Weighted Hamming distance • CB-EUC substitution matrix (Sacan, 2007) 13 SISAP’08 – 20080411 Target dimensionality (d) k=5, synthetic dataset, identity matrix • Sammon’s metric stress: • Breaking point dimensionality 14 SISAP’08 – 20080411 Subsequence length (k) and alphabet size (σ) 15 SISAP’08 – 20080411 Number of landmarks k=5, d=7, synthetic dataset, identity matrix SISAP’08 – 20080411 16 Approximate k-tuple search performance • Find all k-tuples within a specified radius from a query k-tuple k=6, d=8, real dataset, CB-EUC matrix 17 SISAP’08 – 20080411 Homology search k=6, d=8, real dataset, CB-EUC matrix 18 SISAP’08 – 20080411 Search time search radius=7 Database size=100,000 19 SISAP’08 – 20080411 Conclusion • Applied an embedding-based approach to approximate sequence similarity search for the first time • Significant time improvements with negligible degradation in accuracy • Achieved more stable embedding with combined pivot selection strategy • Defined intrinsic Euclidean dimensionality of the dataset 20 SISAP’08 – 20080411

Approximate Similarity Search in Sequence Databases

Related documents

Products

Support

Approximate Similarity Search in Sequence Databases

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib