Fast Searching in Wikipedia, P2P, and Biological Sequences Ela Hunt, Marie Curie Fellow Ela Hunt, Department of Computer Science, GlobIS, hunt@inf.ethz.ch Edinburgh, May 1st 2008 My background Krakow, Glasgow, Berlin, Glasgow, Zurich Programming: BP Exploration, Max-Planck Institute for Molecular Genetics Berlin 1998-2002, PhD at Glasgow, disk-based large indexes for DNA/AA 2001-2008, Fellowships (MRC, EU) Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch My fellowship – GlobalBioRes – global information systems for biomedical research APPLICATION AREAS Search for genes causing disease, and for drug targets in heart disease and cancer GOALS Human/computing efficiency and speed METHODS Algorithms and indexing Visualisation and user studies Prototypes Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Overview WHY bother doing indexing research HOW – options Some solutions Future developments Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Search Problems DNA: find sequences of 25 letters with 1-2 mismatches in HS, MM, RN ( 9 GB) Proteins: find motifs of length 7 with 1-2 mismatches Wikipedia Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Corporate vs private search Corporate (Google) – data mining, cannot predict new misspellings, privacy (??) Wikipedia – no logging, no data mining, privacy protection Exhaustive enumeration of POSSIBLE misspellings, instead of lookup of known ones Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch MOTIVATION for new search techniques Currently heuristics + parallelism: impossible to define query criteria algorithmically, statistical criteria are used but poorly understood by biologists/ other humans Parallelism O(n) -> O(n) Indexing O(n) -> O(logn) GOALS: exhaustivity, quality, speed Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Parallelism vs indexing O(n)/k is still O(n) (FPGAs) O(n) -> O(log n) - Cheaper (fewer boxes) - Algorithmic challenge Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch VLDB’01: index building Challenge: cannot build indexes larger than RAM (memory bottleneck) Cause: indexes have vertical and horizontal links, when a graph is serialized depth-first or breadth-first, cannot efficiently combine both directions when writing to disk, and combine with tree construction Cure: drop suffix links, partition tree vertically, O(n) construction -> O(nlogn), same speed Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Suffix Sequia: IEEE Data Eng. Bull. ‘04 Enhancing an index to the genome with a bit map, using Java direct disk reads, tests with proteins query Mutated query (variants) FILTER: Bitmap Encoding Indexed Substrings Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Substring Index BNCOD’07: approximate search for short strings on relations A substring (n-gram) index for biological sequences Approximate search: mutate a query, then check in a bit map if the string is present in the index, then query index Good performance for large sets of data (proteins, queries of length 7), faster than other methods and exhaustive Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Approximate search in P2P Finding words that are misspelled in web service descriptions (data mining can not identify new misspellings, see Google) Deletion neighbourhood concept Index to all substrings with deletions Test on PlanetLab with Wikipedia articles Good results for English, worse for German and Dutch (due to long words) Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Edit distance: min insert, replacements, deletes to transform one word into another test -> est -> east Edit distance (test,east)=2 (1 delete, 1 insert) Time O(mn) 2 3 4 3 4 2 3 3 4 4 Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch HASH Indexing based on edit distance and mutations Query -> generate mutated variants Look up variants in the index PROBLEM: many variants Neighbourhood size of word length m, alphabet size a and ed=k: choose k out of m (mk), then for each chosen position delete, insert or replace (with a-1 letters, (a-1)k) Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Deletion neighbourhood, k – number of deletions U(test,k=2) = {(test), (est,1), (tst,2), (tet,3), (tes,4), (st,1,1), (et,1,2), (es,1,3), (tt,2,2), (ts,2,3), (te,3,3)} test est,1 tst,2 et,1,2 es,1,3 tet,3 es,4,1 ts,4,2 tt,3,2 tt,2,2 st,2,3 k=1 tes,4 et,3,1 st,2,1 st,1,1 k=0 te,3,3 te,4,3 k=2 Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch HASH indexing based on deletion neighbourhood Query -> deletion neighbourhood (size mk) Index all substrings of n words with deletions (size nmk) Lookup candidates by exact matching Traverse deletion lists, compare deletion offsets, calculate edit distance SMALLER neighbourhood, LARGER INDEX Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Deriving edit distance via deletion neighbourhood Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Edit distance (test,fest)=1, same delete positions Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch FastSS Example (2) Edit distance (test,east) =2, different delete positions Query Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Target FastSS Example (3) Edit distance (est,east) = 1, different word lengths Query Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Target FastSS in memory is fast Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Index sizes – serialised Java index Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch FastSS on MySQL+PHP (English Wikipedia) Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch P2P scenario, based on DHT (NOMS’08) DHT (distributed hash table) – network overlay Insert: put (key, value) Lookup: value=get (key) N nodes, O(logN) requests are needed Test data – Wikipedia, simulating service lookup Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Index documents using put(hash(document),document) Index all neighbours with k=1 (test,tes,tst,tet,est) using put(hash(neighbour),document address) Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Search for “tesx” Neighbours are generated (tesx, esx, tsx, tex, tes) get(hash(neighbour)) yields pointer to document get(pointer) Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch P2PFastSS implementation Java, a DHT based on the Kademlia routing algorithm Deployed on ~360 PlanetLab hosts - up to 100 nodes on each PlanetLab host - ~ 34,200 nodes in total Edit distance (k) set to 1 100 Wikipedia abstracts indexed Every word with length 3 to 16 was indexed Total 2,392 words All experiments carried out 50 times Average values shown, with error bars Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch P2PFastSS – Number of Messages for Indexing (1) High standard deviation (short words need fewer messages), (2) Overhead over exact indexing Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch P2PFastSS – Number of Messages in Search Word length 7, k=1; Overhead introduced by P2PFastSS is mk Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch P2PFastSS – Performance zIndexing time − Similarity indexing: 0.67 to 16.99s − Exact indexing: 0.18 to 15.94s zLookup time − Similarity search: 0.5 to 11.6s (average is less than 3s) − Exact search: 0.2 to 4.5s (average about 2s) zHigh variability due to real-world conditions zStorage operation is slower than searching − keywords are stored redundantly Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Conclusions for P2PFastSS Message overhead ~7 times larger than exact search (7 deletions in word of length 7) P2PFastSS: Only 1.5 times slower than exact search Difference due to benefits of parallel communication (neighbors are searched in parallel) FastSS is the key for similarity searches in structured P2P networks, Intranets, and for service descriptions Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch PLANS: Mobile FastSS Mobile context: P2P Semantics lookup Indexing DHT Phones Android from Google Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch PLANS: FastSS for biological data Index a number of species (12 Drosophila) Search for RNA motifs Goal: understand some complex biological phenomena Algorithmic challenges to be resolved Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Selected collaborations and other work Semantics and data integration in P2P and client-server – Uni ZH (P. Ziegler), Montpellier (Z. Bellahsene) Searching with errors – Uni ZH (T. Bocek, B. Stiller), ETH (pharmacogenomics, A. Gerber), Glasgow (cardiology, A. Dominiczak) User studies in biomedical visualisation – Glasgow (A. Jakubowska, M. Chalmers, A. Dominiczak) Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Recent co-authored papers CartoonPlus: A New Scaling Algorithm for Genomics Data, ICCS2008 VisGenome and Ensembl: Usability of Integrated Genome Maps , DILS 2008, one of five best papers, nominated for best paper award. A Call for Personal Semantic Data Integration, IIMAS 2008, in conjunction with ICDE 2008 PORSCHE: Performance ORiented SCHEma Mediation, Information Systems 2008 VisGenome: visualisation of single and comparative genome representations, Bioinformatics, 2007 XBenchMatch: a Benchmark for XML Schema Matching Tools, VLDB’07, Demo Defining Mapping Mashups with BioXMash, Journal of Integrative Bioinformatics, 2007. Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch Thanks Hepatica nobilis Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch