Fast Searching in Wikipedia, P2P, and Biological Sequences

advertisement
Fast Searching
in Wikipedia, P2P, and Biological
Sequences
Ela Hunt, Marie Curie Fellow
Ela Hunt, Department of Computer Science, GlobIS, hunt@inf.ethz.ch
Edinburgh, May 1st 2008
My background
ƒ Krakow, Glasgow, Berlin, Glasgow, Zurich
ƒ Programming: BP Exploration, Max-Planck
Institute for Molecular Genetics Berlin
ƒ 1998-2002, PhD at Glasgow, disk-based large
indexes for DNA/AA
ƒ 2001-2008, Fellowships (MRC, EU)
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
My fellowship – GlobalBioRes – global
information systems for biomedical research
ƒ APPLICATION AREAS
ƒ
Search for genes causing disease, and for drug
targets in heart disease and cancer
ƒ GOALS
ƒ
Human/computing efficiency and speed
ƒ METHODS
ƒ
ƒ
ƒ
Algorithms and indexing
Visualisation and user studies
Prototypes
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Overview
ƒ WHY bother doing indexing research
ƒ HOW – options
ƒ Some solutions
ƒ Future developments
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Search Problems
ƒ DNA: find sequences of 25 letters with 1-2
mismatches in HS, MM, RN ( 9 GB)
ƒ Proteins: find motifs of length 7 with 1-2
mismatches
ƒ Wikipedia
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Corporate vs private search
ƒ Corporate (Google) – data mining, cannot
predict new misspellings, privacy (??)
ƒ Wikipedia – no logging, no data mining, privacy
protection
ƒ Exhaustive enumeration of POSSIBLE
misspellings, instead of lookup of known ones
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
MOTIVATION for new search techniques
ƒ Currently heuristics + parallelism: impossible to
define query criteria algorithmically, statistical
criteria are used but poorly understood by
biologists/ other humans
ƒ Parallelism O(n) -> O(n)
ƒ Indexing O(n) -> O(logn)
ƒ GOALS: exhaustivity, quality, speed
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Parallelism vs indexing
O(n)/k
is still O(n)
(FPGAs)
O(n) -> O(log n)
- Cheaper (fewer boxes)
- Algorithmic challenge
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
VLDB’01: index building
ƒ Challenge: cannot build indexes larger than
RAM (memory bottleneck)
ƒ Cause: indexes have vertical and horizontal
links, when a graph is serialized depth-first or
breadth-first, cannot efficiently combine both
directions when writing to disk, and combine with
tree construction
ƒ Cure: drop suffix links, partition tree vertically,
O(n) construction -> O(nlogn), same speed
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Suffix Sequia: IEEE Data Eng. Bull. ‘04
ƒ Enhancing an index to the genome with a bit
map, using Java direct disk reads, tests with
proteins
query
Mutated query
(variants)
FILTER:
Bitmap
Encoding
Indexed
Substrings
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Substring
Index
BNCOD’07: approximate search for short
strings on relations
ƒ A substring (n-gram) index for biological
sequences
ƒ Approximate search: mutate a query, then check
in a bit map if the string is present in the index,
then query index
ƒ Good performance for large sets of data
(proteins, queries of length 7), faster than other
methods and exhaustive
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Approximate search in P2P
ƒ Finding words that are misspelled in web service
descriptions (data mining can not identify new
misspellings, see Google)
ƒ Deletion neighbourhood concept
ƒ Index to all substrings with deletions
ƒ Test on PlanetLab with Wikipedia articles
ƒ Good results for English, worse for German and
Dutch (due to long words)
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
ƒ Edit distance: min insert, replacements, deletes
to transform one word into another
ƒ test -> est -> east
ƒ Edit distance (test,east)=2 (1 delete, 1 insert)
ƒ Time O(mn)
2 3 4
3 4
2
3
3
4
4
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
HASH Indexing based on edit distance
and mutations
ƒ Query -> generate mutated variants
ƒ Look up variants in the index
ƒ PROBLEM: many variants
ƒ Neighbourhood size of word length m, alphabet
size a and ed=k: choose k out of m (mk), then
for each chosen position delete, insert or replace
(with a-1 letters, (a-1)k)
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Deletion neighbourhood, k – number of deletions
U(test,k=2) = {(test), (est,1), (tst,2), (tet,3),
(tes,4), (st,1,1), (et,1,2), (es,1,3), (tt,2,2), (ts,2,3),
(te,3,3)}
test
est,1
tst,2
et,1,2
es,1,3
tet,3
es,4,1
ts,4,2
tt,3,2
tt,2,2
st,2,3
k=1
tes,4
et,3,1
st,2,1
st,1,1
k=0
te,3,3
te,4,3
k=2
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
HASH indexing based on deletion
neighbourhood
ƒ Query -> deletion neighbourhood (size mk)
ƒ Index all substrings of n words with deletions
(size nmk)
ƒ Lookup candidates by exact matching
ƒ Traverse deletion lists, compare deletion offsets,
calculate edit distance
ƒ SMALLER neighbourhood, LARGER INDEX
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Deriving edit distance via deletion
neighbourhood
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
ƒ Edit distance (test,fest)=1, same delete positions
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
FastSS Example (2)
ƒ Edit distance (test,east) =2, different delete
positions
Query
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Target
FastSS Example (3)
ƒ Edit distance (est,east) = 1, different word lengths
Query
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Target
FastSS in memory is fast
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Index sizes – serialised Java index
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
FastSS on MySQL+PHP (English Wikipedia)
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
P2P scenario, based on DHT (NOMS’08)
ƒ DHT (distributed hash table) – network overlay
ƒ Insert:
put (key, value)
ƒ Lookup:
value=get (key)
ƒ N nodes, O(logN) requests are needed
ƒ Test data – Wikipedia, simulating service lookup
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
ƒ Index documents using
put(hash(document),document)
ƒ Index all neighbours with k=1 (test,tes,tst,tet,est)
using put(hash(neighbour),document address)
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
ƒ Search for “tesx”
ƒ Neighbours are
generated (tesx, esx,
tsx, tex, tes)
ƒ get(hash(neighbour))
ƒ yields pointer to
document
ƒ get(pointer)
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
ƒ P2PFastSS implementation
ƒ
ƒ
ƒ
Java, a DHT based on the Kademlia routing algorithm
Deployed on ~360 PlanetLab hosts
- up to 100 nodes on each PlanetLab host
- ~ 34,200 nodes in total
Edit distance (k) set to 1
ƒ 100 Wikipedia abstracts indexed
ƒ
ƒ
Every word with length 3 to 16 was indexed
Total 2,392 words
ƒ All experiments carried out 50 times
ƒ
Average values shown, with error bars
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
P2PFastSS – Number of Messages for Indexing
ƒ (1) High standard deviation (short words need fewer
messages), (2) Overhead over exact indexing
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
P2PFastSS – Number of Messages in Search
ƒ Word length 7, k=1; Overhead introduced by
P2PFastSS is mk
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
P2PFastSS – Performance
zIndexing time
−
Similarity indexing: 0.67 to 16.99s
−
Exact indexing: 0.18 to 15.94s
zLookup time
−
Similarity search: 0.5 to 11.6s (average is less than
3s)
−
Exact search: 0.2 to 4.5s (average about 2s)
zHigh variability due to real-world conditions
zStorage operation is slower than searching
−
keywords are stored redundantly
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Conclusions for P2PFastSS
ƒ Message overhead
ƒ
~7 times larger than exact search (7 deletions in word of
length 7)
ƒ P2PFastSS: Only 1.5 times slower than exact
search
ƒ Difference due to benefits of parallel
communication (neighbors are searched in parallel)
ƒ FastSS is the key for similarity searches in
structured P2P networks, Intranets, and for service
descriptions
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
PLANS: Mobile FastSS
ƒ Mobile context: P2P
ƒ Semantics lookup
ƒ Indexing
ƒ DHT
ƒ Phones
ƒ Android from Google
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
PLANS: FastSS for biological data
ƒ Index a number of species (12 Drosophila)
ƒ Search for RNA motifs
ƒ Goal: understand some complex biological
phenomena
ƒ Algorithmic challenges to be resolved
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Selected collaborations and other work
ƒ Semantics and data integration in P2P and client-server
– Uni ZH (P. Ziegler), Montpellier (Z. Bellahsene)
ƒ Searching with errors – Uni ZH (T. Bocek, B. Stiller),
ETH (pharmacogenomics, A. Gerber), Glasgow
(cardiology, A. Dominiczak)
ƒ User studies in biomedical visualisation – Glasgow (A.
Jakubowska, M. Chalmers, A. Dominiczak)
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Recent co-authored papers
ƒ
CartoonPlus: A New Scaling Algorithm for Genomics Data, ICCS2008
ƒ
VisGenome and Ensembl: Usability of Integrated Genome Maps , DILS 2008, one of five best
papers, nominated for best paper award.
ƒ
A Call for Personal Semantic Data Integration, IIMAS 2008, in conjunction with ICDE 2008
ƒ
PORSCHE: Performance ORiented SCHEma Mediation, Information Systems 2008
ƒ
VisGenome: visualisation of single and comparative genome representations,
Bioinformatics, 2007
ƒ
XBenchMatch: a Benchmark for XML Schema Matching Tools, VLDB’07, Demo
ƒ
Defining Mapping Mashups with BioXMash, Journal of Integrative Bioinformatics, 2007.
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Thanks
Hepatica nobilis
Ela Hunt, ETH Zurich, elahunt@inf.ethz.ch
Download