Lucene Near Realtime Search

advertisement
Lucene
Near Realtime Search
Jason Rutherglen & Jake Mannix
LinkedIn
6/3/2009
SOLR/Lucene User’s Group
San Francisco
What is NRT?
• Search on documents nearly as fast as
they are indexed
• Delete documents in a way that is
immediate and IO efficient
• Good for things like Twitter and other apps
that require realtime searching (Social 2.0)
Today?
• Users expect to search their data
immediately after updating it (Web/Social
2.0 apps)
• Search engines are designed to perform
efficient batch indexing (not realtime)
• Batch indexing is slow and updates take a
while to be searchable
NRT in Lucene
• Uses core Lucene code to make existing
batch indexing nearly realtime
• Required retrofitting of some of the core
implementation
• Details are hidden
• Hopefully really easy for developers to use
Lucene NRT Patches
•
•
•
•
•
•
LUCENE-1314 – IndexReader.clone
LUCENE-1516 – IndexWriter.getReader
LUCENE-1313 – RAMDir in IndexWriter
LUCENE-1483 – Fast FieldCache loading
LUCENE-1231 – Column stride fields
LUCENE-1526 – Incremental copy-onwrite
LUCENE-1314
• IndexReader.clone is like reopen
• However it performs a copy-on-write of
norms and deletes
• Used by LUCENE-1516 to keep deletes in
RAM (rather than flush them to disk)
LUCENE-1516
• Adds ability to obtain an IndexReader from
IndexWriter
• Efficient in ram deletes
• Call IndexWriter.getReader instead of
IndexReader.reopen
• All updating, deletes, roepening, and
flushing details hidden from user
• Will be in Lucene 2.9
Sample IW.getReader Code
IndexWriter writer;
Document doc = new Document();
writer.addDocument(doc);
IndexReader reader = writer.getReader();
Document sameDoc= reader.document(0);
assert doc.equals(sameDoc);
LUCENE-1313
• Near Realtime Search
• Makes IW.getReader faster
• New segments are flushed to IndexWriter
internal RAMDirectory
• Could increase overall indexing
performance because there’s no pause
while the ram buffer is being written to disk
• Will be in Lucene 2.9?
LUCENE-1483
• Searches on fieldcaches at the segment
level
• Means faster field cache loading and more
efficient memory usage
• Good for realtime because field cache
loading is less of a bottleneck, less ram
usage
• Will be in Lucene 2.9
LUCENE-1526
• Optimize copy-on-write
• When we’re doing IndexReader.clone, we
may be creating a huge new array for a
small number of deletes or norms updates
• So we need to do incremental copy-onwrite of things like deletes, norms, and
field caches (?)
• Lucene 3.0?
LUCENE-1231
• Column stride fields will make field cache
loading faster because data will be loaded
sequentially from disk
• Today there are potentially two hard drive
seeks per field cache value
(TermEnum.next, TermDocs.next)
• Lucene 3.0?
Future of Lucene NRT
• LUCENE-1292 – Realtime parallel
untokenized field index (for tags)
• Pulsing - Store smaller postings directly in
the term dictionary (to avoid seeks) for
faster field cache loading
• Replication
• More benchmarks
LinkedIn Open Source Projects
• Bobo – Facet library that counts using
custom field caches
http://code.google.com/p/bobo-browse/
• Zoie – Realtime search on top of Lucene
http://code.google.com/p/zoie/
• Voldemort – Distributed key-value storage
http://project-voldemort.com/
BoboBrowse: facet features
•
•
•
•
MultiSelect
Runtime-defined facets (query-based, etc)
Fast (custom field-cache based)
Custom facet types:
– Hierarchical (/a/b/c)
– Range
– Multivalued
Zoie: realtime features
• No modifications to core lucene
• Multiple read/write: RAMDir + FSDir
• IndexReader on (small) RAMDir opened
per request: instantly realtime
• IndexReaderDecorator for custom Reader
• Transparent Indexing: implement
StreamDataProvider then inject
Next Steps
• Help work on the patches?
https://issues.apache.org/jira/browse/LUC
ENE
• LinkedIn is hiring
• Contact: jason.rutherglen@gmail.com or
jake.mannix@gmail.com
Download