Lucene Near Realtime Search

Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene User’s Group San Francisco What is NRT? • Search on documents nearly as fast as they are indexed • Delete documents in a way that is immediate and IO efficient • Good for things like Twitter and other apps that require realtime searching (Social 2.0) Today? • Users expect to search their data immediately after updating it (Web/Social 2.0 apps) • Search engines are designed to perform efficient batch indexing (not realtime) • Batch indexing is slow and updates take a while to be searchable NRT in Lucene • Uses core Lucene code to make existing batch indexing nearly realtime • Required retrofitting of some of the core implementation • Details are hidden • Hopefully really easy for developers to use Lucene NRT Patches • • • • • • LUCENE-1314 – IndexReader.clone LUCENE-1516 – IndexWriter.getReader LUCENE-1313 – RAMDir in IndexWriter LUCENE-1483 – Fast FieldCache loading LUCENE-1231 – Column stride fields LUCENE-1526 – Incremental copy-onwrite LUCENE-1314 • IndexReader.clone is like reopen • However it performs a copy-on-write of norms and deletes • Used by LUCENE-1516 to keep deletes in RAM (rather than flush them to disk) LUCENE-1516 • Adds ability to obtain an IndexReader from IndexWriter • Efficient in ram deletes • Call IndexWriter.getReader instead of IndexReader.reopen • All updating, deletes, roepening, and flushing details hidden from user • Will be in Lucene 2.9 Sample IW.getReader Code IndexWriter writer; Document doc = new Document(); writer.addDocument(doc); IndexReader reader = writer.getReader(); Document sameDoc= reader.document(0); assert doc.equals(sameDoc); LUCENE-1313 • Near Realtime Search • Makes IW.getReader faster • New segments are flushed to IndexWriter internal RAMDirectory • Could increase overall indexing performance because there’s no pause while the ram buffer is being written to disk • Will be in Lucene 2.9? LUCENE-1483 • Searches on fieldcaches at the segment level • Means faster field cache loading and more efficient memory usage • Good for realtime because field cache loading is less of a bottleneck, less ram usage • Will be in Lucene 2.9 LUCENE-1526 • Optimize copy-on-write • When we’re doing IndexReader.clone, we may be creating a huge new array for a small number of deletes or norms updates • So we need to do incremental copy-onwrite of things like deletes, norms, and field caches (?) • Lucene 3.0? LUCENE-1231 • Column stride fields will make field cache loading faster because data will be loaded sequentially from disk • Today there are potentially two hard drive seeks per field cache value (TermEnum.next, TermDocs.next) • Lucene 3.0? Future of Lucene NRT • LUCENE-1292 – Realtime parallel untokenized field index (for tags) • Pulsing - Store smaller postings directly in the term dictionary (to avoid seeks) for faster field cache loading • Replication • More benchmarks LinkedIn Open Source Projects • Bobo – Facet library that counts using custom field caches http://code.google.com/p/bobo-browse/ • Zoie – Realtime search on top of Lucene http://code.google.com/p/zoie/ • Voldemort – Distributed key-value storage http://project-voldemort.com/ BoboBrowse: facet features • • • • MultiSelect Runtime-defined facets (query-based, etc) Fast (custom field-cache based) Custom facet types: – Hierarchical (/a/b/c) – Range – Multivalued Zoie: realtime features • No modifications to core lucene • Multiple read/write: RAMDir + FSDir • IndexReader on (small) RAMDir opened per request: instantly realtime • IndexReaderDecorator for custom Reader • Transparent Indexing: implement StreamDataProvider then inject Next Steps • Help work on the patches? https://issues.apache.org/jira/browse/LUC ENE • LinkedIn is hiring • Contact: jason.rutherglen@gmail.com or jake.mannix@gmail.com

Lucene Near Realtime Search

Related documents

Products

Support

Lucene Near Realtime Search

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib