Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene User’s Group San Francisco What is NRT? • Search on documents nearly as fast as they are indexed • Delete documents in a way that is immediate and IO efficient • Good for things like Twitter and other apps that require realtime searching (Social 2.0) Today? • Users expect to search their data immediately after updating it (Web/Social 2.0 apps) • Search engines are designed to perform efficient batch indexing (not realtime) • Batch indexing is slow and updates take a while to be searchable NRT in Lucene • Uses core Lucene code to make existing batch indexing nearly realtime • Required retrofitting of some of the core implementation • Details are hidden • Hopefully really easy for developers to use Lucene NRT Patches • • • • • • LUCENE-1314 – IndexReader.clone LUCENE-1516 – IndexWriter.getReader LUCENE-1313 – RAMDir in IndexWriter LUCENE-1483 – Fast FieldCache loading LUCENE-1231 – Column stride fields LUCENE-1526 – Incremental copy-onwrite LUCENE-1314 • IndexReader.clone is like reopen • However it performs a copy-on-write of norms and deletes • Used by LUCENE-1516 to keep deletes in RAM (rather than flush them to disk) LUCENE-1516 • Adds ability to obtain an IndexReader from IndexWriter • Efficient in ram deletes • Call IndexWriter.getReader instead of IndexReader.reopen • All updating, deletes, roepening, and flushing details hidden from user • Will be in Lucene 2.9 Sample IW.getReader Code IndexWriter writer; Document doc = new Document(); writer.addDocument(doc); IndexReader reader = writer.getReader(); Document sameDoc= reader.document(0); assert doc.equals(sameDoc); LUCENE-1313 • Near Realtime Search • Makes IW.getReader faster • New segments are flushed to IndexWriter internal RAMDirectory • Could increase overall indexing performance because there’s no pause while the ram buffer is being written to disk • Will be in Lucene 2.9? LUCENE-1483 • Searches on fieldcaches at the segment level • Means faster field cache loading and more efficient memory usage • Good for realtime because field cache loading is less of a bottleneck, less ram usage • Will be in Lucene 2.9 LUCENE-1526 • Optimize copy-on-write • When we’re doing IndexReader.clone, we may be creating a huge new array for a small number of deletes or norms updates • So we need to do incremental copy-onwrite of things like deletes, norms, and field caches (?) • Lucene 3.0? LUCENE-1231 • Column stride fields will make field cache loading faster because data will be loaded sequentially from disk • Today there are potentially two hard drive seeks per field cache value (TermEnum.next, TermDocs.next) • Lucene 3.0? Future of Lucene NRT • LUCENE-1292 – Realtime parallel untokenized field index (for tags) • Pulsing - Store smaller postings directly in the term dictionary (to avoid seeks) for faster field cache loading • Replication • More benchmarks LinkedIn Open Source Projects • Bobo – Facet library that counts using custom field caches http://code.google.com/p/bobo-browse/ • Zoie – Realtime search on top of Lucene http://code.google.com/p/zoie/ • Voldemort – Distributed key-value storage http://project-voldemort.com/ BoboBrowse: facet features • • • • MultiSelect Runtime-defined facets (query-based, etc) Fast (custom field-cache based) Custom facet types: – Hierarchical (/a/b/c) – Range – Multivalued Zoie: realtime features • No modifications to core lucene • Multiple read/write: RAMDir + FSDir • IndexReader on (small) RAMDir opened per request: instantly realtime • IndexReaderDecorator for custom Reader • Transparent Indexing: implement StreamDataProvider then inject Next Steps • Help work on the patches? https://issues.apache.org/jira/browse/LUC ENE • LinkedIn is hiring • Contact: jason.rutherglen@gmail.com or jake.mannix@gmail.com