Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA Schedule • In-depth Indexing/Searching – Performance, Internals – Filters, Sorting • Terms and Term Vectors • Class Project • Q&A 2 Day I Recap • Indexing – IndexWriter – Document/Field – Analyzer • Searching – IndexSearcher – IndexReader – QueryParser • Analysis • Contrib 3 Indexing In-Depth • Deletions and Updates • Optimize • Important Internals – File Formats – Segments, Commits, Merging – Compound File System • Performance 4 Lucene File Formats and Structures • http://lucene.apache.org/java/2_4_0/fileformats.html • A Lucene index is made up of one or more Segments • Lucene tracks Documents internally by an int “id” • This id may change across index operations – You should not rely on it unless you know your index isn’t changing • You can ask for a Document by this id on the IndexReader 5 Segments • Each Segment is an independent index containing: – Field Names – Stored Field values – Term Dictionary, proximity info and normalization factors – Term Vectors (optional) – Deleted Docs • Compound File System (CFS) stores all of these logical pieces in a single file 6 How Lucene Indexes • Lucene indexes Documents into memory – At certain trigger points, memory (segments) are committed/flushed to the Directory • Can be forced by calling commit() – Segments are periodically merged (more in a moment) Segments and Merging • May be created when new documents are added • Are merged from time to time based on segment size in relation to: – MergePolicy – MergeScheduler – Optimization 8 Merge Policy • Identifies Segments to be merged • Two Current Implementations – LogDocMergePolicy – LogByteSizeMergePolicy • mergeFactor - Max # of segments allowed before merging 9 MergeScheduler • Responsible for performing the merge • Two Implementations: – Serial - blocking – Concurrent - new, background 10 Optimize • Optimize is the process of merging segments down into a single segment • This process can yield significant speedups in search • Can be slow • Can also do partial optimizes 11 Final Thoughts On Merging • Usually don’t have to think about it, except when to optimize • In high update, performance critical environments, you may need to dig into it more as it can sometimes cause long pauses • Good to optimize when you can, otherwise, keep a low mergeFactor 12 Deletion • A deletion only marks the Document as deleted – Doesn’t get physically removed until a merge • Deletions can be a bit confusing – Both IndexReader and IndexWriter have delete methods • By: id, term(s), Query(s) Task – Build your index from yesterday and then try some deletes • Id, term, Query – Also try out an optimize on a FSDirectory against the full Reuters sample – 15-20 minutes 14 Updates • Updates are always a delete and an add • Updates are always a delete and an add – Yes, that is a repeat! – Nature of data structures used in search • See IndexWriter.updateDocument() 15 Performance Factors • setRAMBufferSizeMB – New model for automagically controlling indexing factors based on the amount of memory in use – Obsoletes setMaxBufferedDocs • maxBufferedDocs – Minimum # of docs before merge occurs and a new segment is created – Usually, Larger == faster, but more RAM More Factors • mergeFactor – How often segments are merged – Smaller == less RAM, better for incremental updates – Larger == faster, better for batch indexing • maxFieldLength – Limit the number of terms in a Document • Analysis • Reuse – Document, TokenStream, Token 17 Index Threading • IndexWriter and IndexReader are threadsafe and can be shared between threads without external synchronization • One open IndexWriter per Directory • Parallel Indexing – Index to separate Directory instances – Merge using IndexWriter.addIndexes – Could also distribute and collect Benchmarking Indexing • contrib/benchmark • Try out different algorithms between Lucene 2.2 and 2.3 – contrib/benchmark/conf: • indexing.alg • indexing-multithreaded.alg • Info: – Mac Pro 2 x 2GHz Dual-Core Xeon – 4 GB RAM – ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M Benchmarking Results Records/Sec Avg. T Mem 2.2 421 39M Trunk 2,122 52M Trunk-mt (4) 3,680 57M Your results will depend on analysis, etc. Searching • Earlier we touched on basics of search using the QueryParser • Now look at: – Searcher/IndexReader Lifecycle – Query classes – More details on the QueryParser – Filters – Sorting Lifecycle • Recall that the IndexReader loads a snapshot of index into memory – This means updates made since loading the index will not be seen • Business rules are needed to define how often to reload the index, if at all – IndexReader.isCurrent() can help • Loading an index is an expensive operation – Do not open a Searcher/IndexReader for every search Reopen • It is possible to have IndexReader reopen new or changed segments – Save some on the cost of loading a new index • Does not close the old reader, so application must • See DeletionsUpdatesTest.testReopen() 23 Query Classes • TermQuery is basis for all non-span queries • BooleanQuery combines multiple Query instances as clauses – should – required • PhraseQuery finds terms occurring near each other, position-wise – “slop” is the edit distance between two terms • Take 2-3 minutes to explore Query implementations Spans • Spans provide information about where matches took place • Not supported by the QueryParser • Can be used in BooleanQuery clauses • Take 2-3 minutes to explore SpanQuery classes – SpanNearQuery useful for doing phrase matching QueryParser • MultiFieldQueryParser • Boolean operators cause confusion – Better to think in terms of required (+ operator) and not allowed (- operator) • Check JIRA for QueryParser issues • http://www.gossamer-threads.com/lists/lucene/java-user/40945 • Most applications either modify QP, create their own, or restrict to a subset of the syntax • Your users may not need all the “flexibility” of the QP Sorting • Lucene default sort is by score • Searcher has several methods that take in a Sort object • Sorting should be addressed during indexing • Sorting is done on Fields containing a single term that can be used for comparison • The SortField defines the different sort types available – AUTO, STRING, INT, FLOAT, CUSTOM, SCORE, DOC Sorting II • Look at Searcher, Sort and SortField • Custom sorting is done with a SortComparatorSource • Sorting can be very expensive – Terms are cached in the FieldCache Filters • Filters restrict the search space to a subset of Documents • Use Cases – – – – – Search within a Search Restrict by date Rating Security Author Filter Classes • QueryWrapperFilter (QueryFilter) – Restrict to subset of Documents that match a Query • RangeFilter – Restrict to Documents that fall within a range – Better alternative to RangeQuery • CachingWrapperFilter – Wrap another Filter and provide caching Task • Modify your program to sort by a field and to filter by a query or some other criteria – ~15 minutes 31 Searchers • MultiSearcher – Search over multiple Searchables, including remote • MultiReader – Not a Searcher, but can be used with IndexSearcher to achieve same results for local indexes • ParallelMultiSearcher – Like MultiSearcher, but threaded • RemoteSearchable – RMI based remote searching • Look at MultiSearcherTest in example code Expert Results • Searcher has several “expert” methods • HitCollector allows low-level access to all Documents as they are scored Search Performance • Search speed is based on a number of factors: – – – – – – – – Query Type(s) Query Size Analysis Occurrences of Query Terms Optimize Index Size Index type (RAMDirectory, other) Usual Suspects • • • • CPU Memory I/O Business Needs Query Types • Be careful with WildcardQuery as it rewrites to a BooleanQuery containing all the terms that match the wildcards • Avoid starting a WildcardQuery with wildcard • Use ConstantScoreRangeQuery instead of RangeQuery • Be careful with range queries and dates – User mailing list and Wiki have useful tips for optimizing date handling Query Size • Stopword removal • Search an “all” field instead of many fields with the same terms • Disambiguation – May be useful when doing synonym expansion – Difficult to automate and may be slower – Some applications may allow the user to disambiguate • Relevance Feedback/More Like This – Use most important words – “Important” can be defined in a number of ways Usual Suspects • CPU – Profile your application • Memory – Examine your heap size, garbage collection approach • I/O – Cache your Searcher • Define business logic for refreshing based on indexing needs – Warm your Searcher before going live -- See Solr • Business Needs – Do you really need to support Wildcards? – What about date range queries down to the millisecond? FieldSelector • Prior to version 2.1, Lucene always loaded all Fields in a Document • FieldSelector API addition allows Lucene to skip large Fields – Options: Load, Lazy Load, No Load, Load and Break, Load for Merge, Size, Size and Break • Makes storage of original content more viable without large cost of loading it when not used • FieldSelectorTest in example code Relevance • At some point along your journey, you will get results that you think are “bad” • Is it a big deal? – Content, Content, Content! – Relevance Judgments – Don’t break other queries just to “fix” one • Hardcode it! – A query doesn’t always have to result in a 39 “search” Scoring and Similarity • Lucene has sophisticated scoring mechanism designed to meet most needs • Has hooks for modifying scores • Scoring is handled by the Query, Weight and Scorer class Explanations • explain(Query, int) method is useful for understanding why a Document scored the way it did • Shows all the pieces that went into scoring the result: – Tf, DF, boosts, etc. Tuning Relevance • FunctionQuery from Solr (variation in Lucene) • Override Similarity • Implement own Query and related classes • Payloads • Boosts Task • Open Luke and try some queries and then use the “explain” button • Or, write some code to do explains on a query and some documents • See how Query type, boosting, other factors play a role in the score 43 Terms and Term Vectors • Sometimes you need access to the Term Dictionary: – Auto suggest – Frequency information • Sometimes you need a Document-centric view of terms, frequencies, positions and offsets – Term Vectors 44 Term Information • TermEnum gives access to terms and how many Documents they occur in – IndexReader.terms() • TermDocs gives access to the frequency of a term in a Document – IndexReader.termDocs() – TermPositions extends TermDocs and provides access to position and payload info – IndexReader.termPositions() Term Vectors • Term Vectors give access to term frequency information in a given Document – IndexReader.getTermFreqVector • TermVectorMapper provides callbacks for working with Term Vectors 46 TermsTest • Provides samples of working with terms and term vectors 47 Lunch ? 1-2:30 Recap • • • • Indexing Searching Performance Odds and Ends – – – – Explains FieldSelector Relevance Terms and Term Vectors Class Project • Your chance to really dig in and get your hands dirty • Ask Questions • Options… 50 Option I • Start building out your Lucene Application! – Index your Data (or any data) • Threading/Updates/Deletions • Analysis – Search • Caching/Warming • Dealing with Updates • Multi-threaded – Display 51 Option II • Dig deeper into an area of interest – Performance • How fast can you index? • Search? Queries per Second? – – – – Analysis Query Parsing Scoring Contrib 52 Option III • Dig into JIRA issues and find something to fix in Lucene • https://issues.apache.org/jira/secure/Dashboard.jspa • http://wiki.apache.org/lucenejava/HowToContribute 53 Option IV • Try out Solr • http://lucene.apache.org/solr 54 Option V • Other? – Architecture Review/Discussion – Use Case Discussion 55 Project Post-Mortem • Volunteers to share? Open Discussion • Multilingual Best Practices – UNICODE – One Index versus many • • • • • • Advanced Analysis Distributed Lucene Crawling Hadoop Nutch Solr Resources • trainer@lucenebootcamp.com • Lucid Imagination – – – – Support Training Value Add grant@lucidimagination.com Finally… • Please take the time to fill out a survey to help me improve this training – Located in base directory of source – Email it to me at trainer@lucenebootcamp.com • There are several Lucene related talks on Wednesday