Lucene Boot Camp

advertisement
Lucene Boot Camp
Grant Ingersoll
Lucid Imagination
Nov. 4, 2008
New Orleans, LA
Schedule
• In-depth Indexing/Searching
– Performance, Internals
– Filters, Sorting
• Terms and Term Vectors
• Class Project
• Q&A
2
Day I Recap
• Indexing
– IndexWriter
– Document/Field
– Analyzer
• Searching
– IndexSearcher
– IndexReader
– QueryParser
• Analysis
• Contrib
3
Indexing In-Depth
• Deletions and Updates
• Optimize
• Important Internals
– File Formats
– Segments, Commits, Merging
– Compound File System
• Performance
4
Lucene File Formats and
Structures
• http://lucene.apache.org/java/2_4_0/fileformats.html
• A Lucene index is made up of one or more
Segments
• Lucene tracks Documents internally by an int “id”
• This id may change across index operations
– You should not rely on it unless you know your index isn’t
changing
• You can ask for a Document by this id on the
IndexReader
5
Segments
• Each Segment is an independent index containing:
– Field Names
– Stored Field values
– Term Dictionary, proximity info and normalization
factors
– Term Vectors (optional)
– Deleted Docs
• Compound File System (CFS) stores all of these logical
pieces in a single file
6
How Lucene Indexes
• Lucene indexes Documents into memory
– At certain trigger points, memory (segments)
are committed/flushed to the Directory
• Can be forced by calling commit()
– Segments are periodically merged (more in a
moment)
Segments and Merging
• May be created when new documents are
added
• Are merged from time to time based on
segment size in relation to:
– MergePolicy
– MergeScheduler
– Optimization
8
Merge Policy
• Identifies Segments to be merged
• Two Current Implementations
– LogDocMergePolicy
– LogByteSizeMergePolicy
• mergeFactor - Max # of segments allowed
before merging
9
MergeScheduler
• Responsible for performing the merge
• Two Implementations:
– Serial - blocking
– Concurrent - new, background
10
Optimize
• Optimize is the process of merging
segments down into a single segment
• This process can yield significant speedups
in search
• Can be slow
• Can also do partial optimizes
11
Final Thoughts On Merging
• Usually don’t have to think about it, except
when to optimize
• In high update, performance critical
environments, you may need to dig into it
more as it can sometimes cause long pauses
• Good to optimize when you can, otherwise,
keep a low mergeFactor
12
Deletion
• A deletion only marks the Document as
deleted
– Doesn’t get physically removed until a merge
• Deletions can be a bit confusing
– Both IndexReader and IndexWriter
have delete methods
• By: id, term(s), Query(s)
Task
– Build your index from yesterday and then try
some deletes
• Id, term, Query
– Also try out an optimize on a FSDirectory
against the full Reuters sample
– 15-20 minutes
14
Updates
• Updates are always a delete and an add
• Updates are always a delete and an add
– Yes, that is a repeat!
– Nature of data structures used in search
• See
IndexWriter.updateDocument()
15
Performance Factors
• setRAMBufferSizeMB
– New model for automagically controlling indexing
factors based on the amount of memory in use
– Obsoletes setMaxBufferedDocs
• maxBufferedDocs
– Minimum # of docs before merge occurs and a new segment is
created
– Usually, Larger == faster, but more RAM
More Factors
• mergeFactor
– How often segments are merged
– Smaller == less RAM, better for incremental updates
– Larger == faster, better for batch indexing
• maxFieldLength
– Limit the number of terms in a Document
• Analysis
• Reuse
– Document, TokenStream, Token
17
Index Threading
• IndexWriter and IndexReader are threadsafe and can be shared between threads without
external synchronization
• One open IndexWriter per Directory
• Parallel Indexing
– Index to separate Directory instances
– Merge using IndexWriter.addIndexes
– Could also distribute and collect
Benchmarking Indexing
• contrib/benchmark
• Try out different algorithms between Lucene 2.2
and 2.3
– contrib/benchmark/conf:
• indexing.alg
• indexing-multithreaded.alg
• Info:
– Mac Pro 2 x 2GHz Dual-Core Xeon
– 4 GB RAM
–
ant run-task -Dtask.alg=./conf/indexing.alg -Dtask.mem=1024M
Benchmarking Results
Records/Sec
Avg. T Mem
2.2
421
39M
Trunk
2,122
52M
Trunk-mt (4)
3,680
57M
Your results will depend on analysis, etc.
Searching
• Earlier we touched on basics of search
using the QueryParser
• Now look at:
– Searcher/IndexReader Lifecycle
– Query classes
– More details on the QueryParser
– Filters
– Sorting
Lifecycle
• Recall that the IndexReader loads a snapshot
of index into memory
– This means updates made since loading the index will
not be seen
• Business rules are needed to define how often to
reload the index, if at all
– IndexReader.isCurrent() can help
• Loading an index is an expensive operation
– Do not open a Searcher/IndexReader for every
search
Reopen
• It is possible to have IndexReader reopen new
or changed segments
– Save some on the cost of loading a new index
• Does not close the old reader, so application must
• See
DeletionsUpdatesTest.testReopen()
23
Query Classes
• TermQuery is basis for all non-span queries
• BooleanQuery combines multiple Query
instances as clauses
– should
– required
• PhraseQuery finds terms occurring near each
other, position-wise
– “slop” is the edit distance between two terms
• Take 2-3 minutes to explore Query
implementations
Spans
• Spans provide information about where
matches took place
• Not supported by the QueryParser
• Can be used in BooleanQuery clauses
• Take 2-3 minutes to explore SpanQuery
classes
– SpanNearQuery useful for doing phrase
matching
QueryParser
• MultiFieldQueryParser
• Boolean operators cause confusion
– Better to think in terms of required (+ operator) and not
allowed (- operator)
• Check JIRA for QueryParser issues
•
http://www.gossamer-threads.com/lists/lucene/java-user/40945
• Most applications either modify QP, create their
own, or restrict to a subset of the syntax
• Your users may not need all the “flexibility” of the
QP
Sorting
• Lucene default sort is by score
• Searcher has several methods that take in a
Sort object
• Sorting should be addressed during indexing
• Sorting is done on Fields containing a single
term that can be used for comparison
• The SortField defines the different sort types
available
– AUTO, STRING, INT, FLOAT, CUSTOM, SCORE,
DOC
Sorting II
• Look at Searcher, Sort and
SortField
• Custom sorting is done with a
SortComparatorSource
• Sorting can be very expensive
– Terms are cached in the FieldCache
Filters
• Filters restrict the search space to a
subset of Documents
• Use Cases
–
–
–
–
–
Search within a Search
Restrict by date
Rating
Security
Author
Filter Classes
• QueryWrapperFilter (QueryFilter)
– Restrict to subset of Documents that match a Query
• RangeFilter
– Restrict to Documents that fall within a range
– Better alternative to RangeQuery
• CachingWrapperFilter
– Wrap another Filter and provide caching
Task
• Modify your program to sort by a field and
to filter by a query or some other criteria
– ~15 minutes
31
Searchers
• MultiSearcher
– Search over multiple Searchables, including remote
• MultiReader
– Not a Searcher, but can be used with
IndexSearcher to achieve same results for local
indexes
• ParallelMultiSearcher
– Like MultiSearcher, but threaded
• RemoteSearchable
– RMI based remote searching
• Look at MultiSearcherTest in example
code
Expert Results
• Searcher has several “expert” methods
• HitCollector allows low-level access to all
Documents as they are scored
Search Performance
• Search speed is based on a number of factors:
–
–
–
–
–
–
–
–
Query Type(s)
Query Size
Analysis
Occurrences of Query Terms
Optimize
Index Size
Index type (RAMDirectory, other)
Usual Suspects
•
•
•
•
CPU
Memory
I/O
Business Needs
Query Types
• Be careful with WildcardQuery as it rewrites
to a BooleanQuery containing all the terms
that match the wildcards
• Avoid starting a WildcardQuery with wildcard
• Use ConstantScoreRangeQuery instead of
RangeQuery
• Be careful with range queries and dates
– User mailing list and Wiki have useful tips for
optimizing date handling
Query Size
• Stopword removal
• Search an “all” field instead of many fields with the same
terms
• Disambiguation
– May be useful when doing synonym expansion
– Difficult to automate and may be slower
– Some applications may allow the user to disambiguate
• Relevance Feedback/More Like This
– Use most important words
– “Important” can be defined in a number of ways
Usual Suspects
• CPU
– Profile your application
• Memory
– Examine your heap size, garbage collection approach
• I/O
– Cache your Searcher
• Define business logic for refreshing based on indexing needs
– Warm your Searcher before going live -- See Solr
• Business Needs
– Do you really need to support Wildcards?
– What about date range queries down to the millisecond?
FieldSelector
• Prior to version 2.1, Lucene always loaded all
Fields in a Document
• FieldSelector API addition allows Lucene to
skip large Fields
– Options: Load, Lazy Load, No Load, Load and Break,
Load for Merge, Size, Size and Break
• Makes storage of original content more viable
without large cost of loading it when not used
• FieldSelectorTest in example code
Relevance
• At some point along your journey, you will
get results that you think are “bad”
• Is it a big deal?
– Content, Content, Content!
– Relevance Judgments
– Don’t break other queries just to “fix” one
• Hardcode it!
– A query doesn’t always have to result in a
39
“search”
Scoring and Similarity
• Lucene has sophisticated scoring
mechanism designed to meet most needs
• Has hooks for modifying scores
• Scoring is handled by the Query, Weight
and Scorer class
Explanations
• explain(Query, int) method is
useful for understanding why a Document
scored the way it did
• Shows all the pieces that went into scoring
the result:
– Tf, DF, boosts, etc.
Tuning Relevance
• FunctionQuery from Solr (variation in
Lucene)
• Override Similarity
• Implement own Query and related classes
• Payloads
• Boosts
Task
• Open Luke and try some queries and then
use the “explain” button
• Or, write some code to do explains on a
query and some documents
• See how Query type, boosting, other
factors play a role in the score
43
Terms and Term Vectors
• Sometimes you need access to the Term
Dictionary:
– Auto suggest
– Frequency information
• Sometimes you need a Document-centric
view of terms, frequencies, positions and
offsets
– Term Vectors
44
Term Information
• TermEnum gives access to terms and how many
Documents they occur in
– IndexReader.terms()
• TermDocs gives access to the frequency of a
term in a Document
– IndexReader.termDocs()
– TermPositions extends TermDocs and
provides access to position and payload info
– IndexReader.termPositions()
Term Vectors
• Term Vectors give access to term frequency
information in a given Document
– IndexReader.getTermFreqVector
• TermVectorMapper provides callbacks
for working with Term Vectors
46
TermsTest
• Provides samples of working with terms
and term vectors
47
Lunch ?
1-2:30
Recap
•
•
•
•
Indexing
Searching
Performance
Odds and Ends
–
–
–
–
Explains
FieldSelector
Relevance
Terms and Term Vectors
Class Project
• Your chance to really dig in and get your
hands dirty
• Ask Questions
• Options…
50
Option I
• Start building out your Lucene Application!
– Index your Data (or any data)
• Threading/Updates/Deletions
• Analysis
– Search
• Caching/Warming
• Dealing with Updates
• Multi-threaded
– Display
51
Option II
• Dig deeper into an area of interest
– Performance
• How fast can you index?
• Search? Queries per Second?
–
–
–
–
Analysis
Query Parsing
Scoring
Contrib
52
Option III
• Dig into JIRA issues and find something to
fix in Lucene
• https://issues.apache.org/jira/secure/Dashboard.jspa
• http://wiki.apache.org/lucenejava/HowToContribute
53
Option IV
• Try out Solr
• http://lucene.apache.org/solr
54
Option V
• Other?
– Architecture Review/Discussion
– Use Case Discussion
55
Project Post-Mortem
• Volunteers to share?
Open Discussion
• Multilingual Best Practices
– UNICODE
– One Index versus many
•
•
•
•
•
•
Advanced Analysis
Distributed Lucene
Crawling
Hadoop
Nutch
Solr
Resources
• trainer@lucenebootcamp.com
• Lucid Imagination
–
–
–
–
Support
Training
Value Add
grant@lucidimagination.com
Finally…
• Please take the time to fill out a survey to
help me improve this training
– Located in base directory of source
– Email it to me at trainer@lucenebootcamp.com
• There are several Lucene related talks on
Wednesday
Download