Apache_lucene

advertisement
Apache Lucene
Ioan Toma
based on slides from Aaron Bannert aaron@codemass.com
©www.sti-innsbruck.at
Copyright 2012 STI INNSBRUCK www.sti-innsbruck.at
What is Apache Lucene?
“Apache Lucene(TM) is a high-performance, full-featured text search
engine library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially crossplatform.”
- from http://lucene.apache.org/
www.sti-innsbruck.at
Features
•
Scalable, High-Performance Indexing
–
–
–
–
•
over 95GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
– ranked searching -- best results returned first
– many powerful query types: phrase queries, wildcard queries, proximity queries, range
queries and more
– fielded searching (e.g., title, author, contents)
– date-range searching
– sorting by any field
– multiple-index searching with merged results
– allows simultaneous update and searching
www.sti-innsbruck.at
Features
• Cross-Platform Solution
– Available as Open Source software under the Apache License which
lets you use Lucene in both commercial and Open Source programs
– 100%-pure Java
– Implementations in other programming languages available that are
index-compatible
•
•
•
www.sti-innsbruck.at
CLucene - Lucene implementation in C++
Lucene.Net - Lucene implementation in .NET
Zend Search - Lucene implementation in the Zend Framework for PHP 5
4
Ranked Searching
1.
2.
Phrase Matching
Keyword Matching
–
–
www.sti-innsbruck.at
Prefer more unique terms first
Scoring and ranking takes into account the uniqueness of each term
when determining a document’s relevance score
Flexible Queries
• Phrases
“star wars”
• Wildcards
star*
• Ranges
{star-stun}
[2006-2007]
• Boolean Operators
star AND wars
www.sti-innsbruck.at
Field-specific Queries
• Field-specific queries can be used to target specific
fields in the Document Index.
• For example
title:”star wars”
AND
director:”George Lucas”
www.sti-innsbruck.at
Sorting
• Can sort any field in a Document
– For example, by Price, Release Date, Amazon Sales Rank, etc…
• By default, Lucene will sort results by their relevance
score. Sorting by any other field in a Document is also
supported.
www.sti-innsbruck.at
LUCENE INTERNALS
www.sti-innsbruck.at
9
Everything is a Document
• A document can represent anything textual:
– Word Document
– DVD (the textual metadata only)
– Website Member (name, ID, etc…)
• A Lucene Document need not refer to an actual file on a disk,
it could also resemble a row in a relational database.
• Developers are responsible for turning their own data
sets into Lucene Documents
• A document is seen as a list of fields, where a field has
a name an a value
www.sti-innsbruck.at
Indexes
• The unit of indexing in Lucene is a term. A term is often
a word.
• Indexes track term frequencies
• Every term maps back to a Document
• Lucene uses inverted index which allows Lucene to
quickly locate every document currently associated with
a given set up input search terms.
www.sti-innsbruck.at
Basic Indexing
1.
2.
3.
Parse different types of
documents (HTML, PDF,
Word, text files, etc.)
Extract tokens and related
info (Lucene Analyser)
Add the Document to an
Index
Lucene provide a standard
analyzer for English and latin
based languages
.
www.sti-innsbruck.at
Basic Searching
1.
Create a Query
•
2.
3.
(eg. by parsing user input)
Open an Index
Search the Index
•
4.
Use the same Analyzer as before
Iterate through returned Documents
•
•
www.sti-innsbruck.at
Extract out needed results
Extract out result scores (if needed)
Lucene as SOA
1.
Design an HTTP query syntax
–
–
2.
3.
GET queries
XML for results
Wrap Tomcat around core code
Write a Client Library
As it follows SOA principles, basic building blocks such
as load balancers can be deployed to quickly scale up
the capacity of the search subsystem.
www.sti-innsbruck.at
Lucene as SOA Diagram
Single-Machine Architecture
Lucene-based Application includes three components
1. Lucene Custom Client Library
2. Search Service
3. Custom Core Search Library
www.sti-innsbruck.at
LUCENE SCALABILITY
www.sti-innsbruck.at
16
Scalability Limits
• 3 main scalability factors:
– Query Rate
– Index Size
– Update Rate
www.sti-innsbruck.at
Query Rate Scalability
• Lucene is already fast
– Built-in caching
• Easy solution for heavy workloads:
(gives near-linear scaling)
– Add more query servers behind a load balancer
– Can grow as your traffic grows
www.sti-innsbruck.at
Lucene as SOA Diagram
High-Scale Multi-Machine Architecture
www.sti-innsbruck.at
Index Size Scalability
• Can easily handle millions of Documents
• Lucene is very commonly deployed into systems with
10s of millions of Documents.
• Main limits related to Index size that one is likely to run in
to will be disk capacity and disk I/O limits.
If you need bigger:
• Built-in multi-machine capabilities
– Can merge multiple remote indexes at query-time.
www.sti-innsbruck.at
Update Rate
•
Lucene is threadsafe
– Can update and query at the same time
•
I/O is limiting factor
Strategies for achieving even higher update rates:
– Vertical Split – for big workloads (Centralized Index Building)
1.
2.
Build indexes apart from query service
Push updated indexes on intervals
– Horizontal Split – for huge workloads
1.
2.
3.
www.sti-innsbruck.at
Split data into columns
Merge columns for queries
Columns only receive their own data for updates
Download