Apache Lucene n

advertisement
Kiran Manzoor
Sajjad Athar
Waseem Zaman
Apache
Lucene
1
2
What is Lucene
free and open-source library
information retrieval software library,
Java
Download:
http://lucene.apache.org/
3
4
Lucene-based projects
 Lucene Core- flagship sub-project, provides
Java-based indexing and search technology,
as well as spellchecking, hit highlighting and
advanced analysis/tokenization capabilities.
 Apache Nutch- provides web crawling and
HTML parsing
 Apache Solr -is a high performance search
server built using Lucene Core, with XML/HTTP
and JSON/Python/Ruby APIs, hit highlighting,
faceted search, caching, replication, and a
web admin interface
 PyLucene-is a Python port of the Core project
5
Basic Application
1.
2.
3.
4.
Parse docs
Write Index
Make query
Display results
6
Why Lucene?
Scalable, High-Performance Indexing
 over 150GB/hour on modern hardware
 small RAM requirements -- only 1MB heap
 incremental indexing as fast as batch indexing
 index size roughly 20-30% the size of text indexed
7
Why Lucene?
Powerful, Accurate and Efficient Search Algorithms
 ranked searching -- best results returned first
 many powerful query types: phrase queries, wildcard queries, proximity queries,
range queries and more
 fielded searching (e.g. title, author, contents)
 sorting by any field
 multiple-index searching with merged results
 allows simultaneous update and searching
 flexible faceting, highlighting, joins and result grouping
 fast, memory-efficient and typo-tolerant suggesters
 pluggable ranking models, including the Vector Space Model
8
Why Lucene?
Cross-Platform Solution
 Available as both commercial and Open Source programs
 100%-pure Java
 Implementations in other programming languages available that are indexcompatible
9
Languages supported by Lucene
 Object Pascal
 Perl
 C#
 C++
 Python
 Ruby
 PHP
10
websites using Lucene
 twitter
 Apple
 Disney
 chegg
 Instagram
 Ebay
 Nasa
 netflix
 Hp
https://lucidworks.com/2012/01/21/who-uses-lucenesolr/
11
websites using Lucene
 CiteSeerX- academic document search engine
 Bridge Loan- Java based financial site. It replies on Lucene for main
operation
 CodeCrawler - is a smart, web-based search engine specifically built for use
by developers for searching source code
 Hotels and Accommodation - A hotel & accommodation comparison
engine.
 Ghent University Library
 dinnerbase recept - Swedish search engine for recipes
12
How Lucene Work
13
14
Lucene in a search system
Index document
Users
Analyze
document
Search UI
Build document
Index
Build
query
Render
results
Acquire content
Raw
Content
Run query
15
Lucene Jars
16
Core indexing classes
 IndexWriter
 Central component that allows you to create a new index, open an existing one, and
add, remove, or update documents in an index
 Built on an IndexWriterConfig and a Directory
 Directory
 Abstract class that represents the location of an index
 Analyzer
 Extracts tokens from a text stream
17
Core indexing classes (contd.)
 Document
 Represents a collection of named Fields. Text in these Fields are indexed.
 Field
 Note: Lucene Fields can represent both “fields” and “zones” as described in the
textbook
 Or even other things like numbers.
 StringFields are indexed but not tokenized
 TextFields are indexed and tokenized
Indexing a directory
18
File[] files = new File(dataDirPath).listFiles();
for (File file : files) {
if (!file.isDirectory() && !file.isHidden() && file.exists() && file.canRead() &&
filter.accept(file)) {
logString += indexFile(file) + "\n";
}
}
19
Prepare Document
Document document = new Document();
Field contentField = new Field(LuceneConstants.CONTENTS, new
FileReader(file));
Field fileNameField = new Field(LuceneConstants.FILE_NAME,
file.getName(), Field.Store.YES,
Field.Index.NOT_ANALYZED);
Field filePathField = new Field(LuceneConstants.FILE_PATH,
file.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED);
document.add(contentField);
document.add(fileNameField);
20
Creating an IndexWriter
Import
import
import
import
...
org.apache.lucene.analysis.Analyzer;
org.apache.lucene.index.IndexWriter;
org.apache.lucene.index.IndexWriterConfig;
org.apache.lucene.store.Directory;
private IndexWriter writer;
public Indexer(String dir) throws IOException {
Directory indexDir = FSDirectory.open(new
File(dir));
21
Index a Document with IndexWriter
private IndexWriter writer;
...
private void indexFile(File f) throws
Exception {
Document doc = getDocument(f);
writer.addDocument(doc);
}
22
Core searching classes
 IndexSearcher
 Central class that exposes several search methods on an index
 Accessed via an IndexReader
 Query
 Abstract query class. Concrete subclasses represent specific types of queries,
e.g., matching terms in fields, boolean queries, phrase queries, …
 QueryParser
 Parses a textual representation of a query into a Query instance
23
Core searching classes (contd.)
 TopDocs
 Contains references to the top documents returned by a search
 ScoreDoc
 Represents a single search result
24
IndexSearcher
Query
IndexSearcher
IndexReader
Directory
TopDocs
25
Creating an IndexSearcher
import org.apache.lucene.search.IndexSearcher;
...
public static void search(String indexDir,
String q)
throws IOException, ParseException {
IndexReader rdr =
DirectoryReader.open(FSDirectory.open(
new File(indexDir)));
IndexSearcher is = new IndexSearcher(rdr);
...
}
Query and QueryParser
26
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
...
public static void search(String indexDir, String q)
throws IOException, ParseException
...
QueryParser parser =
new QueryParser("contents”,
new StandardAnalyzer());
Query query = parser.parse(q);
...
}
27
search() returns TopDocs
import org.apache.lucene.search.TopDocs;
...
public static void search(String indexDir,
String q)
throws IOException, ParseException
TopDocs hits = searcher.search(query);
}
28
TopDocs contain ScoreDocs
import org.apache.lucene.search.ScoreDoc;
...
public static void search(String
indexDir, String q)
throws IOException, ParseException {
for(ScoreDoc scoreDoc :
hits.scoreDocs) {
Document doc =
is.doc(scoreDoc.doc);
29
Closing IndexSearcher
public static void search(String indexDir,
String q)
throws IOException, ParseException
...
IndexSearcher is = ...;
...
is.close();
}
30
Analyzer
 Tokenizes the input text
 Common Analyzers
 WhitespaceAnalyzer
Splits tokens on whitespace
 SimpleAnalyzer
Splits tokens on non-letters, and then lowercases
 StopAnalyzer
Same as SimpleAnalyzer, but also removes stop words
 StandardAnalyzer
Most sophisticated analyzer that knows about certain token types, lowercases,
removes stop words, ...
31
Analysis example
 “The quick brown fox jumped over the lazy dog”
 WhitespaceAnalyzer
 [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
 SimpleAnalyzer
 [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
 StopAnalyzer
 [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
 StandardAnalyzer
 [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
32
Tokenizers and TokenFilters
 Tokenizer
 TokenFilter
 WhitespaceTokenizer
 LowerCaseFilter
 KeywordTokenizer
 StopFilter
 LetterTokenizer
 PorterStemFilter
 StandardTokenizer
 ASCIIFoldingFilter
 ...
 StandardFilter
 ...
33
Scoring
• Lucene =
Boolean Model + Vector Space Model
• Similarity = Cosine Similarity
– Term Frequency
– Inverse Document Frequency
– Other stuff
• Length Normalization
• Coord. factor (matching terms in OR queries)
• Boosts (per query or per doc)
• To build your own: implement Similarity and call Searcher.setSimilarity
• For debugging: Searcher.explain(Query q, int doc)
34
Performance
 Indexing
Batch indexing
Raise mergeFactor
Raise maxBufferedDocs
 Searching
Reuse IndexSearcher
Optimize: IndexWriter.optimize()
Use cached filters: QueryFilter
Segment_3
Index structure
35
Thank you…
Download
Related flashcards
Create Flashcards