Searching with Lucene Chapter 2

advertisement
Searching with Lucene
Chapter 2
For discussion
•
•
•
•
Information retrieval
What is Lucene?
Code for indexer using Lucene
Pagerank algorithm
Information Retrieval
• Consider a collection of documents
• You want to know what words are in each of
the documents
• Given a word you want to know which
document it occurs
• You want to know how many times a word
occurs in document.
• You want to rank documents according to
count
What is Lucene?
• Apache Lucene is a free/open source
information retrieval software library,
originally created in Java by Doug Cutting.
• It is supported by the Apache Software
Foundation and is released under the Apache
Software License
• It does indexing at lightning speed.
• Lucene experience lead to the development of
Hadoop (by Doug Cutting).
Why do need to study it?
• But search is more than indexing: link analysis,
click analysis, natural language processing, latent
dirichlet allocation (LDA),…page rank,…
• We are interested in data-intensive computing
algorithm such as mapreduce and data structure
such as Google file systems.
• Algorithms we discuss in the context of Lucene
could all be converted to data-intensive methods
for improving performance and scalability.
Pagerank algorithm
Download