Searching with Lucene Chapter 2 For discussion • • • • Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm Information Retrieval • Consider a collection of documents • You want to know what words are in each of the documents • Given a word you want to know which document it occurs • You want to know how many times a word occurs in document. • You want to rank documents according to count What is Lucene? • Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. • It is supported by the Apache Software Foundation and is released under the Apache Software License • It does indexing at lightning speed. • Lucene experience lead to the development of Hadoop (by Doug Cutting). Why do need to study it? • But search is more than indexing: link analysis, click analysis, natural language processing, latent dirichlet allocation (LDA),…page rank,… • We are interested in data-intensive computing algorithm such as mapreduce and data structure such as Google file systems. • Algorithms we discuss in the context of Lucene could all be converted to data-intensive methods for improving performance and scalability. Pagerank algorithm