For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei What is Lucene “Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ” high performance, scalable Information Retrieval (IR) library. a project in the Apache Software Foundation mature, free, open-source implemented in Java. full-text indexing and searching “In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ” “Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ” Lucene is popular a number of ports or integrations to other programming languages C/C++, C#, Ruby, Perl, Python, PHP, etc. 1500+ installations: HP, FedEx, Iron Mountain, Akamai, DSpace, IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster…. Lucene is just a hammer! NOT a ready-to-use search application, like Google a software library, a toolkit a single compact JAR file (less than 1 MB!) A number of full-featured search applications have been built on top of Lucene. What Lucene can do for you add search capabilities to your application index and make searchable any data that you can extract text from Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. You can even index data stored in your databases, indirectly! Search Application Components for indexing Acquire Content Build Document Analyze Document Index Document Components for searching Search User Interface Build Query Search Query Render Results Others Administration Interface Analytics Interface Scaleout Figure 1. Typical components of search application; the shaded components show which parts Lucene handles. Ranking formula score(Q,D) = coord(Q,D) · queryNorm(Q) · ∑ t in Q ( tf(t in D) · idf(t)2 · t.getBoost() · norm(D) ) tf–idf weight (term frequency–inverse document frequency) Key index files in Lucene Segments file Fields information file Text information file Frequency file Position file Inverted Index Example Doc 1: Penn State Football … Posting id word doc offset 1 football Doc 1 3 Doc 1 67 Doc 2 1 football Doc 2: Football players … State 2 penn Doc 1 1 3 players Doc 2 2 4 state Doc 1 2 Doc 2 13 Posting Table Demo How to install Lucene and run the demo Boolean retrieval example apache – lucene apache + lucene apache lucene Luke: http://www.getopt.org/luke/ A online demo (PHP + Lucene) : http://tiny.cc/JCA9K Reference: Lucene: http://lucene.apache.org/ Apache: http://www.apache.org/ “Lucene in Action” Chapter 1 and code: Link Lucene index: http://www.ibm.com/developerworks/library/walucene/ http://lucene.apache.org/java/2_4_0/scoring.html http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/sea rch/Similarity.html http://en.wikipedia.org/wiki/Full_text_search http://en.wikipedia.org/wiki/Index_%28search_engine%29 http://en.wikipedia.org/wiki/Tf-idf