PPTX - Personal Web Pages

advertisement
For ITCS 6265
Professor: Wensheng Wu
Present by TA: Xu Fei
What is Lucene
 “Apache Lucene is a high-performance, full-featured
text search engine library written entirely in Java. It is a
technology suitable for nearly any application that
requires full-text search, especially cross-platform. ”
 high performance, scalable Information Retrieval (IR)
library.
 a project in the Apache Software Foundation
 mature, free, open-source
 implemented in Java.
full-text indexing and searching
 “In text retrieval, full text search refers to a technique
for searching a computer-stored document or
database. In a full text search, the search engine
examines all of the words in every stored document as
it tries to match search words supplied by the user. ”
 “Search engine indexing collects, parses, and stores
data to facilitate fast and accurate information
retrieval. ”
Lucene is popular
 a number of ports or integrations to other
programming languages
 C/C++, C#, Ruby, Perl, Python, PHP, etc.
 1500+ installations:

HP, FedEx, Iron Mountain, Akamai, DSpace, IBM/Yahoo,
Healthline, Webmail, CNET, Lookout (acquired by
Microsoft), webshots.com (100M docs, 4M queries/day),
Siderean, Monster….
Lucene is just a hammer!
 NOT a ready-to-use search application, like Google
 a software library, a toolkit
 a single compact JAR file (less than 1 MB!)
 A number of full-featured search applications have
been built on top of Lucene.
What Lucene can do for you
 add search capabilities to your application
 index and make searchable any data that you can
extract text from
 Lucene doesn’t care about the source of the data, its
format, or even its language, as long as you can derive
text from it.
 You can even index data stored in your databases,
indirectly!
Search Application
 Components for indexing
 Acquire Content
 Build Document
 Analyze Document
 Index Document
 Components for searching
 Search User Interface
 Build Query
 Search Query
 Render Results
 Others
 Administration Interface
 Analytics Interface
 Scaleout
Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.
Ranking formula
score(Q,D) = coord(Q,D) · queryNorm(Q)
· ∑ t in Q ( tf(t in D) · idf(t)2
· t.getBoost() · norm(D) )
 tf–idf weight (term frequency–inverse document
frequency)
Key index files in Lucene
 Segments file
 Fields information file
 Text information file
 Frequency file
 Position file
Inverted Index Example
Doc 1:
Penn State
Football …
Posting
id
word
doc
offset
1
football
Doc 1
3
Doc 1
67
Doc 2
1
football
Doc 2:
Football
players …
State
2
penn
Doc 1
1
3
players
Doc 2
2
4
state
Doc 1
2
Doc 2
13
Posting
Table
Demo
 How to install Lucene and run the demo
 Boolean retrieval example



apache – lucene
apache + lucene
apache lucene
 Luke: http://www.getopt.org/luke/
 A online demo (PHP + Lucene) : http://tiny.cc/JCA9K
Reference:




Lucene: http://lucene.apache.org/
Apache: http://www.apache.org/
“Lucene in Action” Chapter 1 and code: Link
Lucene index: http://www.ibm.com/developerworks/library/walucene/
 http://lucene.apache.org/java/2_4_0/scoring.html
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/sea
rch/Similarity.html
 http://en.wikipedia.org/wiki/Full_text_search
 http://en.wikipedia.org/wiki/Index_%28search_engine%29
 http://en.wikipedia.org/wiki/Tf-idf
Download