Introduction to Lucene and Solr

advertisement
Introduction to Apache Lucene/Solr
CSCI 572: Information Retrieval and
Search Engines
Summer 2010
Outline
•
•
•
•
What is Lucene/Solr?
Where did it come from?
What are the current versions of Lucene/Solr?
What can it do?
May-20-10
CS572-Summer2010
CAM-2
Apache Lucene
• The brainchild of Doug
Cutting
• Free-text indexing library that implements most of
the functionality I’ve talked to you about
– Query Models, Ranking, Indexing
• Core API is implemented in Java
– C++/C, Ruby, Python APIs as well, but small
communities or automatically generated
• Initially Sourceforge, moved to Apache in 2001
May-20-10
CS572-Summer2010
CAM-3
Apache Solr
• Originally developed at CNET
• Web service layer built on top
of Lucene library
• Provides schema and
understanding of field types, conversion to and from
representation
• Provides huge-scale scalability, deployed on top of
application server like Tomcat or Jetty
• P/L independent programming APIs
• Sharing, replication, faceting, highlighting, explain, more
like this and other functionality provided easily
May-20-10
CS572-Summer2010
CAM-4
How to get started
• Lucene (2.9.2 and 3.0.1 stable)
– Put your Java hat on
– Have Eclipse ready or your favorite IDE
– Download lucene-core-<version>.jar from
• http://repo1.maven.org/maven2/org/apache/lucene/
– Download src and build from
• http://www.apache.org/dyn/closer.cgi/lucene/java/
– Check out some example Java code that demonstrates
indexing and querying from Otis Gospodnetic
• http://onjava.com/pub/a/onjava/2003/01/15/lucene.html
May-20-10
CS572-Summer2010
CAM-5
How to get started
• Solr
– Grab a release of Solr (1.4.0 stable)
• http://www.apache.org/dyn/closer.cgi/lucene/solr/
– Unpack into e.g., /usr/local/solr
– Deploy onto tomcat
• Install tomcat into /usr/local/tomcat
• Create solr.xml file and drop into
/usr/local/tomcat/conf/Catalina/localhost/
– Create solr.home JNDI property and point to /usr/local/solr/solr
• Start tomcat
– Head over to $solr/example/example-docs
• curl http://localhost:8983/solr/update -H 'Content-type:text/xml;
charset=utf-8' --data-binary @artists.xml
May-20-10
CS572-Summer2010
CAM-6
Modifying your schema.xml
• Field Types
• Analyzers
• Tokenizers
May-20-10
http://wiki.apache.org/solr/SchemaXml
CS572-Summer2010
CAM-7
Solr Faceting
• facet=on&facet.field=&facet.field=…
• http://wiki.apache.org/solr/SimpleFacetParameters
May-20-10
CS572-Summer2010
CAM-8
Advanced Topics
•
•
•
•
Standing up cores
Sharding
Replication
Zookeeper and Cloud
May-20-10
CS572-Summer2010
CAM-9
Development currently in flux
• Stick with release versions
• Depending on trunk won’t really help
• Lucene and Solr have merged
May-20-10
CS572-Summer2010
CAM-10
Wrapup
• Lots more information at
– http://lucene.apache.org
– http://lucene.apache.org/solr/
– http://lucene.apache.org/java/
• Possible projects
– Geospatial search
• Improving existing code and contributing back to Apache SIS
and to Apache Solr
– Improving date faceting
– Rewriting the ResponseWriter framework
May-20-10
CS572-Summer2010
CAM-11
Acknowledgements
• Material inspired by discussions and talks on the
Apache Mailing lists for Solr, Lucene and through
discussions with the rest of the Lucene community
May-20-10
CS572-Summer2010
CAM-12
Download