Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010 Outline • • • • What is Lucene/Solr? Where did it come from? What are the current versions of Lucene/Solr? What can it do? May-20-10 CS572-Summer2010 CAM-2 Apache Lucene • The brainchild of Doug Cutting • Free-text indexing library that implements most of the functionality I’ve talked to you about – Query Models, Ranking, Indexing • Core API is implemented in Java – C++/C, Ruby, Python APIs as well, but small communities or automatically generated • Initially Sourceforge, moved to Apache in 2001 May-20-10 CS572-Summer2010 CAM-3 Apache Solr • Originally developed at CNET • Web service layer built on top of Lucene library • Provides schema and understanding of field types, conversion to and from representation • Provides huge-scale scalability, deployed on top of application server like Tomcat or Jetty • P/L independent programming APIs • Sharing, replication, faceting, highlighting, explain, more like this and other functionality provided easily May-20-10 CS572-Summer2010 CAM-4 How to get started • Lucene (2.9.2 and 3.0.1 stable) – Put your Java hat on – Have Eclipse ready or your favorite IDE – Download lucene-core-<version>.jar from • http://repo1.maven.org/maven2/org/apache/lucene/ – Download src and build from • http://www.apache.org/dyn/closer.cgi/lucene/java/ – Check out some example Java code that demonstrates indexing and querying from Otis Gospodnetic • http://onjava.com/pub/a/onjava/2003/01/15/lucene.html May-20-10 CS572-Summer2010 CAM-5 How to get started • Solr – Grab a release of Solr (1.4.0 stable) • http://www.apache.org/dyn/closer.cgi/lucene/solr/ – Unpack into e.g., /usr/local/solr – Deploy onto tomcat • Install tomcat into /usr/local/tomcat • Create solr.xml file and drop into /usr/local/tomcat/conf/Catalina/localhost/ – Create solr.home JNDI property and point to /usr/local/solr/solr • Start tomcat – Head over to $solr/example/example-docs • curl http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=utf-8' --data-binary @artists.xml May-20-10 CS572-Summer2010 CAM-6 Modifying your schema.xml • Field Types • Analyzers • Tokenizers May-20-10 http://wiki.apache.org/solr/SchemaXml CS572-Summer2010 CAM-7 Solr Faceting • facet=on&facet.field=&facet.field=… • http://wiki.apache.org/solr/SimpleFacetParameters May-20-10 CS572-Summer2010 CAM-8 Advanced Topics • • • • Standing up cores Sharding Replication Zookeeper and Cloud May-20-10 CS572-Summer2010 CAM-9 Development currently in flux • Stick with release versions • Depending on trunk won’t really help • Lucene and Solr have merged May-20-10 CS572-Summer2010 CAM-10 Wrapup • Lots more information at – http://lucene.apache.org – http://lucene.apache.org/solr/ – http://lucene.apache.org/java/ • Possible projects – Geospatial search • Improving existing code and contributing back to Apache SIS and to Apache Solr – Improving date faceting – Rewriting the ResponseWriter framework May-20-10 CS572-Summer2010 CAM-11 Acknowledgements • Material inspired by discussions and talks on the Apache Mailing lists for Solr, Lucene and through discussions with the rest of the Lucene community May-20-10 CS572-Summer2010 CAM-12