Introduction to Open Source Search with Apache Lucene and Solr

advertisement
Introduction to Open Source
Search with Apache Lucene and
Solr
Grant Ingersoll
The How Many Game
•How many of you:
o Have taken a class in Information Retrieval (IR)?
o Are doing work/research in IR?
o Have heard of or are using Lucene?
o Have heard of or are using Solr?
o Are doing work on core IR algorithms such as compression
techniques or scoring?
o Are doing UI/Application work/research as they relate to search?
Lucid Imagination, Inc.
Topics
•Brief Bio
•Search 101 (skip?)
•What is:
o Apache Lucene
o Apache Solr
•What can they do?
o Features and functionality
o Intangibles
•What’s new in Lucene and Solr?
o How can they help my research/work/____?
Lucid Imagination, Inc.
Brief Bio
•Apache Lucene/Solr Committer
•Apache Mahout co-founder
o Scalable Machine Learning
•Co-founder of Lucid Imagination
o http://www.lucidimagination.com
•Previously worked at Center for Natural Lang. Processing at
Syracuse Univ. with Dr. Liddy
•Co-Author of upcoming “Taming Text” (Manning Publications)
o http://www.manning.com/ingersoll
Lucid Imagination, Inc.
Search 101
•Search tools are designed for dealing with fuzzy
data/questions
o Works well with structured and unstructured data
o Performs well when dealing with large volumes of data
o Many apps don’t need the limits that databases place on content
o Search fits well alongside a DB too
•
Given a user’s information need,
(query) find and, optionally, score
content relevant to that need
o Many different ways to solve
this problem, each with tradeoffs
•What’s “relevant” mean?
Lucid Imagination, Inc.
Search 101
Relevance
Indexing
Vector Space Model (VSM) for relevance
Finds and maps terms and documents
Common across many search engines
Apache Lucene is a highly optimized
implementation of the VSM
Conceptually similar to a book index
At the heart of fast search/retrieve
Apache Lucene in a Nutshell
•http://lucene.apache.org/java
•Java based Application Programming Interface (API) for adding
search and indexing functionality to applications
•Fast and efficient scoring and indexing algorithms
•Lots of contributions to make common tasks easier:
o Highlighting, spatial, Query Parsers, Benchmarking tools, etc.
•Most widely deployed search library on the planet
Lucid Imagination, Inc.
Lucene Basics
•Content is modeled via Documents and Fields
o Content can be text, integers, floats, dates, custom
o Analysis can be employed to alter content before indexing
•Searches are supported through a wide range of Query options
o Keyword
o Terms
o Phrases
o Wildcards
o Many, many more
Lucid Imagination, Inc.
Apache Solr in a Nutshell
•http://lucene.apache.org/solr
•Lucene-based Search Server + other features and functionality
•Access Lucene over HTTP:
o Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
•Most programming tasks in Lucene are configuration tasks in
Solr
•Faceting (guided navigation, filters, etc.)
•Replication and distributed search support
•Lucene Best Practices
Lucid Imagination, Inc.
A small sampling of Lucene/Solr-Powered Sites
Buy.com
10
Features and Functionality
Lucid Imagination, Inc.
Quick Solr/Lucene Demo
•Pre-reqs:
o Apache Ant 1.7.x, Subversion (SVN)
•Command Line 1:
o svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk
o cd solr-trunk/solr/
o ant example
o cd example
o java –Dsolr.clustering.enabled=true –jar start.jar
•Command Line 2
o cd exampledocs; java –jar post.jar *.xml
•http://localhost:8983/solr/browse?q=&debugQuery=true&annotate
Browse=true
Lucid Imagination, Inc.
Other Features
•Data Import Handler
o Database, Mail, RSS, etc.
•Rich document support via Apache Tika
o PDF, MS Office, Images, etc.
•Replication for high query volume
•Distributed search for large indexes
o Production systems with 1B+ documents
•Configurable Analysis chain and other extension points
o Total control over tokenization, stemming, etc.
Lucid Imagination, Inc.
Intangibles
•Open Source
•Flexible, non-restrictive license
o Apache License v2 – non-viral
o “Do what you want with the software, just don’t claim you wrote
it”
•Large community willing to help
o Great place to learn about real world IR systems
•Many books and other documentation
o Lucene in Action by Hatcher, McCandless and Gospodnetic
Lucid Imagination, Inc.
What’s New?
•https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/C
HANGES.txt
•https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHA
NGES.txt
•Codecs
o Pluggable Index Formats
o Provide Different index compression techniques
•Stats to enable alternate scoring approaches
 BM25, Lang. Modeling, etc. -- More work to be done here
•Faster
o Java Strings are slow; convert to use byte arrays
Lucid Imagination, Inc.
Other New Items
•Many new Analyzers (tokenizers, etc.)
o Richer Language support (Hindi, Indonesian, Arabic, …)
•Richer Geospatial (Local) Search capabilities
o Score, filter, sort by distance
o http://wiki.apache.org/solr/SpatialSearch
•Results Grouping
o Group Related Results
o http://wiki.apache.org/solr/FieldCollapsing
•More Faceting Capabilities
o Pivot
o New underlying algorithms
Lucid Imagination, Inc.
How can Lucene/Solr help me?
Everyone
User Experience Researchers
• Fast indexing/search times means less time
waiting for jobs to complete
• Completely Open (source, community)
• Free to use, modify, etc.
• Large community ready and willing to help
• Rapid UI prototyping
• Total Control of results and facets
• Easy to setup and use with little to no
programming required
Lucene/Solr
IR Researchers
Job Seekers
• Flexible Indexing models (trunk)
• Flexible Relevance models via functions
and other mechanisms
• Extendable
• Google Summer of Code
• Other Internships (see me)
• Real programming skills that are highly
valued in industry
• Publicly visible, demonstrable skills
Lucid Imagination, Inc.
Job Trends
http://www.indeed.com
Lucid Imagination, Inc.
Other Things that Can Help
• Nutch
o Crawling
o http://nutch.apache.org
• Mahout
o Machine learning (clustering, classification, others)
o http://mahout.apache.org
• OpenNLP
o Part of Speech, Parsers, Named Entity Recognition
o http://incubator.apache.org/opennlp
• Open Relevance Project
o Relevance Judgments
o http://lucene.apache.org/openrelevance
Lucid Imagination, Inc.
Resources
•http://lucene.apache.org
•http://www.lucidimagination.com
•{java-user|solr-user}@lucene.apache.org
•@gsingers
•http://www.slideshare.net/gsingers
•grant@lucidimagination.com
Lucid Imagination, Inc.
Download