Introduction to Nutch

advertisement
Introduction to Nutch
CSCI 572: Information Retrieval and
Search Engines
Summer 2010
Outline
• What is Nutch?
–
–
–
–
Motivation
Architecture
What currently exists?
How I got involved
• Deploying Nutch on NASA’s Planetary Data
System (PDS)
– Free text “Google-like” search of the PDS catalog
– Architecture/Implementation
May-20-10
CS572-Summer2010
CAM-2
What is Nutch?
• The brainchild of Doug Cutting
– Research/programmer guru who has worked at several
high profile research labs (Yahoo, Bell Labs)
• Nutch builds upon Cutting’s lower level text
indexing library and API called Lucene
• Nutch provides crawling services, protocol
services, parsing services, content management
services on top of the indexing capability provided
by Lucene
May-20-10
CS572-Summer2010
CAM-3
Motivation
• Observation: Web Search is a commodity
– Why can’t it be provided freely?
• Allows tweaking of typically “hidden” ranking algorithms
• Allows developers to focus less on the infrastructure (since Brin
& Page’s paper, the infrastructure is well-known), and more on
providing value-added capabilities
May-20-10
CS572-Summer2010
CAM-4
Motivation
• Value-added capabilities
– Improving fetching speed
– Parsing and handling of the hundreds of different
content types available on the internet
– Handling different protocols for obtaining content
– Better ranking algorithms (OPIC, PageRank)
• More or less, in Nutch, these capabilities all map to
extension points available via Nutch’s plugin
framework
May-20-10
CS572-Summer2010
CAM-5
Nutch’s Architecture
• Nutch Core facilities
–
–
–
–
–
–
Parsing
Indexing
Crawling
Content Management
Querying
Plugin Framework
• Nutch’s extension points
– Scoring, Parsing, Indexing, Querying, URLFiltering
May-20-10
CS572-Summer2010
CAM-6
Nutch’s Architecture
Maps to
Search engine
architecture
proposed by Brin
& Page
May-20-10
CS572-Summer2010
CAM-7
What Currently Exists?
•
Version 0.6.x
–
•
Version 0.7.x
–
•
Major bug fixes
Hadoop, and Lucene library upgrades
Version 1.0
–
–
–
–
•
Completely new underlying architecture based on Hadoop
Parse plugins framework, multi-valued metadata container
Parser Factory enhancement
Version 0.9.x
–
–
•
Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter
extension point, first Apache release after Incubation, mime type system
Version 0.8.x
–
–
–
•
First easily deployable version
Flexible filter framework
Flexible scoring
Initial integration with Tika
Full Search Engine functionality and capabilities, in production at large scale (Internet Archive)
Version 1.1, For full list, see http://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txt
May-20-10
CS572-Summer2010
CAM-8
What Doesn’t?
• Plenty!
• Bug fixes (> 200 issues in JIRA right now with no
resolution)
• Nutch 2.0 architecture
– http://search-lucene.com/m/gbrBF1RMWk9
– Refactored Nutch architecture, delegating to Solr,
HBase, Tika, and ORM
May-20-10
CS572-Summer2010
CAM-9
How I got involved
• In this very class!
– Okay well it used to be called Cs599, but you get the picture
• Started out by contributing RSS parsing plugin
– My final project in 599
• Moved on from there to
–
–
–
–
–
–
NUTCH-88, redesign of the parsing framework
NUTCH-139, Metadata container support
NUTCH-210, Web Context application file
And various other bug fixes, and contributions here and there
Mailing list support
Wiki support
• Became committer in October 2006
• Helped spin Nutch into Apache TLP, March 2010, Nutch
PMC member
May-20-10
CS572-Summer2010
CAM-10
Real world application of Nutch
• I work at NASA’s Jet Propulsion
Laboratory
• NASA’s Planetary Data System
– NASA’s archive for all planetary science
data collected by missions over the past 30
years
– Collected 20 TB over the past 30 years
• Increasing to over 200 TB in the next 3
years!
– Built up a catalog of all data collected
• Where does Nutch fit in?
May-20-10
CS572-Summer2010
CAM-11
Where does Nutch fit into the PDS?
• PDS Management Council decide they want
“Google-like” search of the PDS catalog
• Our plan: use Nutch to implement capability for
PDS
May-20-10
CS572-Summer2010
CAM-12
PDS Google-like Search Architecture
Existing PDS
Search Engine Architecture (e.g. Nutch, Google)
Tomcat
Web
Server
PDS
Catalog
P
D
S
D
pds.war
Crawler
Catalog
Metadata
PDS
Extract
Query
Parser
PDS
Parser
Lucene
Indexer
May-20-10
CS572-Summer2010
Index
CAM-13
Approach
• Export PDS catalog datasets in RDF format (flat files)
• Use nutch to crawl RDF files
– protocol-file plugin in Nutch
• Wrote our own parse-pds plugin
– Parse the RDF files, and then extract the metadata
• Wrote our own index-pds plugin
– Index the fields that we want from the parsed metadata
• Wrote our own query-pds plugin
– Search the index on the fields that we want
May-20-10
CS572-Summer2010
CAM-14
Search Interface
May-20-10
CS572-Summer2010
CAM-15
Results
May-20-10
CS572-Summer2010
CAM-16
Lessons Learned
• Nutch currently isn’t exactly simple to deploy, or
configure
– There is much discussion on mailing lists that refer to
“magic configuration” properties that aren’t intuitive
• Nutch documentation is currently…lacking
• If you know how to use Nutch then it is extremely
easy to use, and a time-saver
• Active participation in mailing lists, wiki,
necessary to use Nutch
May-20-10
CS572-Summer2010
CAM-17
Good News
• Nutch is here to stay
– Only open source, implementation for commodity web
search
– If you want to start your own Google++, Nutch is a great
place to start
• Participation is welcome
– Look what happened to me (student-> commiter)
– Plenty of areas to improve (including documentation)
May-20-10
CS572-Summer2010
CAM-18
Your Class Project
• It’s probably a good idea to at least take a look at Nutch,
whether you use it or not
• You can see how a real implementation of theory described
in class operates
– Implemented in pure Java (1.5)
• Add/extend capabilities within Nutch
–
–
–
–
–
Help finish plugging Nutch into HBase
Configure Nutch using Spring
Fully integrate Nutch and Solr
Fix *important* bugs
Add more scoring algorithm implementations
May-20-10
CS572-Summer2010
CAM-19
Wrapup
• Thanks for your attention!
• Nutch home page:
– http://nutch.apache.org
• Mailing lists
– dev@nutch.apache.org (developer’s list)
– user@nutch.apache.org (user’s list)
May-20-10
CS572-Summer2010
CAM-20
Download