Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010 Outline • What is Nutch? – – – – Motivation Architecture What currently exists? How I got involved • Deploying Nutch on NASA’s Planetary Data System (PDS) – Free text “Google-like” search of the PDS catalog – Architecture/Implementation May-20-10 CS572-Summer2010 CAM-2 What is Nutch? • The brainchild of Doug Cutting – Research/programmer guru who has worked at several high profile research labs (Yahoo, Bell Labs) • Nutch builds upon Cutting’s lower level text indexing library and API called Lucene • Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene May-20-10 CS572-Summer2010 CAM-3 Motivation • Observation: Web Search is a commodity – Why can’t it be provided freely? • Allows tweaking of typically “hidden” ranking algorithms • Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities May-20-10 CS572-Summer2010 CAM-4 Motivation • Value-added capabilities – Improving fetching speed – Parsing and handling of the hundreds of different content types available on the internet – Handling different protocols for obtaining content – Better ranking algorithms (OPIC, PageRank) • More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework May-20-10 CS572-Summer2010 CAM-5 Nutch’s Architecture • Nutch Core facilities – – – – – – Parsing Indexing Crawling Content Management Querying Plugin Framework • Nutch’s extension points – Scoring, Parsing, Indexing, Querying, URLFiltering May-20-10 CS572-Summer2010 CAM-6 Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page May-20-10 CS572-Summer2010 CAM-7 What Currently Exists? • Version 0.6.x – • Version 0.7.x – • Major bug fixes Hadoop, and Lucene library upgrades Version 1.0 – – – – • Completely new underlying architecture based on Hadoop Parse plugins framework, multi-valued metadata container Parser Factory enhancement Version 0.9.x – – • Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system Version 0.8.x – – – • First easily deployable version Flexible filter framework Flexible scoring Initial integration with Tika Full Search Engine functionality and capabilities, in production at large scale (Internet Archive) Version 1.1, For full list, see http://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txt May-20-10 CS572-Summer2010 CAM-8 What Doesn’t? • Plenty! • Bug fixes (> 200 issues in JIRA right now with no resolution) • Nutch 2.0 architecture – http://search-lucene.com/m/gbrBF1RMWk9 – Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM May-20-10 CS572-Summer2010 CAM-9 How I got involved • In this very class! – Okay well it used to be called Cs599, but you get the picture • Started out by contributing RSS parsing plugin – My final project in 599 • Moved on from there to – – – – – – NUTCH-88, redesign of the parsing framework NUTCH-139, Metadata container support NUTCH-210, Web Context application file And various other bug fixes, and contributions here and there Mailing list support Wiki support • Became committer in October 2006 • Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member May-20-10 CS572-Summer2010 CAM-10 Real world application of Nutch • I work at NASA’s Jet Propulsion Laboratory • NASA’s Planetary Data System – NASA’s archive for all planetary science data collected by missions over the past 30 years – Collected 20 TB over the past 30 years • Increasing to over 200 TB in the next 3 years! – Built up a catalog of all data collected • Where does Nutch fit in? May-20-10 CS572-Summer2010 CAM-11 Where does Nutch fit into the PDS? • PDS Management Council decide they want “Google-like” search of the PDS catalog • Our plan: use Nutch to implement capability for PDS May-20-10 CS572-Summer2010 CAM-12 PDS Google-like Search Architecture Existing PDS Search Engine Architecture (e.g. Nutch, Google) Tomcat Web Server PDS Catalog P D S D pds.war Crawler Catalog Metadata PDS Extract Query Parser PDS Parser Lucene Indexer May-20-10 CS572-Summer2010 Index CAM-13 Approach • Export PDS catalog datasets in RDF format (flat files) • Use nutch to crawl RDF files – protocol-file plugin in Nutch • Wrote our own parse-pds plugin – Parse the RDF files, and then extract the metadata • Wrote our own index-pds plugin – Index the fields that we want from the parsed metadata • Wrote our own query-pds plugin – Search the index on the fields that we want May-20-10 CS572-Summer2010 CAM-14 Search Interface May-20-10 CS572-Summer2010 CAM-15 Results May-20-10 CS572-Summer2010 CAM-16 Lessons Learned • Nutch currently isn’t exactly simple to deploy, or configure – There is much discussion on mailing lists that refer to “magic configuration” properties that aren’t intuitive • Nutch documentation is currently…lacking • If you know how to use Nutch then it is extremely easy to use, and a time-saver • Active participation in mailing lists, wiki, necessary to use Nutch May-20-10 CS572-Summer2010 CAM-17 Good News • Nutch is here to stay – Only open source, implementation for commodity web search – If you want to start your own Google++, Nutch is a great place to start • Participation is welcome – Look what happened to me (student-> commiter) – Plenty of areas to improve (including documentation) May-20-10 CS572-Summer2010 CAM-18 Your Class Project • It’s probably a good idea to at least take a look at Nutch, whether you use it or not • You can see how a real implementation of theory described in class operates – Implemented in pure Java (1.5) • Add/extend capabilities within Nutch – – – – – Help finish plugging Nutch into HBase Configure Nutch using Spring Fully integrate Nutch and Solr Fix *important* bugs Add more scoring algorithm implementations May-20-10 CS572-Summer2010 CAM-19 Wrapup • Thanks for your attention! • Nutch home page: – http://nutch.apache.org • Mailing lists – dev@nutch.apache.org (developer’s list) – user@nutch.apache.org (user’s list) May-20-10 CS572-Summer2010 CAM-20