CSCI 572: Information Retrieval and Search Engines


Introduction to Apache Hadoop

CSCI 572: Information Retrieval and

Search Engines

Summer 2010


• What is Hadoop?

Where did it come from?

• What are the current versions of Hadoop?

What can it do?

May-20-10 CS572-Summer2010 CAM-2

Apache Hadoop

• The brainchild of Doug


Built out by brilliant engineers and contributors from Yahoo, and Facebook and Cloudera and other companies

• Started in 2007/2008 when code was spun out of


• Has grown into really large project at Apache with significant ecosystem

May-20-10 CS572-Summer2010 CAM-3

How to get started

• Hadoop (0.20.0/0.20.2)

Put your Java hat on

– Go here:


• If you want to do this on Windows, get Cygwin, or VMWare or something that you can run Linux on

Run the Map Reduce examples on local mode

• Check on the data generated in your HDFS

– Scaling it out

• Amazon Elastic Map Reduce


• Setting it up on your own cluster: DataNodes and


CS572-Summer2010 CAM-4

Basic Operations

• Listing files

./bin/hadoop fs –ls

Writing files

– ./bin/hadoop fs –put

Running Map Reduce Jobs

– mkdir input

– cp conf/*.xml input

– ./bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+’

– cat output/*

May-20-10 CS572-Summer2010 CAM-5

Advanced Topics

• Writing your Mappers and Reducers

Check out Map Reduce Tutorial here:



Code for several examples including Word Count

May-20-10 CS572-Summer2010 CAM-6

Other Hadoop ecosystem projects


– Big Table


– Built at FB, provides SQL interface on HDFS


– Log Processing


Scientific data analysis language on top of M/R and HDFS


Distributed Systems management

May-20-10 CS572-Summer2010 CAM-7

No releases in a while

• Stick with 0.20.x

May-20-10 CS572-Summer2010 CAM-8


• Lots more information at




• Project ideas

Implement GIS or geometrical algorithm in Map


Write REST interface to control HDFS and to M/R

Add new Writeable input data formats

– Integrate Solr and Hadoop

May-20-10 CS572-Summer2010 CAM-9


• Material inspired by discussions and talks on the

Apache Mailing lists for Hadoop and through discussions with the rest of the Hadoop community

May-20-10 CS572-Summer2010 CAM-10