CSCI 572: Information Retrieval and Search Engines

advertisement

Introduction to Apache Hadoop

CSCI 572: Information Retrieval and

Search Engines

Summer 2010

Outline

• What is Hadoop?

Where did it come from?

• What are the current versions of Hadoop?

What can it do?

May-20-10 CS572-Summer2010 CAM-2

Apache Hadoop

• The brainchild of Doug

Cutting

Built out by brilliant engineers and contributors from Yahoo, and Facebook and Cloudera and other companies

• Started in 2007/2008 when code was spun out of

Nutch

• Has grown into really large project at Apache with significant ecosystem

May-20-10 CS572-Summer2010 CAM-3

How to get started

• Hadoop (0.20.0/0.20.2)

Put your Java hat on

– Go here:

• http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html

• If you want to do this on Windows, get Cygwin, or VMWare or something that you can run Linux on

Run the Map Reduce examples on local mode

• Check on the data generated in your HDFS

– Scaling it out

• Amazon Elastic Map Reduce

May-20-10

• Setting it up on your own cluster: DataNodes and

Task/JobTracker

CS572-Summer2010 CAM-4

Basic Operations

• Listing files

./bin/hadoop fs –ls

Writing files

– ./bin/hadoop fs –put

Running Map Reduce Jobs

– mkdir input

– cp conf/*.xml input

– ./bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+’

– cat output/*

May-20-10 CS572-Summer2010 CAM-5

Advanced Topics

• Writing your Mappers and Reducers

Check out Map Reduce Tutorial here:

– http://hadoop.apache.org/common/docs/r0.20.0/mapred

_tutorial.html

Code for several examples including Word Count

May-20-10 CS572-Summer2010 CAM-6

Other Hadoop ecosystem projects

HBase

– Big Table

HIVE

– Built at FB, provides SQL interface on HDFS

Chukwa

– Log Processing

Pig

Scientific data analysis language on top of M/R and HDFS

Zookeeper

Distributed Systems management

May-20-10 CS572-Summer2010 CAM-7

No releases in a while

• Stick with 0.20.x

May-20-10 CS572-Summer2010 CAM-8

Wrapup

• Lots more information at

– http://hadoop.apache.org

– http://hadoop.apache.org/mapreduce/

– http://hadoop.apache.org/hdfs/

• Project ideas

Implement GIS or geometrical algorithm in Map

Reduce

Write REST interface to control HDFS and to M/R

Add new Writeable input data formats

– Integrate Solr and Hadoop

May-20-10 CS572-Summer2010 CAM-9

Acknowledgements

• Material inspired by discussions and talks on the

Apache Mailing lists for Hadoop and through discussions with the rest of the Hadoop community

May-20-10 CS572-Summer2010 CAM-10

Download