CSCI 572: Information Retrieval and
Search Engines
Summer 2010
• What is Hadoop?
•
Where did it come from?
• What are the current versions of Hadoop?
•
What can it do?
May-20-10 CS572-Summer2010 CAM-2
• The brainchild of Doug
Cutting
•
Built out by brilliant engineers and contributors from Yahoo, and Facebook and Cloudera and other companies
• Started in 2007/2008 when code was spun out of
Nutch
• Has grown into really large project at Apache with significant ecosystem
May-20-10 CS572-Summer2010 CAM-3
• Hadoop (0.20.0/0.20.2)
–
Put your Java hat on
– Go here:
• http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
• If you want to do this on Windows, get Cygwin, or VMWare or something that you can run Linux on
•
Run the Map Reduce examples on local mode
• Check on the data generated in your HDFS
– Scaling it out
• Amazon Elastic Map Reduce
May-20-10
• Setting it up on your own cluster: DataNodes and
Task/JobTracker
CS572-Summer2010 CAM-4
• Listing files
–
./bin/hadoop fs –ls
•
Writing files
– ./bin/hadoop fs –put
•
Running Map Reduce Jobs
– mkdir input
– cp conf/*.xml input
– ./bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+’
– cat output/*
May-20-10 CS572-Summer2010 CAM-5
• Writing your Mappers and Reducers
–
Check out Map Reduce Tutorial here:
– http://hadoop.apache.org/common/docs/r0.20.0/mapred
_tutorial.html
–
Code for several examples including Word Count
May-20-10 CS572-Summer2010 CAM-6
•
HBase
– Big Table
•
HIVE
– Built at FB, provides SQL interface on HDFS
•
Chukwa
– Log Processing
•
Pig
–
Scientific data analysis language on top of M/R and HDFS
•
Zookeeper
–
Distributed Systems management
May-20-10 CS572-Summer2010 CAM-7
• Stick with 0.20.x
May-20-10 CS572-Summer2010 CAM-8
• Lots more information at
– http://hadoop.apache.org
– http://hadoop.apache.org/mapreduce/
– http://hadoop.apache.org/hdfs/
• Project ideas
–
Implement GIS or geometrical algorithm in Map
Reduce
–
Write REST interface to control HDFS and to M/R
–
Add new Writeable input data formats
– Integrate Solr and Hadoop
May-20-10 CS572-Summer2010 CAM-9
• Material inspired by discussions and talks on the
Apache Mailing lists for Hadoop and through discussions with the rest of the Hadoop community
May-20-10 CS572-Summer2010 CAM-10