Hadoop on Demand

advertisement
Grid Computing at Yahoo!
Sameer Paranjpye
Mahadev Konar
Yahoo!
Outline
• Introduction
– What do we mean by ‘Grid’?
– Technology Overview
• Technologies
– HDFS
– Hadoop Map-Reduce
– Hadoop on Demand
Condor Week 2007
Introduction
What do we mean by ‘Grid’?
•
•
•
•
Computing platform that can support many distributed applications
– Runs on dedicated clusters of commodity PCs (a Grid)
– Hardware can be dynamically allocated to a “job”
– Plan to support many applications per Grid
Good for Batch data processing
– Log Processing
– Document Analysis and Indexing
– Web Graphs and Crawling
Large scale a primary design goal
– 10,000 PCs / Grid a design goal (working @ 1000 now)
– Very large data (10 Petabyte storage a design goal)
• 100+ TB inputs to a single job
• Bandwidth to data is a significant design driver
Large production deployments
– Number of CPUs that can be applied gates what you can do
– Several clusters of 1000s of nodes
Condor Week 2007
Technology Overview
• Hadoop (Our primary Grid project)
–
–
–
–
An open source apache project, started by Doug Cutting
HDFS, a distributed file system
Implementation of Map-Reduce programming model
http://lucene.apache.org/hadoop
• HOD (Hadoop-on-Demand)
–
–
–
–
Condor Week 2007
Adaptor that runs Hadoop tools on batch systems
Hadoop expressed as a parallel job
Manages setup, startup, shutdown and cleanup of Hadoop
Currently supports Condor and Torque
Technologies
HDFS - Hadoop Distributed FS
• Very Large Distributed File System
– We plan to support 10k nodes and 10 PB data
– Current deployment of 1k+ nodes, 1PB data
• Assumes commodity hardware that fails
– Files are replicated to handle hardware failure
– Checksums for corruption detection and recovery
– Continues operation as nodes / racks added / removed
• Optimized for fast batch processing
– Data location exposed to allow computes to move to data
– Stores data in chunks on every node in the cluster
– Provides VERY high aggregate bandwidth
Condor Week 2007
Hadoop DFS Architecture
Namenode
Metadata (Name, replicas, …):
/home/sameerp/foo, 3, …
/home/sameerp/docs, 4, …
Metadata ops
Datanodes
Client
I/O
Client
Rack 1
Condor Week 2007
Rack 2
Hadoop Map-Reduce
• Implementation of the Map-Reduce programming model
– Framework for distributed processing of large data sets
– Resilient to nodes failing and joining during a job
– Great for web data and log processing
• Pluggable user code runs in generic reusable framework
– Input records are transformed, sorted and combined to produce a
new output
– All actions plugable / configurable
• A reusable design pattern
Input | Map | Shuffle | Reduce | Output
(example)
cat * | grep | sort | unique -c > file
Condor Week 2007
HOD (Hadoop on Demand)
• Adaptor that enables Hadoop use with batch schedulers
– Provisions Hadoop clusters on demand
– Scheduling is handled by resource managers like Condor
– Requests N nodes from a resource manager and provisions them
with a Hadoop cluster
• Condor interaction
– User specifies: number of nodes, workload to launch
– HOD generates class-ads for Hadoop master and slaves and
submits them as Condor jobs
– Cluster comes up when the jobs start running
– HOD launches workload
Condor Week 2007
HOD (Hadoop on Demand)
• HOD shell
– User interface to HOD is a command shell
– Workloads are specified as command lines
– Example:
% bin/hod -c hodconf -n 100
>> run hadoop-streaming.jar –mapper ‘grep condor’ -reducer ‘uniq -c’
-input /user/sameerp/data –output /user/sameerp/condor
• Work in progress
– Data affinity for workloads
– Implementation of elastic workloads
– Software distribution via BitTorrent
Condor Week 2007
Hadoop on Condor
• Clients launch jobs
• Condor dynamically
allocates clusters
• HOD used to start Hadoop
Map-Reduce on cluster
• Map-Reduce Reads/Writes
Data from the HDFS
• When done
– Results are stored in HDFS
and/or returned to the client
– Condor reclaims nodes
Client 1
Client 2
Dynamic Hadoop
Map-Reduce
Cluster
Dynamic Hadoop
Map-Reduce Cluster
Condor
HDFS
Condor Week 2007
Other things in the works
• Record I/O
– Define a structure once, use it in C, Java, Python…
– Export it in a binary or XML format
• Streaming
– A simple way to use existing Unix filters and / or stdin/out
programs in any language with Map-Reduce
• Pig - Y! Research
– Higher level data manipulation language, uses Hadoop
– Data analysis tasks expressed as queries, in the style of
SQL or Relational Algebra
– http://research.yahoo.com/project/pig
Condor Week 2007
The end
THE END
Condor Week 2007
Download