The Hadoop Distributed File System Konstantin

advertisement
The Hadoop Distributed File System
PaoMin Wu
University at Buffalo
ARCHITECTURE
1. Namenode
stores matadata of the system
keeps all namespace in RAM
2. Datanode
block replica
stores application data
3. HDFS-Client
User applications access the file system using the HDFS
client
HDFS Client Process
ARCHITECTURE
4. Image and Journal
Namespace image = file system metadata
Peresistent record of image = checkpoint
5. CheckpointNode (NameNode)
Protects file system metadata
6. BackupNode (NameNode)
Capable of creating periodic checkpoints
FILE I/O OPERATIONS AND REPLICA
MANGEMENT
FILE I/O OPERATIONS AND REPLICA
MANGEMENT
Sort Benchmark
Future Work
Problem:
NameNode contains all important information
Solution:
Allow multiple namespaces(and NameNodes) to share
the physical storage within a cluster
MapReduce: Simplied Data Processing
on Large Clusters
PaoMin Wu
University at Buffalo
Introduction
•key/value pair
•execution across a set of machines
•handling machine failures
•managing the required inter-machine communication
•runs on a large cluster
•powerful interface
•automatic parallelization
•distribution of large-scale computations
Programming Model
Map, written by the user, takes an input pair and
produces a set of intermediate key/value pairs.
The Reduce function, also written by the user, accepts
an intermediate key and a set of values for that key.
The intermediate values are supplied to the user's
reduce function via an iterator.
Example:
Execution Overflow:
Backup Tasks:
Conclusions
1. Restricting the programming model is beneficial
2. Network bandwidth is a scarce resource
3. Redundant execution can help
References:
The Hadoop Distributed File System
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler
Yahoo!
Sunnyvale, California USA
{Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com
MapReduce: Simplied Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
jeff@google.com, sanjay@google.com
Google, Inc.
Download