The Hadoop Distributed File System PaoMin Wu University at Buffalo ARCHITECTURE 1. Namenode stores matadata of the system keeps all namespace in RAM 2. Datanode block replica stores application data 3. HDFS-Client User applications access the file system using the HDFS client HDFS Client Process ARCHITECTURE 4. Image and Journal Namespace image = file system metadata Peresistent record of image = checkpoint 5. CheckpointNode (NameNode) Protects file system metadata 6. BackupNode (NameNode) Capable of creating periodic checkpoints FILE I/O OPERATIONS AND REPLICA MANGEMENT FILE I/O OPERATIONS AND REPLICA MANGEMENT Sort Benchmark Future Work Problem: NameNode contains all important information Solution: Allow multiple namespaces(and NameNodes) to share the physical storage within a cluster MapReduce: Simplied Data Processing on Large Clusters PaoMin Wu University at Buffalo Introduction •key/value pair •execution across a set of machines •handling machine failures •managing the required inter-machine communication •runs on a large cluster •powerful interface •automatic parallelization •distribution of large-scale computations Programming Model Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The Reduce function, also written by the user, accepts an intermediate key and a set of values for that key. The intermediate values are supplied to the user's reduce function via an iterator. Example: Execution Overflow: Backup Tasks: Conclusions 1. Restricting the programming model is beneficial 2. Network bandwidth is a scarce resource 3. Redundant execution can help References: The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com MapReduce: Simplied Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay@google.com Google, Inc.