Hadoop File Systems

Lei Xu
Brief Introduction
 Hadoop
 An apache project for data-intensive applications
 Typical application: Map-Reduce (OSDI’04), a
distributed algorithm for massive-data computation
 Crawl and index web pages (Y!)
 Analyze popular topics and trends (Twitter)
 Led by Yahoo!/Facebook/Cloudera
Brief Introduction (cont’d)
 Hadoop Distributed File System (HDFS)
 A scalable distributed file system to serve Hadoop
MapReduce applications
 Borrow the essential ideas from the Google File
 Sanjay Ghenawat, Howard Gobioff and Shun-Tak
Leung. The Google File System. 19TH ACM Symposium
on Operating System Principles (SOSP’03)
 Share same design assumptions
Google File System
 A scalable distributed file system designed
 Data-intensive applications (mainly MapReduce)
 Web page indexing
 Then it has spread to other applications
 E.g. Gmail, Big Table, App Engine
 Fault-tolerant
 Low-cost hardware
 High throughputs
Google File System (cont’d)
 Departure from other file system assumptions
 Run on top of the commodity hardware
 Component failures are common
 Files are huge
 Basic block size 64~128 MB
 1~64KB in traditional file systems (Ext3/NTFS and etc.)
 Massive-data/data-intensive processing
 Large streaming read and small random read
 Large, sequential writes
 No (or bare) random writes
Hadoop DFS Assumptions
 Other than the assumptions in Google File System,
HDFS assumes that:
 Simple Coherency Model
 Write-once-read-many
 Once a file was created, written and closed, it can not be
changed anymore.
 Moving Computation Is Cheaper than Moving Data
 “Semi-Location-Aware” computation
 Try its best to assign computations closer to the related data
 Portability Across Heterogeneous Hardware and Software
 Is written in Java, multi-platform support
 Google File System was written in C++ and run on Linux
 Store data on top of existing file systems (NTFS/Ext4/Btrfs…)
HDFS Architecture
 Master/Slave Architecture
 NameNode
 Metadata Server
 File location ( file name -> the DataNode )
 File attributions (atime/ctime/mtime, size, the number
of replicas and etc.)
 DataNode
 Manages the storage attached to the nodes that
they run on
 Client
 Producer and Consumers of data
HDFS Architecture (cont’d)
 Metadata Server
 Only one NameNode in one cluster
 Single Point Failure
 Potential performance bottleneck
 Manage the file system namespace
 Traditional hierarchical namespace
 Keep all file metadata in memory for fast access
 The memory size of NameNode determines how many files
can be supported
 Execute file system namespace operation:
 Open/close/rename/create/unlink…
 Return the location of data blocks
NameNode (cont’d)
 Maintains system-wide activities
 E.g. creating new replications of file data, garbage
collection, load balancing and etc.
 Periodically communicates with DataNode to
collect their statuses
 Is DataNode alive?
 Is DataNode overload?
 Storage server
 Store fixed-size data blocks on local file systems (
ext4/zfs/btrfs )
 Serve read/write operations from the clients
 Create, delete, replicate data blocks upon
instruction from the NameNode
 Block size = 64MB
 Application-level implementations
 Does not provide POSIX API
 Hadoop has a FUSE interface
 FUSE: Filesystem in Userspace
 Has limited functions (e.g, no random write supports)
 Query the NameNode for file locations and
 Contact corresponding DataNodes for file I/Os
Data Replication
 Files are stored as a sequence of blocks
 The blocks (typically 64MB) are replicated for fault
 Replication factor is configurable per file
 Can be specified at creation time, and can be changed later
 The NameNode decides how to replicate blocks. It
periodically receives:
 Heartbeat, which implies the DataNode is alive
 Blockreport, which contains a list of all blocks on a
 When a DataNode is down, the NameNode replicas all
blocks on this DataNode to other active DataNode to
achieve enough replications
Data Replication (cont’d)
Data Replication (cont’d)
 Rack Awareness
 Hadoop instance runs on a cluster of computers
that spread across many racks:
 Nodes in same rack are connected by one switches
 Communications between two nodes in different
racks go through switches
 Slower than nodes in same rack
 One rack may fail due to network/power issues.
 Improve data reliability, availability and network
bandwidth utilization
Data Replications (cont’d)
 Rack Awareness (cont’d)
 For common case, the replication factor is three
 Two replicas are placed on two different nodes in
same rack
 The third replica is placed on a node in a remote rack
 Improves write performance
 2/3 writes are in same rack, faster
 Without compromising data reliability
Replica Selection
 For READ operation:
 Minimize the bandwidth consumption and latency
 Prefer nearer node:
 If there is a replica on the same node, it is
 The cluster may span multiple data centers,
replicas in same data centers are preferred
Filesystem Metadata
 The HDFS stores all file metadata on
 An EditLog
 Record every change that occurs to filesystem
 For failure recovery
 Same as journaling file systems (Ext3/NTFS)
 An FSImage
 Stores mapping of blocks to files and file attributes
 EditLog and FSImage are stored on NameNode
Filesystem Metedata(cont’d)
 DataNode has no knowledge about HDFS
 It only stores data blocks as regular files on local
file systems
 With a checksum for data integrity
 It periodically reports a Blockreport that includes
all blocks stored on this DataNode to NameNode
 Only the DataNode has knowledge about the
availability of one block replica.
Filesystem Metadata(cont’d)
 When NameNode starts up
 Load FSImage and EditLog from the local file
 Update FSImage with latest EditLogs
 Create a new FSImage for latest checkpoint and
store on local file system permanently
Communication Protocol
 A Hadoop specific RPC on top of TCP/IP
 NameNode is simply a server that only
responses to the requests issued by
DataNodes or clients
 ClientProtocol.java – client protocol
 DatanodeProtoco.java – datanode protocol
 Primary object of HDFS:
 Reliable with component failures
 In a typical large cluster (>1K nodes), component
failures are common
 Three common types of failures:
 NameNode failures
 DataNode failures
 Network failures
Robustness (cont’d)
 Heartbeats
 Each DataNode sends heartbeats to NameNode
 System status and block reports
 The NameNode marks DataNodes w/o recent
heartbeats as dead
 Does not forward I/O to it
 Mark all data blocks on these DataNodes as unavailable
 Re-replicate these blocks if necessary (according to the
replication factor).
 Can detect network failures and DataNode dies
Robustness (cont’d)
 Re-Balancing
 Automatically move the data on one DataNode to
another one
 If the free space falls below a threshold
 Data-Integrity
 A block of data may be corrupted
 Disk faults, network faults, buggy software
 Client computes checksums for each block and stores
them in a separate hidden file in HDFS namespace
 Verify data before read it
Robustness (cont’d)
 Metadata failures
 FSImage and EditLog are the central data structures
 Once corrupted, HDFS can not build namespace and
access data
 NameNode can be configured to support multiple-
copies of FSImage and EditLog
 E.g: one FSImage/EditLog on local machine, another one
is stored on mounted remote NFS server.
 Reduce the update performances
 Once NameNode is down, it must to restart the cluster
Data Organization
 Data Blocks
 HDFS is designed to support very large files and
streaming I/Os
 A File is chopped up into 64MB blocks
 Reduce the number of connection establishments
and accelerate TCP transmissions
 If possible, each block of a file will reside on a
different DataNode
 For future parallel I/O and computations (MapReduce)
Data Organization (cont’d)
 Staging
 When write a new file
 A client firstly caches the file data into temporary
local file until this file worth over the HDFS block
 Then the client contacts NameNode to assign a
 The client flushes the cached data to the chosen
 Fully utilized the bandwidth
Data Organization (cont’d)
 Replication Pipeline
 A client obtains a DataNode list to flush one block
 The client firstly flushes the data to the first DataNode
 The first DataNode starts to receive the data in small
portions (4kB), writes that portions to local storage,
and transfer it to the next DataNode in the list
 The second DataNode acts as the first one
 The total transfer time for one block(64MB) is:
 T(64MB) + T(4kb) * 2 , for pipeline
 3 * T(64MB), for non-pipeline
Replication Pipeline
 The client asks the
NameNode where to put
 The client push data to
DataNode linearly to fully
utilize network bandwidth
 The secondary replicas
reply to the primary. Then
the primary replies to the
client for success.
* This figure was in “The Google File System” paper
Google v.s Y!/Facebook/Amazon..
• Google File System
• MapReduce
• BigTable
• Hadoop DFS
• Hadoop MapReduce
• HBase
Known Issues and Research
 NameNode is the single point failure
 Limits the total files supported in the HDFS as well
 RAM limitation
 Google has changed the one-master architecture
to multiple-header cluster
 However, the details are unrevealed
Known Issues and Research
Interests (cont’d)
 Use replications to provide data reliability
 Same problems to RAID-1 ?
 Apply RAID technologies to HDFS?
 “DiskReduce: RAID for Data-Intensive Scalable
Computing”, PDSW’09
Known Issues and Research
Interests (cont’d)
 Energy Efficiency
 DataNodes are alive for data availability
 However, there may be no MapReduce
computations running on them.
 Waste of energy
 Hadoop Distributed File System is designed
to serve MapReduce computations
 Provide high reliable storage
 Support mass of data
 Optimized data placement policies based on the
topology of data centers
 Large companies build their core businesses on
top of these infrastructures
 Google: GFS/MapReduce/BigTable
 Yahoo!/Facebook/Amazon/Twitter/NY Times:
 HDFS Architecture Guide:
