HDFS/GFS
Outline
• Requirements for a Distributed File System
• HDFS
– Architecture
– Read/Write
• Research Directions
– Popularity
– Failures
– Network
Properties of a Data Center
• Servers are built from commodity devices
– Failure is extremely common
– Servers only have a limited amount of HDD space
• Network is over-subscribed
– The bandwidth between servers is different
• Demanding applications
– High throughput, low latency
• Resources are grouped into failure-zone
– Independent units of failure
10GB
25GB
100GB
Data-Center Architecture
Properties of a Data Center
• Servers are built from commodity devices
– Failure is extremely common
– Servers only have a limited amount of HDD space
• Network is over-subscribed
– The bandwidth between servers is different
• Demanding applications
– High throughput, low latency
• Resources are grouped into failure-zone
– Independent units of failure
Data-Center Architecture
Failure Domain 1 Failure Domain 2
Data-Center Architecture
Failure Domain 1 Failure Domain 2
Data-Center Architecture
Failure Domain 1 Failure Domain 2
Goals for a Data Center File System
• Reliable
– Over come server failures
• High performing
– Provide good performance to application
• Aware of network disparities
– Make data local to the applications
Common Design Principles
• For performance: Partitioning the data
– Split data into chunks and distribute
• provides high throughput
– Many people can read the chunks in parallel
• Better than everyone one reading the same file
How data is partitioned across nodes
• For reliability: Replication:
– overcome failure by making copies
• At least one copy should be online
How data is duplicated across nodes
• For Network-disparity: rack-aware allocation
– Read from the closest block
– Write to the closest location
Common Design Principles
• For performance: Partitioning the data
– Split data into chunks and distribute
• provides high throughput
– Many people can read the chunks in parallel
• Better than everyone one reading the same file
How data is partitioned across nodes
• For reliability: Replication:
– overcome failure by making copies
• At least one copy should be online
How data is duplicated across nodes
• For Network-disparity: rack-aware allocation
– Read from the closest block
– Write to the closest location
Common Design Principles
• For performance: Partitioning the data
– Split data into chunks and distribute
• provides high throughput
– Many people can read the chunks in parallel
• Better than everyone one reading the same file
How data is partitioned across nodes
• For reliability: Replication:
– overcome failure by making copies
• At least one copy should be online
How data is duplicated across nodes
• For Network-disparity: rack-aware allocation
– Read from the closest block
– Write to the closest location
Outline
• Requirements for a Distributed File System
• HDFS
– Architecture
– Read/Write
• Research Directions
– Popularity
– Failures
– Network
HDFS Architecture
• Name Node – Master (only 1 in a data center)
– All reads/write go through the master
– Manages the data nodes
• Detects failures – triggers replication
• Tracks performance
– Tracks location of blocks
Name
Node
• Tracks block to node mapping
• Tracks status of data nodes
• Rebalances the data center
• Orchestrates read/writes
• Data Node –
– One per server
– Stores the blocks
• Tracks status of blocks
• Ensures integrity of block
Data B Data
Node
What is a Distributed FS Write?
• HDFS
– For high-performance
• Make N copies of the data to be written
• Default N= 3
Write B
HDFS
Master
B
B
B
What is a Distributed FS Write?
• HDFS
– For Fault tolerance
• Place in two different fault domains
• 2 copies in the same rack
• 1 in a different rack
B
Zone 1
B
Zone 2
B
What is a Distributed FS Write?
• HDFS
– For Network awareness
• Currently does nothing
Picks two random racks
What is a Distributed FS Read?
• HDFS
– For Network awareness/performance
• Pick closest copy to read from.
– Nothing specific for Reliability
Read B
Name
Node
B
Zone 1
B
Zone 2
B
Implications of Read/Write Semantics
• One application write == 3 HDFS writes
– Writes are costly!!
– HDFS is optimized for write-once/read-many times workloads
• What is an update/edit? Rewrite blocks?
Modify B
Name
Node
B
Zone 1
B
Zone 2
B
Implications of Read/Write Semantics
• One application write == 3 HDFS writes
– Writes are costly!!
– HDFS is optimized for write-once/read-many times workloads
• An update/Edit:
– delete old data + write new data
Modify B
Name
Node
B
B`
B
B` B`
B
Interesting Challenges
• How happens with more popular blocks?
– Or less popular blocks?
• What happens during server failures?
– Can you loose data?
• What happens if you have a better network?
– No oversubscription
Outline
• Requirements for a Distributed File System
• HDFS
– Architecture
– Read/Write
• Research Directions
– Popularity
– Failures
– Network
Popularity in HDFS
• Not all files are equivalent
– E.g. More people search for bball than hockey
• More popular blocks will have more contention
– Leads to slower performance
– Search for bball will be slower
Popularity in HDFS
• # of copies of a block = function(popularity)
– If 50 people search for bball, then make 50 blocks
– If only 3 search for hockey, then make 3
• You want as many copies of a block as readers
Popularity in HDFS
• # of copies of a block = function(popularity)
– If 50 people search for bball, then make 50 blocks
– If only 3 search for hockey, then make 3
• You want as many copies of a block as readers
Popularity in HDFS
• As data becomes old less people care about it
– So last year’s weather versus today’s weather
• When a block becomes old (older than a week)
– Reduce the number of copies.
– In Facebook data centers, only one copy of old data
Failures in Data Center
• Do servers fail????
– Facebook: 1% of servers fail after-reboot
– Google: at least one server fails a day
Data
B
B`
Data
Data
B Node
Name
Node
• Failed node doesn’t send heart beat
• Name node determines blocks on failed node
• Starts replication.
Failures in Data Center
• Do servers fail????
– Facebook: 1% of servers fail after-reboot
– Google: at least one server fails a day
Name
Node
Data
B
B`
Data B Data
Node
• Failed node doesn’t send heart beat
• Name node determines blocks on failed node
• Starts replication.
Problems With Locality aware DFS
• Ignores contention on the servers
– I/O contention greatly impacts performance
Problems With Locality aware DFS
• Ignores contention on the servers
– I/O contention greatly impacts performance
• Ignores contention in the network
– Similar performance degradation
10GB
25GB
100GB
Types of Network Topologies
• Current Networks
– Uneven B/W everywhere
• Future Networks
– Even B/W everywhere
100GB
100GB
100GB
Implications of Network Topologies
• Blocks can be more spread out!
– No need for two blocks within the same rack
– Same BW everywhere so no need for locality aware placement
Summary
• Properties for a DFS
• Research Challenges
– Popularity
– Failure
– Data Placement
Un-discussed
• Cluster rebalancing
– Move blocks around based on utilization.
• Data integrity
– Use checksum to check if data has gotten corrupted.
• Staging + pipeline