Bryant Yao - Center for Software Engineering

advertisement
The Hadoop Distributed File System:
Architecture and Design
by Dhruba Borthakur
Presented by Bryant Yao
Introduction

What is it? It’s a file system!
◦ Supports most of the operations a normal file
system would.
Open source implementation of GFS
(Google File System).
 Written in Java
 Designed primarily for GNU/Linux

◦ Some support for Windows
Design Goals
HDFS is designed to store large files (think TB or PB).
 HDFS is designed for a computer cluster/s made up of racks.

Rack 1
Rack 2
Cluster

Write once, read many model
◦ Useful for reading many files at once but not single files.

Streaming access of data
◦ Data is coming to you constantly and not in waves
Make use of commodity computers
 Expect hardware to fail
 “Moving computation is cheaper than moving data”

Master/Slave Architecture
Namenode
Datanodes
Master/Slave Architecture cont.
1 master, many slaves
 The master manages the file system namespace and regulates access to
files by clients.

Data distributed across slaves. The slaves store the data as “blocks”.
 What is a block?
◦ A portion of a file.
◦ Files are broken down into and stored as a sequence of blocks.

File 1
A
B
C
Broken down into blocks A, B, and C.
Task Flow
Namenode


Master
Handles metadata operations
◦ Stored in a transaction log called EditLog

Manages datanodes
◦ Passes I/O requests to datanodes
◦ Informs the datanode when to perform block operations.
◦ Maintains a BlockMap which keeps track of which blocks
each datanode is responsible for.

Stores all files’ metadata in memory
◦ File attributes, number of replicas, file’s blocks, block
locations, and checksum of a block.

Stores a copy of the namespace in the FsImage on
disk.
Datanode
Slave
 Handles data I/O.
 Handles block creation, deletion, and
replication
 Local storage is optimized so files are
stored over multiple file directories

◦ Storing data into a single directory
Data Replication
Makes copies of the data!
 Replication factor determines the number
of copies.

◦ Specified by namenode or during file creation

Replication is pipelined!
Pipelining Data Replication

Blocks are split into portions (4KB).
1
2
3
Assume a block is
split into 3 portions:
A, B, and C.
A
1
B
2
3
A
1
C
2
B
3
A
Replication Policy
Communication bandwidth between
computers in a rack is greater than
between a computer outside of the rack.
 We could replicate data across
racks…but this would consume the most
bandwidth.
 We could replicate data across all
computers in a rack…but if the rack dies
we’re in the same position as before.

Replication Policy cont.

Assume only three replicas are created.
◦ Split the replicas between 2 racks.
◦ Rack failure is rare so we’re still able to maintain good data
reliability while minimizing bandwidth cost.

Version 0.18.0
◦ 2 replicas in current rack (2 different nodes)
◦ 1 replica in remote rack

Version 0.20.3.x
◦ 1 replica in current rack
2 replicas in remote rack (2 different nodes)
 What happens if replication factor is 2 or > 3?

◦ No answer in this paper.
◦ Some other papers state that the minimum is 3.
◦ The author wrote a separate paper stating every replica after
the 3rd is placed randomly.
Reading Data

Read the data that’s closest to you!
◦ If the block/replica of data you want is on the
datanode/rack/data center you’re on, read it from
there!

Read from datanodes directly.
◦ Can be done in parallel.

Namenode is used to generate the list of
datanodes which host a requested file as
well as getting checksum values to validate
blocks retrieved from the datanodes.
Writing Data


Data is written once
Split into blocks, typically of size 64MB
◦ The larger the block size, the less metadata
stored by the namenode

Data is written to a temporary local block
on the client side and then flushed to a
datanode, once the block is full.
◦ If a file is closed while the temporary block isn’t
full, the remaining data is flushed to the datanode.

If the namenode dies during file creation, the
file is lost!
Hardware Failure
Imagine a file is broken into 3 blocks
spread over three datanodes.
1
Block A
2
Block B
3
Block C
If the third datanode died, we would have
no access to block C and we can’t
retrieve the file.
1
Block A
2
Block B
3
Block C
Designing for Hardware Failure
 Data
replication
 Safemode
◦ Heartbeat
◦ Block report
Checkpoints
 Re-replication

Checkpoints
EditLog
+
=
File System
Namespace
FsImage
Checkpoints
FsImage is a copy of the system taken
before any changes have occurred.
 EditLog is a log of all the changes to the
namenode since it’s startup.
 Upon the start up of the namenode, it
applies all changes to the FsImage to
create an up to date version of itself.

◦ The resulting FsImage is the checkpoint.

If either the FsImage or EditLog is
corrupt, the HDFS will not start!
Heartbeat and Blockreport

A heartbeat is a message sent from the
datanode to the namenode.
◦ Periodically sent to the namenode, letting the
namenode know it’s “alive.”
◦ If it’s dead, assume you can’t use it.

Blockreport
◦ A list of blocks the datanode is handling.
Safemode



Upon startup, the namenode enters
“safemode” to check the health status of the
cluster. Only done once.
Heartbeat is used to ensure all datanodes
are available to use.
Blockreport is used to check data integrity.
◦ If the number of replicas retrieved is different
from the number of replicas expected, there is a
problem.
Replicated
A
A
A
Found
A
A
Other
Can view file system through FS Shell or
the web
 Communicates through TCP/IP
 File deletes are a move operation to a
“trash” folder which auto-deletes files
after a specified time (default is 6 hours).
 Rebalancer moves data from datanodes
which have are close to filling up their
local storage.

Relation with Search Engines

Originally built for Nutch.
◦ Intended to be the backbone for a search engine.

HDFS is the file system used by Hadoop.
◦ Hadoop also contains a MapReducer which has
many applications, like indexing the web!


Analyzing large amounts of data.
Used by many, many companies
◦ Google,Yahoo!, Facebook, etc.

It can store the web!
◦ Just kidding .
“Pros/Cons”
The goal of this paper is to describe the
system, not analyze it. It gives a great
beginning overview.
 Probably could’ve been
condensed/organized better.
 Some information is missing

◦ SecondaryNameNode
◦ CheckpointNode
◦ Etc.
Pros/Cons of HDFS
In and Beyond the Paper

Pros
◦
◦
◦
◦

It accomplishes everything it set out to do.
Horizontally scalable – just add a new datanode!
Cheap cheap cheap to build.
Good for reading and storing large amounts of data.
Cons
◦ Security
◦ No redundancy of namenode
 Single point of failure
◦ The namenode is not scalable
◦ Doesn’t handle small files well
◦ Still in development, many features missing
Questions?
Thank you for listening!
Download