slides - CS 491/591: Cloud Computing

advertisement

Question

• Scalability vs Elasticity

– What is the difference?

Homework 1

• Installing the open source cloud Eucalyptus in SEC3429

• Individual assignment

• Will need two machines – machine to help with installation and machine on which to install cloud so BRING

YOUR LAPTOP

• Guide to help you – step by step, but you will also need to use Eucalyptus Installation Guide

• When you are done you will have a cloud with a VM instance running on it

• Can use for future work and if not, can say you have installed a cloud and VM image

Components of Eucalyptus

• CLC – cloud controller

• Walrus – Amazon’s S3 for storing VM images

• SC – storage controller

• CC – cluster controllers

• NC – node controllers

Cloud Controller - CLC

• Java program (EC2 compatible interface) and web interface

• Administrative interface for cloud management

• Resource scheduling

• Authentication, accounting, reporting

• Only one CLC per cloud

Walrus

• Written in Java (equivalent to AWS Simple

Storage Service S3)

• Persistent storage to

– all VMs

• VM Images

• Application data

– Volume snapshots (point-in-time copies)

• Can be used as put/get storage as a service

• Only one Walrus per cloud

• Why is it called Walrus – WS3?

Cluster Controller - CC

• Written in C

• Front-end for a cluster within the cloud

• Communicates with Storage Controller and

Node Controller

• Manages VM instance execution and SLAs per cluster

Availability Zones

• Each cluster exists in an availability zone

• A cloud can have multiple locations

• Within each location is a region

• Each region has multiple isolated locations which are called availability zones

• Availability zones are connected through lowlatency links

Storage Controller - SC

• Written in Java (equivalent to AWS Elastic

Block Store EBS)

• Communicates with CC and NC

• Manages Eucalyptus block volumes and snapshots of instances within cluster

• If larger storage needed for application

Node Controller - NC

• Written in C

• Hosts VM instances – where they run

• Manages virtual network endpoints

• Downloads, caches images from Walrus

• Creates and caches instances

• Many NCs per cluster

Interesting Info on Clouds

• What Americans think a compute cloud is

– http://www.citrix.com/lang/English/lp/lp_2328330.asp

Send me interesting links about clouds

Homework

• Read the paper on GFS: Evolution on Fast-

Forward

• Also a link to a longer paper on GFS – original paper from 2003

• I assume you are reading papers as specified in the class schedule

The Original Google File

System

GFS

Some slides from Michael Raines

• During the lecture, you should point out problems with GFS design decisions

Common Goals of GFS and most Distributed File Systems

• Performance

• Reliability

• Scalability

• Availability

GFS Design Considerations

• Component failures are the norm rather than the exception.

• File System consists of hundreds or even thousands of storage machines built from inexpensive commodity parts.

• Files are Huge. Multi-GB Files are common.

• Each file typically contains many application objects such as web documents.

• Append, Append, Append.

• Most files are mutated by appending new data rather than overwriting existing data.

Co-Designing

Co-designing applications and file system API benefits overall system by increasing flexibility

GFS

• Why assume hardware failure is the norm?

• The amount of layers in a distributed system (network, disk, memory, physical connections, power, OS, application) mean failure on any could contribute to data corruption.

• It is cheaper to assume common failure on poor hardware and account for it, rather than invest in expensive hardware and still experience occasional failure.

Initial Assumptions

• System built from inexpensive commodity components that fail

• Modest number of files – expect few million and 100MB or larger. Didn’t optimize for smaller files

• 2 kinds of reads – large streaming read (1MB), small random reads (batch and sort)

• Well-defined semantics:

– Master/slave, producer/ consumer and many-way merge. 1 producer per machine append to file.

– Atomic RW

• High sustained bandwidth chosen over low latency

(difference?)

High bandwidth versus low latency

• Example:

– An airplane flying across the country filled with backup tapes has very high bandwidth because it gets all data at destination faster than any existing network

– However – each individual piece of data had high latency

Interface

• GFS – familiar file system interface

• Files organized hierarchically in directories, path names

• Create, delete, open, close, read, write

• Snapshot and record append (allows multiple clients to append simultaneously)

– This means atomic read/writes – not transactions!

Master/Servers (Slaves)

• Single master, multiple chunkservers

• Each file divided into fixed-size chunks of 64 MB

– Chunks stored by chunkservers on local disks as Linux files

– Immutable and globally unique 64 bit chunk handle

(name or number) assigned at creation

Master/Servers

– R or W chunk data specified by chunk handle and byte range

– Each chunk replicated on multiple chunkservers – default is 3

Master/Servers

• Master maintains all file system metadata

– Namespace, access control info, mapping from files to chunks, location of chunks

– Controls garbage collection of chunks

– Communicates with each chunkserver through HeartBeat messages

– Clients interact with master for metadata, chunksevers do the rest, e.g. R/W on behalf of applications

– No caching –

• For client working sets too large, simplified coherence

• For chunkserver – chunks already stored as local files, Linux caches MFU in memory

Heartbeats

• What do we gain from Heartbeats?

• Not only do we get the new state of a remote system, this also updates the master regarding failures.

• Any system that fails to respond to a Heartbeat message is assumed dead. This information allows the master to update his metadata accordingly.

• This also queues the Master to create more replicas of the lost data.

Client

• Client translates offset in file into chunk index within file

• Send master request with file name/chunk index

• Master replies with chunk handle and location of replicas

• Client caches info using file name/chunk index as key

• Client sends request to one of the replicas (closest)

• Further reads of same chunk require no interaction

• Can ask for multiple chunks in same request

Master Operations

• Master executes all namespace operations

• Manages chunk replicas

• Makes placement decision

• Creates new chunks (and replicas)

• Coordinates various system-wide activities to keep chunks fully replicated

• Balance load

• Reclaim unused storage

• Do you see any problems?

• Do you question any design decisions?

Master - Justification

• Single Master –

– Simplifies design

– Placement, replication decisions made with global knowledge

– Doesn’t R/W, so not a bottleneck

– Client asks master which chunkservers to contact

Chunk Size - Justification

• 64 MB, larger than typical

• Replica stored as plain Linux file, extended as needed

• Lazy space allocation

• Reduces interaction of client with master

– R/W on same chunk only 1 request to master

– Mostly R/W large sequential files

• Likely to perform many operations on given chunk (keep persistent TCP connection)

• Reduces size of metadata stored on master

Chunk problems

• But –

– If small file – one chunk may be hot spot

– Can fix this with replication, stagger batch application start times

Metadata

• 3 types:

– File and chunk namespaces

– Mapping from files to chunks

– Location of each chunk’s replicas

• All metadata in memory

• First two types stored in logs for persistence

(on master local disk and replicated remotely)

Metadata

• Instead of keeping track of chunk location info

– Poll – which chunkserver has which replica

– Master controls all chunk placement

– Disks may go bad, chunkserver errors, etc.

Metadata - Justification

• In memory –fast

– Periodically scans state

• garbage collect

• Re-replication if chunkserver failure

• Migration to load balance

– Master maintains < 64 B data for each 64 MB chunk

• File namespace < 64B

Chunk size (again)- Justification

• 64 MB is large – think of typical size of email

• Why Large Files?

o METADATA!

• Every file in the system adds to the total overhead metadata that the system must store.

• More individual data means more data about the data is needed.

Operation Log

• Historical record of critical metadata changes

• Provides logical time line of concurrent ops

• Log replicated on remote machines

• Flush record to disk locally and remotely

• Log kept small – checkpoint when > size

• Checkpoint in B-tree form

• New checkpoint built without delaying mutations

(takes about 1 min for 2 M files)

• Only keep latest checkpoint and subsequent logs

Snapshot

• Snapshot makes copy of file

• Used to create checkpoint or branch copies of huge data sets

• First revokes leases on chunks

• Newly created snapshot points to same chunks as source file

• After snapshot, client sends request to master to find lease holder

• Master give lease to new copy

Shadow Master

• Master Replication

– Replicated for reliability

– Not mirrors, so may lag primary slightly

(fractions of second)

– Shadow master read replica of operation log, applies same sequence of changes to data structures as the primary does

Shadow Master

• If Master fails:

– Start shadow instantly

– Read-only access to file systems even when primary master down

– If machine or disk mails, monitor outside GFS starts new master with replicated log

– Clients only use canonical name of master

Creation, Re-replication,

Rebalancing

• Master creates chunk

– Place replicas on chunkservers with below-average disk utilization

– Limit number of recent creates per chunkserver

• New chunks may be hot

– Spread replicas across racks

• Re-replicate

– When number of replicas falls below goal

• Chunkserver unavailable, corrupted, etc.

• Replicate based on priority (fewest replicas)

– Master limits number of active clone ops

Creation, Re-replication,

Rebalancing

• Rebalance

– Periodically moves replicas for better disk space and load balancing

– Gradually fills up new chunkserver

– Removes replicas from chunkservers with belowaverage free space

Leases and Mutation Order

• Chunk lease

• One replica chosen as primary - given lease

• Primary picks serial order for all mutations to chunk

• Lease expires after 60 s

Consistency Model

• Why Append Only?

• Overwriting existing data is not state safe.

o We cannot read data while it is being modified.

• A customized ("Atomized") append is implemented by the system that allows for concurrent read/write, write/write, and read/write/write events.

Consistency Model

Serial success

Concurrent successes

Failure

Write defined

Record Append defined consistent but undefined interspersed with inconsistent inconsistent

Table 1: File Region State After Mutation

Consistency Model

• File namespace mutation (update) atomic

• File Region

• Consistent if all clients see same data

• Region – defined after file data mutation (all clients see writes in entirety, no interference from writes)

• Undefined but Consistent - concurrent successful mutations – all clients see same data, but not reflect what any one mutation has written, fragments of updates

• Inconsistent – if failed mutation (retries)

Consistency

• Relaxed consistency can be accommodated – relying on appends instead of overwrites

• Appending more efficient/resilient to failure than random writes

• Checkpointing allows restart incrementally and no processing of incomplete successfully written data

Namespace Management and

Locking

• Master ops can take time, e.g. revoking leases

– allow multiple ops at same time, use locks over regions for serialization

– GFS does not have per directory data structure listing all files

– Instead lookup table mapping full pathnames to metadata

• Each name in tree has R/W lock

• If accessing: /d1/d2/ ../dn/leaf, R lock on /d1, /d1/d2, etc., W lock on /d1/d2 …/leaf

Locking

• Allows concurrent mutations in same directory

• R lock on directory name prevents directory from being deleted, renamed or snapshotted

• W locks on file names serialize attempts to create file with same name twice

• R/W objects allocated lazily, delete when not in use

• Locks acquired in total order (by level in tree) prevents deadlocks

Fault Tolerance

• Fast Recovery

– Master/chunkservers restore state and start in seconds regardless of how terminated

• Abnormal or normal

• Chunk Replication

Data Integrity

• Checksumming to detect corruption of stored data

• Impractical to compare replicas across chunkservers to detect corruption

• Divergent replicas may be legal

• Chunk divided into 64KB blocks, each with 32 bit checksums

• Checksums stored in memory and persistently with logging

Data Integrity

• Before read, checksum

• If problem, return error to requestor and reports to master

• Requestor reads from replica, master clones chunk from other replica, delete bad replica

• Most reads span multiple blocks, checksum small part of it

• Checksum lookups done without I/O

• Checksum computation optimized for appends

• If partial corrupted, will detect with next read

• During idle, chunkservers scan and verify inactive chunks

Garbage Collection

• Lazy at both file and chunk levels

• When delete file, file renamed to hidden name including delete timestamp

• During regular scan of file namespace

– hidden files removed if existed > 3 days

– Until then can be undeleted

– When removed, in-memory metadata erased

– Orphaned chunks identified and erased

– With HeartBeat message, chunkserver/master exchange info about files, master tells chunkserver about files it can delete, chunkserver free to delete

Garbage Collection

• Easy in GFS

• All chunks in file-to-chunk mappings of master

• All chunk replicas are Linux files under designated directories on each chunkserver

• Everything else garbage

Conclusions

• GFS – qualities essential for large-scale data processing on commodity hardware

• Component failures the norm rather than exception

• Optimize for huge files appended to

• Fault tolerance by constant monitoring, replication, fast/automatic recovery

• High aggregate throughput

– Separate file system control

– Large file size

GFS In the Wild - 2003

• Google currently has multiple GFS clusters deployed for different purposes.

• The largest currently implemented systems have over 1000 storage nodes and over 300 TB of disk storage.

• These clusters are heavily accessed by hundreds of clients on distinct machines.

• Has Google made any adjustments?

• Read paper on New GFS

• Google’s Colossus

OpenStack

• 3 components in the architecture

– Cloud Controller - Nova (compute)

• the cloud computing fabric controller,

• Written in Python, uses external libraries

– Storage Controller –Swift

• Analogous to AWS S3

• Can store billions of objects across nodes

• Built-in redundancy

– Image Controller – Glance

• Manages/stores VM images

• Can use local file system, OpenStack Object Store, S3

Download