Data Management in the Cloud

advertisement

Data Management in the

Cloud

Paul Szerlip

The rise of data

Think about this o o

For the past two decades, the largest generator of data was humans -- now it's our devices

Cheap sensors

Cellphones are packed with sensory information

 Images, video, audio, etc

Expensive sensors o DZero, high energy physics, generates 1 TB a day

How do you deal with that much data?

[1,2]

Data in the cloud

Storing the data o Bigtable, S3, NoSQL, etc

Processing the data o MapReduce, Hadoop, etc

Good data management in the cloud

Availability o Accessible in cases of partial network failure or datacenter failure

Scalability o Support for massive database sizes - spread across many servers

Elasticity o Scaling up and scaling down

Performance o Efficient system storage utilization

Multitenancy o Many applications on the same hardware

Good data management

(continued)

Load and Tenant Balancing o Moving load between servers

Fault Tolerance o Tolerating network or hardware failures

Running in heterogeneous environment o Dealing with hardware degredation

Flexible query interface o Providing ways to access both SQL and non-SQL languages

Overarching Themes

Frustration with ACID on the cloud o (Atomicity, consistency, isolation, durability)

Hard to maintain ACID guarantees with data replication over large geographic distances [1] o Consistency, Availability, Tolerance to partitions, choose 2

Rise of NoSQL (a misnomer) [2] o Eventually consistent can be okay, some ACID properties are relaxed or left to application developers

Investigating 3 Systems

Bigtable (Google) o And quick look at MapReduce

Amazon:S3/SimpleDB

Open source NoSQL alternatives: o Cassandra (key-value) o MongoDB (document)

Bigtable

Distributed storage designed to scale to petabyte size databases spread across thousands of servers [1]

Used extensively by Google

Not fully relational o "Sparse, distributed, persistent multidimentional sorted map" [1]

Uses Google File System (GFS) under the hood

Index using row keys o Tablet = range of row keys, used for load balancing

Bigtable Diagram [2]

Bigtable

GFS o SSTable

 Provides a persistent immutable ordered map o

Chubby provides locking mechanism

Ensures single master

Location of bigtable data

 Storing schema information and access control lists

Each Bigtable is allocated to one master, and many multiple tablet servers o Master assigns tablets to different tablet servers, dynamically based on server load

Tablets handle read-write

MapReduce

Introduced by Google in 2004 [1]

Often used to operate on Bigtable data [1]

A means to process large amounts of data in a distributed environment in a highly parallelized manner

MapReduce Steps

1. Input files split into M pieces, multiple copies of program started on cluster

2. One copy is master, M map tasks, R reduce tasks assigned to idle workers

3. Worker reads file split contents, passes to map function - results buffered in memory

4. Buffered results written to local disk periodically, partitioned into R regions by partitioning function, locations passed to master

MapReduce (continued)

5. Reduce worker notified about location, reads buffered data from map workers, sorts so that same keys are grouped together

6. Reduce worker passes key and intermediate values to Reduce function, output is appended to final output file

7. After all map and reduce tasks completed, master wakes up user program

S3 - Simple Storage Service

"Infinite" store for objects of variable size [1]

Organized in 2 levels o Buckets

 Like folders, you can save any number of objects in them o Objects

 Byte container (up to 5 GB) and metadata (up to

2KB)

Limited search o Single bucket, name only

SimpleDB

Organized into domains (tables) where you can insert data, get data, or run queries [1]

Each domain has items which are descibed by attribute name/value pairs

No schema

API Accesso CreateDomain, DeleteDomain, PutAttributes,

DeleteAttributes, GetAttributes, and Select

Meant for fast reads

Keeps multiple copies of the domains

NoSQL

What does this mean?

o More about relaxing ACID than being "No" SQL [2]

Lots of open source NoSQL systems o Zynga was big on NoSQL

Why to use them?

o Excellent elasticity o o o

Flexible data models - often schema-less

CHEAP (relative to RDBMS)

(if you have lots of frequent and small writes)

Types of NoSQL

Key-value o Redis, Cassandra, etc.

Document store o CouchDB, mongoDB, etc

Graph dbs, object stores o Won't go into these much

Cassandra

Highly scalable, eventually consistent, distributed, structured, key-value store [1]

Open sourced by Facebook (2008) [1]

ColumnFamily based o Column is a tuple of {key, value, timestamp} o ColumnFamilies contain many columns, all referenced by row-key

Kind of like a hybrid of Dynamo and Bigtable

[1]

MongoDB

Document-oriented o High input read/write o High availability o o

Scalability

Flexible query language

References

[1] Sakr, S., Liu, A., Batista, D.M., Alomari,

M., A Survey of Large Scale Data

Management Approaches in Cloud

Environments, IEEE Communications, 2011.

[2] Cloud Computing: Theory and Practice

(our lecture notes)

[3] http://www.mongodb.org/display/DOCS/Intro duction

Download