Cassandra - A Decentralized Structured Storage System

advertisement
Subject :
Cassandra - A Decentralized Structured Storage System
Professor :
Dr. sh.Esmaili
The Student’s Identifiers :
Mr. Houshyar Mohammadi Talvar(Slides 4 to 17)
Miss.Hakimi(Slides 19 to 27)
Mr. Hossien Sadrizadeh(Slides 29 to 65)
The Date :
June 6th 2012 , (On Thursday , 25th Khordad 1391 )
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
1 /66
Contenet Of The Presentation:
•
•
•
•
•
•
Abstract
Introduction
Related Work
Data Model
API
System Arcgitecture
• Partitioning
• Replication
• Membership
• Bootstrapping
• Scaling the Cluster
• Local Persistance
• Implementation Details
•
•
•
•
Practical Experiences
Facebook Inbox Search
Conclusion
Acknowledgements
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
2 / 66
Mr. Houshyar Mohammadi Talvar
Slides From 4 To 17
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
3 / 66
Abstract
Cassandra is a distributed storage system for managing very
large amounts of structured data spread out across many
commodity servers, while providing highly available service
with no single point of failure.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
4 / 66
Introduction
Facebook runs the largest social networking
platform that serves hundreds of millions users
at peak times using tens of thousands of servers
located in many data centers around the world.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
5 / 66
Related Work
Systems like Ficus and Coda replicate files for high
availability at the expense of consistency. Update
conflicts are typically managed using specialized
conflict resolution procedures.




Bayou
Coda
Ficus
Dynamo
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
6 / 66
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
7 / 66
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
8 / 66
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
9 / 66
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
10 / 66
Data Model
A table in Cassandra is a distributed multi
dimensional map indexed by a key. The value is
an object which is highly structured.
Simple column families
Cassandra exposes two kinds of columns families
Super column families
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
11 / 66
Data Model(Continue)
ColumnFamily1 Name : MailList
KEY
Name : tid1
Name : tid2
Name : tid3
Name : tid4
Value : <Binary>
Value : <Binary>
Value : <Binary>
Value : <Binary>
TimeStamp : t1
TimeStamp : t2
TimeStamp : t3
TimeStamp : t4
ColumnFamily2
Column Families
are declared
upfront are
SuperColumns
added and
modified
Columns
are added
dynamically
and modified
dynamically
Columns are added
and modified
Type : Simple
Sort : Name
dynamically
Name : WordList
Type : Super
Name : aloha
Sort : Time
Name : dude
C1
C2
C3
C4
C2
C6
V1
V2
V3
V4
V2
V6
T1
T2
T3
T4
T2
T6
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
12 / 66
Data Model(Continue)
Any column within a column family is accessed
using the convention
column family : column
any column within a column family that is of
type super is accessed using the convention
column family :super column : column.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
13 / 66
API
The Cassandra API consists of the following
three simple methods.
insert(table; key; rowMutation)
get(table; key; columnName)
delete(table; key; columnName)
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
14 / 66
System Architecture
The architecture of a storage system that needs
to operate in a production setting is complex.
In addition to the actual data persistence
component, the system needs to have the
following characteristics
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
15 / 66
System Architecture(Continue)
 scalable and robust solutions for load balancing
 membership and failure detection
 failure recovery
 replica synchronization
 overload handling
 state transfer
 concurrency and job scheduling
 request marshalling
 request routing
 system monitoring and alarming
 configuration management
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
16 / 66
Read
Client
Query
Result
Cassandra Cluster
Closest replica
Result
Replica A
Digest Query
Replica B
Replica C
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
17 / 66
Miss. Hakimi
Slides From 19 To 27
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
18 / 66
Partitioning
One of the key design features for Cassandra is
the ability to scale incrementally.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
19 / 66
Partitioning and Replication
1 0
h(key1)
E
A
N=3
C
h(key2)
F
B
D
1/2
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
20 /66
Membership
Cluster membership in Cassandra is based on
Scuttlebutt, a very efficint anti-entropy Gossip
based mechanism.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
21 / 66
Membership
• Gossip protocol is used for cluster membership.
• Super lightweight with mathematically provable
properties.
• State disseminated in O(logN) rounds where N is
the number of nodes in the cluster.
• Every T seconds each member increments its
heartbeat counter and selects one other member
to send its list to.
• A member merges the list with its own list .
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
22 / 66
Failure Detection
Failure detection is a mechanism by which a node
can locally determine if any other node in the
system is up or down.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
23 / 66
Accrual Failure Detector
• Valuable for system management, replication, load
balancing etc.
• Defined as a failure detector that outputs a value, PHI,
associated with each process.
• Also known as Adaptive Failure detectors - designed to
adapt to changing network conditions.
• The value output, PHI, represents a suspicion level.
• Applications set an appropriate threshold, trigger
suspicions and perform appropriate actions.
• In Cassandra the average time taken to detect a failure is
10-15 seconds with the PHI threshold set at 5
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
24 / 66
Bootstrapping
When a node starts for the first time, it chooses
a random token for its position in the ring.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
25 / 66
Scaling The Cluster
When a new node is added into the system, it
gets assigned a token such that it can alleviate a
heavily loaded node.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
26 / 66
Local Persistence
The Cassandra system relies on the local file
system for data persistence.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
27 / 66
Hossien Sadrizadeh
Slides From 29 To 65
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
28 / 66
Implementaion Details
• The following abstractions are need for Cassandra
Process on a Single Machine.
• Partitioning module.
• Cluster membership and Failure detection module.
• Storage engine module.
• Each of these module has been implemented from
the ground using Java.
• Each of these modules rely on an event driven
where the message processing pipeline and the
task pipeline are split into multiple stage along the
line of the SEDA architecture.(Staged Event-Driven
Architecture).
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
29 / 66
1
SEDA
• SEDA combines of threads and event-based
programming models to manage :
•
•
•
•
Concurrency.
I/O.
Schedulaing.
Resource management needs of Internet services.
• In SEDA, applications consist of:
• A network of event-driven stages.
• Each stage connected by explicit queues.
• SEDA is intended to support massive concurrency
demands and simplify the construction of wellconditioned services.
1 SEDA: An Architecture for Well-Conditioned,Scalable Internet Services (Matt Welsh, David Culler, and Eric Brewer)
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
30 / 66
SEDA(Continue)
Thread Server Design :
Each incoming request is dispatched to a
separate threads, which processes the
request and returns a result to the client.
Edges represent control flow between
components .
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
31 / 66
Routing
• All system control messages rely on UDP based
messageing while the application related
messages for replication and request routing
2
relies on TCP .
1
• The request routing modules are implemented
using a certain state machine.
1.UDP : User Datagram Protocol.(a connectionless protocol)
2.TCP : Transfer Control Protocol.(a connection-Oriented protocol)
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
32 / 66
What Happened When a Read/Write Request
From a Node In The Cluster ?
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
33 / 66
Partitioning
1
• In cassandra, the total data managed by the cluster is
represented as a circular space or ring.
• The ring is divided up into ranges equal to the number
of nodes, which each node being responsible for one or
more ranges of the overall data.
• Before a node can join the ring, it must be assigned a
token.
• The token determines the node’s position on the ring
and the range of data it is responsible for.
1 http://www.datastax.com/docs/0.8/cluster_architecture
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
34 / 66
Partitioning – Single Data Center(Continue)
1
A cluster with 4 nodes, the
row keys managed by the
cluster were numbers in the
range of 0 to 100.
Each node is assigned a token
that represents a point in this
range.
In this simple example, the
token values are 0, 25, 50,
and 75.
1 http://www.datastax.com/docs/0.8/cluster_architecture
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
35 / 66
Partitioning – Replica Placement(Continue)
1
• In multi-data center deployments, replica placement is
calculated per data center.
• Additional replicas in the same data center are placed
by walking the ring clockwise until it reaches the first
node in another rack.
1 http://www.datastax.com/docs/0.8/cluster_architecture
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
36 / 66
Partitioning – Multi Data Center(Continue)
1
1 http://www.datastax.com/docs/0.8/cluster_architecture
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
37 / 66
Partitioning – Multi Data Center(Continue)
1
The goal is to ensure that
the nodes for each data
center have token
assignments that evenly
divide the overall range.
1 http://www.datastax.com/docs/0.8/cluster_architecture
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
38 / 66
About Client Request
• All nodes in cassandra are peers.
• A client read/write request can go to any node in the
cluster.
• When a client connect to a node and issues a
read/write request , that node serves as a proxy the
coordinator for that particular operation.
• The job of the coordinator is to act between the client
application and the nodes(replicas)that own the data
being requested.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
39 / 66
About Client Request(Continue)
• The coordinator sends the write request to all replicas
that own the row being written. if all replica nodes are
up and available.
• They will get the write regardless of the consistency
level specified by the client.
• The write consistency level determines how many
replica nodes must respond with a success
acknowledgement in order for the write to be
considered successful.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
40 / 66
An Example To Write
For example, in a single data center 10 node cluster with a
replication factor of 3, an incoming write will go to all 3 nodes
that own the requested row.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
41 / 66
R1
1
12
R2
2
3
11
Client
4
10
5
9
8
6
7
R3
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
42 / 66
Replication Factor & Replication In Cassandra
• Replication factor The total number of replicas
across the cluster is often referred to as the
replication factor.
• A replication factor of 1 means that there is only
one copy of each row, and a replication factor of 2
means two copies of each row.
• Replication Is the process of storing copies of
data on multiple nodes.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
43 / 66
Replication Factor In Cassandra
Replication is the process of storing
copies of data on multiple nodes.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
44 / 66
Commit Log
• The cassandra system base on the local file system
for data persistance.
• We have a dedicated disk on each machine for the
commit log.
• The write into the in-memory data structure is
performed only after a successful write into the
commit log.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
45 / 66
Commit Log & In-Memory Structure
• The cassandra ,first writes data to a commit
log(for durability), and then an in-memory
table structure called memtable.
• A write is successful when :
1.
2.
First, It is written to the commit log.
Second, write in the Memory.
• Writes are batched in memory and
periodically written to disk to a persistent
table structure called an SSTable.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
46 / 66
Structure Of Commit Log
• Every commit log has a header which is
basically:
• A bit vector with fixed size.
• The size of the bit vector is more than the number of column
families.
• These bit vectors are per commit log and also
hold in memory.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
47 / 66
Write Operation Into The Commit Log
• The write operation into the commit log can either
be in normal mode or in fast sync mode.
• In the fast sync mode the writes to the commit log
are buffered.(if the machine is crashed some of data
maybe loss).
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
48 / 66
Implementaion The Commit Log(Continue)
• Traditional databases are not designed to handle
high write throughput.
• Cassandra do writes to disk into sequential writes
thus maximize disk write throughput.
• Since the files dumped to the disk are never
changed then no locks need to be taken while
reading them.for instance the server of cassandra is
practically lockless for read/write operation.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
49 / 66
When Should We Delete The Commit Log?
In any logging system ,
we need a mechanism
to purge commit log
entries.
Question :
Is there any different
between delete a commit log
and delete the entries of
commit log ?
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
50 / 66
Implementaion The Index
• The cassandra system indexes all database on
primary key.
• The data file on disk is broken down into a
sequence of blocks.
• Each block:
• Contains at most 128 keys.
• Is demarcated by a block index.
• The block index capture the relative offset of a
key within the block and the size of its data.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
51 / 66
Layout Of a Sample Block
Structure of a block and their index demarcated in memory
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
52 / 66
Implementaion The Index(Continue)
• When an in-memory data structure(block) is
dumped to disk a block index is generated
and their offsets written out to disk as indics.
• This index is also hold in memory for fast
access.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
53 / 66
What Happened When a Typical Read
Is Take Place?
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
54 / 66
1
What Should We Do When The Number
Of Files Are Increased On The Disk ?
• Over time the number of data files will increase on disk.
• We perform a compaction process, very much like the
Bigtable system.
•
Merges multiple files into one ;essentially merge sort on a
cluster of sorted data files.
• Periodically a compaction process is run to compact all
related data files into one big file.
1 The Research team
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
55 / 66
Practical Experiences
• In the design process of cassandra, we learnt a lot
of usefull experience and it is very benefitical for us.
• We experimented with various implementations of
Failure Detectors.if the size of cluster is grown then
the time of detected faliure is increased.
• Most application only require atomic operation per
key per replica, but there are some application to
do on secondray indexes.(because most developers
work on RDBMS).
• Cassandra is a completely decentralized
system(distributed system).
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
56 / 66
Ganglia /ģæŋ.lia/
• Old monitoring is not benefit anymore,because the
Cassandra system is well integerated with Ganglia
(a distributed monitoring tool)1.
• Ganglia is a scalable distributed system monitor
tool for high-performance computing system such
as clusters.
• The strategy uses a distributed tree structure that
enables organizations to monitor an arbitrarily
large number of clusters while placing bounds on
the required processing load.
1 Matthew L.Massie,Brent N.Chun, and David E, Culler.The Ganglia distributed monitoring system
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
57 / 66
Ganglia(Continue)
Ganglia is comprised of two components:
•Gmon,local-area monitoring system .
•Gmeta wide-area system.
Ganglia local and wide area monitor
interaction. Gmon runs on each
cluster node; gmeta can fail over
between nodes.
Gmon uses UDP multicast.
Gmon communicates with its Gmeta
counterpart using
XML streams sent over TCP
connections.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
58 / 66
Facebook Inbox Search
• what is the matter?
• Millions of messages are sent everyday on Facebook.
• Messages stored in different data centers.
• How to handle indexing all of this information for Inbox
search ?
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
59 / 66
Facebook Inbox Search
• For inbox search we have to make a list of all
messages per user that have been exchanged
between the sender and recipients.
• There are two kinds of search features:
• Term search.
• Search interaction.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
60 / 66
Term Search / Search Interaction
• Term search :
• Key = user ID.
• Super column = the words that make up the message
become.
• Search interaction :
• Key = user ID.
• Super column = the recipients id’s.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
61 / 66
An Actual Example
• The current system store about 50TB of data on a
150 node cluster.
• The previous data are spread out between east
and west coast data center.
• Some measure we product them are in the
following table.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
62 / 66
What Works Are There To Do
On The Future ?
The works that we can do them are:
• Adding compression.
• Secondary index support.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
63 / 66
Cassandra Goals(Conclusion)
• High scalability.
• High performance.
• Throughput.
• Response time.
• High availability.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
64 / 66
Headline Summary
• All implimentation use of java.
• Use the UDP anf TCP protocol for routing.
• Ring mechanism used for clustering.
• All the nodes in the ring are peers.
• Use of replication.
• Use of commit log to persistence files.
• As use of sequential write we have a high throughput.
• All files broken into some blocks.
• It doesn’t use of lock to write/read.
• Use of compression to compat the files.
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
65 / 66
Now,Please
Ask Your Questions !
Cassandra-A Decentrilized Structured Storage System
Azad Kurdistan University
66 / 66
Download