Subject : Cassandra - A Decentralized Structured Storage System Professor : Dr. sh.Esmaili The Student’s Identifiers : Mr. Houshyar Mohammadi Talvar(Slides 4 to 17) Miss.Hakimi(Slides 19 to 27) Mr. Hossien Sadrizadeh(Slides 29 to 65) The Date : June 6th 2012 , (On Thursday , 25th Khordad 1391 ) Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 1 /66 Contenet Of The Presentation: • • • • • • Abstract Introduction Related Work Data Model API System Arcgitecture • Partitioning • Replication • Membership • Bootstrapping • Scaling the Cluster • Local Persistance • Implementation Details • • • • Practical Experiences Facebook Inbox Search Conclusion Acknowledgements Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 2 / 66 Mr. Houshyar Mohammadi Talvar Slides From 4 To 17 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 3 / 66 Abstract Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 4 / 66 Introduction Facebook runs the largest social networking platform that serves hundreds of millions users at peak times using tens of thousands of servers located in many data centers around the world. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 5 / 66 Related Work Systems like Ficus and Coda replicate files for high availability at the expense of consistency. Update conflicts are typically managed using specialized conflict resolution procedures. Bayou Coda Ficus Dynamo Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 6 / 66 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 7 / 66 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 8 / 66 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 9 / 66 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 10 / 66 Data Model A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured. Simple column families Cassandra exposes two kinds of columns families Super column families Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 11 / 66 Data Model(Continue) ColumnFamily1 Name : MailList KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4 Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary> TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4 ColumnFamily2 Column Families are declared upfront are SuperColumns added and modified Columns are added dynamically and modified dynamically Columns are added and modified Type : Simple Sort : Name dynamically Name : WordList Type : Super Name : aloha Sort : Time Name : dude C1 C2 C3 C4 C2 C6 V1 V2 V3 V4 V2 V6 T1 T2 T3 T4 T2 T6 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 12 / 66 Data Model(Continue) Any column within a column family is accessed using the convention column family : column any column within a column family that is of type super is accessed using the convention column family :super column : column. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 13 / 66 API The Cassandra API consists of the following three simple methods. insert(table; key; rowMutation) get(table; key; columnName) delete(table; key; columnName) Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 14 / 66 System Architecture The architecture of a storage system that needs to operate in a production setting is complex. In addition to the actual data persistence component, the system needs to have the following characteristics Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 15 / 66 System Architecture(Continue) scalable and robust solutions for load balancing membership and failure detection failure recovery replica synchronization overload handling state transfer concurrency and job scheduling request marshalling request routing system monitoring and alarming configuration management Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 16 / 66 Read Client Query Result Cassandra Cluster Closest replica Result Replica A Digest Query Replica B Replica C Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 17 / 66 Miss. Hakimi Slides From 19 To 27 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 18 / 66 Partitioning One of the key design features for Cassandra is the ability to scale incrementally. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 19 / 66 Partitioning and Replication 1 0 h(key1) E A N=3 C h(key2) F B D 1/2 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 20 /66 Membership Cluster membership in Cassandra is based on Scuttlebutt, a very efficint anti-entropy Gossip based mechanism. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 21 / 66 Membership • Gossip protocol is used for cluster membership. • Super lightweight with mathematically provable properties. • State disseminated in O(logN) rounds where N is the number of nodes in the cluster. • Every T seconds each member increments its heartbeat counter and selects one other member to send its list to. • A member merges the list with its own list . Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 22 / 66 Failure Detection Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 23 / 66 Accrual Failure Detector • Valuable for system management, replication, load balancing etc. • Defined as a failure detector that outputs a value, PHI, associated with each process. • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions. • The value output, PHI, represents a suspicion level. • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions. • In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 24 / 66 Bootstrapping When a node starts for the first time, it chooses a random token for its position in the ring. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 25 / 66 Scaling The Cluster When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 26 / 66 Local Persistence The Cassandra system relies on the local file system for data persistence. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 27 / 66 Hossien Sadrizadeh Slides From 29 To 65 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 28 / 66 Implementaion Details • The following abstractions are need for Cassandra Process on a Single Machine. • Partitioning module. • Cluster membership and Failure detection module. • Storage engine module. • Each of these module has been implemented from the ground using Java. • Each of these modules rely on an event driven where the message processing pipeline and the task pipeline are split into multiple stage along the line of the SEDA architecture.(Staged Event-Driven Architecture). Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 29 / 66 1 SEDA • SEDA combines of threads and event-based programming models to manage : • • • • Concurrency. I/O. Schedulaing. Resource management needs of Internet services. • In SEDA, applications consist of: • A network of event-driven stages. • Each stage connected by explicit queues. • SEDA is intended to support massive concurrency demands and simplify the construction of wellconditioned services. 1 SEDA: An Architecture for Well-Conditioned,Scalable Internet Services (Matt Welsh, David Culler, and Eric Brewer) Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 30 / 66 SEDA(Continue) Thread Server Design : Each incoming request is dispatched to a separate threads, which processes the request and returns a result to the client. Edges represent control flow between components . Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 31 / 66 Routing • All system control messages rely on UDP based messageing while the application related messages for replication and request routing 2 relies on TCP . 1 • The request routing modules are implemented using a certain state machine. 1.UDP : User Datagram Protocol.(a connectionless protocol) 2.TCP : Transfer Control Protocol.(a connection-Oriented protocol) Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 32 / 66 What Happened When a Read/Write Request From a Node In The Cluster ? Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 33 / 66 Partitioning 1 • In cassandra, the total data managed by the cluster is represented as a circular space or ring. • The ring is divided up into ranges equal to the number of nodes, which each node being responsible for one or more ranges of the overall data. • Before a node can join the ring, it must be assigned a token. • The token determines the node’s position on the ring and the range of data it is responsible for. 1 http://www.datastax.com/docs/0.8/cluster_architecture Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 34 / 66 Partitioning – Single Data Center(Continue) 1 A cluster with 4 nodes, the row keys managed by the cluster were numbers in the range of 0 to 100. Each node is assigned a token that represents a point in this range. In this simple example, the token values are 0, 25, 50, and 75. 1 http://www.datastax.com/docs/0.8/cluster_architecture Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 35 / 66 Partitioning – Replica Placement(Continue) 1 • In multi-data center deployments, replica placement is calculated per data center. • Additional replicas in the same data center are placed by walking the ring clockwise until it reaches the first node in another rack. 1 http://www.datastax.com/docs/0.8/cluster_architecture Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 36 / 66 Partitioning – Multi Data Center(Continue) 1 1 http://www.datastax.com/docs/0.8/cluster_architecture Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 37 / 66 Partitioning – Multi Data Center(Continue) 1 The goal is to ensure that the nodes for each data center have token assignments that evenly divide the overall range. 1 http://www.datastax.com/docs/0.8/cluster_architecture Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 38 / 66 About Client Request • All nodes in cassandra are peers. • A client read/write request can go to any node in the cluster. • When a client connect to a node and issues a read/write request , that node serves as a proxy the coordinator for that particular operation. • The job of the coordinator is to act between the client application and the nodes(replicas)that own the data being requested. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 39 / 66 About Client Request(Continue) • The coordinator sends the write request to all replicas that own the row being written. if all replica nodes are up and available. • They will get the write regardless of the consistency level specified by the client. • The write consistency level determines how many replica nodes must respond with a success acknowledgement in order for the write to be considered successful. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 40 / 66 An Example To Write For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 41 / 66 R1 1 12 R2 2 3 11 Client 4 10 5 9 8 6 7 R3 Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 42 / 66 Replication Factor & Replication In Cassandra • Replication factor The total number of replicas across the cluster is often referred to as the replication factor. • A replication factor of 1 means that there is only one copy of each row, and a replication factor of 2 means two copies of each row. • Replication Is the process of storing copies of data on multiple nodes. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 43 / 66 Replication Factor In Cassandra Replication is the process of storing copies of data on multiple nodes. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 44 / 66 Commit Log • The cassandra system base on the local file system for data persistance. • We have a dedicated disk on each machine for the commit log. • The write into the in-memory data structure is performed only after a successful write into the commit log. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 45 / 66 Commit Log & In-Memory Structure • The cassandra ,first writes data to a commit log(for durability), and then an in-memory table structure called memtable. • A write is successful when : 1. 2. First, It is written to the commit log. Second, write in the Memory. • Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 46 / 66 Structure Of Commit Log • Every commit log has a header which is basically: • A bit vector with fixed size. • The size of the bit vector is more than the number of column families. • These bit vectors are per commit log and also hold in memory. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 47 / 66 Write Operation Into The Commit Log • The write operation into the commit log can either be in normal mode or in fast sync mode. • In the fast sync mode the writes to the commit log are buffered.(if the machine is crashed some of data maybe loss). Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 48 / 66 Implementaion The Commit Log(Continue) • Traditional databases are not designed to handle high write throughput. • Cassandra do writes to disk into sequential writes thus maximize disk write throughput. • Since the files dumped to the disk are never changed then no locks need to be taken while reading them.for instance the server of cassandra is practically lockless for read/write operation. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 49 / 66 When Should We Delete The Commit Log? In any logging system , we need a mechanism to purge commit log entries. Question : Is there any different between delete a commit log and delete the entries of commit log ? Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 50 / 66 Implementaion The Index • The cassandra system indexes all database on primary key. • The data file on disk is broken down into a sequence of blocks. • Each block: • Contains at most 128 keys. • Is demarcated by a block index. • The block index capture the relative offset of a key within the block and the size of its data. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 51 / 66 Layout Of a Sample Block Structure of a block and their index demarcated in memory Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 52 / 66 Implementaion The Index(Continue) • When an in-memory data structure(block) is dumped to disk a block index is generated and their offsets written out to disk as indics. • This index is also hold in memory for fast access. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 53 / 66 What Happened When a Typical Read Is Take Place? Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 54 / 66 1 What Should We Do When The Number Of Files Are Increased On The Disk ? • Over time the number of data files will increase on disk. • We perform a compaction process, very much like the Bigtable system. • Merges multiple files into one ;essentially merge sort on a cluster of sorted data files. • Periodically a compaction process is run to compact all related data files into one big file. 1 The Research team Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 55 / 66 Practical Experiences • In the design process of cassandra, we learnt a lot of usefull experience and it is very benefitical for us. • We experimented with various implementations of Failure Detectors.if the size of cluster is grown then the time of detected faliure is increased. • Most application only require atomic operation per key per replica, but there are some application to do on secondray indexes.(because most developers work on RDBMS). • Cassandra is a completely decentralized system(distributed system). Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 56 / 66 Ganglia /ģæŋ.lia/ • Old monitoring is not benefit anymore,because the Cassandra system is well integerated with Ganglia (a distributed monitoring tool)1. • Ganglia is a scalable distributed system monitor tool for high-performance computing system such as clusters. • The strategy uses a distributed tree structure that enables organizations to monitor an arbitrarily large number of clusters while placing bounds on the required processing load. 1 Matthew L.Massie,Brent N.Chun, and David E, Culler.The Ganglia distributed monitoring system Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 57 / 66 Ganglia(Continue) Ganglia is comprised of two components: •Gmon,local-area monitoring system . •Gmeta wide-area system. Ganglia local and wide area monitor interaction. Gmon runs on each cluster node; gmeta can fail over between nodes. Gmon uses UDP multicast. Gmon communicates with its Gmeta counterpart using XML streams sent over TCP connections. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 58 / 66 Facebook Inbox Search • what is the matter? • Millions of messages are sent everyday on Facebook. • Messages stored in different data centers. • How to handle indexing all of this information for Inbox search ? Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 59 / 66 Facebook Inbox Search • For inbox search we have to make a list of all messages per user that have been exchanged between the sender and recipients. • There are two kinds of search features: • Term search. • Search interaction. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 60 / 66 Term Search / Search Interaction • Term search : • Key = user ID. • Super column = the words that make up the message become. • Search interaction : • Key = user ID. • Super column = the recipients id’s. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 61 / 66 An Actual Example • The current system store about 50TB of data on a 150 node cluster. • The previous data are spread out between east and west coast data center. • Some measure we product them are in the following table. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 62 / 66 What Works Are There To Do On The Future ? The works that we can do them are: • Adding compression. • Secondary index support. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 63 / 66 Cassandra Goals(Conclusion) • High scalability. • High performance. • Throughput. • Response time. • High availability. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 64 / 66 Headline Summary • All implimentation use of java. • Use the UDP anf TCP protocol for routing. • Ring mechanism used for clustering. • All the nodes in the ring are peers. • Use of replication. • Use of commit log to persistence files. • As use of sequential write we have a high throughput. • All files broken into some blocks. • It doesn’t use of lock to write/read. • Use of compression to compat the files. Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 65 / 66 Now,Please Ask Your Questions ! Cassandra-A Decentrilized Structured Storage System Azad Kurdistan University 66 / 66