Cassandra – A Decentralized Structured Storage System A. Lakshaman1, P.Malik1 1Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University The Rise of NoSQL Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earli er 2009 when Johan Oskarsson of Last.fm wanted to organize an eve nt to discuss open-source distributed databases. The name attempted to label the emergence of growing distributed d ata stores that often did not attempt to provide ACID guarantees Refer to http://www.google.com/trends?q=no sql Copyright 2010 by CEBT 2 NoSQL Database Based on Key-value memchached, Dynamo, Volemort, Tokyo Cabinet Based on Column Google BigTable, Cloudata, Hbase, Hypertable, Cassandra Based on Document MongoDB, CouchDB Based on Graph Meo4j, FlockDB, InfiniteGraph Copyright 2010 by CEBT 3 NoSQL BigData Database Based on Key-Value memchached, Dynamo, Volemort, Tokyo Cabinet Based on Column Google BigTable, Cloudata, Hbase, Hypertable, Cassandra Based onDocument MongoDB, CouchDB Based on Graph Meo4j, FlockDB, InfiniteGraph Copyright 2010 by CEBT 4 Copyright 2010 by CEBT 5 Refer to http://blog.nahurst.com/visual-guide-to-nosql-sy Contents Introduction Operations Remind: Dynamo WRITE Cassandra READ Consistency level Data Model System Architecture Performance Benchmark Partitioning Case Study Replication Conclusion Membership Bootstrapping Copyright 2010 by CEBT 6 Remind: Dynamo Distributed Hash Table BASE Basically Available Soft-state Eventually Consistent Client Tunable consistency/availability NRW Configuration W=N, R=1 Read optimized strong consistency W=1, R=N Write optimized strong consistency W+R ≦ N Weak eventual consistency W+R > N Strong consistency Copyright 2010 by CEBT 7 Cassandra Dynamo-Bigtable lovechild Column-based data model Distributed Hash Table Tunable tradeoff – Consistency vs. Latency Properties No single point of Failure Linearly scalable Flexible partitioning, replica placement High Availability (eventually consistency) Copyright 2010 by CEBT 8 Data Model Cluster Key Space is corresponding to db or table space Column Family is corresponding to table Column is unit of data stored in Cassandra Row Key Column Family: “User” “userid1” name: Username, value: uname1 name: Email, value: uname1@abc.com name: Tel, value: 123-4567 “userid2” name: Username, value: uname2 name: Email, value: uname2@abc.com name: Tel, value: 123-4568 “userid3” name: Username, value: uname3 name: Email, value: uname3@abc.com name: Tel, value: 123-4569 Copyright 2010 by CEBT Column Family: “Article” name: ArticleId, value:userid2-1 name: ArticleId, value:userid2-2 name: ArticleId, value:userid2-3 9 System Architecture Partitioning Replication Membership Bootstraping Copyright 2010 by CEBT 10 Partitioning Algorithm Distributed Hash Table Data and Server are located in the same address space Consistent Hashing Key Space Partition: arrangement of the key Overlay Networking: Routing Mechanism Hash(key1) N1 value high N3 N2 hash(key1) N3 N2 N1 low N2 is deemed the coordinator of key 1 Copyright 2010 by CEBT 11 Partitioning Algorithm (cont’d) Challenges Non-uniform data and load distribution Oblivious to the heterogenity in the performance of nodes Solutions Nodes get assigned to multiple positions in the circle (like Dynamo) Analyze load information on the ring and have lightly loads move on the ring to alleviate heavily loaded nodes (like Cassandra) N3 N1 N1 N2 N2 N3 N1 N3 N2 N1 N2 N3 N2 N2 Copyright 2010 by CEBT 12 Replication RackUnware Coordinator of data 1 A J B data1 C I D H E G F RackAware DataCenterShared Copyright 2010 by CEBT 13 Cluster Membership Gossip Protocol is used for cluster membership Super lightweight with mathematically provable properties State disseminated in O(logN) rounds Every T Seconds each member increments its heartbeat counte r and selects one other member send its list to A member merges the list with its own list Copyright 2010 by CEBT 14 Gossip Protocol t1 t2 t3 t4 server 1 server 1 server 1 server 1 server1: t1 server1: t1 server1: t1 server2: t2 server1: t4 server2: t2 server 2 server 2 server 2 server2: t2 server2: t2 server1: t4 server2: t2 t6 server 1 server1: t6 server2: t2 server3 :t5 server 2 server1: t6 server2: t6 server3: t5 t5 server 3 server1: t6 server2: t6 server3: t5 server 1 server1: t4 server2: t2 server3 :t5 server 2 server 3 server3: t5 server1: t4 server2: t2 15 Copyright 2010 by CEBT Accrual Failure Detector Valuable for system management, replication, load balancing Designed to adapt to changing network conditions The value output, PHI, represents a suspicion level Applications set an appropriate threshold, trigger suspicions an d perform appropriate actions In Cassandra the average time taken to detect a failure is 10-1 5 seconds with the PHI threshold set at 5 F(t) = -log10 (P(tnow - tlast )) Copyright 2010 by CEBT where P(t) = (1- e- lt ) 16 Bootstraping New node gets assigned a token such that it can alleviate a he avily loaded node N1 N1 N2 N3 Copyright 2010 by CEBT N2 17 WRITE Interface Simple: put(key,col,value) Complex: put(key,[col:val,…,col:val]) Batch WRITE Opertation Commit log for durability – Configurable fsync – Sequential writes only MemTable – Nodisk access (no reads and seek) Sstables are final – Read-only – indexes Always Writable Copyright 2010 by CEBT 18 READ Interface get(key,column) get_slice(key,SlicePredicate) Get_range_sllices(keyRange, SlicePredicate) READ Practically lock-free Sstable proliferation Row cache Key cache Copyright 2010 by CEBT 19 Consistency Level Tuning the consistency level for each WRITE/READ operation Level Description Level Description ZERO Hail Mary ZERO N/A ANY 1 replica ANY N/A ONE 1 replica ONE 1 replica QUORUM (N/2)+1 QUORUM (N/2)+1 ALL All replica ALL All replica Write Operation Read Operation Copyright 2010 by CEBT 20 Performance Benchmark Random and Sequential Writes Limited by bandwidth Facebook Inbox Search Two kinds of Search – Term Search – Interactions 50+TB on 150 node cluster Latency Stat Search Interactions Term Search Min 7.69ms 7.78ms Median 15.69ms 18.27ms Max 26.13ms 44.41ms Copyright 2010 by CEBT 21 vs MySQL with 50GB Data MySQL ~300ms write ~350ms read Cassandra ~0.12ms write ~15ms read Copyright 2010 by CEBT 22 Case Study Cassandra as primary data store Datacenter and rack-aware replication ~1,000,000 ops/s high sharding and low replication Inbox Search 100TB 5,000,000,000 writes per day Copyright 2010 by CEBT 23 Conclusions Cassandra Scalability High Performance Wide Applicability Future works Compression Atomicity Secondary Index Copyright 2010 by CEBT 24