2011SS-03

Cassandra – A Decentralized Structured Storage System A. Lakshaman1, P.Malik1 1Facebook SIGOPS ‘10 2011. 03. 18. Summarized and Presented by Sang-il Song, IDS Lab., Seoul National University The Rise of NoSQL  Eric Evans, a Rackpage employee, reintroduce the term NoSQL in earli er 2009 when Johan Oskarsson of Last.fm wanted to organize an eve nt to discuss open-source distributed databases.  The name attempted to label the emergence of growing distributed d ata stores that often did not attempt to provide ACID guarantees Refer to http://www.google.com/trends?q=no sql Copyright  2010 by CEBT 2 NoSQL Database  Based on Key-value  memchached, Dynamo, Volemort, Tokyo Cabinet  Based on Column  Google BigTable, Cloudata, Hbase, Hypertable, Cassandra  Based on Document  MongoDB, CouchDB  Based on Graph  Meo4j, FlockDB, InfiniteGraph Copyright  2010 by CEBT 3 NoSQL BigData Database  Based on Key-Value  memchached, Dynamo, Volemort, Tokyo Cabinet  Based on Column  Google BigTable, Cloudata, Hbase, Hypertable, Cassandra  Based onDocument  MongoDB, CouchDB  Based on Graph  Meo4j, FlockDB, InfiniteGraph Copyright  2010 by CEBT 4 Copyright  2010 by CEBT 5 Refer to http://blog.nahurst.com/visual-guide-to-nosql-sy Contents  Introduction  Operations  Remind: Dynamo  WRITE  Cassandra  READ  Consistency level  Data Model  System Architecture  Performance Benchmark  Partitioning  Case Study  Replication  Conclusion  Membership  Bootstrapping Copyright  2010 by CEBT 6 Remind: Dynamo  Distributed Hash Table  BASE  Basically Available  Soft-state  Eventually Consistent  Client Tunable consistency/availability NRW Configuration W=N, R=1 Read optimized strong consistency W=1, R=N Write optimized strong consistency W+R ≦ N Weak eventual consistency W+R > N Strong consistency Copyright  2010 by CEBT 7 Cassandra  Dynamo-Bigtable lovechild  Column-based data model  Distributed Hash Table  Tunable tradeoff – Consistency vs. Latency  Properties  No single point of Failure  Linearly scalable  Flexible partitioning, replica placement  High Availability (eventually consistency) Copyright  2010 by CEBT 8 Data Model  Cluster  Key Space is corresponding to db or table space  Column Family is corresponding to table  Column is unit of data stored in Cassandra Row Key Column Family: “User” “userid1” name: Username, value: uname1 name: Email, value: uname1@abc.com name: Tel, value: 123-4567 “userid2” name: Username, value: uname2 name: Email, value: uname2@abc.com name: Tel, value: 123-4568 “userid3” name: Username, value: uname3 name: Email, value: uname3@abc.com name: Tel, value: 123-4569 Copyright  2010 by CEBT Column Family: “Article” name: ArticleId, value:userid2-1 name: ArticleId, value:userid2-2 name: ArticleId, value:userid2-3 9 System Architecture  Partitioning  Replication  Membership  Bootstraping Copyright  2010 by CEBT 10 Partitioning Algorithm  Distributed Hash Table  Data and Server are located in the same address space  Consistent Hashing  Key Space Partition: arrangement of the key  Overlay Networking: Routing Mechanism Hash(key1) N1 value high N3 N2 hash(key1) N3 N2 N1 low N2 is deemed the coordinator of key 1 Copyright  2010 by CEBT 11 Partitioning Algorithm (cont’d)   Challenges  Non-uniform data and load distribution  Oblivious to the heterogenity in the performance of nodes Solutions  Nodes get assigned to multiple positions in the circle (like Dynamo)  Analyze load information on the ring and have lightly loads move on the ring to alleviate heavily loaded nodes (like Cassandra) N3 N1 N1 N2 N2 N3 N1 N3 N2 N1 N2 N3 N2 N2 Copyright  2010 by CEBT 12 Replication  RackUnware Coordinator of data 1 A J B data1 C I D H E G F  RackAware  DataCenterShared Copyright  2010 by CEBT 13 Cluster Membership  Gossip Protocol is used for cluster membership  Super lightweight with mathematically provable properties  State disseminated in O(logN) rounds  Every T Seconds each member increments its heartbeat counte r and selects one other member send its list to  A member merges the list with its own list Copyright  2010 by CEBT 14 Gossip Protocol t1 t2 t3 t4 server 1 server 1 server 1 server 1 server1: t1 server1: t1 server1: t1 server2: t2 server1: t4 server2: t2 server 2 server 2 server 2 server2: t2 server2: t2 server1: t4 server2: t2 t6 server 1 server1: t6 server2: t2 server3 :t5 server 2 server1: t6 server2: t6 server3: t5 t5 server 3 server1: t6 server2: t6 server3: t5 server 1 server1: t4 server2: t2 server3 :t5 server 2 server 3 server3: t5 server1: t4 server2: t2 15 Copyright  2010 by CEBT Accrual Failure Detector  Valuable for system management, replication, load balancing  Designed to adapt to changing network conditions  The value output, PHI, represents a suspicion level  Applications set an appropriate threshold, trigger suspicions an d perform appropriate actions  In Cassandra the average time taken to detect a failure is 10-1 5 seconds with the PHI threshold set at 5 F(t) = -log10 (P(tnow - tlast )) Copyright  2010 by CEBT where P(t) = (1- e- lt ) 16 Bootstraping  New node gets assigned a token such that it can alleviate a he avily loaded node N1 N1 N2 N3 Copyright  2010 by CEBT N2 17 WRITE   Interface  Simple: put(key,col,value)  Complex: put(key,[col:val,…,col:val])  Batch WRITE Opertation   Commit log for durability – Configurable fsync – Sequential writes only MemTable –   Nodisk access (no reads and seek) Sstables are final – Read-only – indexes Always Writable Copyright  2010 by CEBT 18 READ  Interface  get(key,column)  get_slice(key,SlicePredicate)  Get_range_sllices(keyRange, SlicePredicate)  READ  Practically lock-free  Sstable proliferation  Row cache  Key cache Copyright  2010 by CEBT 19 Consistency Level  Tuning the consistency level for each WRITE/READ operation Level Description Level Description ZERO Hail Mary ZERO N/A ANY 1 replica ANY N/A ONE 1 replica ONE 1 replica QUORUM (N/2)+1 QUORUM (N/2)+1 ALL All replica ALL All replica Write Operation Read Operation Copyright  2010 by CEBT 20 Performance Benchmark  Random and Sequential Writes  Limited by bandwidth  Facebook Inbox Search   Two kinds of Search – Term Search – Interactions 50+TB on 150 node cluster Latency Stat Search Interactions Term Search Min 7.69ms 7.78ms Median 15.69ms 18.27ms Max 26.13ms 44.41ms Copyright  2010 by CEBT 21 vs MySQL with 50GB Data  MySQL  ~300ms write  ~350ms read  Cassandra  ~0.12ms write  ~15ms read Copyright  2010 by CEBT 22 Case Study  Cassandra as primary data store  Datacenter and rack-aware replication  ~1,000,000 ops/s  high sharding and low replication  Inbox Search  100TB  5,000,000,000 writes per day Copyright  2010 by CEBT 23 Conclusions  Cassandra  Scalability  High Performance  Wide Applicability  Future works  Compression  Atomicity  Secondary Index Copyright  2010 by CEBT 24

2011SS-03

Related documents

Products

Support

2011SS-03

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib