Document 13996778

advertisement
1
Cassandra DHT-based storage system
Vaibhav Shankar, Graduate Student, Indiana University Bloomington
Abstract— Cassandra is a structured distributed storage
system initially developed by Facebook when relational
database solutions proved too slow. It has since evolved into a
very scalable open source project managed by the Apache
software foundation. In this paper, we describe the motivation
behind the project, detailed design and a some real-world case
studies of Cassandra's performance. We end the survey with a
section on the limitations and further work planned on this
project
Index Terms— cassandra, database, distributed hash table,
distributed systems
1.Introduction
C
assandra[1] is a distributed data storage system
developed initially by Facebook engineers to facilitate
large scale distributed search in a rapidly resource-fluctuating
network. It has since been adopted by the Apache foundation
in an effort to promote development from a more widespread
community. Relational databases, largely due to their tight
structures, tend to be scale badly when required for large
scale distributed stores. Cassandra is an effort to effectively
manage distributed data stores for maximum performance
This paper is structured as follows. Section 2 describes a
brief history of related work which led to Cassandra's
development . Section 3 describes the implementation aspect
of Cassandra with a peep into the data model and structues
which make it different from conventional data stores.
Section 4 describes a case study of Facebook's Inbox search
and how Cassandra provides rapid search ability in a massive
data store like that. Section 5 describes the limitations of the
system and planned future works in the project.
2. Related Work
Distributing data for performance, availability and
durability has been widely studied in the file system and
database communities. P2P storage systems such as
Bittorrent[2] were one the proponents of this technology. P2P
systems, however, support typically flat namespaces, a
concept extended by distributed file systems which typically
support hierarchical namespaces. Systems like Ficus[3] and
Coda[4] replicate files for high availability at the expense of
consistency. Update conflicts are typically managed using
specialized conflict resolution procedures. The Google File
System (GFS)[5] is another distributed file system built for
hosting the state of Google's internal applications. GFS uses a
simple design inspired by typical static file systems with a
single master server (equivalent to UNIX inodes) for hosting
the entire metadata and where the data is split into chunks

and stored in chunk servers. However the GFS master is now
made fault tolerant using the Chubby[6] abstraction.
Bayou[6] is a distributed relational database system that
allows disconnected operations and provides eventual data
consistency. Among these systems, Bayou, Coda and Ficus
allow disconnected operations and are resilient to issues such
as network partitions and outages. These systems differ on
their conflict resolution procedures. All of them however,
guarantee eventual consistency. Similar to these systems,
Dynamo[7] allows read and write operations to continue even
during network partitions and resolves update conflicts using
different conflict resolution mechanisms, some client driven.
Traditional replicated relational database systems focus on
the problem of guaranteeing strong consistency of replicated
data. Although strong consistency provides the application
writer a convenient programming model, these systems are
limited in scalability and availability [8]. These systems are
not capable of handling network partitions because they
typically provide strong consistency guarantees. Cassandra
tries to address these issues at the cost of a customizable
eventual consistency guarantee. As we will demonstrate in
the following sections, the speed and scale obtained by
Cassandra more than make up for the overhead of eventual
consistency.
3.Technical Specifications
This section is sub-divided into three portions, the first
describes a generic DHT system, the second goes over the
data model of Cassandra and how it differs from
conventional relational database systems and the last one
describes how read and write operations work in this data
storage system
3.1 Distributed Hash Tables
Distributed hash tables are a very well studied distributed data
structure used extensively by peer-to-peer systems and later
adopted my almost all large scale distributed data stores. We
describe distributed hash tables as a simplified concept with 'nodes'
representing systems in the distributed system participating in
storage of data. Figure 1 shows the basic structure of a chord based
DHT.
The Chord based DHT works as follows. First, a good
hashing algorithm such as SHA-1 is used to translate keys to
a 160-bit address space. This address space is then divided
equally into a reduced node space which defines the
maximum number of participating nodes. This number is
typically a power of 2. Now, every node is assigned an ID by
applying this hashing algorithm to its IP address (or other
unique identification). Every file (or data object) which
needs to be stored is then given an ID and that is hashed
2
using the same scheme. Now, the file (or data object) is
stored on the first node which succeeds the hashed location.
Column Families
A column family is a container for columns, analogous to the
table in a relational system. A column family holds an
ordered list of columns, which are referenced by the column
name.
Column families have a configurable ordering (sort) applied
to the columns within each row. Out of the box ordering
implementations include ASCII, UTF-8, Long, and UUID
(lexical or time). However, APIs are provided for writing
one's own orderings as well for more complex structures.
Note that unlike conventional relational database models,
there is no relation between one column family and another
per se. Concepts such as foreign keys and table joins are not
implicitly provided (or required) by Cassandra's model.
Rows
In Cassandra, each column family is stored in a separate file,
and the file is sorted in row (i.e. key) major order. A row
does not keep related column families as mentioned before –
Cassandra maintains no information about related column
families. Instead, related columns that are accessed together
should be kept within the same column family.
Fig 1: Chord based DHT
Several improvements have been made over years of
distributed systems research. Now, more complex schemes
involving finger tables for faster lookup of the destination
node are available which make DHTs very fast and favorable
for managing distributed sharing.
3.2 Cassandra data model
At the core, Cassandra uses a chord based DHT to lookup
data. However, its data model is a significantly different
from conventional relational databases and is described in
detail here. We build the description of the model from
bottom up, starting with atomic data structures and building
up to the whole database.
Columns
The column is the lowest/smallest increment of data. It's a
tuple (triplet) that contains a name, a value and a timestamp.
Here's the interface definition of a Column:
struct Column {
1: binary name,
2: binary value,
3: i64 timestamp,
}
name represents metadata about the column (ex. 'email') and
value represents the value of the corresponding property (ex.
'foo@bar.org'). The timestamp field is used to resolve
conflicts on the latest copy and is generally provided by the
client. This does necessiate that the participating nodes in the
cluster be clock synchronized.
The row key is what determines what machine data is stored
on. This key is used by the hashing algorithm to determine
the ID of the node on which the row is stored. Thus, for each
key you can have data from multiple column families
associated with it.
A JSON representation of the row key -> column families ->
column structure is
{
"mccv":{
"Users":{
"emailAddress"{"name":"emailAddr
ess", "value":"foo@bar.com"},
"webSite":{"name":"webSite",
"value":"http://bar.com"}
},
"Stats":{
"visits":{"name":"visits",
"value":"243"}
}
},
"user2":{
"Users":{
"emailAddress":
{"name":"emailAddress",
"value":"user2@bar.com"},
"twitter":{"name":"twitter",
"value":"user2"}
}
}
}
Note that the key "mccv" identifies data in two different
column families, "Users" and "Stats". This does not imply
that data from these column families is related. The
3
semantics of having data for the same key in two different
column families is entirely up to the application. Also note
that within the "Users" column family, "mccv" and "user2"
have different column names defined. This is perfectly valid
in Cassandra. In fact there may be a virtually unlimited set of
column names defined, which leads to fairly common use of
the column name as a piece of runtime populated data. This
differs a lot from conventional RDBMS systems which do
have very structured tables with exactly same number of
columns in every row. As far as application programming
goes, Cassandra gives us more freedom to clearly express
data, much as modern day XML based systems do.
Super Columns
So far we've covered "normal" columns and rows. Cassandra
also supports super columns: columns whose values are
columns; that is, a super column is a (sorted) associative
array of columns. This is perhaps one of the most powerful
concepts used by applications using Cassandra.
One can thus think of columns and super columns in terms of
maps: A row in a regular column family is basically a sorted
map of column names to column values; a row in a super
column family is a sorted map of super column names to
maps of column names to column values.
A JSON description of this layout:
{
"mccv": {
"Tags": {
"cassandra": {
"incubator": {"incubator":
"http://incubator.apache.org/cassandra/"},
"jira": {"jira":
"http://issues.apache.org/jira/browse/CAS
SANDRA"}
},
"thrift": {
"jira": {"jira":
"http://issues.apache.org/jira/browse/THR
IFT"}
}
}
}
}
Here the column family is "Tags". We have two super
columns defined here, "cassandra" and "thrift". Within these
we have specific named bookmarks, each of which is a
column.
Just like normal columns, super columns are sparse: each row
may contain as many or as few as it likes; Cassandra imposes
no restrictions.
3.3 Reading and Writing to data store
Cassandra employs a very flexible mechanism to ensure
consistency of reads and writes. Figure 2 shows the basic
architecture of the storage system.
Fig 2: Cassandra I/O architecture
We describe write at first though options for read are more
or less the same. A write request could occur from any node
in the cluster. First, the information is written to a commit
log, the write operation does not return until the commit log
is written. After this, a data structure called a 'memtable' is
maintained in the memory of the node where the data is
written. Now, if the memtable becomes full, the write
happens onto a disk termed 'SStable', very similar to Google's
Bigtable. Note that these disks are available on the network
and not directly attached to the current node. Now,
replication options are provided, each of which determine the
level of consistency (and to some extent the performance) of
the system. Suppose we wanted M copies of the data to be
present in the storage system. Now, Cassandra provides
options wherein we can a) send the request and hope for the
best (zero acks), b) wait for one successful response, c) wait
for a quorum (M/2 + 1) responses or d) wait for all responses.
Typically, a quorum is preferred which provides good
consistency and speeds.
SStables use bloom filters to check for existence before
searching whole disks – this comes in handy when we
perform reads and don't find the data in the memtable. Reads
tend to be a little slower than writes, mostly because the
SStable file system ensures there is no overhead on the seek
time to write onto disk.
4. Case study – Facebook Inbox search
Facebook Inbox Search is a Facebook internal
functionality wherein they maintain a per user index of all
messages that have been exchanged between the sender and
the recipients of the message. There are two kinds of search
features that are enabled - (a) term search (b) interactions given the name of a person return all messages that the user
might have ever sent or received from that person. The
schema consists of two column families. For query (a) the
user id is the key and the words that make up the message
become the super column. Individual message identifiers of
the messages that contain the word become the columns
within the super column. For query (b) again the user id is
4
the key and the recipients id's are the super columns. For
each of these super columns the individual message
identifiers are the columns. In order to make the searches fast
Cassandra provides certain hooks for intelligent caching of
data. For instance when a user clicks into the search bar an
asynchronous message is sent to the Cassandra cluster to
prime the buffer cache with that user's index. This way when
the actual search query is executed the search results are
likely to already be in memory. The system currently stores
about 50+TB of data on a 150 node clusters. Some
performance excerpts derived from the original paper are
listed here:
Latency Stat
Search
Interactions
Term Search
References
[1] Lakshman, Malik – Cassandra – A Decentralized
Structured Storage System
[2] Cohen, Bram – Bittorrent, a new p2p app
[3] Peter Reiher, John Heidemann, David Ratner, Greg
Skinner, and Gerald Popek. Resolving file conflicts in
the ficus file system.
[4] Mahadev Satyanarayanan, James J. Kistler, Puneet
Kumar, Maria E. Okasaki, Ellen H. Siegel, and
Min
7.69ms
7.78ms
David C. Steere. Coda: A highly available le system
Median
15.69ms
18.27ms
for a distributed workstation environment.
Max
26.13ms
44.41ms
[5] Sanjay Ghemawat, Howard Gobio, and Shun-Tak
Leung. The google file system.
5. Limitations
•
All data for a single row must fit (on disk) on a
single machine in the cluster. Because row keys
alone are used to determine the nodes responsible
for replicating their data, the amount of data
associated with a single key has this upper bound.
•
A single column value may not be larger than 2GB,
this number is hardcoded and unlikely to change
•
Cassandra has two levels of indexes: key and
column. But in super column families there is a third
level of subcolumns; these are not indexed, and any
request for a subcolumn deserializes all the
subcolumns in that supercolumn
[6] D. B. Terry, M. M. Theimer, Karin Petersen, A. J.
Demers, M. J. Spreitzer, and C. H. Hauser. Managing
update conicts in bayou, a weakly connected
replicated storage system.
[7] Giuseppe de Candia, Deniz Hastorun, Madan
Jampani, Gunavardhan Kakulapati, Alex Pilchin,
Swaminathan Sivasubramanian, Peter Vosshall, and
Werner Vogels. Dynamo: amazon~Os highly available
key-value store.
[8] Jim Gray and Pat Helland. The dangers of replication and
a solution.
6. Conclusion
Cassandra is a storage system providing scalability, high
performance, and wide applicability. It has been clearly
demonstrated via this survey and its sources that Cassandra
can support a very high update throughput while delivering
low latency. Future works involves adding compression,
ability to support atomicity across keys and secondary index
support.
7. Acknowledgment
We would like to thank Dr. Judy Qiu for her guidance and
assistance to choose and find relevant material on the
Cassandra project. I would further like to thank all the
Associate Instructors and fellow students in the B534
Distributed Systems class at IU, Bloomington for their
valuable discussion and insight into this survey.
Download