Introduction to Multi-Data Center Operations with Apache

advertisement
Introduction to Multi-Data Center Operations
with Apache Cassandra and DataStax Enterprise
White Paper
BY DATASTAX CORPORATION
October 2013
1
Table of Contents
Abstract
3
Introduction
3
The Growth in Multiple Data Centers
3
A Brief Multi-Data Center Database Checklist
4
A Look at Apache Cassandra
Cassandra and Multiple Data Centers
Multi-Data Center Performance
5
6
7
Running Analytics and Search Across Multiple Data Centers
Options for Multi-Data Center Analytics and Search
7
7
A Look at DataStax Enterprise
8
What About the Cloud?
8
Managing and Monitoring Multi-Data Center Deployments
9
Multi-Data Center Customer Examples
10
Conclusion
11
About DataStax
11
2
Abstract
Many modern businesses serve customers all around the world, with database applications that
need to be always available even if a disaster hits a particular region. A database that easily
spans multiple data centers and/or the cloud ensures the fastest possible response times for
customers and employees who are geographically separated. A multi-data center database also
protects information from loss in the event that a single data center experiences a disaster. This
paper discusses why multi-data center databases are fast becoming the new norm for database
operations, and how Apache Cassandra™ and DataStax Enterprise can comprise a smart and
agile data store that is truly location-independent.
Introduction
Many modern businesses have external-facing database applications that are dramatically
growing, and which serve a customer base that is geographically dispersed. Numerous
companies also have workforces that are highly distributed in nature, with each employee
needing fast access to the same corporate information no matter where they happen to be
located.
A database that easily spans multiple data centers and/or the cloud ensures the fastest
possible response times (both read and write) for customers and employees who are
geographically separated. A multi-data center database also provides a number of other
benefits such as protecting information from loss in the event that a single data center
experiences a disaster.
This paper discusses why multi-data center databases are fast becoming the new norm for
database operations, along with what characteristics a database must possess to run across
many data centers and the cloud at once. Focus is then turned to how Apache Cassandra and
DataStax Enterprise can be easily configured to run across multiple data centers and cloud
providers to meet the requirements of those needing a smart and agile datastore that is truly
location independent.
The Growth in Multiple Data Centers
A 2012 article in InfoWorld divulged interesting statistics about the rise and growth of multidata centers. In their latest poll of data center managers, the Uptime Institute discovered that 80
percent of respondents have built a new data center or upgraded an existing facility within the
1
past five years.
The same article cited another study of the North American data center market done by Digital
Realty Trust. In that study, 92 percent of respondents said their companies will definitely or
probably expand their data center space in 2012 – the highest percentage reported in six years.
This news, coupled with the fact that data centers are primarily put in place to hold (no surprise)
corporate data, makes it plain to see that the need for databases that can easily span and
interact between multiple data centers is only going to escalate – and likely at a rapid clip.
1
“Large enterprises handing off data center builds as demand booms,” by Ann Bednarz,
InfoWorld, April 23, 2012: http://www.infoworld.com/d/data-center/large-enterprises-handing-datacenterbuilds-demand-booms-191472.
3
Why Multi-Data Center Datastores?
The reasons why a multi-data center datastore is needed vary. Some use cases involve just the
simple desire for a good disaster recovery plan.
But the majority of multi-data center use cases revolve around needing to keep one logical
database synched up between 1-N physical data centers and to deliver, as quickly as possible,
response times for the users that each data center serves.
One other factor contributing to the multi-data center discussion is big data. Those familiar with
the term “big data” normally can recite the “three V’s” of what makes up big data: velocity,
volume, and variety. However, one overlooked aspect of big data systems is “complexity,” which,
according to Gartner Inc., involves the domain of managing data across many different data
2
centers, time zones, geographies, and so forth.
Distributing data across many different data centers and the cloud is not an easy task with
traditional databases. When one adds characteristics of data that is coming in at extremely high
rates of speed from many places, data that is of varying formats, and data that can involve
heavy volumes, the job becomes even harder.
A Brief Multi-Data Center Database
Checklist
Even outside of big data environments, legacy relational databases (RDBMSs), the primary
datastores for most businesses, have traditionally provided minimal support for multi-data
centers. Other than basic replication or one-way mirroring, all RDBMS vendors lack key built-in
features needed by modern applications that require a datastore that spans many different data
centers and/or cloud geographies.
This raises the question: What are the features and capabilities that a modern database /
datastore needs to meet the demands of multi-data center operations? Does it just equate to log
shipping, mirroring between data centers, or master-slave replication – or is it something else?
Increasingly, the must-have short list from those wanting modern multi-data center capabilities
includes the following:
•
•
•
•
•
The ability to span 1-N data centers, and not just two. This includes the agility to handle
multiple cloud geo-zones as well.
Multidirectional syncs between all participating data centers, and not just one way. Or,
in other words, the desire to have truly location independent, read and write anywhere
freedom.
Built-in network intelligence, so that data is smartly transferred between data centers to
minimize bandwidth overload and latency issues.
The ability to support the required type of data traffic across data centers (e.g. real-time,
analytic, search).
Capabilities for handling big data use cases in a way where all data centers appear as
just one logical database to an end user application.
2
“‘Big Data’ Is Only the Beginning of Extreme Information Management,” by Beyer, et al., Gartner
Group Inc., April 7, 2011: http://www.gartner.com/id=1622715 .
4
Pulling this off is not easy unless one starts with the right database architecture and feature set.
Traditional master-slave designs inherent in RDBMSs and some NoSQL solutions are many
times practically impossible, as the requirement for true location independence cannot be met.
Fortunately, Apache Cassandra possesses the right blend of technical features and big data
capabilities to handle modern multi-data center and cloud deployments.
A Look at Apache Cassandra
Apache Cassandra is a massively scalable NoSQL database. Cassandra’s technical roots can be
found at companies recognized for their ability to effectively tackle big data – Google, Amazon,
and Facebook.
Used today by numerous modern businesses to manage their critical data infrastructure,
Cassandra is known for being the solution technical professionals turn to when they need a
realtime NoSQL database that supplies high performance at massive scale, which never goes
down.
Rather than using a legacy master-slave or a manual and
difficult-to-maintain sharded design, Cassandra has a
peer-to-peer (or “masterless”) distributed “ring” architecture that is
elegant, easy to set up, and maintain. In Cassandra, all nodes
are the same; there is no concept of a master node, with all
nodes communicating with each other via a gossip protocol.
Cassandra’s built-for-scale architecture means that it is
capable of handling terabytes of information and thousands
of concurrent users/operations per second across one to
many data centers as easily as it can manage much smaller
amounts of data and user traffic. It also means that, unlike other master-slave or sharded
systems, Cassandra has no single point of failure and therefore is capable of offering true
continuous availability.
5
Cassandra and Multiple Data Centers
Cassandra’s architecture is tailor-made for multiple data centers. Its peer-to-peer design (vs.
legacy master-slave implementations) coupled with online scale-out and full redundancy that
offers no single points of failure and continuous availability make it ideal in multi-data center
environments.
Because Cassandra is a masterless architecture, all nodes are the same and all nodes offer full
read/write capabilities in a database cluster, regardless of where those nodes are physically
located. A single Cassandra ring (or database cluster) can certainly exist at just one physical
data center. However, Cassandra can easily support a single database spanning multiple data
centers, where each data center holds its own copy of the database and can have as many
nodes as needed for supporting that site:
Figure 2: A Single Cassandra Database with Multiple Data Centers
Creating a database that spans multiple data centers in Cassandra is easy and is accomplished
via the definition of a new database. Once the database software has been installed on all
machines in all participating data centers and is running, and network communication has been
established among all the nodes, a keyspace (analogous to an RDBMS database) is created
using Cassandra’s CQL language.
Within the definition of a keyspace, each data center is identified (with the ID matching
configuration parameters that have been previously set) along with the number of copies of the
data that the keyspace will hold in each data center. For example, the syntax below creates a
new keyspace named Globalbiz, with three data centers (DC1, DC2, and DC3): the first and
second holding six total copies of the data (for fault tolerance purposes) and the third data
center holding three copies:
CREATE KEYSPACE Globalbiz
WITH REPLICATION = {'class': 'NetworkTopologyStrategy',
'DC1': 6, 'DC2' : 6, ‘DC3’ : 3};
Once this command successfully executes, all data will then be automatically and transparently
replicated between all nodes in all data centers with no further work being necessary on the part
of any developer or administrator.
6
Multi-Data Center Performance
One reason for multi-data center deployment is to keep copies of a database close to users of a
particular data center/geographic region, with the end result being faster performance for both
reads and writes.
But what about performance across data centers? Won’t updating many nodes in many different
data centers put too heavy a load on a database cluster?
To eliminate this concern, Cassandra has built-in intelligence to only send a single data stream
from one data center to all others participating in a multi-data center cluster. Once the data has
reached one of the nodes in a different data center, that node then takes the responsibility to
update all other nodes in a cluster that are responsible for holding that piece of data.
Figure 3: Cross-Data Center Writes in Cassandra
Running Analytics and Search Across
Multiple Data Centers
In addition to managing real-time data across multiple data centers, many modern businesses
also wish to run analytic and enterprise search operations that span more than one data center.
As with real-time data, implementing cross-data center operations for analytics and search data
has proven to be no easy task.
Options for Multi-Data Center Analytics and Search
The need for multi-data center support for analytics and enterprise search has not been lost on
those developing and supporting analytics and search technology like Apache Hadoop and
Apache Solr.
Today, Apache Hadoop offers a warm standby option that can be configured to go to a different
data center. Third-party Hadoop vendors also offer solutions with one-way mirror capabilities.
For Solr, writes to Solr indexes in the community version of Solr cannot span multiple data
centers. Instead, there is only replication support to another node in a different data center via
rsync.
7
Both the open source versions of Hadoop and Solr as well as those offered by third-party
software vendors miss the mark where the criteria for operating a datastore in a multi-data
center environment is concerned. However, DataStax Enterprise, offered by DataStax, supplies
not only multi-data center support that meets the criteria suggested earlier in this paper for
real-time/online data, but also delivers the same enterprise support for running analytics and
search on Cassandra data across multiple data centers.
A Look at DataStax Enterprise
DataStax is the most trusted provider of Cassandra, employing the Apache chair of the
Cassandra project as well as most of the committers. For enterprises that want to use
Cassandra in production, DataStax supplies DataStax Enterprise Edition, which includes an
enterprise-ready version of Cassandra plus built in security and the ability to run analytics and
enterprise search operations on Cassandra data. With DataStax Enterprise, modern businesses
get a complete big data platform that contains:
•
•
•
•
•
•
•
A certified version of Cassandra that has passed DataStax’s rigorous internal
certification process, which includes heavy quality assurance testing, performance
benchmarking, and defect resolution.
Integrated analytics on Cassandra data using Hadoop MapReduce, Hive, Pig, Mahout,
and Sqoop.
Bundled enterprise search support with Apache Solr.
Automatic management services that transparently run and take care of many
administration tasks without IT staff involvement.
DataStax OpsCenter, a visual management and monitoring tool.
Expert, 24x7x365 support.
Certified maintenance releases and platform certification
What About the Cloud?
Both Cassandra and DataStax Enterprise are fully cloud-enabled and capable of supporting
multiple availability zones in a cloud provider. Further, hybrid deployments are supported so that
a single cluster can span multiple on-premise installations as well as cloud-based
implementations.
Figure 4: Cassandra supports hybrid on-premise/cloud deployments
8
Managing and Monitoring Multi-Data
Center Deployments
Administering and monitoring the performance of any distributed database system can be
challenging, especially when the database spans multiple geographical locations. However,
DataStax makes it easy to manage multi-data center databases with DataStax OpsCenter.
DataStax OpsCenter is a visual management and monitoring solution for Cassandra. Because
DataStax OpsCenter is web based, developers or administrators can easily manage and monitor
all aspects of their databases from any desktop, laptop, or tablet without installing any client
software. This includes databases that span multiple data centers and the cloud.
Figure 5: Managing a 9-node Cassandra cluster with DataStax OpsCenter
9
Multi-Data Center Customer Examples
Figure 6: A sample of companies and organizations using Cassandra in production
Some DataStax customers using Cassandra and DataStax Enterprise across multiple data
centers and the cloud include:
•
•
•
•
•
•
Netflix – has over 500 nodes of Cassandra running in multiple clusters and geo-zones
on Amazon.
eBay – has over 200 TB in DataStax Enterprise across three data centers.
HealthX – supports their online patient and provider portal with DataStax Enterprise
running in multiple geographies on Amazon.
ReachLocal – uses DataStax Enterprise in six different data centers across the world to
support their global online advertising business.
Pantheon Systems – uses Cassandra across multiple data centers to deliver their
cloud-based web development platform.
Scandit – runs Cassandra across three different data centers to support its mobile
barcode and product scanning service.
10
Conclusion
Today’s successful businesses are looking for a modern database management system that
can easily span multiple data centers and handle real-time, analytic, and enterprise search
operations. Cassandra and DataStax Enterprise meet the requirements these businesses have
for multi-data center and cloud support.
To find out more about Cassandra and DataStax, and to obtain downloads of Cassandra and
DataStax Enterprise software, please visit www.datastax.com or send an email to
info@datastax.com . Note that DataStax Enterprise Edition is completely free to evaluate in
development environments, while production deployments require the purchase of a software
subscription.
About DataStax
DataStax powers the big data applications that transform business for more than 300 customers,
including startups and 20 of the Fortune 100. DataStax delivers a massively scalable, flexible
and continuously available big data platform built on Apache Cassandra™. DataStax integrates
enterprise-ready Cassandra and includes the ability to run analytics and search on Cassandra
data across multi-data centers and in the cloud.
Companies such as Adobe, Healthcare Anytime, eBay and Netflix rely on DataStax to transform
their businesses. Based in San Mateo, Calif., DataStax is backed by industry-leading investors:
Lightspeed Venture Partners, Crosslink Capital and Meritech Capital Partners. For more
information, visit DataStax or follow us @DataStax.
11
Download