Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise White Paper BY DATASTAX CORPORATION October 2013 1 Table of Contents Abstract 3 Introduction 3 The Growth in Multiple Data Centers 3 A Brief Multi-Data Center Database Checklist 4 A Look at Apache Cassandra Cassandra and Multiple Data Centers Multi-Data Center Performance 5 6 7 Running Analytics and Search Across Multiple Data Centers Options for Multi-Data Center Analytics and Search 7 7 A Look at DataStax Enterprise 8 What About the Cloud? 8 Managing and Monitoring Multi-Data Center Deployments 9 Multi-Data Center Customer Examples 10 Conclusion 11 About DataStax 11 2 Abstract Many modern businesses serve customers all around the world, with database applications that need to be always available even if a disaster hits a particular region. A database that easily spans multiple data centers and/or the cloud ensures the fastest possible response times for customers and employees who are geographically separated. A multi-data center database also protects information from loss in the event that a single data center experiences a disaster. This paper discusses why multi-data center databases are fast becoming the new norm for database operations, and how Apache Cassandra™ and DataStax Enterprise can comprise a smart and agile data store that is truly location-independent. Introduction Many modern businesses have external-facing database applications that are dramatically growing, and which serve a customer base that is geographically dispersed. Numerous companies also have workforces that are highly distributed in nature, with each employee needing fast access to the same corporate information no matter where they happen to be located. A database that easily spans multiple data centers and/or the cloud ensures the fastest possible response times (both read and write) for customers and employees who are geographically separated. A multi-data center database also provides a number of other benefits such as protecting information from loss in the event that a single data center experiences a disaster. This paper discusses why multi-data center databases are fast becoming the new norm for database operations, along with what characteristics a database must possess to run across many data centers and the cloud at once. Focus is then turned to how Apache Cassandra and DataStax Enterprise can be easily configured to run across multiple data centers and cloud providers to meet the requirements of those needing a smart and agile datastore that is truly location independent. The Growth in Multiple Data Centers A 2012 article in InfoWorld divulged interesting statistics about the rise and growth of multidata centers. In their latest poll of data center managers, the Uptime Institute discovered that 80 percent of respondents have built a new data center or upgraded an existing facility within the 1 past five years. The same article cited another study of the North American data center market done by Digital Realty Trust. In that study, 92 percent of respondents said their companies will definitely or probably expand their data center space in 2012 – the highest percentage reported in six years. This news, coupled with the fact that data centers are primarily put in place to hold (no surprise) corporate data, makes it plain to see that the need for databases that can easily span and interact between multiple data centers is only going to escalate – and likely at a rapid clip. 1 “Large enterprises handing off data center builds as demand booms,” by Ann Bednarz, InfoWorld, April 23, 2012: http://www.infoworld.com/d/data-center/large-enterprises-handing-datacenterbuilds-demand-booms-191472. 3 Why Multi-Data Center Datastores? The reasons why a multi-data center datastore is needed vary. Some use cases involve just the simple desire for a good disaster recovery plan. But the majority of multi-data center use cases revolve around needing to keep one logical database synched up between 1-N physical data centers and to deliver, as quickly as possible, response times for the users that each data center serves. One other factor contributing to the multi-data center discussion is big data. Those familiar with the term “big data” normally can recite the “three V’s” of what makes up big data: velocity, volume, and variety. However, one overlooked aspect of big data systems is “complexity,” which, according to Gartner Inc., involves the domain of managing data across many different data 2 centers, time zones, geographies, and so forth. Distributing data across many different data centers and the cloud is not an easy task with traditional databases. When one adds characteristics of data that is coming in at extremely high rates of speed from many places, data that is of varying formats, and data that can involve heavy volumes, the job becomes even harder. A Brief Multi-Data Center Database Checklist Even outside of big data environments, legacy relational databases (RDBMSs), the primary datastores for most businesses, have traditionally provided minimal support for multi-data centers. Other than basic replication or one-way mirroring, all RDBMS vendors lack key built-in features needed by modern applications that require a datastore that spans many different data centers and/or cloud geographies. This raises the question: What are the features and capabilities that a modern database / datastore needs to meet the demands of multi-data center operations? Does it just equate to log shipping, mirroring between data centers, or master-slave replication – or is it something else? Increasingly, the must-have short list from those wanting modern multi-data center capabilities includes the following: • • • • • The ability to span 1-N data centers, and not just two. This includes the agility to handle multiple cloud geo-zones as well. Multidirectional syncs between all participating data centers, and not just one way. Or, in other words, the desire to have truly location independent, read and write anywhere freedom. Built-in network intelligence, so that data is smartly transferred between data centers to minimize bandwidth overload and latency issues. The ability to support the required type of data traffic across data centers (e.g. real-time, analytic, search). Capabilities for handling big data use cases in a way where all data centers appear as just one logical database to an end user application. 2 “‘Big Data’ Is Only the Beginning of Extreme Information Management,” by Beyer, et al., Gartner Group Inc., April 7, 2011: http://www.gartner.com/id=1622715 . 4 Pulling this off is not easy unless one starts with the right database architecture and feature set. Traditional master-slave designs inherent in RDBMSs and some NoSQL solutions are many times practically impossible, as the requirement for true location independence cannot be met. Fortunately, Apache Cassandra possesses the right blend of technical features and big data capabilities to handle modern multi-data center and cloud deployments. A Look at Apache Cassandra Apache Cassandra is a massively scalable NoSQL database. Cassandra’s technical roots can be found at companies recognized for their ability to effectively tackle big data – Google, Amazon, and Facebook. Used today by numerous modern businesses to manage their critical data infrastructure, Cassandra is known for being the solution technical professionals turn to when they need a realtime NoSQL database that supplies high performance at massive scale, which never goes down. Rather than using a legacy master-slave or a manual and difficult-to-maintain sharded design, Cassandra has a peer-to-peer (or “masterless”) distributed “ring” architecture that is elegant, easy to set up, and maintain. In Cassandra, all nodes are the same; there is no concept of a master node, with all nodes communicating with each other via a gossip protocol. Cassandra’s built-for-scale architecture means that it is capable of handling terabytes of information and thousands of concurrent users/operations per second across one to many data centers as easily as it can manage much smaller amounts of data and user traffic. It also means that, unlike other master-slave or sharded systems, Cassandra has no single point of failure and therefore is capable of offering true continuous availability. 5 Cassandra and Multiple Data Centers Cassandra’s architecture is tailor-made for multiple data centers. Its peer-to-peer design (vs. legacy master-slave implementations) coupled with online scale-out and full redundancy that offers no single points of failure and continuous availability make it ideal in multi-data center environments. Because Cassandra is a masterless architecture, all nodes are the same and all nodes offer full read/write capabilities in a database cluster, regardless of where those nodes are physically located. A single Cassandra ring (or database cluster) can certainly exist at just one physical data center. However, Cassandra can easily support a single database spanning multiple data centers, where each data center holds its own copy of the database and can have as many nodes as needed for supporting that site: Figure 2: A Single Cassandra Database with Multiple Data Centers Creating a database that spans multiple data centers in Cassandra is easy and is accomplished via the definition of a new database. Once the database software has been installed on all machines in all participating data centers and is running, and network communication has been established among all the nodes, a keyspace (analogous to an RDBMS database) is created using Cassandra’s CQL language. Within the definition of a keyspace, each data center is identified (with the ID matching configuration parameters that have been previously set) along with the number of copies of the data that the keyspace will hold in each data center. For example, the syntax below creates a new keyspace named Globalbiz, with three data centers (DC1, DC2, and DC3): the first and second holding six total copies of the data (for fault tolerance purposes) and the third data center holding three copies: CREATE KEYSPACE Globalbiz WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'DC1': 6, 'DC2' : 6, ‘DC3’ : 3}; Once this command successfully executes, all data will then be automatically and transparently replicated between all nodes in all data centers with no further work being necessary on the part of any developer or administrator. 6 Multi-Data Center Performance One reason for multi-data center deployment is to keep copies of a database close to users of a particular data center/geographic region, with the end result being faster performance for both reads and writes. But what about performance across data centers? Won’t updating many nodes in many different data centers put too heavy a load on a database cluster? To eliminate this concern, Cassandra has built-in intelligence to only send a single data stream from one data center to all others participating in a multi-data center cluster. Once the data has reached one of the nodes in a different data center, that node then takes the responsibility to update all other nodes in a cluster that are responsible for holding that piece of data. Figure 3: Cross-Data Center Writes in Cassandra Running Analytics and Search Across Multiple Data Centers In addition to managing real-time data across multiple data centers, many modern businesses also wish to run analytic and enterprise search operations that span more than one data center. As with real-time data, implementing cross-data center operations for analytics and search data has proven to be no easy task. Options for Multi-Data Center Analytics and Search The need for multi-data center support for analytics and enterprise search has not been lost on those developing and supporting analytics and search technology like Apache Hadoop and Apache Solr. Today, Apache Hadoop offers a warm standby option that can be configured to go to a different data center. Third-party Hadoop vendors also offer solutions with one-way mirror capabilities. For Solr, writes to Solr indexes in the community version of Solr cannot span multiple data centers. Instead, there is only replication support to another node in a different data center via rsync. 7 Both the open source versions of Hadoop and Solr as well as those offered by third-party software vendors miss the mark where the criteria for operating a datastore in a multi-data center environment is concerned. However, DataStax Enterprise, offered by DataStax, supplies not only multi-data center support that meets the criteria suggested earlier in this paper for real-time/online data, but also delivers the same enterprise support for running analytics and search on Cassandra data across multiple data centers. A Look at DataStax Enterprise DataStax is the most trusted provider of Cassandra, employing the Apache chair of the Cassandra project as well as most of the committers. For enterprises that want to use Cassandra in production, DataStax supplies DataStax Enterprise Edition, which includes an enterprise-ready version of Cassandra plus built in security and the ability to run analytics and enterprise search operations on Cassandra data. With DataStax Enterprise, modern businesses get a complete big data platform that contains: • • • • • • • A certified version of Cassandra that has passed DataStax’s rigorous internal certification process, which includes heavy quality assurance testing, performance benchmarking, and defect resolution. Integrated analytics on Cassandra data using Hadoop MapReduce, Hive, Pig, Mahout, and Sqoop. Bundled enterprise search support with Apache Solr. Automatic management services that transparently run and take care of many administration tasks without IT staff involvement. DataStax OpsCenter, a visual management and monitoring tool. Expert, 24x7x365 support. Certified maintenance releases and platform certification What About the Cloud? Both Cassandra and DataStax Enterprise are fully cloud-enabled and capable of supporting multiple availability zones in a cloud provider. Further, hybrid deployments are supported so that a single cluster can span multiple on-premise installations as well as cloud-based implementations. Figure 4: Cassandra supports hybrid on-premise/cloud deployments 8 Managing and Monitoring Multi-Data Center Deployments Administering and monitoring the performance of any distributed database system can be challenging, especially when the database spans multiple geographical locations. However, DataStax makes it easy to manage multi-data center databases with DataStax OpsCenter. DataStax OpsCenter is a visual management and monitoring solution for Cassandra. Because DataStax OpsCenter is web based, developers or administrators can easily manage and monitor all aspects of their databases from any desktop, laptop, or tablet without installing any client software. This includes databases that span multiple data centers and the cloud. Figure 5: Managing a 9-node Cassandra cluster with DataStax OpsCenter 9 Multi-Data Center Customer Examples Figure 6: A sample of companies and organizations using Cassandra in production Some DataStax customers using Cassandra and DataStax Enterprise across multiple data centers and the cloud include: • • • • • • Netflix – has over 500 nodes of Cassandra running in multiple clusters and geo-zones on Amazon. eBay – has over 200 TB in DataStax Enterprise across three data centers. HealthX – supports their online patient and provider portal with DataStax Enterprise running in multiple geographies on Amazon. ReachLocal – uses DataStax Enterprise in six different data centers across the world to support their global online advertising business. Pantheon Systems – uses Cassandra across multiple data centers to deliver their cloud-based web development platform. Scandit – runs Cassandra across three different data centers to support its mobile barcode and product scanning service. 10 Conclusion Today’s successful businesses are looking for a modern database management system that can easily span multiple data centers and handle real-time, analytic, and enterprise search operations. Cassandra and DataStax Enterprise meet the requirements these businesses have for multi-data center and cloud support. To find out more about Cassandra and DataStax, and to obtain downloads of Cassandra and DataStax Enterprise software, please visit www.datastax.com or send an email to info@datastax.com . Note that DataStax Enterprise Edition is completely free to evaluate in development environments, while production deployments require the purchase of a software subscription. About DataStax DataStax powers the big data applications that transform business for more than 300 customers, including startups and 20 of the Fortune 100. DataStax delivers a massively scalable, flexible and continuously available big data platform built on Apache Cassandra™. DataStax integrates enterprise-ready Cassandra and includes the ability to run analytics and search on Cassandra data across multi-data centers and in the cloud. Companies such as Adobe, Healthcare Anytime, eBay and Netflix rely on DataStax to transform their businesses. Based in San Mateo, Calif., DataStax is backed by industry-leading investors: Lightspeed Venture Partners, Crosslink Capital and Meritech Capital Partners. For more information, visit DataStax or follow us @DataStax. 11