International Journal of Engineering Trends and Technology (IJETT) – Volume 26 Number 5- August 2015 Apache Cassandra @ Business Field 1st Author-Miss Monika D. Khade, 2nd Author-Miss Purvaja A. Sable Department of Computer Science & Engineering College of Engineering & Technology, Akola Abstract— In the world of cloud computing, one essential ingredient is a database that can accommodate a very large number of users on an ondemand basis. Distributed storage mechanisms are becoming the de-facto method of data storage for the new generation of web applications used by companies like Google, Amazon, Facebook, Twitter, Salesforce.com, Linkedin.com and Yahoo! etc., which are processing large amount of data at a petabyte scale. This paper gives detailed overview of Apache Cassandra – A distributed database management system in business field. We are going step by step from its introduction to the implementation and then to the paradigm where it is used by major internet companies. While doing so, we will also discuss its features, tools available and ongoing research to improve its performance. Keywords –No SQL, Distributed Databases, Apache Cassandra, Multi Data Center, Tunable Consistency. I. INTRODUCTION A. What is Apache Cassendra? Designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous master less replication allowing low latency operations for all clients. Cassandra can be regarded as distributed database system with combination of technologies from Amazon Dynamo and Google BigTable .The roots of Apache Cassandra lie in the NoSQL database requirement for Facebook Corporation. Cassandra was developed by Avinash Lakshman and Prashant Malik at Facebook Corporation to boost its inbox search feature. It was designed to be a mean of database storage for distributed architecture. Being deployable on distributed platform it was highly expected to be able to handle data spread across geographically diverse servers, capable of providing seamless service with built in fault tolerance and no single point of failure .One of the most surprising features of Cassandra is that, though it is said to be sharing a lot of design and internal architecture details with traditional relational database management system, it is by no mean a relational database system. Cassandra is responsible for providing a data model to its users which can then be customized according to data storage and access requirements. ISSN: 2231-5381 In short, Cassandra can be described as an enormously scalable, decentralized and fault tolerant database management system which stores attribute values in structured and indexed fashion for efficient querying using Cassandra query language (CQL). Implementing inbox search feature was one of the difficult tasks given the existing relational database management system. It required very high write throughput capability. Though possible, it was infeasible and inefficient to implements this feature with very high number of geographically diverse users. Since its introduction in 2008 number of Facebook there has been more than double increase in number of users and still Cassandra is giving satisfying performance. Not only by Facebook, but due to its rich set of features Cassandra has been deployed by various major E-commerce businesses such as Netflix, digg and Twitter to name a few. II. HISTORY A. Literature Review Apache Cassandra was initially developed at Facebook to power their Inbox Search feature by Avinash Lakshman (one of the authors of Amazon Dynamo) and Prashant Malik. It was released as an open source project on Google code in July 2008. In March 2009, it became an Apache Incubator project. On February 17, 2010 it graduated to a top-level project. It was named after the Greek mythological prophet Cassandra. Releases after graduation include 0.6, released Apr 12 2010, added support for integrated caching, and Apache Hadoop MapReduce 0.7, released Jan 08 2011, added secondary indexes and online schema changes 0.8, released Jun 2 2011, added the Cassandra Query Language (CQL), selftuning memtables, and support for zerodowntime upgrades. 1.0, released Oct 17 2011, added integrated compression, leveled compaction, and improved read performance http://www.ijettjournal.org Page 263 International Journal of Engineering Trends and Technology (IJETT) – Volume 26 Number 5- August 2015 1.1, released Apr 23 2012, added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments 1.2, released Jan 2 2013, added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing 2.0, released Sep 4 2013, added lightweight transactions (based on the Paxos consensus protocol), triggers, improved compactions 2.0.4, released Dec 30 2013, added allowing specifying datacenters to participate in a repair, client encryption support to ss table loader, allow removing snapshots of nolonger-existing CFs 2.1.0 released Sep 10 2014 2.1.6 released June 08, 2015 2.1.7 released June 22, 2015 2.2.0 released July 20, 2015 B. Evolution of Cassandra Cassandra database system was born at Facebook in 2007 as a resource to handle inbox search feature which consisted of high scalability and real time usage. It was published as an open source project on Google code in 2008 and became top-level project of apache. Since its release it has undergone numerous changes and additions. In the version released in 2010, it added support for integrated caching. For year 2011 version it included support for Cassandra query language (CQL) and support for upgrades with no server downtime is required by many real time websites such as Facebook, Google and Amazon. As a part of recent revision it exhibits more advanced features such as support for SSD, self- tuning caches and row-level isolation. Apache Cassandra is a type of NoSQL database. NoSQL is the class of database management systems (DBMS). The NoSQL stands for the "Not only SQL". It does not use SQL as querying language. NoSQL has distributed, fault-tolerant architecture. There is no fixed schema (formally described structure) and no joins(typical in databases operated with SQL). In this xpensive operation for combining records from two or more tables into one set. Here joins require strong consistency and fixed schemas, lack of these makes NoSQL databases more flexible. It's not a replacement for a RDBMS but compliments. A NoSQL database (sometimes called as Not Only SQL) is a database that provides a mechanism to store and retrieve data other than the tabular relations used in relational databases. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts of data. The primary objective of a NoSQL database is to have simplicity of design, horizontal scaling, and finer control over availability. NoSql databases use different data structures compared to relational databases. It makes some operations faster in NoSQL. The suitability of a given NoSQL database depends on the problem it must solve. From a business standpoint, considering a NoSQL or ‗Big Data‘ environment has been shown to provide a clear competitive advantage in numerous industries. In the ‗age of data‘, this is compelling information as a great saying about the importance of data is summed up with the following ―if your data isn‘t growing then neither is your business‖. Fig. 2 NoSQL Database B. Architecture Of NoSQL Web Fig.1 Evolution Of Cassandra III. CASSANDRA- A NOSQL A. NoSQL-A Database ISSN: 2231-5381 A NoSQL database environment is, simply put, a non-relational and largely distributed database system that enables rapid, ad-hoc organization and analysis of extremely high-volume, disparate data types. NoSQL databases are sometimes referred to as cloud databases, non-relational databases, Big Data databases and a myriad of other terms and were developed in response to the sheer volume of data being generated, stored and analyzed by modern users http://www.ijettjournal.org Page 264 International Journal of Engineering Trends and Technology (IJETT) – Volume 26 Number 5- August 2015 (user-generated data) and their applications (machinegenerated data). failure and therefore is capable of offering true continuous availability. In general, NoSQL databases have become the first alternative to relational databases, with scalability, availability, and fault tolerance being key deciding factors. They go well beyond the more widely understood legacy, relational databases (such as Oracle, SQL Server and DB2 databases) in satisfying the needs of today‘s modern business applications. A very flexible and schema-less data model, horizontal scalability, distributed architectures, and the use of languages and interfaces that are ―not only‖ SQL typically characterize this technology. Fig.4 Architecture of Apache Cassandra B.Dataflow model Fig. 3 NOSQL Family Tree IV. CASSANDRAARCHITECTURE A.Cassandra Database Architecture The architecture of Cassandra greatly contributes to its being able to scale, perform, and offer continuous availability. Cassandra was built from the ground up with the understanding that hardware and system failures can and do occur. This translates into Cassandra sporting a different way of managing and protecting data than a traditional RDBMS. Rather than using a legacy master-slave or a manual and difficult-to-maintain sharded design, Cassandra has a peer-to-peer distributed architecture that is much more elegant, and easy to set up and maintain. In Cassandra, all nodes are the same; there is no concept of a master node, with all nodes communicating with each other via a gossip protocol. Cassandra‘s built-for-scale architecture means that it is capable of handling petabytes of information and thousands of concurrent users/operations per second (across multiple data centers) as easily as it can manage much smaller amounts of data and user traffic. It also means that, unlike other master-slave or sharded systems, Cassandra has no single point of ISSN: 2231-5381 Apache Cassandra is not another relational database in market. Instead of using relational model, it uses key-value map to store its data. The structure is more or less can be explained as in the following picture: Cassanda is essentially a hybrid between a key-value and a column-oriented (or tabular) database. A column family (called "table" since CQL 3) resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time. Each key in Cassandra corresponds to a value which is an object. Each key has values as columns, and columns are grouped together into sets called column families. Thus, each key identifies a row of a variable number of elements. These column families could be considered then as tables. A table in Cassandra is a distributed multi dimensional map indexed by a key. Furthermore, applications can specify the sort order of columns within a Super Column or Simple Column family. Cluster: Cassandra is designed to be distributed over several nodes/machines. A cluster consists of several nodes. I've only ever used Cassandra in a single node which is my computer. Keyspace: http://www.ijettjournal.org Page 265 International Journal of Engineering Trends and Technology (IJETT) – Volume 26 Number 5- August 2015 A cluster consists of several keyspaces. Keyspace is Cassandra defines a column family to be a logical the place where our data reside. A keyspace could division that associates similar data. Basic Cassandra have several Column Family or Super Column Family. data structures: the column, which is a name/value pair and a client-supplied timestamp of when it was last updated, and a column family, which is a container for rows that have similar, but not identical, column sets. There is no need to store a value for Column Family and Super Column Family: every column every time a new entity is stored. A cluster is a container for keyspaces—typically a single Both Column Family and Super Column Family is a keyspace. A keyspace is the outermost container for collection of rows, just like a table is a collection of data in Cassandra, but it‘s perfectly fine to create as rows in relational database. many keyspaces as the application needs. A column family is a container for an ordered collection of rows, each of which is itself an ordered collection of Row columns. A row consists of columns; key-value columns for a row in Column Family, or Super Columns for a row in V. INSTALLATION Super Column Family. A.Environment Cassandra requires the following environment variables to be set: Super Column: • JAVA_HOME - The path location of your Java Virtual Machine (JVM) installation It is sort of container of sub-columns (which are of • CLASSPATH - A path containing all of the required type Key-value Column). Java class files (.jar) • CASSANDRA_CONF - Directory containing the Key-value Column: Cassandra configuration files for convenience, Cassandra uses an include file, cassandra.in.sh, to The most basic data structure in Cassandra where the source these environment variables. It will check the actual data is saved as byte. The behavior is a lot like following locations for this file: Java Hash datatype. • Environment setting for CASSANDRA_INCLUDE if set • $CASSANDRA_HOME/bin • /us/share/cassandra/cassandra.in.sh • /us/local/share/cassandra/cassandra.in.sh • /opt/cassandra/cassandra.in.sh • $HOME/.cassandra.in.sh Cassandra also uses the Java options set in $CASSANDRA_CONF/cassandraenv.sh. If you want to pass additional options to the Java virtual machine, such as maximum and minimum heap size, edit the options in that file rather than setting B.Installing Cassandra Locally Fig. 5 Cassandra Data Model Steps to Implement Cassandra: Apache Cassandra in a nutshell is an open source, peer to peer distributed database architecture, decentralized, easily scalable, fault tolerant, highly available, eventually consistent, schema free, column oriented database. Generally in a master/slave setup, the master node can have far- reaching effects if it goes offline. By contrast, Cassandra has a peer-to-peer distribution model, such that any given node is structurally identical to any other node—that is, there is no ―master‖ node that acts differently than a ―slave‖ node. The aim of Cassandra‘s design is overall system availability and ease of scaling. Cassandra data model comprises of Keyspace (something like a database in relational databases) and column families (tables). ISSN: 2231-5381 This document aims to provide a few easy to follow steps to take the first-time user from installation, to running single node Cassandra, and overview to configure multimode cluster. Cassandra is meant to run on a cluster of nodes, but will run equally well on a single machine. This is a handy way of getting familiar with the software while avoiding the complexities of a larger system. 1)Step 1: Prerequisites and Connecting to the Community http://www.ijettjournal.org Page 266 International Journal of Engineering Trends and Technology (IJETT) – Volume 26 Number 5- August 2015 Cassandra requires the most stable version of Java 7 or 8 you can deploy, preferably the Oracle/Sun JVM. Cassandra also runs on opened and the IBM JVM. (It will NOT run on Rocket, which is only compatible with Java 6.) The best way to ensure you always have up to date information on the project, releases, stability, bugs, and features is to subscribe to the users mailing list (subscription required) and participate in the #cassandra channel on IRC. 2) Step 2: Download Cassandra Download links for the latest stable release can always be found on the website. Users of Debi an or Debi an-based derivatives can install the latest stable release in package form, see DebianPackaging for details. Users of RPM-based distributions can get packages from Datastax. If you are interested in building Cassandra from source, please refer to How to build page. For more details about misc builds, please refer to Cassandra versions and builds page. 3) Step 3: Basic Configuration The Cassandra configuration files can be found in the conf directory of binary and source distributions. If you have installed Cassandra from a dab or rpm package, the configuration files will be located in /etc/Cassandra. 4) Step 4: Start Cassandra And now for the moment of truth, start up Cassandra by invoking 'bin/cassandra -f' from the command line1. The service should start in the foreground and log gratuitously to the console. Assuming you don't see messages with scary words like "error", or "fatal", or anything that looks like a Java stack trace, then everything should be working. Press "Control-C" to stop Cassandra. If you start up Cassandra without the "-f" option, it will run in the background. You can stop the process by killing it, using 'pkill -f Cassandra. VI. FEATURES The main features of Apache Cassandra @ Business are given below: A.Decentralized ISSN: 2231-5381 Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request. B.Supports replication and multi data centre replication Replication strategies are configurable. Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra‘s distributed architecture are specifically tailored for multiple-data centre deployment, for redundancy, for failover and disaster recovery. C.Scalability Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications. D.Fault-tolerant Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime. E.Tunable consistency Writes and reads offer a tenable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable", with the quorum level in the middle. F.Map Reduce support Cassandra has Hadoop integration, with MapReduce support. There is support also for Apache Pig and Apache Hive. G.Query language Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to the traditional RPC interface. Language drivers are available for Java (JDBC), Python (DBAPI2), Node.JS (Helenus), Go (gocql) and C++. VII. SECURITY Cassandra provides these security features to the open source community. A.Client-to-node encryption Cassandra includes an optional, secure form of communication from a client machine to a database cluster. Client to server SSL ensures data in flight is not compromised and is securely transferred back/ forth from client machines. B.Authentication based on internally controlled login accounts/passwords Administrators can create users who can be authenticated to Cassandra database clusters using the CREATE USER command. Internally, Cassandra manages user accounts and access to the database http://www.ijettjournal.org Page 267 International Journal of Engineering Trends and Technology (IJETT) – Volume 26 Number 5- August 2015 cluster using passwords. User accounts may be altered and dropped using the Cassandra Query Language (CQL). C.Object permission management Once authenticated into a database cluster using either internal authentication, the next security issue to be tackled is permission management. What can the user do inside the database? Authorization capabilities for Cassandra use the familiar GRANT/REVOKE security paradigm to manage object permissions. D.Enabling JMX authentication The default settings for Cassandra make JMX accessible only from local host. If you want to enable remote JMX connections, change the LOCAL_JMX setting in cassandra-env.sh and enable authentication and/or ssl. After you enable JMX authentication, ensure that tools that use JMX, such as nodetool and DataStax OpsCenter, are configured to use authentication. VIII. APPLICATIONS Developing Cassandra Applications the primary difference developers will find when developing applications against Cassandra vs. RDBMSs is the data model. Cassandra uses a Google Bigtable model, which provides more flexibility than a relational design and can more easily store structured, semistructured, and unstructured data. C.Digg The recent V4 relaunch is 100% Cassandra. We are running on multiple clusters internally, our largest one is 40 nodes spanning multiple datacenters. We also have 1 core commiter on staff. D.Twitter We're using Cassandra in production for a bunch of things at Twitter. E. SoftwareProjects Software Projects uses 20 Cassandra nodes across 3 datacenters to power the eCommerce platform for 3,000 businesses. We use Cassandra to store real-time purchase data and provide various stats for our customers. Cassandra is our primary data store. F.eBay EBay has Cassandra supporting multiple applications with rings spanning several data centers. G.Rackspace Rackspace uses Cassandra for a variety of internal needs. Several Rackspace employees are also core committers and contributors on the project. H.Ooyala Ooyala uses Cassandra as the backing store for a near-realtime video analytics platform, which allows publishers to analyze and optimize the performance of their online video content. I.Despeger Despegar uses a Cassandra cluster to storage user sessions of its hotel booking site, and also as a persistent cache of flight itineraries. J.SimpleGeo Fig.6 Various Applications used in Cassandra A.Facebook: Facebook moved off its pre-Apache Cassandra deployment in late 2010 when they replaced Inbox Search with the Facebook Messaging platform. In 2012, Facebook began using Apache Cassandra in its Instagram unit. Facebook uses their internally developed Cassandra and has the largest known cluster in operation of around 150 nodes B.Netflix: Netflix – An Example of Succeeding in the Cloud with Cassandra With more than 25 million members worldwide, Netflix, Inc. (Nasdaq: NFLX) is the world's leading Internet subscription service for enjoying movies and TV shows. Netflix allows its members to instantly watch unlimited movies and TV episodes streaming over the Internet to computers and TVs. ISSN: 2231-5381 We use Cassandra as our core datastore for providing location-based services and products. We run Cassandra in multiple availability zones within Amazon EC2. Because NoSQL databases like Cassandra do not support operations like SQL joins, data tends to be highly denormalized. While such a thing (wide rows) is normally a problem for an RDBMS, Cassandra provides exceptional performance for objects with many thousands of columns. The primary container of data is a keyspace, which is like a database in an RDBMS. Inside a keyspace are one or more column families, which are like relational tables, but they are more fluid and dynamic in structure. Column families have one too many thousands of columns, with both primary and secondary indexes on columns being supported. http://www.ijettjournal.org Page 268 International Journal of Engineering Trends and Technology (IJETT) – Volume 26 Number 5- August 2015 In Cassandra, objects are created, data is inserted and manipulated, and information queried via CQL – the Cassandra Query Language, which looks nearly identical to SQL. Developers coming from the relational world will be right at home with CQL and will use standard commands (e.g., INSERT, SELECT) to interact with objects and data stored in Cassandra. Companies running their applications on Apache Cassandra have realized benefits which have directly improved their business. Cassandra is capable of handling all of the big data challenges that might arise: massive scalability, an always on architecture, high performance, strong security, and ease of management, to name a few. Learn about how businesses have successfully deployed Apache Cassandra in their environments based on various types of applications and use cases. Many companies have successfully deployed and benefited from Apache Cassandra including some large companies such as: Apple, Comcast, Instagram, Spotify, eBay, Rackspace, Netflix, and many more. The larger production environments have PB‘s of data in clusters of over 75,000 nodes. Cassandra is available under the Apache 2.0 license. Some of the application use cases that Cassandra excels in include: • Real-time, big data workloads • Time series data management • High-velocity device data consumption and analysis • Media streaming management (e.g., music, movies) • Social media (i.e., unstructured data) input and analysis • Online web retail (e.g., shopping carts, user transactions) • Real-time data analytics • Online gaming (e.g., real-time messaging) • Software as a Service (SaaS) applications that utilize web services • Online portals (e.g., healthcare provider/patient interactions) • Most write-intensive systems. IX. CONCLUSION In this paper, the detailed study is made to understand their features and working of cassendra. We also explain the cassendra with the help of its architecture. Cassandra is a popular among nosql database and Cassandra can be used for applications requiring faster writes and high availability. Nosql databases are not "One size fits all". Each nosql classification addresses a specific data storage and processing requirements. Thus we presented the features of Cassandra distributed database management system and benefits of it using in real world enterprise applications. Cassandra is excellent choice for businesses which are looking for features such as high availability, consistency, and low downtime, and fault tolerance, high scalability in terms of both users and data. These features are ISSN: 2231-5381 empirically tested by the creators of Cassandra on production data spread across geographically diverse areas with volumes in terms of several terabytes. With continuous development going on under Apache software foundation, Cassandra is expected to evolve as most prominent and powerful distributed database management system. ACKNOWLEDGMENT We would like to thank Prof. Sonali Dhule Assistant Professor of Computer Science and Engeering at college of Engineering and Technology at Sant Gadge Baba Amravati University for providing us with an idea to write a survey paper on Apache Cassandra @ Business Field. We also thanks to IJETT for providing this template to modify. The heading of the Acknowledgment section and the References section must not be numbered. REFERENCES 1. Robin Hecht Stefan Jablonski, University of Bayreuth " NoSQL Evaluation A Use Case Oriented Survey" 2011 International Conference on Cloud and Service Computing 2. Dietrich Featherston "cassandra: Principles and Application" Department of Computer Science University of Illinois at Urbana-Champaign 3. Ameya Nayak, Anil Poriya Dept. of Computer Engineering Thakur College of Engineering and Technology University of Mumbai " Type of NOSQL Databases and its Comparison with Relational Databases" International Journal of Applied Information Systems (IJAIS) – ISSN : 2249- 0868 Foundation of Computer Science FCS, New York, USA Volume 5– No.4, March 2013 4. D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Gehrke, L. Haas, A. Halevy, J. Han et al., ―Challenges and opportunities with big data - a community white paper developed by leading researchers across the United States,‖ 2011. 5. F. Bugiotti, L. Cabibbo, P. Atzeni, and R. Torlone, ―Database design for NoSQL systems,‖ in Proceedings of the 33rd International Conference on Conceptual Modeling, 2014, pp. 223–231. 6. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung - The Google File system, ACM SIGOPS Operating Systems Review - SOSP '03 Homepage Volume 37 Issue 5, December 2003 7. Avinash Lakshman, Prashant Malik Cassandra-A Decentralized Structured Storage System, ACM SIGOPS Operating Systems Review archive, Volume 44 Issue 2, April 2010 8. F. Bugiotti, L. Cabibbo, P. Atzeni, and R. Torlone, ―Database design for NoSQL systems,‖ in Proceedings of the 33rd International Conference on Conceptual Modeling, 2014, pp. 223–231. 9. D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Gehrke, L. Haas, A. Halevy, J. Han et al., ―Challenges and opportunities with big data - a community white paper developed by leading researchers across the United States,‖ 2011. 10. M. Lawley and R. W. Topor, ―A query language for EER schemas,‖ inProceedings of the 5th Australasian Database Conference, 1994, pp. 292–304. 11. Melnik S, Gubarev A, Long JJ, et al. Dremel: interactive analysis of web-scale datasets. Proceedings of the 36th International Conference on Very Large Data Bases 2010, pp. 330–339. http://www.ijettjournal.org Page 269