CS 440 Database Management Systems NoSQL & NewSQL, Cont’d. Some slides due to Magda Balazinska 1 Scaling by partitioning & replication • Partition the data across machines • Replicate the partitions – Good: • spread read queries across replica – Bad: • should keep the replica consistent after write queries – Ugly: • difficult to scale transactions – two phase commit is expensive • difficult to scale complex operations 2 NoSQL: Not Only SQL/ Not relational • Goals – highly scalable data management system – flexible data model: various records from different schema • They are willing to give up – Complex queries • e.g. no join – ACID guarantees • weaker versions, e.g. eventual consistency – Multi-object transactions • Not all NoSQL systems give up all these properties 3 NoSQL key features • Scale horizontally simple operations – key lookups, – reads and writes of one record or a small number of records – simple selections • Replicate/distribute data over many servers • Simple call level interface (contrast w/ SQL) • Weaker concurrency model than ACID • Efficient use of distributed indexes and RAM • Flexible schema 4 Different types of NoSQL Taxonomy based on the data models: • Key-value stores – e.g., Dynamo, Project Voldemort, Memcached • Document stores – e.g., SimpleDB, CouchDB, MongoDB • Extensible Record stores – e.g., Bigtable, HBase, Cassandra • NewSQL: new type of RDBMSs – e.g., Megastore, VoltDB, 5 Key-Value stores features • Data model: (key, value) pairs – values are binary objects – no further schema • Operations – insert, delete, and lookup operations on keys – no operation across multiple data items • Consistency – replication with eventual consistency • e.g., vector clocks in Dynamo – goal to NEVER reject any writes (bad for business!) – multiple versions with conflict resolution during reads 6 Key-Value stores features • Use replication to provide fault-tolerance • Quorum replication in Dynamo – Each update creates a new version of an object – Vector clocks track causality between versions – Parameters: • N = number of copies (replicas) of each object • R = minimum number of nodes that must participate in a successful read • W = minimum number of nodes that must participate in a successful write • Quorum: R+W > N 7 Key-Value stores internals • Only primary index: lookup by key – No secondary indexes! • Data remains in main memory • Most systems also offer a persistence option • Some offer ACID transactions others do not – Multiversion concurrency control or locking 8 Multiversion Concurrency Control • Idea: Let writers make a “new” copy while readers use an appropriate “old” copy: MAIN SEGMENT (Current versions of DB objects) v O O’ O’’ VERSION POOL (Older versions that may be useful for some active readers.) Readers are always allowed to proceed. – But may be blocked until writer commits. Multiversion CC (Contd.) • Each version of an object has its writer’s TS as its WTS, and the TS of the Xact that most recently read this version as its RTS. • Versions are chained backward; we can discard versions that are “too old to be of interest”. • Each Xact is classified as Reader or Writer. – Writer may write some object; Reader never will. – Xact declares whether it is a Reader when it begins. WTS timeline old new Reader Xact • For each object to be read: T – Finds newest version with WTS < TS(T). (Starts with current version in the main segment and chains backward through earlier versions.) • Assuming that some version of every object exists from the beginning of time, Reader Xacts are never restarted. – However, might block until writer of the appropriate version commits. Writer Xact • To read an object, follows reader protocol. • To write an object: – Finds newest version V s.t. WTS < TS(T). – If RTS(V) < TS(T), T makes a copy CV of V, with a pointer to V, with WTS(CV) = TS(T), RTS(CV) = TS(T). (Write is buffered until T commits; other Xacts can see TS values but can’t read version CV.) old new – Else, reject write. WTS CV V RTS(V) T Check out DynamoDB! http://aws.amazon.com/dynamodb/ 13 Different types of NoSQL Taxonomy based on the data models: • Key-value stores – e.g., Dynamo, project voldemort, Memcached • Document stores – e.g., SimpleDB, CouchDB, MongoDB • Extensible Record stores – e.g., BigTable, HBase, Cassandra • NewSQL: new type of RDBMSs 14 Document stores • A "document” is a pointer-less object – e.g., XML, JSON, – nested or not – schema-less • relational vs. semi-structured (document) data model? • They may have secondary indexes. • Scalability – Replication (e.g. SimpleDB, CounchDB – means entire db is replicated) – Sharding (MongoDB) – Both 15 Amazon SimpleDB (1/3) • Partitioning – Data partitioned into domains: queries run within a domain – Domains seem to be unit of replication. Limit 10GB – Can use domains to manually create parallelism • Data Model/ Schema – No fixed schema – Objects are defined with attribute-value pairs 16 Amazon SimpleDB (2/3) • Indexing – Automatically indexes all attributes • Support for writing – PUT and DELETE items in a domain • Support for querying – GET by key – Selection + sort: SELECT output_list FROM domain_name [where expression] [sort_instructions] [limit limit] – A simple form of aggregation: count 17 Amazon SimpleDB (3/3) • Availability and consistency – Data is stored redundantly across multiple servers – Takes time for the update to propagate to all locations • Eventually consistent, but an immediate read might not show the change – Choose between consistent or eventually consistent read 18 Different types of NoSQL Taxonomy based on the data models: • Key-value stores – e.g., Dynamo, project voldemort, Memcached • Document stores – e.g., SimpleDB, CouchDB, MongoDB • Extensible record stores – e.g., BigTable, HBase, Cassandra • NewSQL: new type of RDBMSs 19 Extensible record stores • Data model is rows and columns • Typical Access: Row ID, Column ID, Timestamp • Scalability by splitting rows and columns over nodes – Rows: sharding on primary key – Columns: "column groups" = indication for which columns to be stored together (e.g. customer name/address group, financial info group, login info group) 20 Google Bigtable • • • • Distributed storage system Designed to store structured data Scale to thousands of servers Store up to several hundred terabytes (maybe even petabytes) • Perform backend bulk processing • Perform real-time data serving • To scale, Bigtable has a limited set of features 21 Bigtable data model • Sparse, multidimensional sorted map (row:string, column:string, time:int64)string Columns are grouped in to families 22 Bigtable key features • Read/writes of data under single row key is atomic – Only single-row transactions! • Data is stored in lexicographical order – Improves data access locality – Horizontally partitioned into tablets – Tablets are unit of distribution and load balancing • Column families are unit of access control • Data is versioned (old versions garbage collected) – Ex: most recent three crawls of each page, with times 23 Bigtable API • Data definition – Creating/deleting tables or column families – Changing access control rights • Data manipulation – Writing or deleting values – Looking up values from individual rows – Iterating over subset of data in the table • Can select on rows, columns, and timestamps 24 HBase • Open source implementation of BigTablehttp://hbase.apache.org/ 25 Different types of NoSQL Taxonomy based on the data models: • Key-value stores – e.g., Dynamo, project voldemort, Memcached • Document stores – e.g., SimpleDB, CouchDB, MongoDB • Extensible record stores – e.g., BigTable, HBase, Cassandra • NewSQL: new type of RDBMSs 26 Scalable RDBMS: NewSQL • Means RDBS that are offering sharding • Key difference: – NoSQL make it difficult or impossible to perform large scope operations and transactions (to ensure performance), while scalable RDBMS do not preclude these operations, but users pay a price only when they need them. • Megastore, VoltDB, MySQL Cluster, Clusterix, ScaleDB 27 Megastore • Implemented over Bigtable, used within Google • Megastore is a layer on top of Bigtable – Transactions that span nodes – A database schema defined in a SQL-like language – Hierarchical paths that allow some limited joins • Megastore is made available through the Google App Engine Datastore 28 VoltDB • Main-memory RDBMS: no disk IO no buffer mngmt! • Sharded across a shared-nothing cluster – One transaction = one stored procedure – So both the data and processing are partitioned • Transaction processing – SQL execution single-threaded for each shard – Avoids all locking and latching overheads • Synchronous multi-master replication for HA – Multiple nodes may propagate updates – Different from master/ slave 29 Application 1 • Web application that needs to display lots of customer information; the users data is rarely updated, and when it is, you know when it changes because updates go through the same interface. 30 Application 2 • Department of Motor Vehicle: lookup objects by multiple fields (driver's name, license number, birth date, etc); "eventual consistency" is ok, since updates are usually performed at a single location. 31 Application 3 • eBay-style application. Cluster customers by country; separate the rarely changed "core” customer information (address, email) from frequently-updated info (current bids). 32 Application 4 • Everything else (e.g. a serious DMV application) 33 Criticism (from Stonebraker, CACM2011) • No ACID = no interest in enterprises – Screwing up mission-critical data is no-no-no • Low-level query language is death – Before SQL • NoSQL means No Standards – One (typical) large enterprise has 10,000 databases. These need accepted standards 34