Wasef: Incorporating Metadata into NoSQL Storage Systems Ala’ Alkhaldi, Indranil Gupta, Vaijayanth Raghavan, Mainak Ghosh Department of Computer Science University of Illinois, Urbana Champaign Distributed Protocols Research Group: http://dprg.cs.uiuc.edu 1 NoSQL Storage Systems • Growing quickly • $3.4B industry by 2018 • Fast reads and writes • Several orders of magnitude faster than MySQL and relational databases • Easier to Manage • Support CRUD Operations on Data (Create Read Update Delete) • Many companies use them in running critical infrastructures • Google, Facebook, Yahoo!, and many others • Many open-source NoSQL databases • Apache Cassandra, Riak, MongoDB, etc. 2 The Need for Metadata • Though easier to manage than RDBMSs, there are still a lot of pain points • Today, System Administrators need to • Parse flat files in system logs, E.g., if they want to debug behavior • Manually count token ranges, E.g., during node decommissioning • Many of these pain points could be alleviated if there were a metadata system available • Metadata can also provide new features not possible today • E.g., data provenance 3 Metadata • Metadata = Essential Information about a {system, table, row}, but excluding the data itself • E.g., for a table: columns, and history of past deleted columns • We argue that metadata should be treated as a first-class citizen in NoSQL storage systems • We present the first metadata collection system for NoSQL Storage Systems, called Wasef • We integrate Wasef into Apache Cassandra, which is the most popular NoSQL Storage System • Our Metadata-enabled Cassandra is called W-Cassandra • Available for free download at: http://dprg.cs.uiuc.edu/downloads 4 The Wasef System • Wasef is a Metadata Management System for NoSQL data stores • Wasef is guided by five design principles – it should: 1. Be able to store metadata cleanly 2. Enable Accessibility of Metadata via Clean APIs 3. Be modular, and integrated with underlying NoSQL functionality • Do not change other data APIs 4. Provide Flexibility in Granularity at which Metadata is Collected 5. Be efficient and only collect the minimal metadata required 5 Wasef Architecture • Registry = List of (object, operation) pairs saying which operation triggers metadata collection for which object • Log = The Metadata itself • Need easy querying and accessibility • Stored as system tables • where available from the underlying NoSQL Store • Use CRUD (from underlying NoSQL) for metadata • APIs provided to • Clients • Use cases 6 Wasef APIs Internal API • Registry.add(target, operation) • Registry.delete(target, operation) • Registry.query(target, operation) • Log.add(target, operation, timestamp, value) • “target” • Name of database entity for which metadata is being collected • We use a systematic naming convention using dotted notation • Example: <KeySpaceName.Table.RowID.Column> • Log.delete(target, operation, startTime, endTime) • Log.query(target, operation, startTime, endTime) External API • Wrappers around Internal API • Convenience functions • “operation” • Operation, which when invoked by any client, triggers collection of metadata for this target • Uses a systematic naming convention • Examples: • Column add, Row insert, Truncate table 7 W-Cassandra: Incorporating Wasef into Cassandra (v 1.2.x) Supported metadata targets and operations Target Identifier Operations Collected Metadata Schema Name Alter, Drop Old and new names, replication map Table Name Alter, Drop, Truncate Column family name, new and old properties (e.g. column names, types,..) Row Partitioning Keys Insert, Update, Delete Key names, affected columns, TTL, ... Column Clustering keys and column name Insert, Update, Delete Key names, affected columns, TTL, … Node Node ID On request Token ranges 8 W-Cassandra: Registry Table Takeaways • Separate row for each object • Stores all triggering operations for that object • Makes it easy to look up during an operation Registry School.Teacher School.Teacher.John Schema of “registry” table (in CQL) create table registry( target text, operation text, primary key( target, operation )); Partitioning Key Clustering Key AlterCF_Add Truncate null null Delete_Row Update_Row Null null W-Cassandra: Log Table Takeaways • All metadata for a given object stored as columns within one row • Orders entries by time inserted Querying all metadata for one object is fast Schema of “log” table (in CQL) create table log( Log School.Teacher School.Teacher.John AlterCF_Add-1509051314-admin AlterCF_Add-2009051414-admin {col_name:address, col_type:text, compaction_class: SizeTieredCompactionStrategy} {col_name:mobile, col_type:text, compression_sstable: DefaultCompressor} Update_Row-1510051314-admin Update_Row-2010051414-admin {col_name:address, col_old_val:null,col_new_val:’ Urbana,IL’, ttl:432000} {col_name:mobile, col_old_val:null, col_new_val:’55555’, ttl:432000} target text, operation text, time long, client text, value text, primary key(target, operation, time, client)); Partitioning Key Clustering Key 10 Use Case 1: Flexible Column Drop Cassandra JIRA Issue 3919 • When a column is deleted, its data doesn’t go away • Re-adding a new empty column still leaves old data available for querying! Wasef allows us to address this JIRA issue, and build a new flexible column drop feature Flexible column drop feature akin to “Trash Bin” in OSs today • When a column is dropped, it is no longer available for querying • However, column is not deleted immediately • Sys admin has a grace period to “rescue” deleted column • Or sys admin can explicitly deleted column for good Original Schema First Column Drop Add Column Tentative Drop (Delete Schema Only) Second Column Drop Grace Period Expires Permanent Drop (Delete schema and data) 11 Use Cases 2 and 3 • Use Case 2: Automated Node Decommissioning • When a node is decommissioned, today sysadmin needs to manually check ranges of tokens (keys) • W-Cassandra automates this checking process • Use Case 3: Data Provenance • Today, NoSQL systems do not support tracking of provenance of data items 1. 2. Where did this data item come from? How was this data item generated/modified? • Wasef tracks these two (for requested objects) 12 Evaluation on AWS: System Throughput Setup • AWS Cluster (6 machines) • EC2 m1.large instances • YCSB Heavy Workload from clients • 12 GB of datadata • 1M operation per run • Plot shows maximum achievable throughput Wasef lowers throughput by only 9% 13 Latency Results • Compared to Cassandra, Wasef: • Affects read latency by only 3% • Affects update latency by 15% • Can be optimized further • Latencies are not affected by metadata size (up to 8% of data) 14 Scalability With Cluster Size Setup • Increase cluster size from 2 to 10 servers • Also proportionally increase dataset size and client load • {2GB data, 25 threads} per server • Each point is the average of 1M operations Wasef’s overhead only about 10% and rises slowly with cluster size 15 Use Case: Column Drop Setup • • • • Customized client 4 nodes 8 GB Dataset Each bar average of 500 drop operations Dropping a column is 5% slower (and is sometimes faster) Note: The Wasef Implemenation is correct, while Cassandra 1.2 is not 16 Summary • Wasef is the first system to support metadata as first-class citizens for NoSQL data stores • Modular, flexible, queryable, minimally intrusive • W-Cassandra • We augmented Cassandra 1.2.x with Wasef • Implemented 3 use-cases scenarios: Flexible Column Drop, Automated Node Decommissioning, Data Provenance • Performance • Incurs low overheads on throughput and latency • Scales well with cluster size, workload, data size, and metadata size • Code is available for download at: Distributed Protocols Research Group: http://dprg.cs.uiuc.edu 17 Backup Slides 18 Related Work Wasef is not 1. Database catalog (Structural metadata) • Describes database entities and the hierarchical relationships between them. • Wasef collects descriptive and administrative metadata. 2. Zookeeper, Chubby, or Tango (Standalone metadata services) • Wasef is a subsystem of the NoSQL datastore which collects metadata during system operations. 3. Amazon S3, Azure Cloud Store, Google Cloud Data Store • Metadata can be associated with the stored objects. However, Metadata is limited in size (10s of KB) and Metadata operations are inflexible. • Wasef treats metadata as any of the system data. 4. Trio: data provenance system for RDBMS • Scalability is a big issue. Collecting metadata in NoSQL data stores is a relatively new field 19 Use Case 2: Node Decommissioning Setup • • • 4 nodes 4 GB dataset Token ranges per node increased from 64 - 256 The average overhead is 1.5% Overhead smaller at larger datasizes 20 Scalability With Metadata Size Update and Read Latencies are Largely Independent of Size of Metadata 21 2. Verification tool for node decommissioning operation Node decommissioning from cluster nodetool decommission • A critical operation when the replication factor is one • Can not be verified in the standard version How the tool works • During node decommission: store the new replicas for the token ranges in Log table. Target: node IP. Metadata: decommission • To verify: nodetool decommission -verify <decommission node IP> Token ranges are retrieved from the log and checked for existence in the system 22 3. Providing Data Provenance Data Provenance: • The history of an item, which includes its source, derivation, and ownership. • It increases the value of the item since it proves its authenticity and reproducibility (e.g. documenting the workflow of a scientific experiement) Wasef provides data provenance by design. It collects: • • • • • Target full name operation name Timestamp The authenticated session owner name The results ( depends on the operation) Provenance data is treated like client data ( can be queried, searched, replicated, ..) Garbage collection is not supported 23 Experiments • We modified Cassandra to incorporate Wasef • We ran our system on AWS (Amazon Web Services) • Settings • EC2 (m1.large) Instances to evaluate our W-Cassandra System • Each instance has 2 virtual CPUs (4 ECUs), 7.5 GB of RAM, and 480GB of ephemeral disk storage. They run Ubuntu 12.04 64-bit. • Workload: YCSB (Yahoo Cloud Serving Benchmark) • Heavy workload (50% read, 50% update), zipfian distribution, client uses a separate machine. 24