Supporting Multi-row Distributed Transactions with Global Snapshot Isolation Using Bare-bones HBase Chen Zhang Hans De Sterck University of Waterloo Outline Introduction General Background Snapshot Isolation (SI) HBase System Design Transactional SI Protocol System Performance Future Work General Background (1) Database transactions have been widely used by websites, analytical programs, etc. Snapshot isolation (SI) has been adopted by major DBMS for high throughput No solution exists for traditional DBMS to be easily replicated and scaled on clouds Column-oriented data stores are proven to be scalable on clouds (BigTable, HBase). However, multi-row distributed transactions are not supported out-of-the-box General Background (2) Google recently published a paper in OSDI10, Oct. 4 (submission deadline May 7) about their “Percolator” system on top of BigTable for multi-row distributed transactions with SI Our paper describes an approach for multi-row distributed transactions with SI on top of HBase, and it turns out that Google’s system has many design elements that are similar to ours Snapshot Isolation (1) Snapshot Isolation (SI) For transaction T1 that starts at timestamp ts1 T1 is given the database snapshot up to ts1, and T1 can do reads/writes on its own snapshot independently When T1 commits, T1 checks to see if any other transactions have committed conflicting data updates. If not, T1 commits Snapshot Isolation (2) Strong SI vs SI Strong SI requires every transaction T to see the most up-todate snapshot of data SI requires every transaction T to see a consistent snapshot which can be any snapshot taken earlier than T’s start timestamp Timestamp Ordering S2 S1 C1 C2 S3 C3 T1 T2 T3 HBase HBase is a column-oriented data store A single global database-like table view Multi-version data distinguished by timestamp A data table is horizontally split into row regions and each region is hosted by a region server HBase guarantees single atomic row read/write Outline Introduction General Background Snapshot Isolation (SI) HBase System Design Transactional SI Protocol System Performance Future Work Design-Overview (1) General Design Objective No deployment of extra programs and inherits HBase properties Scalability, fault tolerance, high throughput, access transparency, etc. Non-intrusive to user data and easy to be adopted No modification to existing user data Implement a client library to manage transaction at client side autonomously; no server-side changes Transactions put their own information into the global tables Meanwhile query those tables for information about other transactions to determine whether to commit/abort Design-Overview (2) General SI Protocol Timestamp Ordering S2 S1 C1 C2 S3 C3 T1 T2 T3 Every transaction, when it commits, obtains a unique, strictly incremental commit timestamp to determine the order between transactions and be used to enforce SI Every transaction commits successfully by inserting a row in the Committed table Every transaction, when it starts, read the commit timestamp of the most recently committed transaction, and use that as its start timestamp Design-Overview (3) Simplified Protocol Walkthrough Get start timestamp S, a snapshot of Committed table T reads/writes versions of data identified by S When T tries to commit Checks conflicting updates committed by other transactions by scanning Committed table T writes a row into Precommit table to indicate its attempt to commit Checks conflicting commit attempts by scanning Precommit table If both checks return no conflict, T proceeds to commit by atomically inserting one row to Committed table Design-SI Protocol For Read-only transactions: Only need to obtain start timestamp and read the correct version of data from the snapshot No need to do Precommit/Commit For Update transactions: Get start timestamp ts Read/write Precommit Commit Design-SI Protocol For Read-only Transaction Ti Get start timestamp Si and maintain DataSet DS in memory Data read {(L1, data1),…} To read data item at L1 If L1 is in DS, read from DS. Otherwise Query Version table and get C1 Scan Committed table and get the most recent transaction Ci that updates data to version V; update Version table with Ci Use V to read data and add (L1, data) to DS if necessary Design-SI Protocol For Update Transaction Ti (1) Get start timestamp Si and maintain DataSet DS Data read/written {(L1, data1),…} Read data item at L1 (same as Read-only case) Write Directly write to data tables with unique timestamp Wi Design-SI Protocol For Update Transaction Ti (2) Precommit Get precommit label Pi Scan Committed table at range [Si+1, ∞) for conflicting commits with overlapping write set. If no conflicts, proceed Add a row Pi to Precommit table. Scan Precommit Table at full range for other rows with overlapping writeset with either nothing under column “Committed” or a value under “Committed” column larger than Si. If no conflicts, proceed Design-SI Protocol For Update Transaction Ti (3) Commit Get Commit timestamp Ci Add a row Ci to Committed table with data items in writeset as columns (HBase atomic row write) Add “Ci” to row Pi in Precommit table Design-Timestamp Mechanism For each transaction Ti, four labels/timestamps are used Design-Timestamp Mechanism Issue globally unique and incremental timestamp/label by using the HBase atomic incrementColumnValue method on a single HBase Table Design- Obtain Start Timestamp For example, at the time T1 starts, there is a gap for C2 in Committed table The snapshot for T1 is C3, which includes L1 with version W1, and L2 with version W3 Before T1 commits, C2 appears. The snapshot of T1 should have included L1 with version W2 Use CommittedIndex Table to store recent snapshot Outline Introduction General Background Snapshot Isolation (SI) HBase System Design Transactional SI Protocol System Performance Future Work Performance (1) Test the basic timestamp/label issuing mechanism using a single HBase table Performance (2) Test the necessity of using version table to minimize the range of scanning Committed table to find the most recent data version Performance (3) Compare SI Read performance compared to bare-bones HBase Read Performance (4) Compare SI Write performance compared to bare-bones HBase Write Future Work Support strong SI with no blocking reads providing high throughput Add mechanism in handling straggling/failed transactions Explore and experiment with usage application scenarios General Background (3) Similarities-Compared with Percolator Support ACID transactions and guarantee snapshot isolation utilizing the multi-version data support from the underlying column store Implemented as client library rather than as server side middleware Dispense globally unique and well-ordered timestamps from a central location Share some similar protocols for the commit process General Background (4) Differences-Compared with Percolator Percolator focuses on analytical workloads that tolerate large latency; our system focuses on random data access with high throughput and low latency for web applications Percolator achieves Strong SI but reads may be blocking, sacrificing throughput to data freshness; our system achieves SI and does not block reads, sacrificing data freshness to high throughput Percolator requires modification to existing user data; our system uses a separate set of tables, which is nonintrusive to user data Percolator relies on BigTable single row atomic transaction which is not supported by HBase Questions Thank you!