Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto, Canada Transactional Memory Programming Paradigm Each thread executing a parallel region: Announces start of a transaction Executes operations on shared objects Attempts to commit the transaction If no data race, commit succeeds, operations take effect Otherwise commit fails, operations discarded, transaction restarted Simpler than locking! Transactional Memory Used in multiprocessor platforms Our work: the first TM implementation on a cluster Supports both SQL and parallel scientific applications (C++) TM in a Multiprocessor Node T1: Active T2: Active Copy of A T2: Write(A) A T1: Read(A) Multiple physical copies of data High memory overhead TM on a Cluster Key Idea 1. Distributed Versions Different versions of data arise naturally in a cluster Create new version on different node, others read own versions read write read read Exploiting Distributed Page Versions Distributed Transactional Memory (DTM) network v3 v2 v1 mem0 mem1 mem2 txn0 txn1 txn2 v0 ... memN txnN Key Idea 2: Concurrent “Snapshots” Inside Each Node Txn1 (v2) Txn0 (v1) v1 v1 v2 v2 read v2 v2 Key Idea 2: Concurrent “Snapshots” Inside Each Node Txn1 (v2) Txn0 (v1) v1 v1 v2 v2 read v2 v2 Key Idea 2: Concurrent “Snapshots” Inside Each Node Txn1 (v2) Txn0 (v1) v1 v1 v2 v1 v2 read v2 v2 Distributed Transactional Memory A novel fine-grained distributed concurrency control algorithm Low memory overhead Exploits distributed versions Supports multithreading within the node Provides 1-copy serializability Outline Programming Interface Design Data access tracking Data replication Conflict resolution Experiments Related work and Conclusions Programming Interface init_transactions() begin_transaction() allocate_dtmemory() commit_transaction() Need to declare TM variables explicitly Data Access Tracking DTM traps reads and writes to shared memory by either one of: Virtual memory protection Classic page-level memory protection technique Operator overloading in C++ Trapping reads: conversion operator Trapping writes: assignment ops (=, +=, …) & increment/decrement(++/--) Data Replication T1(UPDATE) Page 1 Page 1 Page 2 Page 2 …… …… Page n Page n Twin Creation Wr p1 T1(UPDATE) Page 1 Page 2 Page 1 Page 2 …… …… Page n P1 Twin Page n Twin Creation Wr p2 T1(UPDATE) Page 1 Page 2 P2 Twin Page 1 Page 2 …… …… Page n P1 Twin Page n Diff Creation T1(UPDATE) Page 1 Page 1 Page 2 Page 2 …… …… Page n Page n Broadcast of the Modifications at Commit Latest Version = 7 Latest Version = 7 T1(UPDATE) Page 1 Page 1 Diff broadcast (vers 8) Page 2 v2 v1 Page 2 v1 …… …… Page n Page n Other Nodes Enqueue Diffs Latest Version = 7 Latest Version = 7 T1(UPDATE) Page 1 Page 1 Diff broadcast (vers 8) Page 2 v2 v1 v8 v1 Page 2 …… …… Page n v8 Page n Update Latest Version Latest Version = 7 Latest Version = 8 T1(UPDATE) Page 1 Page 1 v8 Page 2 v1 v8 v1 Page 2 …… …… Page n v2 Page n Other Nodes Acknowledge Receipt Latest Version = 7 Latest Version = 8 T1(UPDATE) Page 1 Ack (vers 8) Page 1 v8 Page 2 v1 v8 v1 Page 2 …… …… Page n v2 Page n T1 Commits Latest Version = 8 Latest Version = 8 T1(UPDATE) Page 1 Page 1 v8 Page 2 v1 v8 v1 Page 2 …… …… Page n v2 Page n Lazy Diff Application T2(V2): Latest Version = 8 Rd(…, P1, P2) V0 Page 1 V8 V2 V1 V0 Page 2 V8 . V1 . . Page N V5 V3 V4 Lazy Diff Application T2(V2): Latest Version = 8 Rd(…, P1, P2) V2 Page 1 V8 V0 Page 2 V8 . V1 . . Page N V5 V3 V4 Lazy Diff Application T2(V2): Latest Version = 8 Rd(…, P1, P2) V2 Page 1 V8 V1 Page 2 . . . Page N V5 V8 V3 V4 Lazy Diff Application T2(V2): Latest Version = 8 Rd(…, P1, P2) V2 Page 1 V8 V1 Page 2 . . . T3(V8): Rd(PN) Page N V5 V8 V3 V4 Lazy Diff Application T2(V2): Latest Version = 8 Rd(…, P1, P2) V2 Page 1 V8 V1 Page 2 . . . T3(V8): Rd(PN) Page N V8 V5 Waiting Due to Conflict T2(V2): Latest Version = 8 Rd(…, P1, P2) V2 Page 1 V8 Wait until T2 commits V1 Page 2 . . . T3(V8): Rd(PN, P2) Page N V8 V5 Transaction Abort Due to Conflict T2(V2): Latest Version = 8 Rd(…, P1, P2) V2 Page 1 V8 V0 Page 2 V8 . V1 . . T3(V8): Rd(P2) Page N V5 V3 V4 Transaction Abort Due to Conflict T2(V2): Latest Version = 8 Rd(…, P1, P2) V2 Page 1 V8 V8 Page 2 CONFLICT! T3(V8): Rd(P2) . . . Page N V5 V3 V4 Write-Write Conflict Resolution Can be done in two ways Executing all updates on a master node, which enforces serialization order OR Aborting the local update transaction upon receiving a conflicting diff flush More on this in the paper Experimental Platform Cluster of Dual AMD Athlon Computers 512 MB RAM 1.5GHz CPUs RedHat Fedora Linux OS Benchmarks for Experiments TPC-W e-commerce benchmark Models an on-line book store Industry-standard workload mixes Browsing (5% updates) Shopping (20% updates) Ordering (50% updates) Database size of ~600MB Hash-table micro-benchmark (in paper) Application of DTM for E-Commerce App Server Web Server Customer TP HT HTTP The Internet HTTP RPC Web Server App Server H TT P Customer SQL DATABASE Customer Web Server App Server Application of DTM for E-Commerce We use a Transactional Memory Cluster as the DB Tier Cluster Architecture Implementation Details We use MySQL’s in-memory HEAP tables RB-Tree main-memory index No transactional properties Provided Multiple node by inserting TM calls threads running on each Baseline for Comparison State-of-the-art Conflict-aware protocol for scaling e-commerce on clusters Coarse grained (per-table) concurrency control (USITS’03, Middleware’03) Throughput Scaling 350 Throughput (WIPS) 300 250 200 150 100 50 0 0 1 2 3 4 5 6 # of Slave Replicas Ordering Shopping Browsing 7 8 Fraction of Aborted Transactions # of slaves Ordering Shopping Browsing 1 1.15% 1.44% 0.63% 2 0.35% 2.27% 1.34% 4 0.07% 1.70% 2.37% 6 0.02% 0.41% 2.07% 8 0.00% 0.22% 1.59% Comparison (browsing) 350 Throughput (WIPS) 300 250 200 150 100 50 0 0 2 4 6 Number of Replicas Conflict-Aware DTM 8 Comparison (shopping) 350 Throughput (WIPS) 300 250 200 150 100 50 0 0 2 4 6 Number of Replicas Conflict-Aware DTM 8 Comparison (ordering) Throughput (WIPS) 200 180 160 140 120 100 80 60 40 20 0 0 2 4 6 Number of Replicas Conflict-Aware DTM 8 Related Work Distributed concurrency control for database applications Distributed object stores Postgres-R(SI), Wu and Kemme (ICDE’05) Ganymed, Plattner and Alonso (Middleware’04) Argus (’83), QuickStore (’94), OOPSLA’03 Distributed Shared Memory TreadMarks, Keleher et al. (USENIX’94) Tang et al. (IPDPS’04) Conclusions New software-only transactional memory scheme on a cluster Fine-grained distributed concurrency control Both strong consistency and scaling Exploits distributed versions, low memory overheads Improved throughput scaling for ecommerce web sites Questions? Backup slides Example Program #include <dtm_types.h> typedef struct Point { dtm_int x; dtm_int y; } Point; init_transactions(); for (int i = 0; i < 10; i++) { begin_transaction(); Point * p = allocate_dtmemory(); p->x = rand(); p->y = rand(); commit_transaction(); } Query weights 1.0 0.9 0.8 0.7 0.6 Writes Reads 0.5 0.4 0.3 0.2 0.1 0.0 Ord Idx(0.35) Shp Idx(0.1) Brw Idx(0.03) Ord,No Idx(0.26) Shp,No Idx(0.07) Brw,No Idx(0.02) Decreasing the fraction of aborts 3.00% 2.68% 2.34% 2.37% 2.50% Fraction of Aborts 2.83% 2.07% 2.00% 1.59% 1.50% 1.34% 1.34% 1.00% 0.50% 0.00% M + 2S M + 2S, Confl. Reduce M + 4S M + 4S, Confl. Reduce M + 6S M + 6S, Confl. Reduce M + 8S M + 8S, Confl. Reduce Micro benchmark experiments 1% 5% 10% 15% 20% Throughput ( x 1000 ) 1200 1000 800 600 400 200 0 1 2 3 4 5 6 7 number of machines 8 9 10 Micro benchmark experiments (with read-only optimization) R/O Opt Base Throughput ( x 1000 ) 500 400 300 200 100 0 1 2 3 4 5 6 7 number of machines 8 9 10 Fraction of aborts # of machines 1 2 4 6 8 10 % aborts 0 0.57 1.69 2.94 4.05 5.08