PROGRAMMING SUPPORT AND ADAPTIVE CHECKPOINTING FOR HIGH-THROUGHPUT DATA SERVICES WITH LOG-BASED RECOVERY Jingyu Zhou Shanghai Jiao Tong Univ Jiesheng Wu Microsoft DSN 2010 Caijie Zhang Google Inc. Hong Tang Yahoo! Tao Yang UC Santa Barbara Backgrounds Many large-scale data-mining and offline applications in Google, Yahoo, Microsoft, Ask.com, etc. require High data parallelism/throughput Data persistence. But not so stringent availability E.g., URL property service (UPS) at Ask.com search offline mining platform Hundreds of app. modules access UPS Examples of high-throughput data services for web mining/search Internet Web documents Data/ info service Crawler Crawler Crawler Document DB Document DB Document DB 10-50 billion URLs Data/info service Data mining Data mining job job … Data mining job e.g. URL property service. 100K-500K/s Existing Approaches for Highperformance and Persistence Database systems suffer from high overhead, limits its performance while supporting general features Need more machine resources Related work and well-known techniques for high availability Data replication Log-based recovery Checkpointing Challenges and Focus of this work System design with careful selection and integration of fault-tolerant techniques for high throughput computing. Trade off in availability, but allow some down time. Low cost: logging/checkpoint. Fine-grain for minimum service disruption. Local data recovery. Periodic remote backup. Programming support Lightweight, services. simplifying construction of robust data SLACH: Selective Logging & Adaptive CHeckpointing Targeted data services Request-driven thread model. In-memory objects. Data independence. Similar to key-value stores in BigTable/Dynamo, but higher throughput. Architecture of SLACH Main Techniques Selective operation logging Only log write operations (oid, op_type, parameters, timestamp) Write-ahead log, i.e., write then apply operations Object-level checkpoint to avoid service disruptions with adaptive load control Ckpt objects one-by-one. Still allow concurrent access of other objects Perform checkpointing when load is low to amortize cost of checkpointing Light weight API while supporting legacy code. Object-level Checkpoints Adaptive Checkpointing Control Goal is to balance ckpt. cost and recovery speed Ckpt. less frequently-> larger logs -> lengthy recovery Ckpt. too often -> higher overhead Ideally, High server load -> ckpt. less frequently Low server load -> ckpt. more frequently Adjust between a Low Watermark (LW) & High Watermark (HW) of service loads Loadcurr = α×loadprev+(1-α)×sample Adaptive Checkpointing Frequency Ckpt. threshold between LB and UB LB, UB are log size parameters, determined by app. Threshold LB F(load) (UB LB), where SLACH Programming Support Application developers Call SLACH function log() to log an object operation Define 3 callback functions: 1) what to checkpoint (call SLACH’s ckpt() for each selected object, 2) recover one object from a checkpoint, 3) replay a log operation. SLACH Provide functions log() and ckpt(). Call user’s checkpoint callback fun during checkpoint. Call a user’s recover function during checkpoint recover. Call a user’s replay function when recovering from a log. SLACH API for Applications class SLACH::API { public: /* register ckpt. policy and parameters */ void register_policy(const Policy& p); /* log one write operation */ void log(int64_t obj_id, int op, ...); /* checkpoint one object */ void ckpt(int64_t obj_id, const void* addr, uint32_t size); }; SLACH Interface class SLACH::Application { … protected: /* application checkpoint callback function */ virtual void ckpt_callback()=0; /* callback of loading one object checkpoint*/ virtual void load_one_callback(int64_t obj_id, const void *addr,uint32_t size)=0; /* callback of replaying one operation log */ virtual void replay_one_callback(int64_t obj_id, int op, const para_vec& args)=0; }; An Example: Application-level code struct Item { double price; int quantity; }; class MyService : public SLACH::Application { private: Item obj[1000]; SLACH::API slach_; /* SLACH API */ static const int OP_PRICE=0;/* an op type */ public: void update_price(int id, double p) { slach_.log(id, OP_PRICE, &p, sizeof(p)); obj[id].price = p; } Application objects being accessed Log selected object update operation An Example: Call-back functions SLACH calls this user function during checkpointing. void ckpt_callback() { for (int i=0; i<1000 ; i++) slach_.ckpt(i, &obj[i], sizeof(obj[i])); } SLACH calls this when recovering an object from a checkpoint . void load_one_callback(int64_t id, const void *p, uint32_t size) { memcpy(&obj[id], p, size); } SLACH calls this when recovering an object by log replaying void replay_one_callback(int64_t id, int op, const para_vec& args) { switch (op) { case OP_PRICE: obj[id].price = *(double*)args[0].second; break; // ... } SLACH Implementation and Applications Part of Ask.com middleware infrastructure in C++ for data mining and search offline platform Application samples: UPS (URL property service) for recording property of all URLs crawled/collected. HIS (Host information service) for recording property of all hosts crawled on the web. 20-80% of write traffic. Running on a cluster of hundreds of machines. In production for last 3 years. Significantly reduced development time (1-2 months vs. few days). Characteristics of UPS/HIS Perfor. characteristics of UPS/HIS per partition. Data Max. Read Max. Write UPS 1.9GB 110K Req/s 56K Req/s HIS 2.1GB 58K Req/s 16K Req/s Parameters for adaptive ckpt. Control UPS HIS 0.8 α Moving avg. 0.8 LB/UB low/upper b. 1M-8M entries 0.3M-1.8M LW/HW L/H watermark 20%-85% 35%-85% β Scaling 3 6 w Sampling win. 5s 5s Evaluation Impact of logging overhead System behavior during checkpointing Effectiveness of adaptive checkpoint control Performance comparison of hash table implementation using SLACH and BerkeleyDB Evaluation Setting Benchmarks UPS (URL property service) HIS (Host-level property service) Persistent Hash Table (PHT) Metric: throughput loss percent SuccessfulRe quests LossPercent (1 ) 100 TotalRe quests Hardware: a 15 node cluster, gigabit link Selective Logging Overhead of UPS • Base: logging is disabled • Log: selective logging is enabled Negligible impact when server load < 40%. System Performance During Checkpointing (100% server load) During ckpt, 8.9% throughput drop During ckpt, 57.6% increase of response time Effectiveness of Adaptive Threshold Controller – Performance Comparison in UPS • Fixed threshold policy, 8M has lower runtime overhead – less frequent ckpt • Adaptive approach has comparable performance as fixed policy of 8M. Effectiveness of Threshold Controller – Recovery Speed • Fixed threshold -> fixed log size -> same recovery time • Adaptive approach: small log for light load (less recovery time), large log for higher load (more recovery time) SLACH is better for all value sizes, because 1. BDB incurs more per-operation overhead 2. BDB involves more disk I/Os PHT vs. Berkeley DB 30-B value, SLACH is 5.3 times higher SLACH ckpt has less overhead 1. BDB ckpt is not async 2. SLACH fuzzy ckpt still allow access Conclusions SLACH contributions A lightweight programming framework for very highthroughput, persistent data services Simplify application construction while meeting reliability demands Selective logging to enhance performance System design with careful integration of multiple techniques Dynamic adjust ckpt. frequency to meet throughput demands Fine-grained ckpt without service disruptions Evaluation of integrated scheme in production applications. Data and Failure Models Data independence and object-oriented access model Key-value store as in Dynamo/BigTable, but with much higher throughput demand per machine Each object is a continuous memory block Middleware infrastructure can handle noncontiguous ones Fail-stop Focus on local recovery due to app. failures OS/Hardware failure can be dealt with remote ckpt. Implemented, but not the scope of this paper