S3: A Secure Scalability Service for Dynamic Content Bruce Maggs Carnegie Mellon University and Akamai Technologies Joint work with Charlie Garrod and Amit Manjhi and Natassa Ailamaki, Phil Gibbons, Todd Mowry, Chris Olston, and Anthony Tomasic. Number of requests a website receives is unpredictable CNN, NY Times, ABC News unavailable from 9-10 AM (Eastern Time) Page views/day (in millions) CNN.com 150 9/11* 100 50 Usual 0 Content providers’ dilemma: how many resources to provision? Need on-demand scalability Content Delivery Network (CDN) Solution CNN.com Normal Page views/day (in millions) 800 50k 12-Sep-01 600 400 1.2k 200 0 50k Election day (Nov 2), 2004 Page was 1.2k instead of 50k on 12 Sep, 01 Used Akamai on Election day Source: http://www.tcsa.org/lisa2001/cnn.txt http://www.akamai.com/en/html/about/press/press479.html Typical Web-Site Architecture Request Users Execute Access code DB DB App Web Server Server Home server Response CDN Architecture Internet core Users CDN nodes Content providers CDNs excel at delivering static content. Advantages of CDNs • Large infrastructure handles load spikes • Clients charged on a per-usage basis – no need to guess what resources to provision • Moves data closer to end-users – decreases latency and increases throughput CDN Application Services CDN’s can also run applications Internet DB Users but for data-intensive dynamic applications… database server becomes the bottleneck! Methods to scale the database component • In-house database scalability: [DBCache, DBProxy, MTCache, NEC Cache Portal] – Must provision for peak load • Database outsourcing: Database as a service [Hacigumus+ ICDE ’02, SIGMOD ’02] – Have to cede control of data • Database Scalability Service (DBSS): Shared infrastructure that caches applications’ data [INRIA/LIP6, CIDR ’05, SIGMOD ’06, ICDE ’07] S3 Database Scalability Service • CDN-like proxy nodes cache results of database queries – reduces load on central database servers • All database updates sent to central server – clients don’t cede ownership of their data • Uses publish/subscribe system to maintain data consistency – avoids additional load at the central server • Content provider may encrypt database requests/responses to protect sensitive data Database Scalability Service users: Content Delivery Network DBSS Internet home server databases: Database Scalability Service users: Internet Web and application servers DBSS home server databases: Database Scalability Service client apps: DBSS Internet home server databases: Outline • • • • Need for on-demand scalability S3 invalidation mechanism Security-scalability tradeoff Reducing latency Addressing consistency • TTL is wasteful: – Often refresh cached data unnecessarily (workloads dominated by reads) – Must set TTL=0 for strong consistency! • Solution: update or invalidate cached data only when affected by updates – Naïve approach: home organizations notify proxy servers of relevant updates not scalable Our approach: Fully-distributed, proxy-to-proxy update notification mechanism Distributed Consistency Mechanism update users update notification proxy node Multicast Environment update notification • Distributed app-level multicast environment, e.g., Scribe • Forward all updates to backend home servers Configuring Multicast Channels • Key observation: Web applications typically interact with DB via a small, fixed set of query/update templates (usually 10-100) • Example: SELECT qty FROM inv WHERE id = ? UPDATE inv SET qty = ? WHERE id = ? Templates: natural way to configure channels Options: Channel-by-query or Channel-by-update Channel-by-Query Option • One channel per query template Q: C(Q) Begin caching result(s) of query template Q Subscribe to C(Q) Evict only query result for Q Unsubscribe from C(Q) Issue update Determine which query templates Q1, …, Qn affected; send notification on each C(Qi) • Few subscriptions/cached result • Many invalidation notifications/update Conflicts determined lazily (upon update) Channel-by-Update Option • One channel per update template U: C(U) Begin caching result(s) of query template Q Determine which update templates U1, …, Un apply; subscribe to each C(Ui) Evict only query result for Q Unsubscribe from all C(Ui) above Issue update using Send notification on C(U) template U • Many subscriptions/cached result • Few invalidation notifications/update Conflicts determined eagerly (when caching Q) Parameter-Specific Channels • Optimization: consider parameter bindings supplied at runtime … for example: • Q5: SELECT qty FROM inv WHERE id = ? – When issued with id = 29, create extra parameterspecific channel C(5, 29) – Subscribe to both C(5) and C(5, 29) • Upon update: – If update affects a single item with id = X, send notification on channel C(5, X) • Saves work if X 29 – Updates affecting multiple items sent to C(5) S3 Prototype • • • Tomcat as proxy web server/servlet container Proxy database cache written in Java Queries: access cached data when possible – – • • • • Cache JDBC query results (i.e., materialized views) Index results by JDBC query representation MySQL4 as back-end database Updates: sent to back-end database Invalidation notifications delivered via Scribe Experiments on Emulab (Utah) – Thanks! Benchmark Applications • Bookstore (TPC-W, from UW-Madison) – Online bookseller, a standard web benchmark – Changed the popularity of books • Auction (RUBiS, from Rice) – Modeled after Ebay • Bulletin board (RUBBoS, from Rice) – Modeled after Slashdot Benchmarks model popular websites Selective: cache queries only if subscribed to parameter-dependent groups Impact of Cooperative Caching Throughput (WIPS) 250 200 NoProxy 150 NoCache 100 SimpleCache Ferdinand 50 0 bookstore brow sing mix bookstore shopping mix auction Outline • • • • Need for on-demand scalability S3 invalidation mechanism Security-scalability tradeoff Reducing latency Guaranteeing security in a DBSS setting Limit ability to observe an application’s data by: – DBSS administrator – Unauthorized application through the DBSS Security-Scalability tradeoff in the DBSS setting Analyzing the code helps in managing this tradeoff A simple solution for guaranteeing security • Outsource database scalability – Home server: master copies of all data— handles updates directly • No query execution on the DBSS – DBSS caches query results (read-only)—kept consistent by invalidation All data passing through the DBSS can be encrypted: Query, Update, Query results A Simple Example toys (toy_id, toy_name) No Invalidations Q1:toy_id=15 Q1: toy_id=15 Empty Q1 U1 11 Barbie Nothing is encrypted 15 GI Joe DBSS Home server Database Q1: SELECT toy_id FROM toys WHERE toy_name=“GI Joe” U1: DELETE FROM toys WHERE toy_id=5 Invalidate EmptyResult Q1: Q1 U1 Q1: Result 11 Barbie 15 GI Joe Results are encrypted More encryption leads to more invalidations Challenge: providing scalability while guaranteeing security When updates occur, DBSS needs to invalidate Application faces a dilemma in what data to encrypt (secure) More encryption Less encryption Conservative Invalidation Precise Invalidation Security Scalability Security-scalability tradeoff Opportunity for managing the tradeoff Not all data is equally sensitive Data Sensitivity Completely insensitive Moderately sensitive Extremely sensitive Bestsellers list Inventory records, customer records Credit Card Information Don’t care Care but worried about scalability impact Secure at all costs But for most data, nontrivial to assess: 1. Data-sensitivity 2. Scalability impact of securing the data Key Insight: arbitrary queries and updates not possible function get_toy_id ($toy_name) { $template:=“SELECT toy_id FROM toys WHERE toy_name=?”; $query:=attach_to_template ($template, $toy_name); execute ($query); … } Given templates: Can statically identify data not needed for precise invalidation Data not useful for invalidation: examples Example 1: Q1: SELECT toy_id FROM toys WHERE toy_name=? Q2: SELECT toy_name FROM toys WHERE toy_id=? No data is needed for precise invalidation Example 2: Q1: SELECT toy_id FROM toys WHERE toy_name=? U1: DELETE FROM toys WHERE toy_id=? Query parameters are not needed for precise invalidation (the query result is needed though) Security without hurting scalability Data not needed for invalidation Can secure “for free” (without hurting scalability) Security Conscious Scalability Approach [SIGMOD ’06] As a result, Tradeoff has to be only managed over remaining data Sample experiment: methodology • Scalability: max # concurrent users with acceptable response times • Security: # templates with encrypted results Users 5 ms 100 ms Home server CDN and DBSS • California Privacy Law determined sensitive data • Non-transactional invalidation • Start with a cold cache Benchmark Applications • Bookstore (TPC-W, from UW-Madison) – Online bookseller, a standard web benchmark – Changed the popularity of books • Auction (RUBiS, from Rice) – Modeled after Ebay • Bulletin board (RUBBoS, from Rice) – Modeled after Slashdot Benchmarks model popular websites Security-Scalability Tradeoff Q1 SELECT toy_id FROM toys WHERE toy_name=? Q2 SELECT qty FROM toys WHERE toy_id=? Q3 SELECT cust_name FROM customers WHERE cust_id=? U1: DELETE FROM toys WHERE toy_id=5 Template x Scalability Security Blind Template Parameters Query result x x Statement x x x View Invalidations All Q1, Q2, Q3 All Q1, Q2 All Q1, Q2 with toy_id=5 Q1 with toy_id=5 Q2 with toy_id=5 X denotes encrypted, visible Scalability (number of concurrent users supported) Magnitude of Security-Scalability tradeoff View Statement Template Blind 900 600 300 00 0 Auction Bboard Benchmark Applications Bookstore Security Results Query data that can be encrypted “for free” Parameters and result 4 6 17 7 7 7 Result 18 12 14 Nothing Auction Bboard Bookstore Security Results in Detail • Auction: The historical record of user bids was not exposed • Bboard: The rating users give one another based on the quality of their posting • Bookstore: Book purchase association rules discovered by the vendor – customers who purchase book A also purchase book B Scalability (Number of concurrent users supported) Scalability Conscious Security Approach (SCSA) to managing the tradeoff 900 Nothing encrypted SCSA 600 Everything encrypted 300 0 0 5 10 15 20 25 Security (Number of query templates with encrypted results) 1. Easy to either get good scalability or good security 2. SCSA presents a shortcut to manage the tradeoff 30 Outline • • • • Need for on-demand scalability S3 invalidation mechanism Security-scalability tradeoff Reducing latency Contributors to User Latency Request, high latency Response, high latency Web server App server Database Traditional architecture high latency CDN DBSS Database DBSS architecture A single HTTP request Multiple database requests 42 Sample Web Application Code function find_comments ($user_id) { $template:=“SELECT from_id, body FROM comments WHERE to_id=?” $query:=attach_to_template ($template, $user_id) $result:=execute ($query) foreach ($row in $result) print (get_body ($row), get_name (get_id ($row))) } (N+1) queries are issued because: • Convenient for programmers to abstract database values • No effect in the traditional setting Found many examples in the benchmark applications 43 Reducing User Latency in a DBSS Setting Transformations to reduce number of round-trips 1. Group execution of queries: MERGING transformation 2. Overlap execution of queries: NONBLOCKING transformation 44 Web Application Code Transformed Code Procedural program with embedded SQL Transformed program and SQL Holistic transformations using src-to-src compilers The MERGING Transformation www.ebay.com John Names of users who have posted comments about John Content Delivery Network 1 Query 1. Find user_ids who have made comments 2. For each user_id, find name of the user 45 N Database Queries Scalability Service High latency The MERGING Transformation Find names of users who have commented about John Names of users who have posted comments about John 1. Find user_ids who have made comments 2. For each user_id, find name of the user SELECT from_id, u.name FROM comments, users u WHERE from_id = u.id AND to_id = ? Assuming constant cache hit rate, the #round-trips to the database decreases by a factor of (N+1) 46 The NONBLOCKING Transformation www.amazon.com John Home page Content Delivery Network 1. Greet user 2. Get names of related books Database Scalability Service High latency 47 Issue queries concurrently to reduce latency Applicability of the Transformations % of dynamic runtime interactions MERGING NONBLOCKING EITHER 100 75 50 25 0 Auction Bboard Bookstore Either transformation applies to 25% (Auction), 75% (Bboard), and 50% (Bookstore) dynamic runtime interactions 48 BBOARD Application: Impact on Latency Average latency in ms Database DBSS-DB latency client-DBSS 1400 1050 700 350 0 No 49 Both Transformations Overall latency decreases by 38%, the DBSS-DB latency decreases by 65% Impact of Latency on Scalability Improved scalability Scalability Threshold Latency curve Latency Reduced latency curve Simultaneous users supported Reducing latency improves scalability 50 Scalability (number of concurrent users supported) Effect of the Transformations on Scalability Applying both transformations yield the best scalability 51