Towards Understanding Modern Web Traffic Sunghwan Ihm and Vivek S. Pai Google Inc. / Princeton University Web Changes and Growth Simple static documents complex rich media applications Heavy client-side interactions (e.g., Ajax) Traffic increase Social networking, file-sharing, and video streaming sites Trends expected to continue Applications migrated to the Web A de facto standard interface of cloud services 2 Sunghwan Ihm, Princeton University Understanding Changes Goal: shape system design by better understanding the traffic optimization opportunities Improve response times Understand caching effectiveness Design intermediary systems: firewalls, security analyzers, and reporting/monitoring systems 3 Sunghwan Ihm, Princeton University Challenges Tracking changes Requires large-scale data set spanning many years collected under the same conditions We address these challenges by Web page analysis new analysis techniques for 1. Requires Analyzing large-scale data withsuitable full content Webapages interactions 2. dynamic Developing new with Webclient-side page analysis technique (e.g, Ajax) Redundancy and caching Requires full content instead of simple access logs for assessing implications of content-based caching 4 Sunghwan Ihm, Princeton University CoDeeN Traffic CoDeeN content distribution network (CDN) http://codeen.cs.princeton.edu/ A semi-open globally distributed open proxy on 500+ PlanetLab nodes Running since 2003 30+ million requests per day 5 Sunghwan Ihm, Princeton University Data Collection Full Content Access Logs WAN User Browser Cache Local Proxy Cache CoDeeN Cache Origin Web Server Assume local proxy caches 1. Access logs (all requests, but limited info.) URL, Timestamp, Content-Length, ContentType, Referer, etc. 2. Full content (cache-misses) Header + body 6 Sunghwan Ihm, Princeton University Data Set 5 years: from 2006 to 2010 Focus on one month (April) per year Full content data only for 2010 Focus on US, CN, FR, BR: Total volume per month 100M+ requests / 1TB+ / 100K+ users 3.3~6.6 TB 280~460 million requests 240~360K unique client IPs (40~60% /8 nets) 168~187 countries and regions 820K~1.2 million servers 7 Sunghwan Ihm, Princeton University Analysis Outline 1. High-level analysis 2. Page-level analysis Access Logs 3. Caching analysis Full Content 8 Sunghwan Ihm, Princeton University 1. High-Level Analysis Q: What has changed over five years? Connection speed NAT usage Max # concurrent browser connections Content type Object Size Traffic share of Web sites 9 Sunghwan Ihm, Princeton University % Content 1 Type 0 . 1 0 . 0 1 0 . 1 1 % R e q 0 . 1 1 % R e q u e s t s FR 0 . 1 h t m l 0 . 0 1 0 . 1 1 1 0 0 0 c s s 1 1 0 1 0 0 x m l % R e q u e s t s ja v a s c r ip t h t m l im a g e c s s o t h e r v id e o x m l f lv v id e o ja v a s c r ip t o c t e t (c) BR 0 0 2010: We observe growth of Flash vide (c) BR US, 20062010, X and Yvideo, log-scale JavaScript, rve growth ofboth Flash 1 0 0 A sharp increase of Ajax: JavaScript / CSS / XML A sharp increase of Flash video (FLV)7 (<5%25%) 5 u e s ts 10 Sunghwan Ihm, Princeton University Traffic Share of Web Sites Increase in video sites’ traffic Increase in ad networks and analytics sites’ requests (~12%) Ad networks market growth Most accessed site by users search / analytics google.com, baidu.com, google-analytics.com % user share increasing, tracking up to 65% 11 Sunghwan Ihm, Princeton University 2. Page-Level Analysis Q: How have Web pages changed? New page detection heuristic Initial page characteristics Page size / # of embedded objects / latency Page load latency simulation Entire page characterization 12 Sunghwan Ihm, Princeton University Page Detection Problem Given a set of access logs, detect the page boundaries Time main embedded # of embedded objects, page size, time, etc. Challenge: previous approaches from 1990s are a poor fit, inaccurate for modern Web traffic 13 Sunghwan Ihm, Princeton University Previous Approach #1: Time-based Check idle time between requests If within a threshold (e.g. 1 second), they belong to the same page Misclassify client-side interactions (Ajax) with longer idle time as pages 14 Sunghwan Ihm, Princeton University Previous Approach #2: Type-based Check file extension / content type Regard every html object as a main object Misclassify frames/iframes within a page as separate pages 15 Sunghwan Ihm, Princeton University Algorithm Ajax frames/iframes 1. Group logs into streams by Referer field 2. Consider all html object as main object candidates ( Type-based) 3. Ignore those with no children (embedded objects) 4. Apply idle time among the candidates for finalizing selection ( Time-based) 16 Sunghwan Ihm, Princeton University Validation Ground truth: browse Alexa’s top 100 sites Visit about 10 pages per site Record Web page URLs (main objects) Total 1197 pages Precision # correct pages found / # total pages found Recall # correct pages found / # total correct pages 17 Sunghwan Ihm, Princeton University Validation Result 1 4 26~33 R e c a ll 0 .8 0 .6 0 .4 0 .2 0 0 19~30 4~24 1 sec T im e T y p e T im e + T y p e S tr e a m S tr u c tu r e 0 .2 0 .4 0 .6 0 .8 1 P r e c is io n StreamStructure outperforms other approaches Figure 8: Precisiontoand StreamStructure outperforms 18 Robust the recall: idle time parameter selection previous page detection algorithms, simultaneously achieving high precision and recall. It is also robust to Sunghwan the idle time University paIhm, Princeton Identifying Initial Page Loads Client-side Interactions (e.g., Ajax) Initial Page Load Initial page: user-perceived page userperceived latency traffic/revenue of Websites Apply Time-based approach, but DNS lookup or 40-60% of traffic browser processing time can vary significantly after initial page loads Use Google Analytics beacon JavaScript collecting various client-side info. Fires when document are loaded 19 Sunghwan Ihm, Princeton University Initial Page Size and # Objects Initial pages become increasingly complex US: about 2x increase Caching 2006: 69 KB / 6 objects Effectiveness 2010: 133 KB / 12 objects 20 Sunghwan Ihm, Princeton University Initial Page Load Latency Median latency dropped in 2009 and 2010 Increased # of browser concurrent connections Reduced per-object latency from improved caching behavior / client bandwidth 21 Sunghwan Ihm, Princeton University 3. Caching Analysis Q: Implications for caching? URL popularity Caching effectiveness Required cache storage size Impact of aborted transfers 22 Sunghwan Ihm, Princeton University Two Caching Approaches HTTP Object-based Approach Whole object HTTP-cacheable only Previously reported cache hit rate: 35~50% Byte hit rate usually much less Content-based Approach Cache smaller chunks instead of objects Protocol independent Effective for uncacheable content as well WAN accelerators, storage/file systems 23 Sunghwan Ihm, Princeton University Ideal Cache Hit Rate 1.8~2.5x HTTP object-based: 17~28% Mainly effective for JavaScript and image Content-based: 42~51% with 128-byte chunks Effective for any content type Growth of tail that hurts caching 24 Sunghwan Ihm, Princeton University Origins of Redundancy (%) Origins of Redundancy 100 75 50 Aborted 25 US, 128 byte 0 o de vi o di au t te oc e ag im s cs l ipt xm scr va ja l m ht l al Content updates inter-URL aliasing intra-URL object-hit Most of additional savings from the redundancy across different versions (intra-URL) across different objects (inter-URL) 25 Sunghwan Ihm, Princeton University 4 0 B y te H itR a te ( % ) B y te H itR a te ( % ) 5 0 M R C 1 K B H T T P 3 0 2 0 1 0 5 0 4 0 3 0 2 0 1 0 CN: 218GB 0 0 10 INF 0G 20 0G 10 G 50 G 20 G 10 5G 2G 1G 0M 50 0M 20 0M 10 : Required Cache Storage Size C a c h e S to r a g e S iz e ( B y te ) 1-KB outperforms 128-B w/ metadata overhead (b) CN MRC: Multi-Resolution Chunking (USENIX’10) A large cacheworking with MRC provides 2x the byte Increases set size Large cache storage highly desirable fully downloaded hit ra 26 Sunghwan Ihm, Princeton University Conclusions Analyzed five years of real Web traffic with over 70,000 users Observed a rise of Ajax and Flash video, search engine / analytics site tracking 65% users Developed StreamStructure Half of the traffic occurs due to client-side interactions after initial page loads Pages have become increasingly complex Content-based caching with large cache storage highly desirable 2x larger byte hit rate, aborted transfers 27 Sunghwan Ihm, Princeton University Thank You sihm@cs.princeton.edu http://www.cs.princeton.edu/~sihm/ 28 Sunghwan Ihm, Princeton University