Web Caching Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002 Agenda Caching: Why, Where, How, What Some empirical data: Zipf’s Law Content Delivery Networks Bibliography Why cache? Number of unique pages: 800M < X < 2.2B Number of unique web sites: 8,500,000 static pages: %30 - %40 pages revisited: %80 expected hit-rate: %24 - %32 Why cache? Bandwidth Latency Performance = Response Time Server Load Failure Redundancy Where Reverse Reverse Reverse Proxy Reverse Proxy Proxy Proxy Local ISP Content Content Content Content Server Server Server Server cache cdn L4 Switch cache Intranet cache cache cache Browser Browser Browser Data Center ISP cdn Hot-potato routing Get traffic off of your network as soon as possible Bounces traffic around the internet Increases chance of dropped packet Increases latency Destination You are here How: Types of Caches Simple Proxy Transparent Proxy Reverse Proxy Adaptive Caching Push Caching Active Caching Streaming Caches How: Simple Proxy Harvest/Squid Provide web content for a fixed user base Standalone operation May be transparent Commodity product/technology Easy to get 90% correct How: Transparent Proxy No client configuration Violates end-to-end paradigm Client thinks it is talking directly to server Server thinks it is talking to cache Implemented as Pass-through unit L4 switch How: Reverse Proxy Designed to offload duties from one or more specific servers Data size is limited to size of static content on the server Challenge is fast, disk-less operation Cache consistency is easy Single point of failure How: Adaptive Caching ISP Level caching Cooperating multiple distributed caches Operate as a cache-mesh based on content demand Multicast for group membership (GCS) Content Routing Protocol sends request to the appropriate cache within the mesh How: Push Caching Send the data out proactively Content Delivery Networks Paid for by data providers More on this later! How: Active Caching Use an applet inside of the cache to customize dynamic pages on the fly How do you identify dynamic pages? Where does the custom data come from? Who is going to pay for this service? How: Streaming Caches What about streaming content Movies Audio Proprietary streaming protocols Challenge is to maintain Quality of content and service Who pays for this? What: Content and Protocols Mostly Static Content HTML XML GIF AVI EXE Etc. What: Content and Protocols HTTP 1.0 Basic protocol Send Request based on fix number of verbs GET HEAD POST Receive response, meta-data, content What: Content and Protocols HTTP Request Request = Simple-Request | Full-Request Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line ; * ( General-Header ; | Request-Header ; | Entity-Header ) ; CRLF [ Entity-Body ] What: Content and Protocols Example: GET /pub/www/index.html HTTP/1.0 Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Sat, 19 Oct 2002 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private What: Content and Protocols Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Thu, 13 Jul 2000 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private What: Content and Protocols Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT Response: HTTP/1.1 304 Not Modified Basic caching algorithm Pages may be Fresh: up-to-date Expired: current date > expiration date Stale: “old” Basic caching algorithm - #2 If (page is in the cache) if ( page is expired or stale ) Get from server - if-modified-since If not modified, Get from cache Get from cache Else Get from Server Basic caching algorithm - #3 If cache has space Store the file Else 1. Delete expired from cache 2. Delete stale from cache 3. Delete LRU from cache 4. Delete largest/smallest from cache? Agenda Caching: Why, Where, How, What Some empirical data: Zipf’s Law Content Delivery Networks Bibliography Zipf’s law Zipf’s law: The frequency of an event P as a function of rank i is a power law function: Pi = Ω / iα where α ≤ 1 Zipf’s law Observed to be true for Frequency of written words in English texts Population of cities Income of a company as a function of rank Zipf’s law and web access For a given server, page access by rank follows Zipf’s law Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83 Observations Top %1 of all documents account for %20 - %35 of proxy requests Top %10 account for %45 - %55 of requests It takes %25 to %40 of all documents to account for %70 of requests It takes %70 to %80 of all documents to account for %90 of requests Observations Observations For an infinite sized cache, the hit-ratio for a web-proxy grows in a log-like fashion as a function of the client population of the proxy and the number of requests seen by the proxy. Observations The hit-ratio of a web cache grows in a log- like fashion as a function of the cache size. Observations Locality of Reference The probability that a document will be referenced k requests after it was last referenced is roughly proportional to 1/k. Observations - NOT There is very little correlation between access frequency and document size There is no correlation between access frequency and the change rate of a document No single web server contributes to most of the popular pages Zipf’s Law and Caching Discussion How does this help in cache design? Are there any business implications? Agenda Caching: Why, Where, How, What Some empirical data: Zipf’s Law Content Delivery Networks Bibliography CDN “Traditional” CDN Dirty Secrets P2P content delivery systems Why use a CDN? Reverse Reverse Reverse Proxy Reverse Proxy Proxy Proxy Local ISP Content Content Content Content Server Server Server Server cache cdn L4 Switch cache Intranet cache cache cache Browser Browser Browser Data Center ISP cdn What is CDN? Content Deliver Networks = PUSH PUSH = Prefetch CDN Mechanisms DNS redirection Complete Partial URL rewrite A DNS-redirecting CDN DNS redirector example.com ? B Client Network Model HTTP server A HTTP B server GET http://example.com/foo http://example.com/foo HTTP server Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt C Original server CDN DNS Full Redirection (Semi)automatic mechanism to replicate original site on CDN servers Replace original DNS entry with enhanced DNS server that uses knowledge of network and server load to direct clients to appropriate CDN server TTL on DNS entries are very short Adero, NetCaching, IntelliDNS CDN DNS Partial Redirection Statically modify selected URL’s within pages to point to CDN service Replicate selected objects to CDN service Redirect clients of selected URL’s using enhanced DNS server that uses knowledge of network and server load Akamai, Digital Island, MirrorImage, SolidSpeed, Speedera CDN rewrite Modify pages at the origin server on the fly Change embedded URL’s based on up-to- date knowledge of the network and CDN server loads Does not require additional DNS lookups Fasttide, Clearway Measuring a CDN’s performance Two papers K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001. The measured performance of content distribution networks Client Actions R: Resolve domain name F: Fetch content Ordinary client use of CDN: RF Instead of doing (RF)+ we do R+ then F+ This allows us to compare the server chosen to some other servers that could have been chosen, over a large number of fetches. Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt The measured performance of content distribution networks Procedure R+: Collect a set of servers by repeated DNS queries to a variety of name servers over a number of hours F+: Fetch a particular piece of content from each member of the set, measuring latency Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt The measured performance of content distribution networks Important Details Interleaved fetches Fetch1 at server1, fetch1 at server2, etc. Not fetch1 at server1, fetch2 at server1, etc. Unmeasured fetch before measured fetch Avoids cache misses Measure only HTTP fetch latency CDN not penalized for cost of DNS resolution Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt The measured performance of content distribution networks: Looking at these graphs Note: log plot of latency Gray line: cumulative distribution at one server Red line: cumulative distribution at all servers Blue line: cumulative distribution at CDN Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt The measured performance of content distribution networks Cumulative Distribution Right way to look at this data Want to understand frequency and magnitude of bad choices Consistent = vertical Fast = to the left Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt The measured performance of content distribution networks Results Akamai does a better job than Digital Island Neither does a particularly good job of selecting the optimal server Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt The measured performance of content distribution networks What’s wrong with this study? Focus is on choice of server Cost of DNS is explicitly excluded How does this relate to client performance? Measuring a CDN’s performance Two papers K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001. On the Use and Performance of Content Distribution Networks Focus is on client perceived performance Build canonical web page with images from CDN server On the Use and Performance of Content Distribution Networks If each CDN serves different content, then how did they create comparable pages? Size matters! Select images of (almost) identical sizes from each of the CDN services On the Use and Performance of Content Distribution Networks Step 1: For services using only DNS redirection, get an IP address from the DNS server For services using rewriting, get the page and extract the CDN content server from the page Amortize DNS lookup time over all images in this page On the Use and Performance of Content Distribution Networks Step 2: Download all the images from the IP address of the identified server Throw this data away The purpose is to make sure that there are no cache misses On the Use and Performance of Content Distribution Networks Step 3: Download all the images from the IP address of the identified server just like a browser would (4 in parallel) Repeat every 30 minutes over a period of 24 hours with a 10 minute jitter On the Use and Performance of Content Distribution Networks Results On the Use and Performance of Content Distribution Networks Four Conclusions 1. Forcing a DNS lookup in the critical path of resource retrieval, does not generally result in better server choices 2. The download time from a previously selected server is often better than from the download time from the newly selected server 3. CDN servers are generally not loaded so frequent DNS lookup is not helpful 4. It makes sense for CDNs to increase the DNS TTL given to a client unless the servers are known to be loaded On the Use and Performance of Content Distribution Networks Is this a better study? More detailed results Relates to observed performance A good marketing white paper What did we learn? Dirty Secrets of the CDN world CDNs are tremendously underutilized CDNs are over-architected The value of a CDN is its remote presence in the ISP. Not in its ability to load balance Remember the ISP Interconnect? P2P content delivery systems PUSH content to the leaf nodes Content Manager Server other leaf nodes from the edges Kontiki 2 1 client client 3 P2P CDN Four Challenges 1. Aggregate input streams 2. Deal with unstable peers 3. Manage Malicious peers 4. Who really pays for this? P2P Caching? Discussion: Is this a good idea? What are the issues? Where is the payback? Agenda Caching: Why, Where, How, What Some empirical data: Zipf’s Law Content Delivery Networks Bibliography Bibliography Gray, Shenoy, Rules of Thumb in Data Engineering 1999, Revised March 2000. Microsoft Research MS-TR-99-100 Berners-Lee, Fielding, Frystyk, Hypertext Transfer Protocol -- HTTP/1.0, IETF RFC 1945, http://www.w3.org/Protocols/rfc1945/rfc1945 Fielding, Gettys, Mogul, Frystyk, Masinter, Leach, Berners-Lee, Hypertext Transfer Protocol -- HTTP/1.1, ftp://ftp.isi.edu/in-notes/rfc2616.txt Greg Barish and Katia Obraczka. World Wide Web Caching: Trends and Techniques. IEEE Communications, May 2000. http://www.isi.edu/people/katia/cache-survey.pdf. Breslau, Cao, Fan, Phillips, Shenker, Web Caching and Zipf-like distributions: Evidence and Implications, IEEE Infocom 1999 K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. www.terena.nl/conf/wcw/Proceedings/S4/S4-1.pdf B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001. http://www.icir.org/vern/imw-2001/imw2001-papers/10.pdf Bibliography “Zipf Distribution of Web Site Popularity”, http://www.useit.com/alertbox/zipf.html S. Gribble, E. Brewer, “System Design Issues for Internet Middleware Services: Deductions from a Large Client Trace”, Proceedings of the USENIX Symposium on Internet Technologies and Systems Monterey,California,December 1997 “The Internet is a little bit broken”, http://www.internap.com/about/theproblem.html “Reliable Internet Connectivity with BGP, Chapter 7, Influencing Entrance Selection”, http://www.bgpbook.com/archpolicyenter.html A. Cockburn, B. McKenzie, “What do Web Users Do? An Empirical Analysis of Web Use”, http://www.cosc.cantebury.ac.nz/~andy/papers/ijhcsAnalysis.pdf