Replication, Concluded and New Trends Part I Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 24, 2008 Reminders Project demos Friday May 9 Demo must be on mod cluster, at least 4 machines Email to schedule early next week Please aim to have “code complete” by May 1 or so – integration testing is important! Project report and final code due Monday May 12 Includes experiments showing scalability of querying, crawling, or indexing Final exam May 12, 6-8PM, here in Towne 315 2 Recall: Central Issues in Replication What to replicate Where to replicate How to maintain consistency (and how fresh data needs to be) How to route requests to replicas 3 Choosing Which Replica to Use Round-robin Simply allocate each request to the next server, incrementing next at each point Does this make sense over the entire Internet? Load balancing Requires some sort of load-monitoring code at each server (e.g., how many threads running, queues available, etc.) Server feeds this information into the coordinator What are the dangers of doing this? Topologically aware Tries to allocate requests to the “nearest” server, according to network performance This is what content distribution networks like Akamai aim for 4 Physically Mapping the Requests Internal routing (transparent to client) One machine masquerades as the server It forwards requests to a particular machine This is commonly done by search engines, CNN, etc. Domain Name Server tricks When a client looks up the server, it gets a different DNS address than other machines do – thus it transparently talks to someone else Typically, content distribution networks do this Let’s look at Akamai as an example… 5 Akamai: A Real-World Content Distribution Network Goal is to produce replicas of data from “big” web sites e.g., images from CNN; QuickTime movie trailers from Apple Similar services from Cisco, Digital Island, Exodus, etc. Basic model: Providers pay Akamai to distribute their content Akamai asks ISPs if they can install boxes in their networks 18,000 servers in 1,000 networks in 69 countries HTTP responses include a “base” container, plus references to embedded content from “nearby” Akamai boxes Idea: HTML changes frequently; images, videos do not Newer service, EdgeComputing, also creates JSP pages in the end nodes 6 Bird’s Eye View of Akamai Supplying QuickTime Content QT Server Akamai infrastructure Replica 123.45.67.89 Client 3: QUERY qt.akamai.tech.net 4: IP 123.45.67.89 (“nearby” Akamai box) Akamai DNS 7 What Akamai’s Secret Algorithms Do Replica maintenance Not every data item needs to be replicated everywhere – determine what items should be placed at which replicas This is based on ideas from peer-to-peer that we’ll discuss later in the course Estimating the closest replica This is incredibly hard – how to figure out which replica is the fastest to access – unless a replica at every ISP Studies have shown that Akamai and other CDNs aren’t perfect [Johnson et al.] But it’s still typically “good enough” Today Akamai also tries to deliver some kinds of dynamic content, using database-like technology 8 Replication, Summarized Replication is generally controlled by the content producer – the server Considerations about where to place replicas for max effect, how to maintain correct semantics of the application We’ve seen Akamai as a real example Can also have something similar, but client-sidedriven, potentially with support from intermediate points in the network… 9 Caching Caching is, in essence, a lazy, request-driven version of replication Generally needs to be fully transparent Implemented over standard HTTP Replication can often use proprietary protocols However: servers want control over what gets cached Why is this important? Caching can be done at either endpoint, or somewhere in between… 10 Why Caching Works on the Web Graphs: www.useit.com/alertbox/zipf.html Web tends to have a Zipfian distribution of requests Flat in log-log scale A few items are very frequent, many are somewhat frequent, a huge number are infrequent (“heavy tail”) 11 Where Caching is Done on the Web Typically, the browser does its own local caching Netscape, IE, Mozilla, etc. maintain a large (10s of MB) cache of documents and images Some organizations do their own caching using a proxy server Large ISPs, AOL, many companies At the server-side, certain objects are frequently cached in memory to speed up responses 12 Proxy Servers At an organizational level, may route all requests through a gateway Sometimes this is a firewall, other times not Sometimes it’s application-transparent, other times it must be specified “Proxy server” is a middleman Takes requests from client – makes requests of server Reads response from server – forwards to client May perform shared caching of requests Very common for large ISPs, businesses, etc. 13 Cache Hits and Misses Misses are due to several possible factors: First-time request – compulsory miss Uncacheable item (e.g., HTTPS, item marked as uncacheable) Item expired Insufficient space in cache – capacity miss (Structure of cache means item will be evicted – conflict miss) 14 Cache Replacement Considerations Cost of fetching Cost of storage How often it’s been used Probability it will be accessed again When last modified When likely to expire 15 Cache Item Replacement Algorithms Algorithms look similar to those in other disciplines (OS virtual memory, microprocessor caching): Least Recently Used Least Frequently Used Others Size – remove largest item in cache Hybrid of LFU/LRU and size etc. This problem is easier than in other disciplines – why? 16 Summary Replication and caching are very common in today’s web Improve performance Replication also provides greater availability Replication is generally done with server knowledge; caching is done (roughly) transparently Both rely on frequent requests 17 The Future Let’s take a brief look at current trends – what might be a few years down the pike… Let’s step back and look at the fundamental assumptions we’ve been making in our distributed architectures What basic assumptions do we make about our nodes in any of our basic architectures (client-server, P2P, etc.)? What basic assumptions do we make about the data? What basic assumptions do we generally make about distributing computation? 18 Smaller Systems What lies in the future? RF tags, cameras, camera phones, temperature sensors, etc. … all interconnected by some form of networking Sensor networks: the latest rage in distributed systems research http://robotics.eecs.berkeley.edu/~pister/SmartDust/ 19 What Can We Do with Sensor Networks? Environmental monitoring: temperature in different parts of a building air quality monitoring equipment and staff in hospitals etc. Law enforcement: Video feeds and anomalous behavior (J. Shi) Research studies: Study ocean temperature, currents Monitor status of eggs in endangered birds’ nests Fun: Record sporting events or performances from every angle (video & audio) New immersive environments 20 Why They’re Hard Many, many devices Power and resource constraints Most of these devices are wireless, tiny, battery-powered Can only transmit data every so often!! High rate of failure and error Use redundancy to overcome this Very limited intelligence Many sensors can’t run sophisticated code Can we use the large amount of parallelism to compensate? Very local knowledge Know about a few nodes within proximity 21 The Problems of Focus Languages: how do we express what we want to do with sensor networks Surprisingly effective: subset of SQL for monitoring data from relatively simple sensors Why would this be? Robustness: need to combine info from many sensors to account for individual errors Routing: need to aggregate data in a power-efficient way Streams: data is an infinitely long sequence – how do we deal with that? Summarization data structures (data is roughly according to this distribution) Operations over “sliding windows” Again, SQL is the basis of a lot of work! 22 Will Sensor Networks Make It to the Real World? A definite “yes”… … because they’re already deployed: RFIDs, cameras, traffic monitors But they aren’t yet general-purpose Implications on privacy? We don’t yet understand how to write applications for sensor networks Want to insulate the programmer from low-level considerations Want to satisfy performance and resource constraints What is the right level of describing an application? Web services? Something else? Can database-style languages get us most of the way? 23 Sensor Net Research at Penn (Ives, Guha, Lee, Loo; Mihaylov, Liu, Jacob) The Internet is now heavily based on “streaming” data, remote devices that do sensing Sometimes it’s “motes”, other times it’s routers, monitoring software on servers, etc. It’s very complicated to program for all of these devices Can we build apps that let us integrate and monitor relevant data, without worrying about device specifics? The key idea: use query languages (think XQuery or SQL) as the basic way of requesting sensor data Extend with ideas from data integration, to support heterogeneous sensors, combining sensor data with databases, etc. 24 ASPEN: Sensors as Distributed Data Goal: extensible monitoring of streaming data sources Programmer specifies computation, and the processing and data “flow” to where it’s most effective A smart optimizer knows about devices and connectivity Programming is data-centric, not device-centric Everything is abstracted as tables View of the system continuously refreshed Basic Approach 1/5and location details from programmer – Hide physical connectivity group data sources into abstract relations Video(lat, long,time,frame) Mic(lat, long,time,sample) Represent each sensor as the source of a stream of time-varying tuples Basic Approach 2/5 Video(lat, long,time,frame) Mic(lat, long,time,sample) (385300,770200,1,―) ,(385300,770200,2, ―) ,… (385301,770201,1,) , (385301,770201,2,) , … (385302,770201,1,) , (385302,770201,2,) , … (385302,770200,1,―) ,(385302,770200,2, ┘) ,… (385303,770201,1,) , (385303,770201,2, ), … (385300,770201,1,―) ,(385300,770201,2, ┘) ,… (385301,770202,1, ┘),(385301,770202,2, ┐) ,… (385300,770200,1,―) ,(385300,770200,2, ―) ,… (385301,770202,1,), (385301,770202,2,) , … (385301,770202,1, ) , (385301,770202,2,) , … (385302,770202,1,) , (385302,770202,2,) , … Basic Approach 3/5on properties of the data, Support queries based independent of the devices “Show me all of the video frames between [38°53.01’,77°02.01’] and [38°53.03’,77°02.01’] with a ” “How many video frames with a are also near a microphone sample with sound?” … Can also combine with lookups in tables to do data integration e.g., “Show me video frames with a that fall within the coordinates of the conference room in RoomTable?” e.g., “Find the ssn of Bob Smith, use this to look up his transponder ID, and show me video near him” BasicSupport Approach logical views 4/5 – “abstract sensors” integrating data from different types of lower-level sensors AVObservations(lat, long,time,frame,sample) :video(lat,long,time,frame), mic(lat2,long2,time,sample) where dist(lat,long,lat2,long2) < 5m and sample > ― and frame > (385300,770200,1,―), (385300,770200,2, ―),… (385301,770201,1,), (385301,770201,2,) (385302,770200,1,―), (385302,770200,2, ┘) ,… (385302,770201,1,), (385302,770201,2,) (385300,770201,1,―), (385300,770201,2, ┘) ,… (385303,770201,1,), (385303,770201,2, ) (385301,770202,1, ┘), (385301,770202,2, ┐) ,… (385301,770202,1,), (385301,770202,2,) (385300,770200,1,―), (385300,770200,2, ―) ,… (385301,770202,1, ), (385301,770202,2,) (385302,770202,1,), (385302,770202,2,) BasicSupport Approach logical views 5/5 – “abstract sensors” integrating data from different types of lower-level sensors AVObservations(lat, long,time,frame,sample) :video(lat,long,time,frame), mic(lat2,long2,time,sample) where dist(lat,long,lat2,long2) < 5m and sample > ― and frame > (385303,770201,1,, ┘), (385301,770202,1,, ┘), (385303,770201,2,, ┘), (385303,770201,2,, ┘), (385303,770201,2, , ┐), … (385302,770200,2, ┘) ,… (385300,770201,2, ┘) ,… (385303,770201,1,), (385303,770201,2, ) (385301,770202,1, ┘), (385301,770202,2 ┐) ,… (385301,770202,1, ), Challenges We Are Addressing Data integration has been based on static data Adapt mappings, queries to stream data, including timing, synchronization, link properties, … Optimization of queries is hard in the simplest case, and here we need to do it in distributed fashion with limited knowledge Distribute computation to the network, and to the devices with the “right” position and “right” capabilities 31 Next Time Bigger systems: The grid and cloud computing Higher-level networks: Semantics and the Web 32