Caching - SEAS - University of Pennsylvania

advertisement
Replication, Concluded
and New Trends Part I
Zachary G. Ives
University of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 24, 2008
Reminders
 Project demos Friday May 9
 Demo must be on mod cluster, at least 4 machines
 Email to schedule early next week
 Please aim to have “code complete” by May 1 or so –
integration testing is important!
 Project report and final code due Monday May 12
 Includes experiments showing scalability of querying,
crawling, or indexing
 Final exam May 12, 6-8PM, here in Towne 315
2
Recall: Central Issues in Replication
 What to replicate
 Where to replicate
 How to maintain consistency
(and how fresh data needs to be)
 How to route requests to replicas
3
Choosing Which Replica to Use
Round-robin
 Simply allocate each request to the next server, incrementing next at
each point
 Does this make sense over the entire Internet?
Load balancing
 Requires some sort of load-monitoring code at each server (e.g., how
many threads running, queues available, etc.)
 Server feeds this information into the coordinator
 What are the dangers of doing this?
Topologically aware
 Tries to allocate requests to the “nearest” server, according to
network performance
 This is what content distribution networks like Akamai aim for
4
Physically Mapping the Requests
Internal routing (transparent to client)
 One machine masquerades as the server
 It forwards requests to a particular machine
 This is commonly done by search engines, CNN, etc.
Domain Name Server tricks
 When a client looks up the server, it gets a different DNS
address than other machines do – thus it transparently
talks to someone else
 Typically, content distribution networks do this
 Let’s look at Akamai as an example…
5
Akamai: A Real-World
Content Distribution Network
 Goal is to produce replicas of data from “big” web sites
 e.g., images from CNN; QuickTime movie trailers from Apple
 Similar services from Cisco, Digital Island, Exodus, etc.
 Basic model:
 Providers pay Akamai to distribute their content
 Akamai asks ISPs if they can install boxes in their networks
 18,000 servers in 1,000 networks in 69 countries
 HTTP responses include a “base” container, plus references to
embedded content from “nearby” Akamai boxes
 Idea: HTML changes frequently; images, videos do not
 Newer service, EdgeComputing, also creates JSP pages in the end
nodes
6
Bird’s Eye View of Akamai
Supplying QuickTime Content
QT Server
Akamai
infrastructure
Replica
123.45.67.89
Client
3: QUERY
qt.akamai.tech.net
4: IP 123.45.67.89
(“nearby” Akamai box)
Akamai DNS
7
What Akamai’s Secret Algorithms Do
Replica maintenance
 Not every data item needs to be replicated everywhere – determine
what items should be placed at which replicas
 This is based on ideas from peer-to-peer that we’ll discuss later in the
course
Estimating the closest replica
 This is incredibly hard – how to figure out which replica is the fastest
to access – unless a replica at every ISP
 Studies have shown that Akamai and other CDNs aren’t perfect
[Johnson et al.]
 But it’s still typically “good enough”
Today Akamai also tries to deliver some kinds of dynamic
content, using database-like technology
8
Replication, Summarized
 Replication is generally controlled by the content
producer – the server
 Considerations about where to place replicas for
max effect, how to maintain correct semantics of the
application
 We’ve seen Akamai as a real example
 Can also have something similar, but client-sidedriven, potentially with support from intermediate
points in the network…
9
Caching
Caching is, in essence, a lazy, request-driven
version of replication
 Generally needs to be fully transparent
 Implemented over standard HTTP
 Replication can often use proprietary protocols
 However: servers want control over what gets cached
 Why is this important?
 Caching can be done at either endpoint, or somewhere in
between…
10
Why Caching Works on the Web
Graphs: www.useit.com/alertbox/zipf.html
Web tends to have a Zipfian distribution of requests
 Flat in log-log scale
 A few items are very frequent, many are somewhat
frequent, a huge number are infrequent (“heavy tail”)
11
Where Caching is Done on the Web
 Typically, the browser does its own local caching
 Netscape, IE, Mozilla, etc. maintain a large (10s of MB)
cache of documents and images
 Some organizations do their own caching using a
proxy server
 Large ISPs, AOL, many companies
 At the server-side, certain objects are frequently
cached in memory to speed up responses
12
Proxy Servers
 At an organizational level, may route all requests
through a gateway
 Sometimes this is a firewall, other times not
 Sometimes it’s application-transparent, other times it
must be specified
 “Proxy server” is a middleman




Takes requests from client – makes requests of server
Reads response from server – forwards to client
May perform shared caching of requests
Very common for large ISPs, businesses, etc.
13
Cache Hits and Misses
 Misses are due to several possible factors:
 First-time request – compulsory miss
 Uncacheable item
(e.g., HTTPS, item marked as uncacheable)
 Item expired
 Insufficient space in cache – capacity miss
 (Structure of cache means item will be evicted – conflict
miss)
14
Cache Replacement Considerations






Cost of fetching
Cost of storage
How often it’s been used
Probability it will be accessed again
When last modified
When likely to expire
15
Cache Item Replacement Algorithms
 Algorithms look similar to those in other disciplines
(OS virtual memory, microprocessor caching):
 Least Recently Used
 Least Frequently Used
 Others
 Size – remove largest item in cache
 Hybrid of LFU/LRU and size
 etc.
 This problem is easier than in other disciplines –
why?
16
Summary
 Replication and caching are very common in today’s
web
 Improve performance
 Replication also provides greater availability
 Replication is generally done with server knowledge;
caching is done (roughly) transparently
 Both rely on frequent requests
17
The Future
 Let’s take a brief look at current trends – what
might be a few years down the pike…
 Let’s step back and look at the fundamental
assumptions we’ve been making in our distributed
architectures
 What basic assumptions do we make about our nodes in
any of our basic architectures (client-server, P2P, etc.)?
 What basic assumptions do we make about the data?
 What basic assumptions do we generally make about
distributing computation?
18
Smaller Systems
 What lies in the future?
 RF tags, cameras, camera
phones, temperature sensors,
etc.
 … all interconnected by
some form of networking
 Sensor networks: the latest
rage in distributed systems
research
http://robotics.eecs.berkeley.edu/~pister/SmartDust/
19
What Can We Do with Sensor
Networks?
 Environmental monitoring:




temperature in different parts of a building
air quality
monitoring equipment and staff in hospitals
etc.
 Law enforcement:
 Video feeds and anomalous behavior (J. Shi)
 Research studies:
 Study ocean temperature, currents
 Monitor status of eggs in endangered birds’ nests
 Fun:
 Record sporting events or performances from every angle (video &
audio)
 New immersive environments
20
Why They’re Hard
 Many, many devices
 Power and resource constraints
 Most of these devices are wireless, tiny, battery-powered
 Can only transmit data every so often!!
 High rate of failure and error
 Use redundancy to overcome this
 Very limited intelligence
 Many sensors can’t run sophisticated code
 Can we use the large amount of parallelism to compensate?
 Very local knowledge
 Know about a few nodes within proximity
21
The Problems of Focus
 Languages: how do we express what we want to do with
sensor networks
 Surprisingly effective: subset of SQL for monitoring data from
relatively simple sensors
 Why would this be?
 Robustness: need to combine info from many sensors to
account for individual errors
 Routing: need to aggregate data in a power-efficient way
 Streams: data is an infinitely long sequence – how do we deal
with that?
 Summarization data structures (data is roughly according to this
distribution)
 Operations over “sliding windows”
 Again, SQL is the basis of a lot of work!
22
Will Sensor Networks Make It
to the Real World?
 A definite “yes”…
 … because they’re already deployed: RFIDs, cameras, traffic monitors
 But they aren’t yet general-purpose
 Implications on privacy?
 We don’t yet understand how to write applications for
sensor networks
 Want to insulate the programmer from low-level considerations
 Want to satisfy performance and resource constraints
 What is the right level of describing an application? Web services?
Something else?
 Can database-style languages get us most of the way?
23
Sensor Net Research at Penn
(Ives, Guha, Lee, Loo; Mihaylov, Liu, Jacob)
 The Internet is now heavily based on “streaming” data,
remote devices that do sensing
 Sometimes it’s “motes”, other times it’s routers, monitoring software
on servers, etc.
 It’s very complicated to program for all of these devices
 Can we build apps that let us integrate and monitor relevant
data, without worrying about device specifics?
 The key idea: use query languages (think XQuery or SQL) as
the basic way of requesting sensor data
 Extend with ideas from data integration, to support heterogeneous
sensors, combining sensor data with databases, etc.
24
ASPEN: Sensors as Distributed Data
 Goal: extensible monitoring of streaming data
sources
 Programmer specifies computation, and the processing
and data “flow” to where it’s most effective
 A smart optimizer knows about devices and connectivity
 Programming is data-centric, not device-centric
 Everything is abstracted as tables
 View of the system continuously refreshed
Basic
Approach
1/5and location details from programmer –
Hide
physical connectivity
group data sources into abstract relations
Video(lat, long,time,frame)
Mic(lat, long,time,sample)
Represent
each sensor as the
source of a stream of time-varying tuples
Basic
Approach
2/5
Video(lat, long,time,frame)
Mic(lat, long,time,sample)
(385300,770200,1,―) ,(385300,770200,2, ―) ,…
(385301,770201,1,) , (385301,770201,2,) , …
(385302,770201,1,) , (385302,770201,2,) , …
(385302,770200,1,―) ,(385302,770200,2, ┘) ,…
(385303,770201,1,) , (385303,770201,2, ), …
(385300,770201,1,―) ,(385300,770201,2, ┘) ,…
(385301,770202,1, ┘),(385301,770202,2, ┐) ,…
(385300,770200,1,―) ,(385300,770200,2, ―) ,…
(385301,770202,1,), (385301,770202,2,) , …
(385301,770202,1, ) , (385301,770202,2,) , …
(385302,770202,1,) , (385302,770202,2,) , …
Basic Approach
3/5on properties of the data,
Support queries based
independent of the devices
“Show me all of the video frames between
[38°53.01’,77°02.01’] and [38°53.03’,77°02.01’] with a ”
“How many video frames with a  are also near a
microphone sample with sound?”
… Can also combine with lookups in tables to do data
integration
e.g., “Show me video frames with a  that fall within
the coordinates of the conference room in
RoomTable?”
e.g., “Find the ssn of Bob Smith, use this to look up his
transponder ID, and show me video near him”
BasicSupport
Approach
logical views 4/5
– “abstract sensors” integrating
data from different types of lower-level sensors
AVObservations(lat, long,time,frame,sample) :video(lat,long,time,frame), mic(lat2,long2,time,sample)
where dist(lat,long,lat2,long2) < 5m and sample > ― and frame > 
(385300,770200,1,―), (385300,770200,2, ―),…
(385301,770201,1,), (385301,770201,2,)
(385302,770200,1,―), (385302,770200,2, ┘) ,…
(385302,770201,1,), (385302,770201,2,)
(385300,770201,1,―), (385300,770201,2, ┘) ,…
(385303,770201,1,), (385303,770201,2, )
(385301,770202,1, ┘), (385301,770202,2, ┐) ,…
(385301,770202,1,), (385301,770202,2,)
(385300,770200,1,―), (385300,770200,2, ―) ,…
(385301,770202,1, ), (385301,770202,2,)
(385302,770202,1,), (385302,770202,2,)
BasicSupport
Approach
logical views 5/5
– “abstract sensors” integrating
data from different types of lower-level sensors
AVObservations(lat, long,time,frame,sample) :video(lat,long,time,frame), mic(lat2,long2,time,sample)
where dist(lat,long,lat2,long2) < 5m and sample > ― and frame > 
(385303,770201,1,, ┘), (385301,770202,1,, ┘),
(385303,770201,2,, ┘), (385303,770201,2,, ┘), (385303,770201,2, , ┐),
…
(385302,770200,2, ┘) ,…
(385300,770201,2, ┘) ,…
(385303,770201,1,), (385303,770201,2, )
(385301,770202,1, ┘), (385301,770202,2 ┐) ,…
(385301,770202,1, ),
Challenges We Are Addressing
 Data integration has been based on static data
 Adapt mappings, queries to stream data, including timing,
synchronization, link properties, …
 Optimization of queries is hard in the simplest case,
and here we need to do it in distributed fashion with
limited knowledge
 Distribute computation to the network, and to the devices with the
“right” position and “right” capabilities
31
Next Time
 Bigger systems: The grid and cloud computing
 Higher-level networks: Semantics and the Web
32
Download