Understanding and Improving Modern Web Traffic Caching

advertisement
Towards Understanding
Modern Web Traffic
Sunghwan Ihm and Vivek S. Pai
Google Inc. / Princeton University
Web Changes and Growth



Simple static documents  complex rich media
applications
 Heavy client-side interactions (e.g., Ajax)
Traffic increase
 Social networking, file-sharing, and video
streaming sites
Trends expected to continue
 Applications migrated to the Web
 A de facto standard interface of cloud services
2
Sunghwan Ihm, Princeton University
Understanding Changes

Goal: shape system design by better
understanding the traffic optimization
opportunities

Improve response times

Understand caching effectiveness

Design intermediary systems: firewalls,
security analyzers, and reporting/monitoring
systems
3
Sunghwan Ihm, Princeton University
Challenges

Tracking changes
 Requires large-scale data set spanning many years
collected under the same conditions
We address these challenges by


Web page analysis
new
analysis techniques
for
1. Requires
Analyzing
large-scale
data withsuitable
full content
Webapages
interactions
2. dynamic
Developing
new with
Webclient-side
page analysis
technique
(e.g, Ajax)
Redundancy and caching
 Requires full content instead of simple access logs
for assessing implications of content-based caching
4
Sunghwan Ihm, Princeton University
CoDeeN Traffic


CoDeeN content distribution network (CDN)
 http://codeen.cs.princeton.edu/
A semi-open globally distributed open proxy on
500+ PlanetLab nodes

Running since 2003

30+ million requests per day
5
Sunghwan Ihm, Princeton University
Data Collection
Full
Content
Access
Logs
WAN
User
Browser
Cache
Local
Proxy
Cache
CoDeeN
Cache
Origin
Web Server
Assume local proxy caches
 1. Access logs (all requests, but limited info.)
 URL, Timestamp, Content-Length, ContentType, Referer, etc.
 2. Full content (cache-misses)
 Header + body

6
Sunghwan Ihm, Princeton University
Data Set


5 years: from 2006 to 2010
 Focus on one month (April) per year
 Full content data only for 2010
Focus
on US, CN, FR, BR:
Total volume
per month
100M+ requests / 1TB+ / 100K+ users
 3.3~6.6 TB
 280~460 million requests
 240~360K unique client IPs (40~60% /8 nets)
 168~187 countries and regions
 820K~1.2 million servers
7
Sunghwan Ihm, Princeton University
Analysis Outline
1. High-level analysis
2. Page-level analysis
Access Logs
3. Caching analysis
Full Content
8
Sunghwan Ihm, Princeton University
1. High-Level Analysis

Q: What has changed over five years?

Connection speed

NAT usage

Max # concurrent browser connections

Content type

Object Size

Traffic share of Web sites
9
Sunghwan Ihm, Princeton University
%
Content 1
Type
0
.
1
0
.
0
1
0
.
1
1
%
R
e
q
0
.
1
1
%
R
e
q
u
e
s
t
s
FR
0
.
1
h
t
m
l
0
.
0
1
0
.
1
1
1
0
0
0
c
s
s 1
1
0 1
0
0
x
m
l
%
R
e
q
u
e
s
t
s
ja
v
a
s
c
r
ip
t
h
t
m
l
im
a
g
e
c
s
s
o
t
h
e
r
v
id
e
o
x
m
l
f
lv
v
id
e
o
ja
v
a
s
c
r
ip
t
o
c
t
e
t
(c) BR
0
0
2010: We observe growth
of
Flash
vide
(c) BR
US, 20062010,
X and Yvideo,
log-scale JavaScript,
rve growth
ofboth
Flash
1
0
0


A sharp increase of Ajax: JavaScript / CSS / XML
A sharp increase of Flash video (FLV)7
(<5%25%)
5
u
e
s
ts

10
Sunghwan Ihm, Princeton University
Traffic Share of Web Sites



Increase in video sites’ traffic
Increase in ad networks and analytics sites’
requests (~12%)
 Ad networks market growth
Most accessed site by users
 search / analytics
 google.com, baidu.com, google-analytics.com
 % user share increasing, tracking up to 65%
11
Sunghwan Ihm, Princeton University
2. Page-Level Analysis

Q: How have Web pages changed?

New page detection heuristic

Initial page characteristics
 Page size / # of embedded objects / latency

Page load latency simulation

Entire page characterization
12
Sunghwan Ihm, Princeton University
Page Detection Problem

Given a set of access logs, detect the page
boundaries
Time
main


embedded
# of embedded objects, page size, time, etc.
Challenge: previous approaches from 1990s are
a poor fit, inaccurate for modern Web traffic
13
Sunghwan Ihm, Princeton University
Previous Approach #1:
Time-based
Check idle time between requests
 If within a threshold (e.g. 1 second), they belong
to the same page


Misclassify client-side interactions (Ajax)
with longer idle time as pages
14
Sunghwan Ihm, Princeton University
Previous Approach #2:
Type-based
Check file extension / content type
 Regard every html object as a main object


Misclassify frames/iframes within a page as
separate pages
15
Sunghwan Ihm, Princeton University
Algorithm
Ajax
frames/iframes
1. Group logs into streams by Referer field
2. Consider all html object as main object
candidates ( Type-based)
3. Ignore those with no children (embedded
objects)
4. Apply idle time among the candidates for
finalizing selection ( Time-based)
16
Sunghwan Ihm, Princeton University
Validation



Ground truth: browse Alexa’s top 100 sites
 Visit about 10 pages per site
 Record Web page URLs (main objects)
 Total 1197 pages
Precision
 # correct pages found / # total pages found
Recall
 # correct pages found / # total correct pages
17
Sunghwan Ihm, Princeton University
Validation Result
1
4
26~33
R
e
c
a
ll
0
.8
0
.6
0
.4
0
.2
0
0
19~30
4~24
1 sec
T
im
e
T
y
p
e
T
im
e
+
T
y
p
e
S
tr
e
a
m
S
tr
u
c
tu
r
e
0
.2
0
.4
0
.6
0
.8
1
P
r
e
c
is
io
n
StreamStructure outperforms other approaches
Figure 8:
Precisiontoand
StreamStructure
outperforms 18
 Robust
the recall:
idle time
parameter selection
previous page detection algorithms, simultaneously achieving
high precision and recall. It is also robust to Sunghwan
the idle
time University
paIhm, Princeton

Identifying Initial Page Loads
Client-side
Interactions
(e.g., Ajax)
Initial
Page Load


Initial page: user-perceived page  userperceived latency  traffic/revenue of Websites
Apply Time-based
approach,
but DNS lookup or
40-60%
of traffic
browser processing time can vary significantly
after initial page loads

Use Google Analytics beacon
 JavaScript collecting various client-side info.
 Fires when document are loaded
19
Sunghwan Ihm, Princeton University
Initial Page Size and # Objects
Initial pages become increasingly complex
 US: about 2x increase
Caching
 2006: 69 KB / 6 objects
Effectiveness
 2010: 133 KB / 12 objects

20
Sunghwan Ihm, Princeton University
Initial Page Load Latency
Median latency dropped in 2009 and 2010
 Increased # of browser concurrent connections
 Reduced per-object latency from improved
caching behavior / client bandwidth

21
Sunghwan Ihm, Princeton University
3. Caching Analysis

Q: Implications for caching?

URL popularity

Caching effectiveness

Required cache storage size

Impact of aborted transfers
22
Sunghwan Ihm, Princeton University
Two Caching Approaches


HTTP Object-based Approach
 Whole object
 HTTP-cacheable only
 Previously reported cache hit rate: 35~50%
 Byte hit rate usually much less
Content-based Approach
 Cache smaller chunks instead of objects
 Protocol independent
 Effective for uncacheable content as well
 WAN accelerators, storage/file systems
23
Sunghwan Ihm, Princeton University
Ideal Cache Hit Rate
1.8~2.5x
HTTP object-based: 17~28%
 Mainly effective for JavaScript and image
 Content-based: 42~51% with 128-byte chunks
 Effective for any content type
 Growth of tail that hurts caching

24
Sunghwan Ihm, Princeton University
Origins of Redundancy (%)
Origins of Redundancy
100

75
50
Aborted
25
US, 128 byte
0
o
de
vi o
di
au t
te
oc e
ag
im
s
cs
l ipt
xm scr
va
ja l
m
ht
l
al
Content
updates
inter-URL
aliasing
intra-URL
object-hit
Most of additional savings from the redundancy
 across different versions (intra-URL)
 across different objects (inter-URL)
25
Sunghwan Ihm, Princeton University
4
0
B
y
te
H
itR
a
te
(
%
)
B
y
te
H
itR
a
te
(
%
)
5
0
M
R
C
1
K
B
H
T
T
P
3
0
2
0
1
0
5
0
4
0
3
0
2
0
1
0
CN: 218GB
0
0
10
INF
0G
20
0G
10
G
50
G
20
G
10
5G
2G
1G
0M
50
0M
20
0M
10
:
Required Cache Storage Size
C
a
c
h
e
S
to
r
a
g
e
S
iz
e
(
B
y
te
)
1-KB outperforms 128-B w/ metadata overhead
(b) CN
 MRC: Multi-Resolution Chunking (USENIX’10)
A large
cacheworking
with MRC
provides 2x the byte
 Increases
set size
 Large cache storage highly desirable

fully downloaded
hit ra
26
Sunghwan Ihm, Princeton University
Conclusions




Analyzed five years of real Web traffic with over
70,000 users
Observed a rise of Ajax and Flash video, search
engine / analytics site tracking 65% users
Developed StreamStructure
 Half of the traffic occurs due to client-side
interactions after initial page loads
 Pages have become increasingly complex
Content-based caching with large cache storage
highly desirable
 2x larger byte hit rate, aborted transfers
27
Sunghwan Ihm, Princeton University
Thank You
sihm@cs.princeton.edu
http://www.cs.princeton.edu/~sihm/
28
Sunghwan Ihm, Princeton University
Download