Redundancy in NETWORK TRAFFIC

advertisement
REDUNDANCY IN NETWORK TRAFFIC:
FINDINGS AND IMPLICATIONS
Ashok Anand
Chitra Muthukrishnan
Aditya Akella
University of Wisconsin, Madison
Ramachandran Ramjee
Microsoft Research Lab, India
Redundancy in network traffic
2

Redundancy in network traffic
 Popular

objects, partial content matches, headers
Redundancy elimination (RE) for improving network
efficiency
 Application
 Web
 Recent
 WAN
layer object caching
proxy caches
protocol independent RE approaches
optimizers, De-duplication, WAN Backups, etc.
Protocol independent RE
3
WAN link


Message granularity: packet or object chunk
Different RE systems operate at different
granularity
RE applications
4

Enterprise and data centers
 Accelerate

WAN performance
As a primitive in network architecture
 Packet
Caches [Sigcomm 2008]
 Ditto [Mobicom 2008]
Protocol independent RE in enterprises
5
Data
centers

Globalized enterprise dilemma

Centralized servers





ISP

Wan
Opt

Direct request to closest servers
Complex management
RE gives benefits of both worlds

Enterprises
Distributed servers

Wan
Opt
Simple management
Hit on performance
Deployed in network middle-boxes
Accelerate WAN traffic while keeping
management simple
RE for accelerating WAN backup
applications
Recent proposals for protocol independent RE
6
Web content

Reduce load on ISP access
links


Packet caches [Sigcomm
2008]

ISP
RE deployment on ISP access
links to improve capacity

University
RE on all routers
Ditto [Mobicom 2008]

Enterprises
Improve effective capacity
Use RE on nodes in wireless
mesh networks to improve
throughput
Understanding protocol independent RE systems
7

Currently little insight into these RE systems
How far are these RE techniques from optimal?
 Are there other better schemes?
 When is network RE most effective?
 Do end-to-end RE approaches offer performance close to
network RE?
 What fundamental redundancy patterns drive the design
and bound the effectiveness?


Important for effective design of current systems as well
as future architectures e.g. Ditto, packet caches
Large scale trace-driven study
8

First comprehensive study



Performance comparison of different RE algorithms




Average bandwidth savings
Bandwidth savings in peak and 95th percentile utilization
Impact on burstiness
Origins of redundancy



Traces from multiple vantage points
Focus on packet level redundancy elimination
Intra-user vs. Inter-user
Different protocols
Patterns of redundancy



Distribution of match lengths
Hit distribution
Temporal locality of matches
Data sets
9

Enterprise packet traces
(3 TB) with payload

11 enterprises
Small (10-50 IPs)
 Medium (50-100 IPs)
 Large (100+ IPs)

2 weeks
 Protocol composition


HTTP (20-55%)


Spring et al. (64%)
File sharing (25-70%)

Centralization of servers

UW Madison packet traces
(1.6 TB) with payload
10000 IPs; trace collected
at campus border router
 Outgoing /24, web server
traffic
 2 different periods of 2
days each
 Protocol composition

Incoming, HTTP 60%
 Outgoing, HTTP 36%

Evaluation methodology
10

Emulate memory-bound (500 MB - 4GB) WAN optimizer


Entire cache resides in DRAM (packet-level RE)
Emulate only redundancy elimination


Deployment across both ends of access links




WAN optimizers do other optimizations also
Enterprise to data center
All traffic from University to one ISP
Replay packet trace
Compute bandwidth savings as (saved bytes/total bytes)


Includes packet headers in total bytes
Includes overhead of shim headers used for encoding
Large scale trace-driven study
11

Performance comparison of different RE algorithms

Origins of redundancy

Patterns of redundancy
 Distribution
of match lengths
 Hit distribution
Redundancy elimination algorithms
12
Redundancy elimination
algorithms
Redundancy suppression
across different packets
(Use history)
MODP (Spring et al.)
MAXP (new algorithm)
Data compression only
within packets
(No history)
GZIP and other variants
MODP
13


Spring et al. [Sigcomm 2000]
Compute fingerprints
Fingerprint table
Packet payload
Window
Payload-1
Rabin fingerprinting
Payload-2
Value sampling: sample those fingerprints
whose value is 0 mod p
Packet store

Lookup fingerprints in Fingerprint table
MAXP
14


Similar to MODP
Only selection criteria changes
MODP
Sample those fingerprints whose value is 0
mod p
No fingerprint to represent the shaded
region
MAXP
Choose fingerprints that are local maxima
( or minima) for p bytes region
Gives uniform selection of fingerprints
Optimal
15

Approximate upper bound on optimal
 Store
every fingerprint in a bloom filter
 Identify fingerprint match if bloom filter contains the
fingerprint

Low false positive for bloom filter: 0.1%
Comparison of MODP, MAXP and optimal
16
MAXP
Optimal
70
60
50
40
30
128

32
16
Fingerprint sampling period(p)
8
4
MAXP outperforms MODP by 5-10% in most cases



64
Bandwidth savings(%)
MODP
Uniform sampling approach of MAXP
MODP loses due to non uniform clustering of fingerprints
New RE algorithm which performs better than classical MODP
Comparison of different RE algorithms
17
GZIP
(10 ms)->GZIP
MAXP
MAXP->(10ms)->GZIP
Bandwidth savings(%)
70
60
50
40
30
20
10
0
Small

Large
Univ/24
Univ-out
(10ms buffering) -> GZIP increases benefit up to 5%
MAXP significantly outperforms GZIP, offers 15-60% bandwidth savings


Medium
GZIP offers 3-15% benefit


-> means
followed by
MAXP -> (10 ms) -> GZIP further enhances benefit up to 8%
We can use combination of RE algorithms to enhance the bandwidth savings
Large scale trace-driven study
18

Performance study of different RE algorithms

Origins of redundancy

Patterns of redundancy
 Distribution
of match lengths
 Match distribution
Origins of redundancy
19


Different users accessing the same content, or same content being
accessed repeatedly by same user?
Middle-box deployments can eliminate bytes shared across users

How much sharing across users in practice?
INTER-USER: sharing across users
(a) INTER-SRC
(b) INTER-DEST
(c) INTER-NODE
Flow-1
Flow-2
Flow-3
INTRA-USER: redundancy
within same user
(a) INTRA-FLOW
(b) INTER-FLOW
Flow-1
Flow-2
Flow-3
Data Centers
Enterprise
Middlebox
Middlebox
Study of composition of redundancy
20
100
90
80
70
60
50
40
30
20
10
0
intersrc
internode

interdst
interflow

Small
Medium
Large
UIn
UOut
intraflow
UOut/24
Contribution to Savings (%)
Inter User 
Intra User
90% savings is across
destinations for Uout/24
For Uin/Uout, 30-40%
savings is due to intra-user
For enterprises, 75-90%
savings is due to intra-user
Implication: End-to-end RE as a
promising alternative
21
Data Centers
Enterprise

Middlebox
End-to-end RE as a compelling design choice


Similar savings
Deployment requires just software upgrade


Middle-boxes are expensive
Middle-boxes may violate end-to-end semantics
Middlebox
Large scale trace-driven study
22

Performance study of different RE algorithms

End-to-end RE versus network RE

Patterns of redundancy
 Distribution
of match lengths
 Hit distribution
Match length analysis
23

Do most of the savings come from full packet
matches?
 Simple

technique of indexing full packet will be good
For partial packet matches, what should be the
minimum window size?
Match length analysis for enterprise
24
Percentage
Match length distribution
Contribution to savings
80
70
60
50
40
30
20
10
0
Bins of different match lengths (in bytes)



70% of the matches are less than 150 bytes and contribute 20% of savings
10% of the matches come from full matches and contribute 50% of savings
Need to index small chunks of size <= 150 bytes for maximum benefit
Hit distribution
25

Contributors of redundancy
 Few
pieces of content repeated multiple times
 Small
 Many
packet store would be sufficient
pieces of content repeated few times
 Large
packet store
Zipf-like distribution for chunk matches
26

Chunk ranking



Unique chunk matches
sorted by their hit counts
Straight line shows the
zip-fian distribution
Similar to web page
access frequency

How much popular chunks
contribute to savings?
Savings due to hit distribution
27



80% of savings come
from 20% of chunks
Need to index 80% of
chunks for remaining
20% of savings
Diminishing return for
cache size
Savings vs. cache size
28
Savings (%)
Small
Medium
Large
45
40
35
30
25
20
15
10
5
0
0
300
600
900
1200
1500
Cache size (MB)


Small packet caches (250 MB) provide significant percentage of
savings
Diminishing returns for increasing packet cache size after 250 MB
Conclusion
29


First comprehensive study of protocol independent RE
systems
Key Results
 15-60%
savings using protocol independent RE
 A new RE algorithm, which performs 5-10% better than
Spring et al. approach
 Zip-fian distribution of chunk hits; small caches are
sufficient to extract most of the redundancy
 End-to-end RE solutions are promising alternatives to
memory-bound WAN optimizers for enterprises
30
Questions ?
Thank you!
31
Backup slides
Peak and
th
95
percentile savings
32
Mean
Median
95%tile
Peak
60
Savings (%)
50
40
30
20
10
0
1
10
100
1000
Time (seconds)
10000
100000
Effect on burstiness
33

Wavelet based multi-resolution analysis

Energy plot



higher energy means more burstiness
Compared with uniform compression
Results

Enterprise

No reduction in burstiness


Peak savings lower than average savings
University
Reduction in burstiness
 Positive correlation of link utilization with redundancy

Redundancy across protocols
34


Large enterprise
Protocol
Percentage Volume
Percentage redundancy
HTTP
16.8
29.5
SMB
45.46
21.4
LDAP
4.85
44.33
Src code ctrl
17.96
50.32
Protocol
Percentage Volume
Percentage redundancy
HTTP
58
12.49
DNS
0.22
21.39
RTSP
3.38
2
FTP
0.04
16.93
University
Download