Got Predictability? Experiences with FT Middleware Tudor Dumitraş Priya Narasimhan

advertisement
Carnegie Mellon
Got Predictability?
Experiences with FT Middleware
Tudor Dumitraş
Priya Narasimhan
Carnegie Mellon University
Carnegie Mellon
Who Needs Predictability?

Service-level agreements

Problem determination, fingerpointing

Self-management, autonomic computing
FT-middleware protects the critical parts of IT infrastructures

Higher predictability requirements
2
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Predictability of Fault-Tolerant Middleware

Faults are inherently unpredictable

What about the fault-free case?

Reportedly, max (response time) >> average (response time)
3
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Empirical Data Collected

MEAD Trace: micro-benchmark (client-server)




Middleware for Embedded Adaptive Dependability
Fault-Tolerant CORBA implementation
1200 configurations
FTDS Trace: 7 macro-benchmarks (3-tier applications)




Developed during Fault-Tolerant Distributed Systems class
Enterprise applications: online gaming, e-commerce
Use CORBA or EJB
336 configurations
Available at: http://www.ece.cmu.edu/~tdumitra/FT_traces/
4
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Fault-Free vs. Faulty Unpredictability (MEAD Trace)
2
Recovery Time [s]
Average Recovery Time
Max Fault-Free Latency
1.5
1
0.5
0
1
4
7
10
13
16
Number of Clients
19
22
5
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Fault-Free vs. Faulty Unpredictability (FTDS Trace)
13.6 s
1.2
Fault Detection & Fail-over
Fault Detection
Fail-over
Request Processing
Max Fault-Free Latency
Recovery Time [s]
1
0.8
0.6
0.4
0.2
0
1
2
3
4
Project
5
6
7
6
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Outline

Can we predict the maximum latency of FT middleware?

When do high latencies occur and how high are they?

How common are the high latencies?

Do most requests have bounded latencies?
7
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
MEAD Architecture
The Replicator
C
R
C
C
Interface to application / CORBA
Tunability
Tunable mechanisms
Replication
style
#replicas
R
Client
(modified system calls)
Replicated
state
R
Server
CORBA
CORBA
Replicator
Replicator
Group Communication
Interface to Group Communication
Host OS
Host OS
Networking
Active Replication:
Passive Replication:
all replicas process requests
primary replica processes requests
8
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Applications from the FTDS Trace
1. Su-Duel-Ku

2. Blackjack
Competitive Sudoku
3. FTEX

Electronic stock exchange
Online casino
4. eJBay

Online auctioning
EJB
CORBA
5. Mafia


Online game
6. Park’n Park

Parking-lot management
7. Ticket Center

Online ticketing
9
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Architecture of FTDS Applications
10
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Outline

Can we predict the maximum latency of FT middleware?

When do high latencies occur and how high are they?

How common are the high latencies?

Do most requests have bounded latencies?
11
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Example of Unpredictability
x 10
8
x 10
1.8
1.6
7
1.4
6
1.2
5
1.2
4
1
0.8
3
0.8
0.6
2
0.4
1
PDF
Latency [μs]
1.8
4
x 10
2
-4
1
0.2
0
5
10
15
20
Time [s]
25
30
35
0
0
1.6
1.4
0.6
0.4
0.2
0.5
1
1.5
Latency [μs]
2
4
x 10
0
Maximum latency can be orders of magnitude larger than the average
12
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Latency [μs]
4
Carnegie Mellon
Average latency [μs]
Unpredictability in the MEAD Trace
7
10
6
10
5
10
4
10
3
10
65536
4096
256
16
0
1000
2000
3000
4000
5000
13
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Maximum latency [μs]
Average latency [μs]
Unpredictability in the MEAD Trace
7
10
6
10
5
10
4
10
3
10
65536
4096
256
16
0
1000
2000
3000
4000
7
10
6
10
5
10
4
10
3
10
5000 65536
4096
256
16
0
1000
2000
3000
4000
5000
14
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Average and Maximum Latency
4
10
MEAD
3.5
10
3
Maximum latency [s]
Maximum latency [s]
2
2.5
2
1.5
1
10
10
10
1
0
-1
MEAD
SuDuelKu
FTEX
Park’n Park
Ticket Center
-2
0.5
0
0
10
1
2
3
Average latency [s]
4
-3
10
-2
0
10
Average latency [s]
10
2
15
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Outline

Can we predict the maximum latency of FT middleware?

When do high latencies occur and how high are they?

How common are the high latencies?

Do most requests have bounded latencies?
16
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Statistical Analysis of Unpredictability
8
x 10
-4
x 10
2
Z max 
7
Max  Mean
4
1.8

1.6
6
1.4
1.2
4
1
Mean  3
3
0.8
Latency [μs]
PDF
5
0.6
2
0.4
1
0.2
0
0
0
0.5
1
Latency [μs]
1.5
2
x 10
4
17
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
1.5%
300
1%
200
0.5%
100
0%
16
256
4096
16384
Size of reply messages [bytes]
65536
Maximum z-score
Percentage of outliers
Correlation with Message Size (MEAD)
0
18
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Time in Kernel and User Mode (MEAD)

25% kernel mode
16 KB and 64 KB

10% kernel mode
16 B, 256 B and 4 KB
19
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
3%
150
2%
100
1%
50
0%
SuDuelKu
FTEX eJBay Mafia
Ticket Center
Blackjack
Park’n Park
FTDS Project
Maximum z-score
Percentage of outliers
Number and Size of Outliers (FTDS)
0
20
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Correlation with Number of Clients (FTDS)
SuDuelKu
60
5%
50
4%
40
3%
30
2%
20
1%
10
0%
1
4
Clients
7
10
Maximum z-score
Percentage of outliers
6%
0
21
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Correlation with Request Rate (FTDS)
FTEX
60
5%
50
4%
40
3%
30
2%
20
1%
10
0%
5
10
15
20
Request rate [req/s]
25
Maximum z-score
Percentage of outliers
6%
0
22
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Outline

Can we predict the maximum latency of FT middleware?

When do high latencies occur and how high are they?

How common are the high latencies?

Do most requests have bounded latencies?
23
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Outlier Distribution (MEAD)
1200
Experiments
1000
800
600
400
200
0
0%
1%
2%
3%
4%
Outliers per Experiment
5%
6%
24
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Outlier Distribution (Comparison)
1
Probability Density
Ticket Center
0.8 eJBay
Park’n Park
0.6
Blackjack
FTEX
Mafia
0.4
0.2
0
0%
SuDuelKu
1%
2%
3%
4%
Outliers per Experiment
5%
6%
25
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Isolating the Unpredictability (MEAD)
26
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Isolating the Unpredictability (MEAD)
The “haircut” effect of removing 1% of the highest latencies
27
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
The Magical 1%
Unpredictability seems to be confined
to 1% of the remote invocations.
28
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Latency [s]
2
Carnegie Mellon
Magical 1%
1.5
SuDuelKu
1.5
Mafia
1
1
0.5
0.5
0
Latency [s]
4
10
30
40
0
0.4
Blackjack
3
0.3
2
0.2
1
0.1
0
15
Latency [s]
20
10
20
30
40
10
4
5
2
0
10
© 2007 Tudor Dumitraş
20
30
Experiment
40
Average latency
20
30
40
20
30
40
Park’n Park
0
6
FTEX
10
10
MEAD
0
99 th percentile
200
400
600
800
Experiment
Maximum latency
Got Predictability? Experiences with Fault-Tolerant Middleware
1000
1200
29
Carnegie Mellon
Outline

Can we predict the maximum latency of FT middleware?

When do high latencies occur and how high are they?

How common are the high latencies?

Do most requests have bounded latencies?
30
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Bounds for the 99th Percentile
MEAD [ ]
Latency range
Ticket Center [ ]
99 th percentiles
[ ] Confidence interval
Park’n Park [ ]
Mafia
eJBay []
Z 99%  10
FTEX []
Blackjack []
SuDuelKu []
0
40
80
120
160
200
Z-Scores of Latency
240
31
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Trends for the 99th Percentile (MEAD)
7
10
99% latency [ ms]
6
10
5
10
4
10
3
10
65536
16384
4096
5000
4000
256
3000
2000
16
Request size [bytes]
1000
0
Request rate [req/s]
32
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Summary

Can we predict the maximum latency of FT middleware?


When do high latencies occur and how high are they?



Usually not correlated with configuration parameters, OS metrics
Comparable with recovery time after crash faults
How common are the high latencies?


Not always; maximum usually not correlated with average
Confined to 1% of remote invocations
Do most requests have bounded latencies?

99% of requests have a z-score < 10
33
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Implications of the Magical 1%

Predictable maximum latencies are hard to achieve


Cannot eliminate high latencies by carefully configuring the system
Statistical predictability is easy to achieve


99th percentile latency bounded with high confidence
Confirmed for different







Applications
Programming languages
Middleware technologies
Replication mechanisms
Operating systems
Not confirmed for WANs, wireless networks
Statistical predictability is relevant for many enterprise applications
34
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Thank You!
For more information: http://www.ece.cmu.edu/~tdumitra
35
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
MEAD Trace vs. FTDS Trace
MEAD
FTDS
Programming language
C++
Java
Middleware
CORBA
EJB, CORBA
Tiers
2 (client, server)
3 (client, business logic, DB)
Replication mechanisms
ORB-level, transparent
Application-level
Recovery coordination
Distributed
(group communication)
Centralized
(replication manager)
Operating System
TimeSys Linux
SUSE Linux
Environment
Isolated experiments
Shared cluster
36
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Experimental Setup
MEAD

Test bed


FTDS

Test bed


Emulab
100 Mb/s LAN

Pentium III at 850 MHz

Parameters varied





Replication style: active, passive
Replication degree: 1, 2, 3 replicas
Number of clients: 1 – 22
Think time: 0, 0.5, 2, 8, 32 ms
Reply size:
16 B, 256 B, 4 KB, 16 KB, 64 KB


Undergraduate cluster
100 Mb/s LAN
Pentium IV at 2.4 GHz
Parameters varied



Clients: 1, 4, 7, 10
Think time: 0, 20, 40 ms
Reply size:
original, 256 B, 512 B, 1 KB
37
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Sources of Unpredictability
Client
Server
Application
out
in
client
server
out
in
in
out
out
in
ORB
in
out
interc_hi
Replicator
out
interc_lo
in
reply
request
Group Communication
38
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Passive Replication
Passively Replicated
Server Object
Passively Replicated
Client Object
Primary
Replica
Primary
Replica
State
ORB
ORB
ORB
ORB
State
ORB
Request
Response
State Transfer
Client Group
Server Group
© 2007 Tudor Dumitraş
39
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Active Replication
Actively Replicated
Server
Actively Replicated
Client
ORB
ORB
ORB
ORB
Duplicate
Invocation
Suppressed
ORB
Duplicate
Responses
Suppressed
Client Group
Server Group
© 2007 Tudor Dumitraş
40
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
1%
400
0.5%
200
0%
1
4
7
10
13
16
Number of clients
19
22
Maximum z-score
Percentage of outliers
Correlation with Number of Clients (MEAD)
0
41
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Minor Page Faults (MEAD)
42
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Carnegie Mellon
Outlier Distribution (MEAD)
1200
Experiments
1000
800
600
400
200
0
0%
1%
2%
3%
4%
Outliers per Experiment
5%
6%
43
© 2007 Tudor Dumitraş
Got Predictability? Experiences with Fault-Tolerant Middleware
Download