Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson,

advertisement
Jockey
Guaranteed Job Latency in
Data Parallel Clusters
Andrew Ferguson, Peter Bodik,
Srikanth Kandula, Eric Boutin,
and Rodrigo Fonseca
DATA PARALLEL
CLUSTERS
2
DATA PARALLEL
CLUSTERS
3
DATA PARALLEL
CLUSTERS
Predictabi
lity
4
DATA PARALLEL
CLUSTERS
Deadli
ne
5
DATA PARALLEL
CLUSTERS
Deadli
ne
6
VARIABLE LATENCY
7
VARIABLE LATENCY
8
VARIABLE LATENCY
9
VARIABLE LATENCY
10
VARIABLE LATENCY
11
VARIABLE LATENCY
12
VARIABLE LATENCY
13
4.3x
VARIABLE LATENCY
14
Why does latency vary?
1. Pipeline
complexity
2. Noisy execution
environment
15
Cosmos
MICROSOFT’S DATA
PARALLEL CLUSTERS
16
• CosmosStore
Cosmos
MICROSOFT’S DATA
PARALLEL CLUSTERS
17
• CosmosStore
• Dryad
Cosmos
MICROSOFT’S DATA
PARALLEL CLUSTERS
18
• CosmosStore
• Dryad
• SCOPE
Cosmos
MICROSOFT’S DATA
PARALLEL CLUSTERS
19
• CosmosStore
• Dryad
• SCOPE
Cosmos
MICROSOFT’S DATA
PARALLEL CLUSTERS
20
Cosmos
Cluster
DRYAD’S DAG WORKFLOW
21
Cosmos
Cluster
DRYAD’S DAG WORKFLOW
22
Pipeli
ne
Job
DRYAD’S DAG WORKFLOW
23
Deadline
Deadline
Deadline
Deadline
Deadli
ne
DRYAD’S DAG WORKFLOW
24
Deadline
Deadline
Deadline
Deadline
Deadli
ne
DRYAD’S DAG WORKFLOW
25
Job
Sta
ge
DRYAD’S DAG WORKFLOW
26
Tas
ks
Job
Sta
ge
DRYAD’S DAG WORKFLOW
27
28
Priorities?
EXPRESSING
PERFORMANCE TARGETS
29
Priorities? Not expressive
enough
Weights?
EXPRESSING
PERFORMANCE TARGETS
30
Priorities? Not expressive
enough
Weights? Difficult for users
to set
Utility curves?
EXPRESSING
PERFORMANCE TARGETS
31
Priorities? Not expressive
enough
Weights? Difficult for users
to set
Utility curves? Capture
deadline & penalty
EXPRESSING
PERFORMANCE TARGETS
32
OUR GOAL
33
Maximize utility
OUR GOAL
34
Maximize utility
while minimizing resources
OUR GOAL
35
Maximize utility
while minimizing resources
by dynamically adjusting
the allocation
OUR GOAL
36
Jockey
37
• Large clusters
Jockey
38
• Large clusters
• Many users
Jockey
39
• Large clusters
• Many users
• Prior execution
Jockey
40
f(job state, allocation) ->
remaining run time
JOCKEY – MODEL
41
f(job state, allocation) ->
remaining run time
JOCKEY – MODEL
42
f(job state, allocation) ->
remaining run time
JOCKEY – MODEL
43
f(job state, allocation) ->
remaining run time
JOCKEY – MODEL
44
JOCKEY – CONTROL LOOP
45
JOCKEY – CONTROL LOOP
46
JOCKEY – CONTROL LOOP
47
f(job state, allocation) ->
remaining run time
JOCKEY – MODEL
48
f(job state, allocation) ->
remaining run time
f(progress, allocation) ->
remaining run time
JOCKEY – MODEL
49
JOCKEY – PROGRESS
INDICATOR
50
JOCKEY – PROGRESS
INDICATOR
51
JOCKEY – PROGRESS
INDICATOR
52
total running
JOCKEY – PROGRESS
INDICATOR
53
total running
+
total queuing
JOCKEY – PROGRESS
INDICATOR
54
sta
ge
total
+
running
total
queuing
JOCKEY – PROGRESS
INDICATOR
55
Stage 1
total
+
running
total
queuing
+
Stage 2
total
+
running
total
queuing
+
Stage 3
total
+
running
total
queuing
JOCKEY – PROGRESS
INDICATOR
56
Stage 1
total
#
+
running
complet
total
total
e
tasks
queuing
+
Stage 2
total
#
+
running
complet
total
total
e
tasks
queuing
+
Stage 3
total
#
+
running
complet
total
total
e
tasks
queuing
JOCKEY – PROGRESS
INDICATOR
57
JOCKEY – PROGRESS
INDICATOR
58
JOCKEY – PROGRESS
INDICATOR
59
JOCKEY – PROGRESS
INDICATOR
60
JOCKEY – PROGRESS
INDICATOR
61
JOCKEY – PROGRESS
INDICATOR
62
JOCKEY – PROGRESS
INDICATOR
63
JOCKEY – CONTROL LOOP
64
Job
model
1% complete
2% complete
3% complete
4% complete
5% complete
JOCKEY – CONTROL LOOP
65
Job
10 nodes
model
20 nodes
30 nodes
1% complete
2% complete
3% complete
4% complete
5% complete
JOCKEY – CONTROL LOOP
66
Job
10 nodes
model
20 nodes
30 nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36 minutes
21 minutes
5% complete
54 minutes
34 minutes
20 minutes
JOCKEY – CONTROL LOOP
67
Job
10 nodes
model
20 nodes
30 nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36 minutes
21 minutes
5% complete
54 minutes
34 minutes
20 minutes
Comple
tion:
1%
Deadlin
e:
50 min.
JOCKEY – CONTROL LOOP
68
Job
10 nodes
model
20 nodes
30 nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36 minutes
21 minutes
5% complete
54 minutes
34 minutes
20 minutes
Comple
tion:
1%
Deadlin
e:
50 min.
JOCKEY – CONTROL LOOP
69
Job
10 nodes
model
20 nodes
30 nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36 minutes
21 minutes
5% complete
54 minutes
34 minutes
20 minutes
Comple
tion:
3%
Deadlin
e:
40 min.
JOCKEY – CONTROL LOOP
70
Job
10 nodes
model
20 nodes
30 nodes
1% complete
60 minutes
40 minutes
25 minutes
2% complete
59 minutes
39 minutes
24 minutes
3% complete
58 minutes
37 minutes
22 minutes
4% complete
56 minutes
36 minutes
21 minutes
5% complete
54 minutes
34 minutes
20 minutes
Comple
tion:
5%
Deadlin
e:
30 min.
JOCKEY – CONTROL LOOP
71
f(progress, allocation) ->
remaining run time
JOCKEY – MODEL
72
f(progress, allocation) ->
remaining run time
analytic model?
JOCKEY – MODEL
73
f(progress, allocation) ->
remaining run time
analytic model?
machine learning?
JOCKEY – MODEL
74
f(progress, allocation) ->
remaining run time
analytic model?
machine learning?
simulator
JOCKEY – MODEL
75
Problem
Solution
JOCKEY
76
Problem
Solution
Pipeline
complexity
JOCKEY
77
Problem
Solution
Pipeline
complexity
Use a
simulator
JOCKEY
78
Problem
Solution
Pipeline
complexity
Noisy
environment
JOCKEY
Use a
simulator
79
Problem
Solution
Pipeline
complexity
Noisy
environment
JOCKEY
Use a
simulator
Dynamic
control
80
Jockey in
Action
81
• Real job
Jockey in
Action
82
• Real job
• Production cluster
Jockey in
Action
83
• Real job
• Production cluster
• CPU load: ~80%
Jockey in
Action
84
Jockey in Action
85
Jockey in Action
86
Jockey in Action
87
Jockey in Action
Initial deadline:
140 minutes
88
Jockey in Action
New deadline:
70 minutes
89
Jockey in Action
Release
resources due
to excess
pessimism
New deadline:
70 minutes
90
Jockey in Action
“Oracle”
allocation:
Total allocationhours
91
Jockey in Action
Available
parallelism
less than
allocation
“Oracle”
allocation:
Total allocationhours
92
Jockey in Action
Allocation
above oracle
“Oracle”
allocation:
Total allocationhours
93
Evaluation
94
• Production cluster
Evaluation
95
• Production cluster
• 21 jobs
Evaluation
96
• Production cluster
• 21 jobs
• SLO met?
Evaluation
97
• Production cluster
• 21 jobs
• SLO met?
• Cluster impact?
Evaluation
98
Evaluation
99
Evaluation
Jobs which met the SLO
100
Evaluation
Jobs which met the SLO
101
Missed 1 of
94 deadlines
Evaluation
Jobs which met the SLO
102
Evaluation
Jobs which met the SLO
103
Evaluation
Jobs which met the SLO
104
Evaluation
1.4x
Jobs which met the SLO
105
Missed 1 of
94 deadlines
Evaluation
Allocated too many
resources
Jobs which met the SLO
106
Missed 1 of
94 deadlines
Evaluation
Simulator made good predictions:
80% finish before deadline
Allocated too many
resources
Jobs which met the SLO
107
Missed 1 of
94 deadlines
Evaluation
Simulator made good predictions:
80% finish before deadline
Allocated too many
resources
Control loop is
stable
Jobs which met the SLO
and successful
108
Evaluation
109
Evaluation
110
Evaluation
111
Evaluation
112
Evaluation
113
Evaluation
114
Conclusion
115
Data parallel jobs are
complex,
116
Data parallel jobs are
complex,
yet users demand deadlines.
117
Data parallel jobs are
complex,
yet users demand deadlines.
Jobs run in shared, noisy
clusters,
118
Data parallel jobs are
complex,
yet users demand deadlines.
Jobs run in shared, noisy
clusters, making simple models
inaccurate.
119
Jockey
120
simulator
121
control-loop
122
Deadline
Deadline
Deadline
Deadline
Deadli
ne
123
124
Questions?
Andrew Ferguson
adf@cs.brown.edu
125
Coauthors
•
•
•
•
Peter Bodík (Microsoft Research)
Srikanth Kandula (Microsoft Research)
Eric Boutín (Microsoft)
Rodrigo Fonseca (Brown)
Questions?
Andrew Ferguson
adf@cs.brown.edu
126
Backup
Slides
127
Utility Curves
For single jobs,
scale doesn’t matter
Deadline
For multiple jobs,
se financial penalties
128
Jockey
Resource allocation control loop
Utility
Run Time
1. Slack
2. Hysteresis
3. Dead Zone
129
Prediction
Cosmos
Resource sharing in Cosmos
130
• Resources are allocated with a form of
fair sharing across business groups and
their jobs. (Like Hadoop FairScheduler or
CapacityScheduler)
• Each job is guaranteed a number of
tokens as dictated by cluster policy; each
running or initializing task uses one
token. Token released on task
completion.
• A token is a guaranteed share of CPU
and memory
Jockey
Progress indicator
• Can use many features of the job to build
a progress indicator
• Earlier work (ParaTimer) concentrated on
fraction of tasks completed
• Our indicator is very simple, but we found
it performs best for Jockey’s needs
Fraction of completed vertices
Total vertex initialization time
Total vertex run time
131
Comparison with ARIA
132
• ARIA uses analytic models
• Designed for 3 stages: Map,
Shuffle, Reduce
• Jockey’s control loop is robust due
to control-theory improvements
• ARIA tested on small (66-node)
cluster without a network bottleneck
• We believe Jockey is a better match
for production DAG frameworks
such as Hive, Pig, etc.
Jockey
Latency prediction: C(p, a)
• Event-based simulator
– Same scheduling logic as actual Job Manager
– Captures important features of job progress
– Does not model input size variation or speculative reexecution of stragglers
– Inputs: job algebra, distributions of task timings, probabilities
of failures, allocation
• Analytic model
– Inspired by Amdahl’s Law: T = S + P/N
133
– S is remaining work on critical path, P is all remaining work, N
is number of machines
Jockey
Resource allocation control
loop
• Executes in Dryad’s Job Manager
• Inputs: fraction of completed tasks
in each stage, time job has spent
running, utility function,
precomputed values (for speedup)
• Output: Number of tokens to
allocate
134
• Improved with techniques from
Jockey
offline
during job runtime
job profile
135
Jockey
offline
during job runtime
job profile
simulator
136
Jockey
offline
during job runtime
job profile
simulator
job stats
137
Jockey
offline
during job runtime
job profile
simulator
latency
predictions
job stats
138
Jockey
offline
during job runtime
job profile
simulator
latency
predictions
job stats
utility
function
139
Jockey
offline
during job runtime
job profile
simulator
latency
predictions
job stats
utility
function
resource allocation control loop
running
job
140
Download