Lyon-workshop-sdi

advertisement
Optimization of Google Cloud
Task Processing with
Checkpoint-Restart Mechanism
Speaker: Sheng Di
Coauthors: Yves Robert, Frédéric Vivien,
Derrick Kondo, Franck Cappello
1/22
Outline
 Background of Google Cloud Task Processing
 System Overview
 Research Formulation
 Optimization of Fault-tolerance
 Optimization of the Number of Checkpoints
 Adaptive Optimization of Fault Tolerance
 Local disk vs. Shared disk
 Performance Evaluation
 Conclusion and Future Work
2/22
Background
 Google trace (released in 2011.11):
 670,000 jobs, 2,500,000 tasks, 12,000 nodes
 One-month period (29 days)
 Various events, Resource request/allocation,
Job/task length, Various attributes, etc.
 There are two types of jobs in Google trace:

sequential-task job and Bag-of-Task job
 4000 application types, such as map-reduce.
 Failure events occur often for some tasks!
 Most of task lengths are short (a few or dozens of
minutes), so task execution is sensitive to
checkpointing cost.
3/22
System Overview
 User Interface
User Interface (Task Parser)
 Receive tasks
 Coordinate resource
Resource Allocation Layer
Service Layer
competition among hosts
 Resource Allocation
Fault Tolerance
 Task Scheduling
Job/Task Scheduling Layer
Virtual Machine Layer
Physical Infrastructure Layer
 Coordinate resource usage
within a particular host
4/22
System Overview (Cont’d)
 Task Processing Procedure
Job scheduling & Task Execution Process
Job
Submission Resource Isolation & Checkpointing Restarting
& Migration
Cloud server
Resource Pool
Queue
Task
scheduling
notification
Job
Process Restart or Migration
Task
Failed VM or Service
Running VM
Physical node
5/22
Research Formulation

Analysis of Google trace:
Task failure intervals, Task length, Job structure


Equidistant checkpointing model
Checkpointing interval for a particular task is fixed


Task execution model (suppose k failures)
Tw(task) = Te(task)+C(x-1)+Σk{roll-back-loss}+Σk{restart-cost}

Task Entry
Task Exit
Task’s wall-clock time
Productive time

Checkpoint cost
Roll-back loss
Restart cost
Objective: minimizing E(Tw(task))


Random Variable: K (# of task failure events)
Compute optimal # of checkpoints for a Google task
6/22
Optimization of the Number of
Checkpoints: New formula
 Theorem 1:
 x*: the optimal number of checkpointing intervals
 Te: task execution length (productive length)
 E(Y): task’s expected # of failures (characterized by MNOF)
 C: checkpoint cost (time increment per checkpoint)
 Formula (3):
 Example:
 A task’s productive length is 18 seconds, C = 2 sec,
expected # of failures = 2 in its execution
 Optimal # of checkpointing intervals = sqrt(18*2/(2*2))=3
 The optimal checkpointing interval = 18/3 = 6 seconds
7/22
Optimization of the Number of
Checkpoints : Discussion

Formula (3) does not depend on probability
distribution, unlike Young’s formula
 Young’s formula (proposed in 1977)
Optimal checkpoint interval:
C: checkpointing cost
Tf: mean time between failures (MTBF)
Conditions:



(1) Task failure intervals follows exponential distribution
(2) Checkpoint cost C is far smaller than checkpoint interval Tc

Due to Taylor series and second-order approximation
8/22
Optimization of the Number of
Checkpoints : Discussion
 The assumption with exponential distribution makes
Young’s formula unsuitable for Google task processing
 Distribution of Google task failure intervals based on priority
9/22
Optimization of the Number of
Checkpoints : Discussion
 Corollary 1: Young’s formula is a special case
 Two important conditions:
 Task failure intervals follow exponential distribution
 Checkpointing cost is small
10/22
Optimization of the Number of
Checkpoints : Discussion

Our formula (3) is easier to apply than
Young’s formula in practice
- Young’s formula depends on MTBF, while MTBF
may not be easy to predict precisely



Non-asynchronous clocks across hosts
Inevitable influence of checkpointing cost
Significant delay of failure detection
- By contrast, MNOF is easy to record accurately
11/22
Adaptive Optimization of Chpt Positions
 Problem: what if the probability distribution of failure intervals (or
failure rates) changes over time?
 This is possible due to changeable priority ….
 Objective: To design an adaptive algorithm to dynamically suit the
changing failure rates.
 Question: Will the optimal checkpoint positions change with
decreasing remaining workload over time?
means current time
Kth chpt
(K+1)th chpt
Later on
Opt chpt
intervals?
 Solution:
 We just need to monitor MNOF, regardless of the
decreasing remaining workload to process
- because of Theorem 2
12/22
Adaptive Optimization of Fault
Tolerance (Cont’d)
 Theorem 2:
Optimal # of checkpointing Intervals
Optimal # of checkpointing intervals
computed at (k+1)th checkpoint position computed at kth checkpoint position
13/22
Local disk vs. Shared disk checkpointing
 Characterization based on BLCR
 Operation time cost in setting a checkpoint
14/22
Performance Evaluation
 Experimental Setting
 We build a testbed based on Google trace, in a cluster
with hundreds of VM instances running across 16 nodes
(16*8 cores, 16*16GB memroy size, XEN4.0, BLCR)
 We call it GloudSim (Google based cloud simulation
system) [under review by HiPC’13]
 We reproduce Google task execution as close as
possible to Google trace, e.g.,




Task arrivals are based on the trace or some distribution
Task’s memory is reproduced via Google trace
Task’s failure events are reproduced via Google trace
Each job is chosen from among all sample jobs in the trace
15/22
Performance Evaluation (Cont’d)
 Experimental Results
 Job’s Workload-Processing Ratio (WPR)
 Checkpointing effect with precise prediction
(on MNOF and MTBF)
16/22
Performance Evaluation (Cont’d)
 Distribution of WPR with diff. C/R formulas
a
17/22
Performance Evaluation (Cont’d)
 MNOF & MTBF w.r.t. Priority in Google trace
 MNOF is stable with task lengths, while MTBF
is not stable (changing from 179 to 4199 secs)
18/22
Performance Evaluation (Cont’d)
 Min/Avg/Max WPR with respect to diff. Priorities
 Our formula outperforms Young’s formula by 3-10%19/22
Performance Evaluation (Cont’d)
 Wall-clock lengths of 10,000 job execution
 Conclusion: Job wall-clock lengths are often incremented
by 50-100 seconds under Young’s formula than ours.
20/22
Performance Evaluation (Cont’d)
 Adaptive Algorithm vs. Static Algorithm
21/22
Conclusion and Future Work
 Selected conclusions:
 Our formula (3) is better than Young’s formula by
3-10 percent, w.r.t. Google task processing
 Job wall-clock lengths are incremented by 50-100
seconds under Young’s formula than ours.
 Worst WPR under dynamic algorithm stays about
0.8, compared to 0.5 under static algorithm.
 Future work
 Port our theorems to more cases like MPI over
Cloud platforms.
22/22
Thanks for your
attention!!
Contact me at:
disheng222@gmail.com
23/22
Download