C 1 - Purdue University

advertisement
Final exam of
Tanzima Zerin Islam
School of Electrical & Computer Engineering
Purdue University
West Lafayette, IN
Date: April 8, 2013
Distributed Computing Environments
High Performance Computing
(HPC):
Projected MTBF 3-26 minutes in
exascale
Failure: hardware, software
Grid:
Cycle sharing system
Highly volatile environment
Failure: eviction of guest jobs
@Notre Dame
@Purdue
@Indiana U.
Internet
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
1
Fault-tolerance with Checkpoint-Restart
Checkpoints are execution states
System-level
Memory state
Compressible
Application-level
Selected variables
Hard to compress
Tanzima Islam (tislam@purdue.edu)
Struct ToyGrp{
1. float Temperature[1024];
2. int Pressure[20][30];
};
Reliable & Scalable Checkpointing Systems
2
Challenges in Checkpointing Systems
HPC:
Scalability of checkpointing systems
@Notre Dame
Grid:
Use of dedicated checkpoint servers
@Purdue
@Indiana U.
Internet
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
3
Contributions of This Thesis
2nd Place, ACM
Student Research
Competition’10
Compression on
Multi-core
FALCON
Reliable Checkpointing
System in Grid
[Best Student Paper
Nomination, SC’09]
MCRENGINE
MCRCLUSTER
Unpublished
Preli
m
2007 - 2009
Tanzima Islam (tislam@purdue.edu)
Scalable Checkpointing
System in HPC
[Best Student Paper
Nomination, SC’12]
2009-2010
2010-2012
2012-2013
Reliable & Scalable Checkpointing Systems
4
Agenda
[MCRENGINE] Scalable checkpointing system for HPC
[MCRCLUSTER] Benefit-aware clustering
Future directions
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
5
A Scalable Checkpointing System using
Data-Aware Aggregation and Compression
Collaborators:
Kathryn Mohror, Adam Moody, Bronis de Supinski
Big Picture of HPC
Compute Nodes
Network Contention
Gateway Nodes
Atlas
Contention for Shared
File System Resources
Hera
Hera
Contention for Other
Clusters
Parallel File System
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
7
Checkpointing in HPC
MPI applications
Take globally coordinated checkpoints asynchronously
Application-level checkpoint
High-level data format for portability
HDF5, Adios, netCDF etc.
Checkpoint writing
Struct
ToyGrp{
N1
(Funneled)
Application
NM (Grouped)
1. float Temperature[1024];
2. short Pressure[20][30];
};
I/O Library
Parallel File
System (PFS)
Data-Format API
Not scalable
HDF5
NetCDF
Tanzima Islam (tislam@purdue.edu)
Parallel File
System (PFS)
Best compromise
but complex
1.
2.
3.
HDF5 checkpoint{
NN (Direct)
Group “/”{
Group “ToyGrp”{
DATASET “Temperature”{
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE {(1024) / (1024)}
}
DATASETParallel
“Pressure”
File {
DATATYPE
H5T_STD_U8LE
System
(PFS)
DATASPACE SIMPLE {(20,30) / (20,30)}
}}}}
Easiest but
contention on PFS
Reliable & Scalable Checkpointing Systems
8
IOR
Direct (NN): 78MB per process
Observations:
(−) Large average write time
(−) Large average read time
less frequent checkpointing
poor application performance
Average Read Time (s)
Average Write Time (s)
Impact of Load on PFS at Large Scale
250
200
150
100
50
0
# of Processes (N)
Tanzima Islam (tislam@purdue.edu)
1400
1200
1000
800
600
400
200
0
# of Processes (N)
Reliable & Scalable Checkpointing Systems
9
What is the Problem?
Today’s checkpoint-restart systems will not scale
Increasing number of concurrent transfers
Increasing volume of checkpoint data
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
10
Our Contributions
Data-aware aggregation
Reduces the number of concurrent transfers
Improves compressibility of checkpoints by using semantic information
Data-aware compression
Improves compression ratio by 115% compared to concatenation and
general-purpose compression
Design and develop mcrEngine
Grouped (NM) checkpointing system
Improves checkpointing frequency
Improves application performance
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
11
Naïve Solution: Data-Agnostic Compression
Agnostic scheme – concatenate checkpoints
First Phase
C1
C1
pGzip
C2
PFS
C2
Agnostic-block scheme – interleave fixed-size blocks
C1
[1-B]
C1
[B+1-2B]
C2
C2
[1-B]
[B+1-2B]
C1
[1-B]
C2
C1
C2
[1-B]
[B+1-2B]
[B+1-2B]
pGzip
PFS
Observations:
(+) Easy
(−) Low compression ratio
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
12
Our Solution: [Step 1] Identify Similar Variables Across
[Step
[Step
2] Merging
2] Merging
Scheme
Scheme
II: Aware-Block
I: Aware Scheme
Scheme
Processes
P0
P1
Group ToyGrp{
Meta-data:
float Temperature[1024];
1. Name
int Pressure[20][30];
2. Data-type
3. Class:
};
-- Array, Atomic
Concatenating
similar variables
Group ToyGrp{
float Temperature[100];
int Pressure[10][50];
};
C1.T
C1.P
C2.T
C2.P
C1.T
C2.T
C1.P
C2.P
C1.T
C1.P
C2.T
C2.P
Interleaving
similar variables
Tanzima Islam (tislam@purdue.edu)
Interleave
First
Next
‘B’
‘B’ bytes
bytes
of Temperature
Pressure
Reliable & Scalable Checkpointing Systems
13
[Step 3] Data-Aware Aggregation & Compression
Aware scheme – concatenate similar variables
Aware-block scheme – interleave similar variables
C1.H
.T
Data-type aware compression
C2.H
.T
C1.D
.P
FPC
C2.D
.P
Lempel-Ziv
First Phase
Output buffer
T
P
H
D
pGzip
Second Phase
PFS
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
14
How MCRENGINE Works
CNC : Compute node component
ANC: Aggregator node component
Rank-order groups, Grouped (NM) transfer
Group
Group
CNC
CNC
CNC
Compute
Component
CNC
CNC
CNC
Compute
Component
Identifiesdata-aware
Applies
“similar” variables
aggregation and compression
Request
Meta-data
H, PD
H
T D
P T,
Aggregator
T
P
H D
pGzip
Request
T, PD
H
T D
P H,
Meta-data
Aggregator
T
P
PFS
H D
pGzip
Group
CNC
CNC
CNC
Compute
Component
H
T D
P H,
Request
Meta-data
T, PD
Aggregator
T
P
H D
pGzip
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
15
Evaluation
Applications
ALE3D – 4.8GB per checkpoint set
Cactus – 2.41GB per checkpoint set
Cosmology – 1.1GB per checkpoint set
Implosion – 13MB per checkpoint set
Experimental test-bed
LLNL’s Sierra: 261.3 TFLOP/s, Linux cluster
23,328 cores, 1.3 Petabyte Lustre file system
Compression algorithm
FPC [1] for double-precision float
Fpzip [2] for single-precision float
Lempel-Ziv for all other data-types
pGzip for general-purpose compression
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
16
Evaluation Metrics
Effectiveness of data-aware compression
What is the benefit of multiple compression phases?
How does group size affect compression ratio?
Compression ratio =
Uncompressed size
Compressed size
Performance of mcrEngine
Overhead of the checkpointing phase
Overhead of the restart phase
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
17
Multiple Phases of Data-Aware Compression
No Benefit with Data-Agnostic Double Compression
are Beneficial
Data-agnostic double compression is not beneficial
Because, data-format is non-uniform and uncompressible
Data-type aware compression improves compressibility
First phase changes underlying data format
Compression Ratio
4
3.5
Data-Agnostic
3
Data-Aware
2.5
2
1.5
1
0.5
0
First
Second
ALE3D
Tanzima Islam (tislam@purdue.edu)
First
Second
Cactus
First
Second
Cosmology
First
Second
Implosion
Reliable & Scalable Checkpointing Systems
18
Impact of Group Size on Compression Ratio
Different merging schemes better for different applications
Larger group size beneficial for certain applications
ALE3D: Improvement of 8% from group size 2 to 32
2.5
Compression Ratio
4.5
Aware-Block
3.5
1.5
2.5
0.5
ALE3D
Aware
Cactus
Group size
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
19
Data-Aware Technique Always Wins
over Data-Agnostic
Data-aware technique always yields better compression
ratio than Data-Agnostic technique
98-115%
2.5
Compression Ratio
4.5
Aware-Block
3.5
Aware
1.5
Agnostic-Block
2.5
0.5
ALE3D
Agnostic
Cactus
Group size
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
20
Summary of Effectiveness Study
Data-aware compression always wins
Reduces gigabytes of data for Cactus
Larger group sizes may improve compression ratio
Different merging schemes for different applications
Compression ratio follows course of simulation
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
21
Impact of Data-Aware Compression on Latency
IOR with Grouped(NM) transfer, groups of 32 processes
Data-aware: 1.2GB, data-agnostic: 2.4GB
Data-aware compression improves I/O performance at large scale
Improvement during write 43% - 70%
Improvement during read 48% - 70%
Average Transfer Time (sec)
400
350
Agnostic-Write
300
250
Aware-Write
Agnostic
200
Agnostic-Read
150
Aware-Read
Aware
100
50
0
# of Processes (N)
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
22
Impact of Aggregation & Compression on Latency
Average Read Time (sec) Average Write Time (sec)
Used IOR
250
Direct (NN): 87MB per process
Grouped (NM): Group size 32, 1.21GB per aggregator
200
150
N->N Write
100
N->M Write
50
0
1400
1200
1000
800
600
400
200
0
N->N Read
N->M Read
# of Processes (N)
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
23
End-to-End Checkpointing Overhead
15,408 processes
Group size of 32 for NM schemes
Each process takes a checkpoint
Total Checkpointing Overhead (sec)
Converts network bound operation into CPU bound one
Reduction in
Checkpointing Overhead
350
300
87%
250
Transfer Overhead
51%
CPU Overhead
200
150
100
50
0
No Comp. Indiv. No Comp. Agnostic
Comp
Direct
Grouped
ALE3D
Tanzima Islam (tislam@purdue.edu)
Aware No Comp. Indiv. No Comp. Agnostic
Comp
Direct
Aware
Grouped
Cactus
Reliable & Scalable Checkpointing Systems
24
End-to-End Restart Overhead
Reduced overall restart overhead
Reduced network load and transfer time
Total Recovery Overhead (sec)
600
Reduction in
I/O Overhead
Recovery
Overhead
500
62%
400
64%
Transfer Overhead
CPU Overhead
300
200
43%
71%
100
0
No Comp.
Indiv.
Comp
No Comp. Agnostic
Direct
Grouped
ALE3D
Tanzima Islam (tislam@purdue.edu)
Aware No Comp.
Indiv.
Comp
No Comp. Agnostic
Direct
Grouped
Cactus
Reliable & Scalable Checkpointing Systems
25
Aware
Summary of Scalable Checkpointing System
Developed data-aware checkpoint compression technique
Relative improvement in compression ratio up to 115%
Investigated different merging techniques
Demonstrated effectiveness using real-world applications
Designed and developed MCRENGINE
Reduces recovery overhead by more than 62%
Reduces checkpointing overhead by up to 87%
Improves scalability of checkpoint-restart systems
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
26
Benefit-Aware Clustering of
Checkpoints from Parallel Applications
Collaborators:
Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski
Our Goal & Contributions
Goal:
Can suitably grouping checkpoints increase compressibility?
Contributions:
Design new metric for “similarity” of checkpoints
Use this metric for clustering checkpoints
Evaluate the benefit of the clustering on checkpoint storage
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
28
Different Clustering Schemes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
5
1
12
8
1
7
4
16
10
3
14
11
3
8
15
7
1
4
2
13
6
Our
Solution
5
14
15
13
7
10
Random
Rank-wise
4
14
8
6
Reliable & Scalable Checkpointing Systems
5
13
10
9
12
9
Tanzima Islam (tislam@purdue.edu)
11
16
11
6
16
12
15
3
9
2
Data-aware
29
Research Questions
How to cluster checkpoints?
Does clustering improve compression ratio?
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
30
Benefit-Aware Clustering
Similarity metric: Improvement in reduction
Goal: Minimize the total compressed size
β
Benefit−matrix
1
2
0.9
3
4
6
7
8
0.8
Benefit matrix of Cactus
5
9
10
11
0.7
12
13
14
15
16
0.6
17
18
19
20
21
0.5
22
23
24
25
26
0.4
27
28
29
30
31
0.3
32
V1
Tanzima Islam (tislam@purdue.edu)
V3
V5
V7
V9
V12
V15
V18
V21
Reliable & Scalable Checkpointing Systems
V24
V27
V30
V33
31
Novel Dissimilarity Metric
Two factors for the dissimilarity between two checkpoints
1
Δ(i, j) =
×
β(i, j)
Tanzima Islam (tislam@purdue.edu)
N
Σ [(i, k) – β(j, k)]2
k=1
Reliable & Scalable Checkpointing Systems
32
How Benefit-Aware Clustering Works
D
P
T
Chunking
Sample
double T[3000];
double V[10];
double P[5000];
double D[4000];
double R[100];
double T[3000];
D[4000];
double P[5000];
double T[3000];
D[4000];
P1
P2
Wavelet
P3
P4
P5
D
P
T
Cluster
1
P1
P3
β(14 )
Filter
Tanzima Islam (tislam@purdue.edu)
Order
Cluster
2
P2
P5
P4
Similarity
Reliable & Scalable Checkpointing Systems
Cluster
33
Structure of MCRCLUSTER
P5
F
PO
4
F
S
C
OP3 S C
F OP2 S C
S C
P1
F O
F O S
A2
Aggregator
A1
PFS
Aggregator
C
Compute Node
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
34
Evaluation
Application
IOR (synthetic checkpoints)
Cactus
Experimental test-bed
LLNL’s Sierra: 261.3 TFLOP/s, Linux cluster
23,328 cores, 1.3 Petabyte Lustre file system
Evaluation metric:
Macro benchmark: Effectiveness of clustering
Micro benchmark: Effectiveness of sampling
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
35
Effectiveness of MCRCLUSTER
IOR: 32 checkpoints
Odd processes write 0
Even processes write: <rank> | 1234567
29% more compression compared to rank-wise, 22% more compared to
random grouping
Silhouette plot of pam(x = distance_matrix, k = n um_cluster, diss = TRUE)
3 clusters Cj
j : nj | aveiÎCj si
n = 32
V1
V3
V5
V7
V9
V11
V13
V15
V17
V19
V21
V23
V25
V27
V29
V31
V2
V10
V12
V14
V6
V4
V18
V8
V24
V32
V26
V28
V30
V22
V20
V16
Weighted−Distance
1
2
3
2.0
4
5
6
7
1 : 16 | 1.00
8
9
10
1.5
11
12
13
14
15
16
17
18
1.0
2 : 9 | 0.75
19
20
21
22
23
24
25
0.5
3 : 7 | 0.92
26
27
28
29
0.2
0.4
0.6
Silhouette width si
Average silhouette width : 0.91
Tanzima Islam (tislam@purdue.edu)
0.8
1.0
30
31
32
0.0
0.0
V1
V3
V5
V7
V9
V12
Reliable & Scalable Checkpointing Systems
V15
V18
V21
V24
V27
V30
36
V33
Effectiveness of Sampling
X axis: Each variable
Y axis: Range of benefit values
Take away:
Chunking method preserves benefit relationships the closest
Chunking
0.000
0.000
0.001
0.001
0.002
0.002
0.003
0.003
0.004
0.004
0.005
0.005
0.006
0.006
0.007
0.007
Wavelet Transform
1
6
12
19
26
33
40
47
54
61
68
75
82
89
Tanzima Islam (tislam@purdue.edu)
96
1 6
12
19
26
33
40
47
47
54
54
61
61
68
68
75
75
82
82 89
89 96
96
Reliable & Scalable Checkpointing Systems
37
Contributions of MCRCLUSTER
Design similarity and distance metric
Demonstrate significant result on synthetic data
22% and 29% improvement compared to random and rank-wise
clustering, respectively
Future directions for a first year Ph.D. student
Study impact on real applications
Design scalable clustering technique
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
38
Applicability of My Research
Condor systems
Compression for scientific data
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
39
Conclusions
This thesis addresses:
Reliability of checkpointing-based recovery in large-scale computing
Proposed three novel systems:
FALCON: Distributed checkpointing system for Grids
MCRENGINE: “Data-Aware Compression” and scalable checkpointing
system for HPC
MCRCLUSTER: “Benefit-Aware Clustering”
Provides a good foundation for further research in this field
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
40
Questions?
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
41
Future Directions: Reliability
Reliability: Similarity-based process grouping for better
compression
Group processes based on similarity instead of rank [On going]
Analytical solution to group size selection
Variable streaming
Integrating mcrEngine with SCR
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
42
Future Directions: Performance
Cache usage analysis and optimization
Developed user-level tool for analyzing cache utilization [Summer’12]
Short term goals:
Apply to real-applications
Automate analysis
Long-term goals:
Suggest potential code optimizations
Automate application tuning
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
43
Contact Information
Tanzima Islam (tislam@purdue.edu)
Website: web.ics.purdue.edu/~tislam
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
44
Effectiveness of mcrCluster
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
45
Backup Slides
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
46
[Backup Slide] Failures in HPC
“A Large-scale Study of Failures in High-performance
Computing Systems”, by Bianca Schroeder, Garth Gibson
1
0
0
1
0
0
8
0
6
0
4
0
2
0
0
P
e
rc
e
n
ta
g
e
(%
)
8
0
U
n
k
n
o
w
n
6
0
4
0
8
0
8
0
6
0
U
n
k
n
o
w
n
6
0
4
0
4
0
2
0
2
0
2
0
0
H
a
r
d
w
a
r
e
S
o
f
t
w
a
r
e
H
a
r
d
w
a
r
e
N
e
t
w
o
r
k
S
o
f
t
w
a
r
e
E
n
v
ir
o
n
m
e
n
t
N
e
t
w
o
r
k
H
u
m
a
n
E
n
v
ir
o
n
m
e
n
U
n
k
n
o
w
n
H
u
m
a
n
1
0
0
P
e
rc
e
n
ta
g
e
(%
)
1
0
0
P
e
r
c
e
n
ta
g
e
(
%
)
H
a
r
d
w
a
r
e
S
o
f
t
w
a
r
e
H
a
r
d
w
a
r
e
N
e
t
w
o
r
k
S
o
f
t
w
a
r
e
E
n
v
ir
o
n
m
e
n
N
e
t
w
o
r
k
H
u
m
a
n
E
n
v
ir
o
n
m
e
n
t
U
n
k
n
o
w
n
H
u
m
a
n
D E F G H A
lls
y
s
t
e
m
s
D E F GH A
lls
y
s
t
e
m
s
00
D E F G H A
lls
y
s
t
e
m
s
D E F GH A
lls
y
s
t
e
m
s
(b)
(a) (a)
(b)
Figure
1. The breakdown
of failuresofinto
root causes (a) and
the breakdownofofdowntime
downtime into
rootroot
causes
(b). Each
Breakdown
of root causes
failures
Breakdown
into
causes
Figure graph
1. Theshows
breakdown
of failures
into root
causes
(a)F,and
theHbreakdown
of downtime
into root
causes (A–H).
(b). Each
the breakdown
for systems
of type
D, E,
G, and
and aggregate
statistics across
all systems
graph shows the breakdown for systems of type D, E, F, G, and H and aggregate statistics across all systems (A–H).
ure record.
If the
system
administrator was able to identify
variance
or the standard
deviation, is that it is47normalized by
Tanzima
Islam
(tislam@purdue.edu)
Reliable & Scalable
Checkpointing
Systems
[Backup Slide] Failures in HPC
“Hiding Checkpoint Overhead in HPC Applications with a
Semi-Blocking Algorithm”, by Laxmikant Kalé et. al.
Disparity between network bandwidth and memory size
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
48
[Backup Slides] Falcon
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
49
[Backup Slide] Breakdown of Overheads
180
180
160
160
Recovery Overhead (sec)
Checkpointing Overhead (sec)
Performance scales with checkpoint sizes
Lower network transfer overhead
140
120
100
80
60
40
140
120
100
80
60
40
20
20
0
0
500MB
946MB
Tanzima Islam (tislam@purdue.edu)
1677MB
500MB
Reliable & Scalable Checkpointing Systems
946MB
1677MB
50
[Backup Slide] Parallel Falcon
180
180
160
160
Recovery Overhead (sec)
Checkpoint Storing Overhead (sec)
67% improvement in CPU time
140
120
100
80
60
40
140
120
100
80
60
40
20
20
0
0
500MB
946MB
Tanzima Islam (tislam@purdue.edu)
1677MB
500MB
Reliable & Scalable Checkpointing Systems
946MB
1677MB
51
[Backup Slides] mcrEngine
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
52
[Backup Slide] How to Find Similarity
Var: “ToyGrp/Temperature”
Type: F32LE, Array1[1024]
P0
Group ToyGrp{
float Temperature[1024];
short Pressure[20][30];
int Humidity;
};
P1
Group ToyGrp{
float Temperature[50];
short Pressure[2][6];
double Unit;
int Humidity;
};
Inside source code: Variables
represented as members of a
group in actual source code. A
group can be thought of the
construct “Struct” in C
Tanzima Islam (tislam@purdue.edu)
ToyGrp/Temperature_F32LE_Array1
D
Var: “ToyGrp/Pressure”
Type: S8LE, Array2D [20][30]
ToyGrp/Pressure_S8LE_Array2D
Var: “ToyGrp/Humidity”
Type: I32LE, Atomic
ToyGrp/Humidity_I32LE_Atomic
Var: “ToyGrp/Temperature”
Type: F32LE, Array1D [50]
ToyGrp/Temperature_F32LE_Array1
D
Var: “ToyGrp/Pressure”
Type: S8LE, Array2D [2][6]
ToyGrp/Pressure_S8LE_Array2D
Var: “ToyGrp/Unit”
Type: F64LE, Atomic
ToyGrp/Unit_F64LE_Atomic
No match
Var: “ToyGrp/Humidity”
Type I32LE, Atomic
Inside a checkpoint: Variables
annotated with metadata
ToyGrp/Humidity_I32LE_Atomic
Generated hash key for matching
Reliable & Scalable Checkpointing Systems
53
[Backup Slide] Compression Ratio Follows Course of
Simulation
Data-aware technique always yields better compression
2.3
Cactus
Aware-Block
Compression Ratio
Aware
1.8
Agnostic-Block
1.3
Agnostic
0.8
2.3
Cosmology
6.0
2.1
5.0
1.9
4.0
1.7
3.0
1.5
2.0
1.3
1.0
Implosion
Simulation Time-steps
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
54
[Backup Slide] Relative Improvement in Compression
Ratio Compared to Data-Agnostic Scheme
Application
Total Size Data Types(%)
Aware-Block
(GB)
DF
S F Int (%)
Aware (%)
ALE3D
4.8
88.8
~0
11.2
6.6 - 27.7
6.6 - 12.7
Cactus
2.41
33.9
4
0
66.06
10.7 – 11.9
98 - 115
Cosmology
1.1
24.3
67.2
8.5
20.1 – 25.6
20.6 – 21.1
Implosion
0.013
0
74.1
25.9
36.3 – 38.4
36.3 – 38.8
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
55
References
1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed
Compressor for Double-Precision Floating-Point Data”.
2. P. Lindstrom and M. Isenburg, “Fast and Efficient
Compression of Floating-Point Data”.
3. L. Reinhold, “QuickLZ”.
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
56
Execution Environment: Grid
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
57
State-of-the-Art:
Checkpointing in Grid with Dedicated Storage
@Notre Dame
@Purdue
@Indiana U.
Internet
Dedicated
Storage Server
Submitter
Tanzima Islam (tislam@purdue.edu)
Problems:
(−) High transfer latency
(−) Contention on servers
(−) Stress on shared network resources
Reliable & Scalable Checkpointing Systems
58
Research Question
Can we improve the performance of
applications by storing checkpoints on the
grid resources?
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
59
Overview of Our Solution:
Checkpointing in Grid with Distributed Storage
@Notre Dame
@Purdue
@Indiana U.
Internet
Submitter
Tanzima Islam (tislam@purdue.edu)
Q1. Which storage nodes?
Q2. How to balance load?
Q3. How to efficiently storage & retrieve?
Constraint:
-- All components must be user-level
Reliable & Scalable Checkpointing Systems
60
Answer to Q1: Storage Host Selection
Build failure model for storage resources
Compute correlated temporal reliability
Based on historical data
Rank machines
Based on: reliability, load, and network overhead
Output:
(m+k) storage hosts
Compute Host
Addresses Q2
down
down
Objective function:
Storage Host 1
checkpoint storing overhead – benefit from restart
down
Tanzima Islam (tislam@purdue.edu)
down
Reliable & Scalable Checkpointing Systems
Storage Host 2
61
Checkpoint-Recovery Scheme
Disk
Disk
Original Checkpoint
Original Checkpoint
Compression
Decompression
Compressed
Compressed
Erasure Encoding
(m+k)
Erasure Decoding
(m)
Fragments
Fragments
Storage Host
Storage Host
1
2
m+k
Checkpoint Storing Phase
Tanzima Islam (tislam@purdue.edu)
1
2
m+k
Recovery Phase
Reliable & Scalable Checkpointing Systems
62
Evaluation Setup
2 different applications with 4 input sets
MCF (SPEC CPU 2006)
TIGR (BioBench)
System-level checkpoints
Macro benchmark experiment
Average job makespan
Micro benchmark experiments
Efficiency of checkpoint and restart
Efficiency in handling simultaneous clients
Efficiency in handling multiple failures
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
63
Checkpoint Storing & Recovery Overhead
Performance scales with checkpoint sizes
Lower network transfer overhead
Transfer Overhead
180
160
CPU Overhead
160
Recovery Overhead (sec)
Checkpointing Overhead (sec)
180
140
140
120
120
100
100
80
60
40
80
60
40
20
20
0
0
500MB
946MB
Tanzima Islam (tislam@purdue.edu)
1677MB
500MB
Reliable & Scalable Checkpointing Systems
946MB
1677MB
64
Overall Performance Comparison
Performance improvement between 11% and 44%
Average Makespan Time (min)
160
140
120
Remote Dedicated
Server
100
Local Dedicated
Server
80
Falcon with
Distributed Server
60
40
20
0
mcf
tigr
Benchmark Applications
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
65
Summary of Reliable Checkpointing System
Developed a reliable checkpoint-recovery system
FALCON
Select reliable storage hosts
Prefer lightly loaded ones
Compress and encode
Store and retrieve efficiently
Ran experiments with FALCON in DiaGrid
Performance improvement between 11% and 44%
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
66
Checkpointing in HPC
Compute Nodes
Network Contention
Gateway Nodes
Atlas
Contention for Shared
File System Resources
Hera
Hera
Contention for Other
Clusters for File System
Parallel File System
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
67
2-D vs N-D Compression
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
68
Benefit−matrix
1
2
0.9
3
4
5
6
7
0.8
8
9
10
11
0.7
12
13
14
15
16
0.6
17
18
19
20
21
0.5
22
23
24
25
26
0.4
27
28
29
30
31
0.3
32
V1
Tanzima Islam (tislam@purdue.edu)
V3
V5
V7
V9
V12
V15
V18
V21
Reliable & Scalable Checkpointing Systems
V24
V27
V30
V33
69
Challenge in Extreme-Scale: Increase in Failure-Rate
1 Eflop/s
100 Pflop/s
10 Pflop/s
1 Pflop/s
100 Tflop/s
10 Tflop/s
N=1
1 Tflop/s
N=500
100 Gflop/s
10 Gflop/s
1 Gflop/s
60,000
Number of Cores
50,000
40,000
30,000
20,000
10,000
2004
Tanzima Islam (tislam@purdue.edu)
2005
2006
Reliable & Scalable Checkpointing Systems
2007 2008
Year
2009
70
2010
2011
Towards Online Clustering
Reduce dimension of β
Reduce the number of variables
Representative data-type
Number of elements greater than a threshold [Example: 100 variables
double-type covers 80% of data]
Reduce the amount of data
Sampling: Random, chunking and wavelet
Tanzima Islam (tislam@purdue.edu)
Reliable & Scalable Checkpointing Systems
71
Download