GPFS

advertisement
Research work that has been done at HLRS by
Alejandro Calderon
On evaluating GPFS
On evaluating GPFS

Short description

Metadata evaluation
 fdtree

Bandwidth evaluation
 Bonnie
 Iozone


acaldero @ arcos.inf.uc3m.es
IODD
IOP
HPC-Europa (HLRS)
2
GPFS description
http://www.ncsa.uiuc.edu/UserInfo/Data/filesystems/index.html
General Parallel File System (GPFS) is a parallel file system package developed by IBM.

History:
 Originally developed for IBM's AIX operating system then ported to Linux Systems.

Features:
 Appears to work just like a traditional UNIX file system from the user application level.
 Provides additional functionality and enhanced performance when accessed via
parallel interfaces such as MPI-I/O.
 High performance is obtained by GPFS by striping data across multiple nodes and disks.
 Striping is performed automatically at the block level. Therefore, all files
(larger than the designated block size) will be striped.
 Can be deployed in NSD or SAN configurations.
 Clusters hosting a GPFS file system can allow other clusters at different
geographical locations to mount that file system.
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
3
GPFS (Simple NSD Configuration)
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
4
GPFS evaluation (metadata)

fdtree



Used for testing the metadata performance of a file system
Create several directories and files, in several levels
Used on:

Computers:


noco-xyz
Storage systems:

Local, GPFS
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
5
fdtree [local,NFS,GPFS]
./fdtree.bash -f 3 -d 5 -o X
2500
/gpfs
/tmp
Operations/Sec.
2000
/mscratch
1500
1000
500
0
Directory creates per
second
acaldero @ arcos.inf.uc3m.es
File creates per second
File removals per
second
HPC-Europa (HLRS)
Directory removals per
second
6
fdtree on GPFS (Scenario 1)
ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs...
nodex





several nodes,
several process per node,
different subtrees,
many small files
acaldero @ arcos.inf.uc3m.es
…
P1
Scenario 1:
HPC-Europa (HLRS)
Pm
…
…
…
7
fdtree on GPFS (scenario 1)
ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs...
600
Directory creates per second
500
File creates per second
Operations/Sec.
File removals per second
400
Directory removals per second
300
200
100
0
1n-1p
acaldero @ arcos.inf.uc3m.es
4n-4p
4n-8p
HPC-Europa (HLRS)
4n-16p
8n-8p
8n-16p
8
fdtree on GPFS (Scenario 2)
ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs...
nodex

Scenario 2:




P1
several nodes,
one process per node,
same subtree,
many small files
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
…
Px
…
9
fdtree on GPFS (scenario 2)
ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs...
45
working in the same directory
40
working in different directories
Files creates per second
35
30
25
20
15
10
5
0
1
2
4
8
number of process (1 per node)
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
10
Metadata cache on GPFS ‘client’


Working in a GPFS directory with 894 entries
ls –las need to get each file attribute from GPFS
metadata server
306$
hpc13782 noco186.nec 304$
307$
305$ time ls -als | wc -l
894
real 0m0.466s
0m0.033s
0m0.034s
0m0.222s
0m0.009s
user 0m0.010s
0m0.011s
0m0.025s
sys 0m0.052s
0m0.024s
0m0.064s

In a couple of seconds, the contents of the cache
seams disappear
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
11
fdtree results

Main conclusions

Contention at directory level:
 If two o more process from a parallel application need to write
data, please be sure each one use different subdirectories
from GPFS workspace

Better results than NFS (but lower that the local file system)
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
12
GPFS performance (bandwidth)

Bonnie



Read and write a 2 GB file
Write, rewrite and read
Used on:

Computers:



Cacau1
Noco075
Storage systems:

GPFS
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
13
Bonnie on GPFS [write + re-write]
GPFS over NFS
bandwidth (MB/sec.)
180
160
write
140
rewrite
120
100
80
60
40
20
0
cacau1-GPFS
noco075-GPFS
write
51,86
164,69
rewrite
3,43
36,35
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
14
Bonnie on GPFS [read]
GPFS over NFS
250
bandwidth (MB/sec.)
200
read
150
100
50
0
read
acaldero @ arcos.inf.uc3m.es
cacau1-GPFS
noco075-GPFS
75,85
232,38
HPC-Europa (HLRS)
15
GPFS performance (bandwidth)

Iozone



Write and read with several file size and access size
Write and read bandwidth
Used on:

Computers:


Noco075
Storage systems:

GPFS
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
16
Iozone on GPFS [write]
Write on GPFS
1000,00-1200,00
1200,00
800,00-1000,00
600,00-800,00
1000,00
400,00-600,00
Bandwidth (MB/s)
200,00-400,00
800,00
0,00-200,00
600,00
400,00
16384
200,00
2048
524288
4
te
s
(b
y
R
ec
Le
n
HPC-Europa (HLRS)
262144
131072
65536
32768
16384
8192
4096
2048
1024
512
256
128
64
32
File size (KB)
acaldero @ arcos.inf.uc3m.es
)
256
0,00
17
Iozone on GPFS [read]
Read on GPFS
2000,00-2500,00
2500,00
1500,00-2000,00
1000,00-1500,00
2000,00
0,00-500,00
1500,00
500,00
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
ec
Le
n
4
524288
262144
131072
32768
16384
65536
File size (KB)
16
8192
4096
2048
1024
512
64
256
128
64
0,00
(b
yt
16384
4096
1024
256
es
)
1000,00
R
Bandwidth (MB/s)
500,00-1000,00
18
GPFS evaluation (bandwidth)

IODD

Evaluation of disk performance by using several nodes:



next ->
disk and networking
A dd-like command that can be run from MPI
Used on:


2, and 4 nodes,
4, 8, 16, and 32 process (1, 2, 3, and 4 per node)
that write a file of 1, 2, 4, 8, 16, and 32 GB
By using both, POSIX interface and MPI-IO interface
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
19
How IODD works…
nodex
P1
a b .. n

P2
a b .. n
…
Pm
… a b .. n
nodex = 2, 4 nodes
 processm = 4, 8, 16, and 32 process (1, 2, 3, 4 per node)
 file sizen = 1, 2, 4, 8, 16 and 32 GB
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
20
IODD on 2 nodes [MPI-IO]
GPFS (writing, 2 nodes)
bandwidth (MB/sec.)
160-180
140-160
120-140
100-120
80-100
60-80
40-60
20-40
0-20
180
160
140
120
100
80
60
40
2
20
8
0
1
2
4
process per node
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
8
file size (GB)
32
16
21
IODD on 4 nodes [MPI-IO]
GPFS (writing, 4 nodes)
bandwidth (MB/sec.)
160-180
140-160
120-140
100-120
80-100
60-80
40-60
20-40
0-20
180
160
140
120
100
80
60
40
2
20
8
0
1
2
file size (GB)
4
process per node
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
8
32
16
22
Differences by using different APIs
GPFS (2 nodes, MPI-IO)
GPFS (2 nodes, POSIX)
180
GPFS (writing, 2 nodes)
bandwidth (MB/sec.)
70
160
160-180
140-160
120-140
100-120
80-100
60-80
40-60
20-40
0-20
60
50
140
120
100
40
80
60
30
40
20
1
2
1 20
2
10
4
8
0
1
4
process per node
acaldero @ arcos.inf.uc3m.es
8
0
file size (GB) 1
16
2
4
file size (GB)
16
2
4
process per node
8
32
16
32
8
HPC-Europa (HLRS)
23
IODD on 2 GB [MPI-IO, = directory]
GPFS (writing, 1-32 nodes, same directory)
bandwidth (MB/sec.)
160
140
120
100
80
60
1
2
40
4
Number of nodes
20
8
0
16
32
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
24
IODD on 2 GB [MPI-IO, ≠ directory]
GPFS (writing, 1-32 nodes, different directories)
bandwidth (MB/sec.)
160
140
120
100
80
60
1
2
40
4
Number of nodes
20
8
0
16
32
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
25
IODD results

Main conclusions

The bandwidth decrease with the number of processes per node
 Beware of multithread application with medium-high I/O
bandwidth requirements for each thread

It is very important to use MPI-IO because this API let
users get more bandwidth

The bandwidth decrease with more than 4 nodes too

With large files, the metadata management seams not to be the main
bottleneck
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
26
GPFS evaluation (bandwidth)

IOP



Get the bandwidth obtained by writing and reading in parallel
from several processes
The file size is divided between the process number so each
process work in an independent part of the file
Used on:


GPFS through MPI-IO (ROMIO on Open MPI)
Two nodes writing a 2 GB files in parallel


On independent files (non-shared)
On the same file (shared)
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
27
How IOP works…
File per process (non-shared)
P1
a
a ..
P2
b
b ..
…
… x
Segmented access (shared)
P1
Pm
x ..
a b .. x a b .. x
n

P2
…
…
Pm
a b .. x
n
2 nodes
 m = 2 process (1 per node)
 n = 2 GB file size
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
28
IOP: Differences by using shared/non-shared
writing on file(s) over GPFS
Bandwidth (MB/sec.)
180
160
NON-shared
140
shared
120
100
80
60
40
20
1M
B
25
6K
B
51
2K
B
64
KB
12
8K
B
32
KB
16
KB
8K
B
4K
B
2K
B
1K
B
0
access size
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
29
IOP: Differences by using shared/non-shared
reading on file(s) over GPFS
Bandwidth (MB/sec.)
200
180
NON-shared
160
shared
140
120
100
80
60
40
20
B
1M
25
6K
B
51
2K
B
64
KB
12
8K
B
32
KB
16
KB
8K
B
4K
B
2K
B
1K
B
0
access size
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
30
GPFS
writing in non-shared files
GPFS
writing in a shared file
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
31
GPFS writing in shared file:
the 128 KB magic number
140
write
read
Rread
Bread
100
80
60
40
20
1 KB
2 KB
4 KB
8 KB
16 KB
32 KB
64 KB
128 KB
256 KB
512 KB
0
1 MB
bandwith (MB/sec)
120
access size
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
32
IOP results

Main conclusions

If several process try to write to the same file but on
independent areas then the performance decrease

With several independent files results are similar on several
tests, but with shared file are more irregular

Appears a magic number: 128 KB
Seams that at that point the internal algorithm changes and it
increases the bandwidth
acaldero @ arcos.inf.uc3m.es
HPC-Europa (HLRS)
33
Download