Research work that has been done at HLRS by Alejandro Calderon On evaluating GPFS On evaluating GPFS Short description Metadata evaluation fdtree Bandwidth evaluation Bonnie Iozone acaldero @ arcos.inf.uc3m.es IODD IOP HPC-Europa (HLRS) 2 GPFS description http://www.ncsa.uiuc.edu/UserInfo/Data/filesystems/index.html General Parallel File System (GPFS) is a parallel file system package developed by IBM. History: Originally developed for IBM's AIX operating system then ported to Linux Systems. Features: Appears to work just like a traditional UNIX file system from the user application level. Provides additional functionality and enhanced performance when accessed via parallel interfaces such as MPI-I/O. High performance is obtained by GPFS by striping data across multiple nodes and disks. Striping is performed automatically at the block level. Therefore, all files (larger than the designated block size) will be striped. Can be deployed in NSD or SAN configurations. Clusters hosting a GPFS file system can allow other clusters at different geographical locations to mount that file system. acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 3 GPFS (Simple NSD Configuration) acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 4 GPFS evaluation (metadata) fdtree Used for testing the metadata performance of a file system Create several directories and files, in several levels Used on: Computers: noco-xyz Storage systems: Local, GPFS acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 5 fdtree [local,NFS,GPFS] ./fdtree.bash -f 3 -d 5 -o X 2500 /gpfs /tmp Operations/Sec. 2000 /mscratch 1500 1000 500 0 Directory creates per second acaldero @ arcos.inf.uc3m.es File creates per second File removals per second HPC-Europa (HLRS) Directory removals per second 6 fdtree on GPFS (Scenario 1) ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs... nodex several nodes, several process per node, different subtrees, many small files acaldero @ arcos.inf.uc3m.es … P1 Scenario 1: HPC-Europa (HLRS) Pm … … … 7 fdtree on GPFS (scenario 1) ssh {x,...} fdtree.bash -f 3 -d 5 -o /gpfs... 600 Directory creates per second 500 File creates per second Operations/Sec. File removals per second 400 Directory removals per second 300 200 100 0 1n-1p acaldero @ arcos.inf.uc3m.es 4n-4p 4n-8p HPC-Europa (HLRS) 4n-16p 8n-8p 8n-16p 8 fdtree on GPFS (Scenario 2) ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs... nodex Scenario 2: P1 several nodes, one process per node, same subtree, many small files acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) … Px … 9 fdtree on GPFS (scenario 2) ssh {x,...} fdtree.bash -l 1 -d 1 -f 1000 -s 500 -o /gpfs... 45 working in the same directory 40 working in different directories Files creates per second 35 30 25 20 15 10 5 0 1 2 4 8 number of process (1 per node) acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 10 Metadata cache on GPFS ‘client’ Working in a GPFS directory with 894 entries ls –las need to get each file attribute from GPFS metadata server 306$ hpc13782 noco186.nec 304$ 307$ 305$ time ls -als | wc -l 894 real 0m0.466s 0m0.033s 0m0.034s 0m0.222s 0m0.009s user 0m0.010s 0m0.011s 0m0.025s sys 0m0.052s 0m0.024s 0m0.064s In a couple of seconds, the contents of the cache seams disappear acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 11 fdtree results Main conclusions Contention at directory level: If two o more process from a parallel application need to write data, please be sure each one use different subdirectories from GPFS workspace Better results than NFS (but lower that the local file system) acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 12 GPFS performance (bandwidth) Bonnie Read and write a 2 GB file Write, rewrite and read Used on: Computers: Cacau1 Noco075 Storage systems: GPFS acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 13 Bonnie on GPFS [write + re-write] GPFS over NFS bandwidth (MB/sec.) 180 160 write 140 rewrite 120 100 80 60 40 20 0 cacau1-GPFS noco075-GPFS write 51,86 164,69 rewrite 3,43 36,35 acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 14 Bonnie on GPFS [read] GPFS over NFS 250 bandwidth (MB/sec.) 200 read 150 100 50 0 read acaldero @ arcos.inf.uc3m.es cacau1-GPFS noco075-GPFS 75,85 232,38 HPC-Europa (HLRS) 15 GPFS performance (bandwidth) Iozone Write and read with several file size and access size Write and read bandwidth Used on: Computers: Noco075 Storage systems: GPFS acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 16 Iozone on GPFS [write] Write on GPFS 1000,00-1200,00 1200,00 800,00-1000,00 600,00-800,00 1000,00 400,00-600,00 Bandwidth (MB/s) 200,00-400,00 800,00 0,00-200,00 600,00 400,00 16384 200,00 2048 524288 4 te s (b y R ec Le n HPC-Europa (HLRS) 262144 131072 65536 32768 16384 8192 4096 2048 1024 512 256 128 64 32 File size (KB) acaldero @ arcos.inf.uc3m.es ) 256 0,00 17 Iozone on GPFS [read] Read on GPFS 2000,00-2500,00 2500,00 1500,00-2000,00 1000,00-1500,00 2000,00 0,00-500,00 1500,00 500,00 acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) ec Le n 4 524288 262144 131072 32768 16384 65536 File size (KB) 16 8192 4096 2048 1024 512 64 256 128 64 0,00 (b yt 16384 4096 1024 256 es ) 1000,00 R Bandwidth (MB/s) 500,00-1000,00 18 GPFS evaluation (bandwidth) IODD Evaluation of disk performance by using several nodes: next -> disk and networking A dd-like command that can be run from MPI Used on: 2, and 4 nodes, 4, 8, 16, and 32 process (1, 2, 3, and 4 per node) that write a file of 1, 2, 4, 8, 16, and 32 GB By using both, POSIX interface and MPI-IO interface acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 19 How IODD works… nodex P1 a b .. n P2 a b .. n … Pm … a b .. n nodex = 2, 4 nodes processm = 4, 8, 16, and 32 process (1, 2, 3, 4 per node) file sizen = 1, 2, 4, 8, 16 and 32 GB acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 20 IODD on 2 nodes [MPI-IO] GPFS (writing, 2 nodes) bandwidth (MB/sec.) 160-180 140-160 120-140 100-120 80-100 60-80 40-60 20-40 0-20 180 160 140 120 100 80 60 40 2 20 8 0 1 2 4 process per node acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 8 file size (GB) 32 16 21 IODD on 4 nodes [MPI-IO] GPFS (writing, 4 nodes) bandwidth (MB/sec.) 160-180 140-160 120-140 100-120 80-100 60-80 40-60 20-40 0-20 180 160 140 120 100 80 60 40 2 20 8 0 1 2 file size (GB) 4 process per node acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 8 32 16 22 Differences by using different APIs GPFS (2 nodes, MPI-IO) GPFS (2 nodes, POSIX) 180 GPFS (writing, 2 nodes) bandwidth (MB/sec.) 70 160 160-180 140-160 120-140 100-120 80-100 60-80 40-60 20-40 0-20 60 50 140 120 100 40 80 60 30 40 20 1 2 1 20 2 10 4 8 0 1 4 process per node acaldero @ arcos.inf.uc3m.es 8 0 file size (GB) 1 16 2 4 file size (GB) 16 2 4 process per node 8 32 16 32 8 HPC-Europa (HLRS) 23 IODD on 2 GB [MPI-IO, = directory] GPFS (writing, 1-32 nodes, same directory) bandwidth (MB/sec.) 160 140 120 100 80 60 1 2 40 4 Number of nodes 20 8 0 16 32 acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 24 IODD on 2 GB [MPI-IO, ≠ directory] GPFS (writing, 1-32 nodes, different directories) bandwidth (MB/sec.) 160 140 120 100 80 60 1 2 40 4 Number of nodes 20 8 0 16 32 acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 25 IODD results Main conclusions The bandwidth decrease with the number of processes per node Beware of multithread application with medium-high I/O bandwidth requirements for each thread It is very important to use MPI-IO because this API let users get more bandwidth The bandwidth decrease with more than 4 nodes too With large files, the metadata management seams not to be the main bottleneck acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 26 GPFS evaluation (bandwidth) IOP Get the bandwidth obtained by writing and reading in parallel from several processes The file size is divided between the process number so each process work in an independent part of the file Used on: GPFS through MPI-IO (ROMIO on Open MPI) Two nodes writing a 2 GB files in parallel On independent files (non-shared) On the same file (shared) acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 27 How IOP works… File per process (non-shared) P1 a a .. P2 b b .. … … x Segmented access (shared) P1 Pm x .. a b .. x a b .. x n P2 … … Pm a b .. x n 2 nodes m = 2 process (1 per node) n = 2 GB file size acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 28 IOP: Differences by using shared/non-shared writing on file(s) over GPFS Bandwidth (MB/sec.) 180 160 NON-shared 140 shared 120 100 80 60 40 20 1M B 25 6K B 51 2K B 64 KB 12 8K B 32 KB 16 KB 8K B 4K B 2K B 1K B 0 access size acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 29 IOP: Differences by using shared/non-shared reading on file(s) over GPFS Bandwidth (MB/sec.) 200 180 NON-shared 160 shared 140 120 100 80 60 40 20 B 1M 25 6K B 51 2K B 64 KB 12 8K B 32 KB 16 KB 8K B 4K B 2K B 1K B 0 access size acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 30 GPFS writing in non-shared files GPFS writing in a shared file acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 31 GPFS writing in shared file: the 128 KB magic number 140 write read Rread Bread 100 80 60 40 20 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 0 1 MB bandwith (MB/sec) 120 access size acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 32 IOP results Main conclusions If several process try to write to the same file but on independent areas then the performance decrease With several independent files results are similar on several tests, but with shared file are more irregular Appears a magic number: 128 KB Seams that at that point the internal algorithm changes and it increases the bandwidth acaldero @ arcos.inf.uc3m.es HPC-Europa (HLRS) 33