A Case Study in Using Local I/O and GPFS to Improve Simulation Scalability Vincent Bannister, G.W. Howell, C. T. Kelley, Eric Sills, Qianyi Zhang North Carolina State University1 1Contact Information: Vince Bannister, now at Microsoft, e-mail vebannis@ncsu.edu, Gary Howell and Eric Sills, HPC/ITD NCSU, Tim Kelley from the NCSU Math Department, Qianyi Zhang from the NCSU Statistics Department gary_howell@ncsu.edu, eric_sills@ncsu.edu, ctk@ncsu.edu, qzhang3@stat.unity.ncsu Abstract Many optimization algorithms exploit parallelism by calling multiple independent instances of the function to be minimized, and these function in turn may call off-the-shelf simulators. The I/O load from the simulators can cause problems for an NFS file system. In this paper we explore efficient parallelization in a parallel program for which each processor makes serial calls to a MODFLOW simulator. Each MODFLOW simulation reads input files and produces output files. The application is "embarassingly" parallel except for disk I/O. Substitution of local scratch as opposed to global file storage ameliorates synchronization and contention issues. An easier solution was to use the high performance global file system GPFS instead of NFS. Compared to using local I/O, using a high performance shared file system such as GPFS requires less user effort. Keywords: Software, Efficiency, Scalability, Parallel, High Performance File Systems, GPFS. Introduction MODFLOW is a large serial simulation suite. A parallel recode would not be easy. Due to the complexity of MODFLOW, we preferred to consider it as a “black box”. Given an input file, MODFLOW produces a desired output. Each processor can run many such simulations, with each MODFLOW output parsed by a high level code (Matlab in our case) which writes a file with the correct inputs for the next run. On a single CPU, file I/O requires only a small fraction of execution time. Parallel execution with many instances of MODFLOW and Matlab simultaneously accessing a single NFS file system produces an I/O bound problem, with file system contention resulting in poor scaling. We solved the problem in two ways, first by using local hard drives (each node has a local hard drive) for intermediate computations, second by using the high performance shared parallel file system GPFS. Both solutions gave good performance for a moderate number of processors. This paper describes and compares the solutions. First the problem is described in a bit more detail, then the steps taken to use local file systems as opposed to a global file system. Timing results are compared to the solution of changing the file system from NFS to GPFS. Discussion and conclusions are in a final section. A SAS based statistical analysis comparing sets of times for NFS, vs. NFS with local file I/O, and GPFS file systems. For each of these 3 cases we compare performance of loaded and unloaded file global file systems. A Case Study in Simulation -- Using Local I/O Our simulations used MODFLOW[1]. The MODFLOW[2] simulator creates and tears down a large number of scratch files during execution, primarily as intermediaries in the computational process. Unfortunately, without a major recode, it seems impossible to avoid use of the file system. The initial input is a large matrix of input values, passed in plaintext form in a single file. The program then distributes the data to each individual instance of the simulator, invoking individual simulations which produce individual output on each processor. A high level wrapper program (written in Matlab) makes decisions for a new simulator run. After a set of simulations is performed on each processor, then a final combined data file is formed. More specifically, each instance of the wrapper program first parses the entirety of the input file and forms a series of directories, numbered according to relevant job. The files relevant to the running of the simulator are then copied into each numbered directory, and each distinct input set is written to the appropriate directory. The simulator is executes via a blocking Unix system call and, after MODFLOW’s completion, the wrapper parses the appropriate output file, and removes the temporary files associated with the run. The wrapper then checks for further jobs, and if it finds there are none, synchronizes with the remaining processes to assemble the output array, and delete the remains of the directories. Once the output array is formed, the wrapper writes a common output data file in human-legible format. This approach depends only on the availability of Unix-style system commands, and implementation of the generic MPI functions. This method requires little communication overhead; each simulation is run as a serial command and the "scratch" file it writes can be read by a new simulation on the same processor. Limiting use of the shared file system to initiation of the simulation and constructing a common data file allows effective parallel scaling of the computation. File System Race Conditions in NFS The parallel code is a wrapper on the serial simulations. The wrapper initializes and sequences the serial simulations, each of which reads from an input file and creates an output file to be read by another simulation. The initial wrapper suffered from erratic NFS file system behavior, delivering only partially correct results and with some simulations aborting with input errors. The problem was that C file system calls could encounter race conditions. Though a file has been closed (or even in some cases been subjected to an "fsync"), the call will return before the file has been written to disk. This can result in a race condition between the file write performed by the simulator and the file read of the wrapper, as the write command may not have completed it’s action when the next wrapper command (read) is invoked. A pathological example: … system(“Run the simulation suite”); fp = fopen(“output file”, "a+"); … The read will then open and lock the file, forcing the write to hold, and causing the reading of an empty file, pushing null values into the results matrix. Afterward, the write will perform, generating the output as appropriate, and leaving no indication of having been interrupted. If there are a number of commands between the system call to the simulator and the read action of the wrapper, the code tends to execute correctly. However, when the two I/O calls are adjacent as in the example, the failure emerged frequently, causing as many as a fourth of the run attempts to fail, or return partial results, with filling zero or null values. To resolve this issue, the processes must be forced to block until the final command of the simulator has committed its action upon the file. On our NFS file system the standard "fsync" command did not reliably enforce synchronization. The following sleep(1) command did eliminate the race condition. One solution: … system(“Run the simulation suite”); sleep(1); fp = fopen(“output file”, "a+"); … Contention for a Shared (Shared) File System As the environment we operate is shared with other processes and users, the node-to-storage link is in near continual use. Moreover, simulations on each processor of the parallel job tend to access the shared file system simultaneously. Each process would have to frequently wait a significant amount of time to gain access to the file system. Parallel speed-ups and scalability were limited and erratic. At low network usage on NFS share: 60 Number of processors 50 40 30 20 10 0 0 2 4 6 8 10 Time for full code excecution 12 14 16 18 Use of On Node (Local) Storage The NC State blade center has 40 Gbyte local disks on each 2 CPU blade. Most of the local hard drives is available as a /scratch directory accessible to user level programs. running cluster has fairly large on-node storage, and only two processors per node. The wrapper code was modified so that intermediate simulations run to and from local storage. In the modified version, the wrapper created the individual run directories on the blades directly, ran the simulator, cleaned the blade’s temporary files, and returned the results, through MPI communication. With these changes, even during an extremely high-use period, the processes did not slow noticeably once admitted to the running queue, and performed more consistently across numerous comparable runs. As an example of a similar use of on-node storage, see Saltz et. al[3]. The code change was Essentially from this: … system(“Run the simulation suite”); fp = fopen(“output file”, "a+"); … To this: … system("mkdir /scratch/sim"); system("cp lib/* /scratch/sim/"); chdir("/scratch/sim/"); system(“Run the simulation suite”); fp = fopen(“output file”, "a+"); system(“rm –f /scratch/sim/*”); … The following results were typical for NFS results on a relatively unloaded file system. 60 Number of processors 50 40 On node FS 30 Shared FS 20 10 0 0 5 10 Time in seconds 15 20 Good parallel efficiency resulted. A disadvantage to this approach is that local file systems may not be as reliable (for example, they may be already full, or the user code may leave them full, rendering them unusable for the next run, or user). Using a high performance parallel file system (GPFS as opposed to NFS) was an alternative approach we evaluated. GPFS GPFS is a licensed commerical file system available from IBM. GPFS[4] is designed for high throughput and also to cope well with contention for the disk resource. For the following results, both GPFS and NFS file systems used similar Apple X-Raid boxes. The NFS was mounted with meta data cached. The "shared" times had all files read to and from a shared NFS file system and did the "GPFS" runs. The "On Node" runs used the NFS file system for initial and final storage of data and used local on node hard drives for intermediate input output scratch space for the MODFLOW simulations. As with the previous results, times were averaged over 50 runs, for which the 3 longest and shortest times were omitted. Number of ProcessorsOn node GPFS 16 12 8 6 4 2 1 6.71 8.46 9.69 10.3 13.26 18.9 27.2 Shared 7.91 10.96 12.96 16.1 20.7 35.7 59.3 8.49 11.03 12.18 16.2 18.6 30.6 49.3 Impact of File System Load on Execution Time The above results did not differentiate between an unloaded and a loaded file system. As another experiment, we ran the same code with 16 processes running. There were 3 factors to be tested NFS vs. GPFS, on node scratch space vs. Shared file system, and loaded file system vs. Unloaded. Both the GPFS and NFS file systems were running on the same disk hardware (Apple Xraids). For these experiments the NFS file system was used only by the test code. The GPFS file system was shared by other users, so was not as reliably unloaded. The program bonnie++ which tests I/O latency and throughput was applied to load both the NFS and GPFS file systems, with local I/O to blade hard The following 20 times for each case were on successive runs with a fresh copy of bonnie started before the first run and continuing during all 20 runs. Times from the first few minutes of the runs correspond to bonnie writing with "putc". The "putc" bonnie writes appear to be the most disruptive of the bonnie file operations. GPFS was relatively unimpacted by the running of bonnie, so we retested running 4 simultaneous versions of bonnie, verifying that bonnie was having some influence. Running from NFS file system /amd64_share bonnie on, times for successive runs (max = 134.4, median = 27.5) 50.0 62.0 41.0 38.1 35.3 76.3 134.4 27.0 27.6 27.4 26.7 27.5 27.2 27.9 27.6 28.2 25.1 26.0 30.6 25.6 Running from NFS file system /amd64_share, bonnie off, times for successive runs (max = 26.4, median = 25.1) 26.4 25.1 24.9 24.9 25.0 25.6 25.0 25.1 24.8 25.2 25.0 25.1 25.0 24.9 24.9 25.3 25.3 24.8 25.1 25.2 Running from NFS file system /amd64_share, intermediate I/O to local disks, bonnie on (max = 48.4, median = 23.8) 23.9 23.5 23.9 24.1 23.7 23.5 23.8 23.7 24.1 44.4 48.4 23.5 32.9 28.4 23.8 24.1 23.8 27.6 24.0 23.8 Running from NFS file system /amd64_share, intermediate I/O to local disks, bonnie off (max = 24.3, median = 23.7) 24.0 23.7 24.3 24.2 23.8 23.7 24.1 23.7 23.7 23.6 23.9 23.7 23.7 23.7 23.6 23.7 23.9 23.4 23.6 23.7 Running from /gpfs_share, bonnie on (max = 24.1, median = 23.6) 24.4 23.9 23.5 23.8 23.4 23.4 24.0 23.4 23.4 24.0 23.3 23.6 23.8 23.2 23.5 23.5 23.4 23.4 23.7 23.4 Running from /gpfs_share, bonnie off (max = 24.6, median = 23.7) 24.6 23.6 23.6 23.8 23.8 23.6 23.7 23.6 23.7 23.7 23.8 23.6 23.6 23.4 23.5 23.9 23.6 23.4 23.7 23.4 Running from gpfs_share, 4 copies of bonnie on (max = 29.9, median = 27.4) 29.9 27.8 27.4 28.0 27.2 27.4 27.8 27.3 27.3 27.2 27.8 27.4 27.5 27.1 27.3 27.3 27.3 27.8 27.8 27.4 29.10 Discussion A statistical analysis using SAS is appended. It shows that the assumption that the above 6 lots of 20 numbers each are drawn from the same normally distributed set of numbers is very low. An interested reader can refer to that analysis for more detail. Some conclusions supported by the data analysis are: For both an unloaded and a loaded file system, GPFS execution times were much less variable than the NFS times. When the file system was loaded, NFS times were unpredictable and especially in the case that NFS was used to store intermediate results NFS times were sometimes very poor. Putting intermediate results on local file systems improved the worst case performance of NFS but a few runs still took almost twice as long as the average. Using a global NFS file system for all the runs in the presence of file system contention (here artificially induced by running the I/O transfer test program bonnie++) caused a number of the simulations to require much more time. One run required more than 2 minutes, compared to the median execution time of around 27 seconds. Using local file ystems for scratch storage improved execution times, both in the median and maximal observed times. With bonnie on, the maximal time was 48.4 seconds, the median time was less than 24 seconds, with two observed times more than 40 seconds, and one more than 30 seconds. Without bonnie, the median execution time changed little, but the maximal time was 24.3 seconds. The variability of execution times was much less with GPFS. The maximal and median execution times were almost the same for the unloaded NFS storage with local scratch and for the GPFS without scratch and even for the GPFS with bonnie contention. Running four copies of bonnie simultaneously verified that bonnie was producing a gpfs load, but even in that case the maximal time increased by only about 20% and the median time by about 10%. In the experiments here, GPFS is faster than NFS for the case that several processes write to and read from the same file; a design difference that may explain the difference in performance is that GPFS uses distributed locking and meta data mechanisms to allow several processes to access a file simultaneously. GPFS suffers less contention due to multiple processes simultaneously accessing their own files than does a globally shared NFS, perhaps because GPFS is designed to write to multiple disks in parallel. MPI I/O compatibility is built into GPFS; configuring NFS to work with MPI I/O, on the other hand, entailed turning off caching of meta data, which had a strongly adverse impact on data throughput (the runs made here used cached meta data). For more detail about GPFS design, see[5]. While one test case of an untuned GPFS compared to an untuned (meta data cached ext3) NFS system can not be taken as definitive, GPFS performance was certainly superior in this case. We have had other user feedback indicating that codes scale better on this particular GPFS file system compared to our NFS ext3 system. From the system point of view, setting up GPFS was a significant labor, but then helped all users. Training a user to use local storage helps a user at a time and entails significant user effort. Moreover, the user must also be persuaded to clean up after himself so that local hard drives do not fill with abandoned user files. Some users of local disks find a significant staging time to recover their data after each run. Other issues with local disks Even when users do not use disks local to a cluster node, some local disk system management issues remain, e.g., choice of files to permanently store on local hard drives. For example, operating system files and system commands likely to be accessed during execution must be present, and similarly shared libraries may need to be found on all nodes so they can be accessed during code execution. Storing other frequently accessed executables on local hard drives can avoid contention for a globally shared file system. Conclusions On our cluster of GiGE connected IBM blades, currently with around 500 hundred processors and several dozen jobs running at a time, GPFS performs significantly better than NFS. Using GPFS reduced file I/O variability more than using local hard drives. Shifting the algorithm to using GPFS is usually trivial from the user point of view. Installing GPFS was not trivial from a system administration standpoint and requires more file I/O servers than does NFS. The cost of licensing GPFS may be significant. Some other high performance file systems are available. These include XFS (from SGI[6]) which has a good track record on large HPC machines and is now available in an open source Linux version, the file system Lustre [7], less mature perhaps but with a few years of use at some DOE HPC sites, available in both supported and open source versions, and the open source PVFS [8]. The bibliography gives some references on hiding latency[9] and modeling performance [10] of parallel I/O and high performance file systems. Some more general references on parallel I/O are Hubovsky's thesis [11] and Sloan's recent book on Linux clusters[12]. Acknowledgements Tim Kelley and Vince Bannister acknowledge support from NSF grant DMS-04-4537 and ARO grant W911NF-06-1-0096. G. W. Howell and Qianyi Zhang acknowledge support from the NIH Molecular Libraries Roadmap for Medical Research, Grant 1 P20 HG003900-01. The cluster was partially supported by ARO grant W911NF-04-1-0276. Bibliography 1 M. G. McDonald and A. W. Harbaugh, "A Modular Three Dimensional Finite Difference Groundwater Flow Model", U.S. Geological Survey Techniques of Water Resource Investigations, Book 6, Chapter AL, Reston, VA., 1988 2 ......................................................................... MODFLO W and some related software, http://water.usgs.gov/nrp/gwsoftware/modflow.html 3 ......................................................................... Joel Saltz, Anurag Acharya, Alan Sussman, Jeff Hollingsworth, and Michael B, "Tuning the I/O Performance of Earth Science Applications", NASA Goddard http://www.cs.umd.edu/projects/hpsl/hpio/sio/ccsf_oct95.txt, 1995 4 ......................................................................... "An Introduction to GPFS", IBM white paper, http://www03.ibm.com/systems/clusters/software/whitepapers/gpfs_intro. pdf, 2006 5 ......................................................................... Frank Schmuck and Roger Haskin, "GPFS, A Shared-Disk File System for Large Computing Clusters", Proceedings of the Conference on File and Storage Technologies (FAST '02) January, 2002, pp. 231-244. 6 ......................................................................... Philip Traubman, "Scalability and Performance in Modern File Systems", SGI white paper, http://www.sgi.com/pdfs/2668.pdf 7 ......................................................................... Lustre: "A Scalable, High-Performance File System", Cluster File Systems white paper, http://www.lustre.org/docs/whitepaper.pdf 8 ......................................................................... Parallel Virtual File System, Version 2, User Guide http://www.pvfs.org/pvfs2/pvfs2-guide.html, 2003 9 ......................................................................... Xiaosong Ma, Flexible and Efficient I/O Optimization for Parallel Applications, http://moss.csc.ncsu.edu/~mueller/seminar/spring03/ma.ppt, NC State CSC seminar, spring 2003. 10 ....................................................................... Yijian Wang and David Kaeli, "Modeling and Acceleration of FileI/O Dominated Parallel Workloads", www.ece.neu.edu/students/yiwang/Analogic.ppt, Northeastern University presentation to Analogic Corporation, 2005 11 ....................................................................... Rainer Hubovsky, "Dealing with Massive Data: From Parallel I/O to Grid I/O", http://www.cs.dartmouth.edu/pario/hubovsky_dictionary.pdf, Thesis, University of Vienna, 2003. 12........................................................................Joseph P. Sloan, High Performance Linux Clusters with OSCAR, Rocks, openMosix & MPI, O'Reilly Press, 2005. Data Description: Appendix -- Statistical analysis We had six sets of 20 times each. These were the times to perform the same sets of reads writes under the following specified conditions: s : loaded NFS global file system, not using local scratch 1 s2 : s3 s4 s5 s6 s7 unloaded NFS global file system, not using local scratch : loaded NFS global file system, using local scratch : unloaded NFS global file system, using local scratch : loaded GPFS global file system, not using local scratch : unloaded GPFS global file system, not using local scratch : 4x load on GPFS global file system, not using local scratch 1 Test the difference time between NFS and GPFS Only data set of s which is for GPFS is 4x loaded. To avoid the complication, the analysis for NFS and GPFS is just based on data set s to s and the following model. 7 1 Model (1): 6 Yt D1 D2 D3 D12 D23 t X where Y ( response variable): run time : mean time when D D D D D 0 D : dummy variable which is defined as t 1 1 2 3 12 23 , 1, D1 0, D2 : NFS GPFS dummy variable which is defined as 1, D2 0, D3 : loaded unloaded dummy variable which is defined as 1, local scrach D3 non local scrach 0, interaction effect between D and D which is equal to D D D : interaction effect between D and D which is equal to D D : error which is assumed to be white noise. D12 : 1 2 23 2 3 1 2 2 3 t Note that the number of observations for NFS is 80 and the number of observations for GPFS is 40. Therefore the data is unbalanced and ANOVA table is not appropriate for the analysis of NFS and GPFS. Tukey-Kramer test, which can solve the unbalanced problem for comparison, is used to test the difference between GPFS and NFS. Following provides the SAS output for Tukey-Kramer test on the variable D . From 1 the output, it is obvious that the difference between NFS and GPFS is significant. SAS output for the means comparison 1 Test the difference between using local scratch and not using local scratch Model (1) can still be used as a regression model to test whether the difference between using local scratch and not using local scratch is significant or not. The analysis is based on data set s to s . The number of observations for using local scratch is 40 and the number of observation for not using local scratch is 80. So Tukey-Kramer test is used again for this comparison. The SAS output shows that the difference between them is significant. 1 6 SAS output for the means comparison 1 Test difference between loaded and unloaded FS. From the ANOVA table, the p-value for load ( D ) is .5344, which is not small. However, the p-value for interaction effect ( D ) in the same ANOVA table is low. Therefore, the analysis 2 12 for difference between loaded and unloaded should extend to a multiple comparison. The elements for multiple comparisons are: loaded NFS loaded GPFS unloaded NFS unloaded GPFS The output from SAS shows: No significant difference between unloaded GPFS and unloaded NFS No significant difference between unloaded GPFS and loaded GPFS No significant difference between unloaded NFS and loaded GPFS significant difference between loaded NFS and unloaded GPFS significant difference between loaded NFS and loaded GPFS significant difference between loaded NFS and unloaded NPFS Therefore, the above four elements can be view as two groups: Group 1: loaded NFS Group 2: unloaded GPFS, loaded GPFS and unloaded NFS There is significant difference between groups. SAS output for ANOVA Table and means comparison: 1 Test for the effect of load in GPFS. From the data description, there are three levels for load in GPFS. To test the effect of load in GPFS, make analysis based on data set s to s and use the following regression model. 5 Model (2): 7 Yt D D 2 23 t , where Y ( response variable): run time 0 : mean time when D D : classified variable which is defined as t D2 2 2 0, unloaded D2 1, one loaded 2, four loaded D3 : dummy variable which is defined as 1, local scrach D3 non local scrach 0, D2 D3 interaction effect between and which is equal to D D : error which is assumed to be white noise. D23 : 2 t The multiple comparison results from SAS shows: No significant difference between unloaded and one loaded Significant difference between one loaded and four loaded 3 Significant difference between unloaded and four loaded The GLM Procedure Least Squares Means load LSMEAN time LSMEAN 0 1 2 23.6800000 23.6000000 27.6000000 Number 1 2 3 Least Squares Means for effect load Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: time i/j 1 2 3 The GLM Procedure Dependent Variable: time Source Mean Square Model 777.85588 Sum of DF Squares F Value Pr > F 6.37 Error 122.14725 5 <.0001 3889.27942 114 13924.78650 Corrected Total MSE R-Square time Mean 119 17814.06592 Coeff Var Root The GLM Procedure Least Squares Means Adjustment for Multiple Comparisons: Tukey-Kramer H0:LSMean1= local LSMean2 time LSMEAN Pr 0 27.9962500 1 21.0087500 > |t| 0.0055 The GLM Procedure Least Squares Means Adjustment for Multiple Comparisons: Tukey-Kramer NFS H0:LSMean1= LSMean2 time LSMEAN Pr > |t| 0 20.1462500 1 28.8587500 0.0006 1 Test the difference between s and s Use the two sample t-test to test whether the mean of s is different from the mean of s . The SAS output for the test is 3 5 3 5 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Variable D1 N Mean Mean Mean Std Dev Std Dev Std Dev Std Err time 0.2275 time 5.3795 time 4.0915 0 20 23.46 23.6 23.74 0.2991 0.4369 0.0669 1 20 23.634 26.945 30.256 7.0738 10.332 1.5817 Diff (1-2) -6.55 -3.345 -0.14 5.0064 6.4521 1.5832 T-Tests DF Variable Method t Value Pr > |t| Variances The means of the two samples are significantly different from each other.