Exploiting Local Data in Parallel Array I/O on a Practical Network of Workstations Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo, K. E. Seamonsy Computer Science Department University of Illinois, Urbana, Illinois 61801 fycho,winslettg@cs.uiuc.edu Abstract A cost-eective way to run a parallel application is to use existing workstations connected by a local area network such as Ethernet or FDDI. In this paper, we present an approach for parallel I/O of multidimensional arrays on small networks of workstations with a shared-media interconnect, using the Panda I/O library. In such an environment, the message passing throughput per node is lower than the throughput obtainable from a fast disk and it is not easy for users to determine the conguration which will yield the best I/O performance. We introduce an I/O strategy that exploits local data to reduce the amount of data that must be shipped across the network, present experimental results, and analyze the results using an analytical performance model and predict the best choice of I/O parameters. Our experiments show that the new strategy results in a factor of 1.2{2.1 speedup in response time compared to the Panda version originally developed for the IBM SP2, depending on the array sizes, distributions and compute and I/O node meshes. Further, the performance model predicts the results within a 13% margin of error. 1 Introduction Workstation clusters are emerging as an economical way to run parallel applications. Despite their attractive features these clusters still have some limits when compared to massively parallel processors. First, usually fewer nodes are available limiting the scale of experiments. Second, there are no widely available implementations of portable libraries like PVM or MPI that have been highly tuned up for the clusters. Third, the popular FDDI and Ethernet are slower than interconnects of massively parallel processors [Keeton95]. For example, Presently y Presently aliated with Oracle Corporation. aliated with Transarc Corporation. with a single sender and receiver on an FDDI HP workstation cluster, we obtained 3.33 MB/sec on average out of the peak FDDI bandwidth of 12.5 MB/sec using MPICH 1.0.11 [MPICH]1 . As the number of senders and receivers increases, bandwidth per node degrades although the aggregate bandwidth increases. Thus interconnect speed, not disk speed, will often be the limiting factor for parallel I/O performance. PIOUS [Moyer94], VIP-FS [Harry95] and VIPIOS [Brezany96] provide parallel I/O on workstation clusters. However, each of these lacks one or more of the features desired for parallel scientic applications running on a network of workstations: collective I/O, special considerations for low message passing bandwidth, better utilization of limited resources, and minimized data transfer over the network. Our own Panda parallel I/O library2 provided collective I/O but lacked the other desired features. Thus we set out to provide them, by addressing three important issues: sharing the limited number of nodes between I/O and computation, making an optimal choice of the exact I/O nodes to save network bandwidth and to reduce concurrent access to the network, and choosing other parameters relevant to I/O such as the logical topology of compute and I/O nodes (hereafter called the compute mesh and I/O mesh respectively), array distribution in memory and on disk. To help in choosing parameter values and ease the performance analysis, we developed an accurate analytical performance model. In the remainder of the paper, we begin by briey introducing Panda in section 2. In section 3, we discuss \part-time" I/O, present an algorithm that optimizes I/O node selection and show the performance results. In section 4 we present a simple abstract performance model, analyze the results, and show how our performance model is used to select the conguration with the shortest response time for our running examples. Related work is described in section 5 and then we conclude the paper in section 6. 1 Throughput for passing smaller messages on similar HP platforms connected by an isolated FDDI gets up to 11 MB/sec (http://onet1.external.hp.com/netperf/NetperfPage.html). The dierence can be attributed to MPI overhead and a non-isolated network. 2 http://drl.cs.uiuc.edu/panda/. 2 Panda Array I/O Library Panda is a parallel I/O library for multidimensional arrays. Its original design was intended for SPMD applications running on distributed memory architectures, with arrays distributed across multiple processors, and that are fairly closely synchronized at I/O time. Panda supports HPF-style BLOCK, CYCLIC, and data distributions across the multiple compute nodes on which Panda clients are running. Panda's approach to highperformance I/O in this environment is called serverdirected I/O [Seamons95]. 0 0 1 1 2 3 2 0 0 1 1 3 clients (compute nodes) servers (I/O nodes) Figure 1: Dierent array data distributions in memory and on disk provided by Panda. Figure 1 shows a 2D array distributed (BLOCK, BLOCK) across 4 compute nodes arranged in a 22 mesh. Each piece of the distributed array is called a compute chunk, and each compute chunk resides in the memory of one compute node. The I/O nodes are also arranged in a mesh, and the data can be distributed across them using HPF-style directives BLOCK and . CYCLIC(K) is only supported on disk when the distributions in memory and on disk are identical, called natural chunking [Seamons94] in Panda. (We have not found any scientists yet who need a CYCLIC distribution on disk.) The array distribution on disk can also be radically dierent from that in memory. For instance, the array in Figure 1 has a (BLOCK, *) distribution on disk. Using this distribution, the resulting data les can be concatenated together to form a single array in row-major or column-major order, which is particularly useful if the array is to be sent to a workstation for postprocessing with a visualization tool, as is often the case. Each I/O chunk resulting from the distribution chosen for disk will be buered and sent to (or read from) local disk at one I/O node, and that I/O node is in charge of reading, writing, gathering, and scattering the I/O chunk. For example, in Figure 1, during a Panda write operation I/O node 0 gathers compute chunks from compute nodes 0 and 1, reorganizes them into a single I/O chunk, and writes it to disk. In parallel, I/O node 1 is gathering, reorganizing and writing its own I/O chunk. For a read operation, the reverse process is used. During I/O Panda divides large I/O chunks into smaller pieces, called subchunks, in order to obtain better le system performance and keep I/O node buer space requirements low. For a write operation, a Panda server will repeatedly gather and write subchunks, one by one. More precisely, for each I/O subchunk, the server determines which clients own a piece of the subchunk (\overlapping clients") and the indices of the overlapping subarray at each client. The server sends each overlapping client a message (the \schema mes- sage") requesting the relevant subarray. Clients respond to schema messages by sending the corresponding array portions in a single message (whether or not the data was contiguous at the client). With a ne-grained data distribution, a client may have several overlapping subarrays for a single subchunk. To keep message trac low, the server requests all overlapping subarrays from a client in a single message, and receives them in a single response message. The experiments described in this paper extend Panda 2.1 [Subraman96], which uses nonblocking communication between clients and servers to avoid any possible deadlock in part-time I/O (discussed in the following section) and to overlap communication and computation. 3 Part-time I/O with Optimal I/O Nodes A network of workstations usually has only a limited number of nodes and scientists often want to run their applications on as many of them as possible [Kuo96]. It would be wasteful to dedicate nodes to I/O, leaving them idle during computation. Panda 2.1 supports an alternative I/O strategy, part-time I/O, where there are no dedicated I/O nodes. Instead some of the compute nodes become I/O nodes at I/O time, and return to computation after nishing the I/O operation [Subraman96]. When a client reaches a collective I/O call, it may temporarily become a server. In the rest of the paper we will use the terms \I/O nodes" and \part-time I/O nodes" interchangeably. As implemented in Panda 2.1, part-time I/O chooses the rst k of the compute nodes as I/O nodes, regardless of array distribution or compute node mesh. A similar strategy has been taken in other libraries that provide part-time I/O [Nieplocha96]. This provides acceptable performance on a high speed, switch-based interconnect whose processors are homogeneous with respect to I/O ability, as on the SP2 [Subraman96], but on a sharedmedia interconnect like FDDI or Ethernet, we found that performance is generally unsatisfactory and tends to vary widely with the exact choice of compute and I/O node meshes. To see the source of the problem, consider the example in Figure 1. With the naive selection of compute nodes 0 and 1 as I/O nodes, compute node 1 needs to send its local chunk to compute node 0 and gather a subchunk from compute nodes 2 and 3. This incurs extra message passing which is unnecessary if compute node 2 doubles as an I/O node, instead of compute node 1. 3.1 I/O Node Selection Strategy In an environment where the interconnect will clearly be the bottleneck for I/O, as is the case for the HP workstation cluster with an FDDI interconnect, the single most important optimization we can make is to minimize the amount of remote data transfer. In this section we show how to choose I/O nodes in a manner that will minimize the number of array elements that must be shipped across the network during I/O. More precisely, suppose we are given a target number m of part-time I/O nodes, the current distribution of data across processor memories, and a desired distribution of data across I/O nodes. The data distribution in memory need not be an HPF-style distribution; for example, some nodes may have signicantly more data than others, as might happen during an adaptive meshrenement computation or when some nodes are significantly more powerful than others, as might be the case in a workstation cluster.3 Similarly, the distribution on disk need not be HPF-style; we need only know which of the m I/O nodes is to receive each byte of data. Under these assumptions, we show how to choose the I/O nodes from among the set of n m compute nodes so that remote data transfer is minimized. We begin by forming an m n array called the I/O matrix (M ), where each row represents one of the I/O nodes and each column represents one of the n compute nodes. The (i; j ) entry in the I/O matrix, M (i; j ) is the total number of array elements that the ith I/O node will have to gather from the j th compute node, which can be computed from the array size, in-memory distribution and the target disk distribution. In Panda, every node involved in an array I/O operation has access to array size and distribution information, so M can be generated at run time. For example, if the memory and disk distributions are the HPF-style distributions shown in Figure 1, the I/O matrix consists of 2 rows and 4 columns. For an 8 8 array of 1-byte elements, distributed as in Figure 1, M is: 0 0 M = 016 016 16 16 Given the I/O matrix, the goal of choosing the m I/O nodes that will minimize remote data transfer can be formalized as the problem of choosing m matrix entries M (i1 ; j1 ); : : : ; M (im ; jm ) such that no two entries lie in the same column P or row (ik 6= il and jk 6= jl , for 1 k < l m) and 1km M (ik ; jk ) is maximal. We can simplify the problem by traversing the I/O matrix and selecting m of the largest entries from each row. We call this the simplied I/O matrix (M 0 ). The proposition below shows that only the simplied I/O matrix need be considered when searching for a solution. Proposition 1 An assignment of part-time I/O nodes that minimizes remote data transfer can be found among the entries in the simplied I/O matrix. Proof 1 Suppose we have an assignment of part-time I/O nodes that minimizes remote data transfer and includes an entry M (i; j ) that is not part of the simplied I/O matrix (the `old solution'). Then the simplied matrix contains m entries M (i; jk ) M (i; j ), 1 k m, none of which have been chosen as part of the old solution. Thus if we can substitute one of the M (i; jk )| call it M (i; l)|for M (i; j ), creating a new solution, the 3 If the data were very unevenly distributed across compute nodes and/or I/O nodes, or if dierent I/O nodes had dierent capabilities, then the optimal choice of I/O nodes might need to consider not only minimal remote data transfer but also other costs at each node, to avoid potential load imbalance that could adversely aect performance. To be accurate in this case, the LogGP analytical model used in section 4 would need to be extended to model these costs (disk access time, local copying, etc.). amount of remote data transfer will be at least as low as before. However, the substitution can only be performed if we can nd a choice of M (i; l) such that no other entry in the new solution is from the ith row or from the lth column. Since the old solution can contain only one entry from row i, we can safely substitute M (i; l) for M (i; j ) without introducing a row conict. Suppose, however, that for each of the m possible choices of l we nd that the new solution contains two entries from column l. Then the old solution has m entries from columns other than j , which is impossible. We conclude that a solution can be found by considering only the simplied I/O matrix. A matching in a graph is a maximal subset of the edges of a graph such that no two edges share the same endpoint. Given a simplied I/O matrix M 0 , the problem of assigning I/O nodes is equivalent to a weighted version of the matching problem in a bipartite graph G. Consider every row (I/O nodes) and column (compute nodes) of M 0 to be a point in G and each entry M 0 (i; j ) to be the weight of the edge connecting points i and j . We wish to nd the matching of G with the largest possible sum of weights. The optimal solution can be obtained3 by the Hungarian Method [Papadimitriou82] in O(m ) time, where m is the number of part-time I/O nodes. Figure 2 shows the simplied I/O matrix M 0 and corresponding bipartite graph G for M . 16 M0 = 16 16 16 16 0 16 1 16 G: 16 c0 c1 c2 c3 Figure 2: Simplied I/O matrix M 0 and the corresponding bipartite graph G. The original indices for the entries in M 0 are separately recorded. The edges of G in thicker lines are the matching selected to minimize the amount of remote data transfer. Proposition 2 An assignment of m part-time I/O nodes that minimizes remote data transfer can be determined from the simplied I/O matrix in O(m3 ) time. Proof 2 Follows directly from the preceding discussion. The preceding discussion assumes that the goal is to optimize the choice of I/O nodes for a single write request. With small modications, the same approach can be used to optimize the choice of I/O nodes for a sequence of read and write requests, including any initial bulk data loading. In this case, the I/O matrix must contain the total number of bytes transferred between each client and server, which can be obtained from a description of the anticipated data distributions, meshes, and numbers and types of read and write requests. As in the case of a single I/O request, uneven data distributions or unequal node capabilities may mean that the optimization decision should consider additional factors beyond minimal data transfer. In addition, if optimal I/O nodes are to be used for persistent arrays, the I/O nodes need to maintain directory information regarding the arrays stored, as Panda does in its \schema" les. write operation, 2x4 compute node mesh (BLOCK,BLOCK) in memory, (BLOCK,*) on disk 3.2 Performance Results 4 Panda I/O nodes also write 1 MB unit, the subchunk size in all experiments. 18 16 Panda Response Time (sec) 64MB 14 12 64MB 10 64MB 8 6 16MB 16MB 4 16MB 2 4MB 4MB 4MB 0 2 4 # of I/O nodes fixed 8 optimal Figure 3: Panda response time for writing an array using xed or optimal I/O nodes. Memory mesh: 2 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: n 1, where n is the number of I/O nodes. Disk distribution: (BLOCK, *). write operation, 4x2 compute node mesh (BLOCK,BLOCK) in memory, (BLOCK,*) on disk 20 18 16 64MB Panda Response Time (sec) The experiments were conducted on an 8-node HP 9000/ 735 workstation cluster running HP-UX 9.07 on each node, connected by FDDI. Each node has a main memory of size 144 MB and two local disks and each I/O node used a 4 GB local disk. At the time of our experiments, the disks were 45{90% full depending on the node. We measured the le system throughput for write operations by averaging 8 trials of writing a 128 MB le using 1 MB application write requests4 : 5.96 MB/sec for the least occupied disk and 5.63 MB/sec for the fullest disk with standard deviations of 0.041 and 0.153 respectively. For the message passing layer, we used MPICH 1.0.11 and obtained an average message passing bandwidth per node of 3.2, 2.9 or 2.3 MB/sec when there are 1, 2 or 4 pairs of senders and receivers, respectively, for Panda's common message sizes (32{512 KB). The cluster is not fully isolated from other networks, so we did our experiments when no other user job was executing on the cluster. All the experimental results shown are the average of 3 or more trials and error bars show a 95% condence interval for the average. We used all 8 nodes as compute nodes in our experiments, as that conguration is probably most representative of scientists' needs. The in-memory distribution was (BLOCK, BLOCK) and we tested performance using the two obvious compute node meshes for a 2D array on 8 nodes|24 and 42. We chose a 2D instead of 3D array so that we could examine the performance impact of changing the dierent compute node mesh (there is only one obvious mesh, 222, for a 3D array distributed (BLOCK,BLOCK,BLOCK)). We used 2, 4, or 8 part-time I/O nodes, while increasing the array size from 4 MB (10241024) to 16 MB (20482048) and 64 MB (40964096). For the disk distribution, we used either (BLOCK, *) or (*, BLOCK) to show the eect of a radically dierent distribution. We present results for writes; reads are similar. Figures 3, 4, 5 and 6 compare the time to write an array using an optimal choice of I/O nodes or using the rst k nodes as I/O nodes (\xed" I/O nodes). A group of 6 bars is shown for each number of I/O nodes. Each pair of bars shows the response time to write an array of the given size using xed I/O nodes and optimal I/O nodes respectively. The use of optimal I/O nodes reduces array output time by at least 19% across all dierent combinations of array sizes and meshes, except for the case of 8 I/O nodes, for which the xed and optimal I/O nodes are identical. In Figures 3 and 4, dierent congurations perform very dierently. For instance, for a 42 compute node mesh (Figure 4) with 4 I/O nodes, the change to optimal I/O nodes halves response time. Moving from 2 to 4 I/O nodes gives a superlinear speedup when optimal I/O nodes are used, but only in the 42 mesh; the 24 mesh sees only a factor of 1.5 speedup. The reasons for these dierences will be explored in the next section. 20 14 12 64MB 10 8 6 64MB 16MB 4 2 16MB 4MB 16MB 4MB 4MB 0 2 4 # of I/O nodes fixed 8 optimal Figure 4: Panda response time for writing an array using xed or optimal I/O nodes. Memory mesh: 4 2. Memory distribution: (BLOCK, BLOCK). Disk mesh: n 1, where n is the number of I/O nodes. Disk distribution: (BLOCK, *). write operation, 2x4 compute node mesh (BLOCK,BLOCK) in memory, (*,BLOCK) on disk 20 18 Panda Response Time (sec) 16 14 64MB 12 10 64MB 8 64MB 6 16MB 4 16MB 2 16MB 4MB 4MB 4MB 0 2 4 # of I/O nodes fixed 8 optimal Figure 5: Panda response time for writing an array using xed or optimal I/O nodes. Memory mesh: 2 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: 1 n, where n is the number of I/O nodes. Disk distribution: (*, BLOCK). write operation, 4x2 compute node mesh (BLOCK,BLOCK) in memory, (*,BLOCK) on disk 12 64MB 64MB 64MB Panda Response Time (sec) 10 8 6 4 16MB 16MB 16MB 2 4MB 4MB 4MB 0 2 4 # of I/O nodes fixed 8 optimal Figure 6: Panda response time for writing an array using xed or optimal I/O nodes. Memory mesh: 4 2. Memory distribution: (BLOCK, BLOCK). Disk mesh: 1 n, where n is the number of I/O nodes. Disk distribution: (*, BLOCK). Figures 5 and 6 show the results obtained with a (*, BLOCK) distribution on disk. In Figures 3 and 4, use of additional I/O nodes always improved response time. However, for the 24 compute node mesh in Figure 5, response time increases when we move from 4 to 8 I/O nodes, unless the I/O node selection is optimized. Similarly, xed part-time I/O slows down when we move from 2 to 4 I/O nodes, for a 42 compute node mesh in Figure 6. The performance results in Figures 7 and 8 show the Panda response time to write out data with a negrained memory distribution, using optimal I/O nodes. The same arrays are used as in earlier gures, but with a (BLOCK, CYCLIC(K)) memory distribution, for K 2 f32; 64; 128g. For a given number of I/O nodes and array size, each group of three bars shows the performance for dierent choices of K. The performance is very close to that of the (BLOCK, BLOCK) distribution with optimal I/O nodes shown in Figures 3 and 4, with a little increase in response time. The increase is due to extra overhead for computing the overlaps for CYCLIC distributions, and the smaller granularity of contiguous data to be sent to I/O nodes, which makes remote data transfer more expensive. The cyclic block size K does not impact performance signicantly in Figures 7 and 8. Changing K does not aect the amount of local data at each I/O node, which is the same as for the (BLOCK, BLOCK) distribution and is a primary determinant of performance. Further, each I/O node has the same number of overlapping clients for each chunk as it did for the (BLOCK, BLOCK) distribution, so message-passing costs are much the same. We repeated the same experiments while distributing the in-memory array (CYCLIC(K), BLOCK) and the results are shown in Figures 9 and 10. Compared to Figures 7 and 8, there's a noticeable performance degradation. For the previous congurations, at each I/O node either 1/2 (42 mesh with 4 or 8 I/O nodes) or 1/4 (all other cases) of the data was local. Now the amount of local data has dropped to 1/8, with four exceptions in the 42 mesh (1/2 local: 4 MB, K = 128, 8 I/O nodes; 1/4 local: 16 MB, K = 128, 8 I/O nodes; 4 MB, K = 64, 8 I/O nodes; 4 MB, K = 128, 4 I/O nodes). An examination of the latter congurations in Figure 8 shows how increasing the amount of local data, by increasing K or decreasing the array size, signicantly impacts performance. Moreover, when only 1/8 of the data is local, each I/O node must communicate with every other node to gather each chunk, aggravating network contention and increasing wait times. Distributing the in-memory array (CYCLIC(K), CYCLIC(K)) gave performance close to that of (CYCLIC(K), BLOCK). We reran the same tests on the same cluster, but using a 10 Mb/s Ethernet, and observed similar general trends. Panda provided lower I/O throughput than with FDDI, as expected, but also severely suered when there were many overlapping clients. For instance, with a 24 compute node mesh, a (BLOCK, BLOCK) memory distribution, and a (BLOCK, *) data distribution on disk, each I/O node has 4 overlapping clients, and doubling the number of I/O nodes reduces response time by less than 10%, even when using optimal I/O nodes. The Ethernet uses random exponential backo on packet collisions, which means that the Ethernet degrades less write operation, 2x4 compute node mesh (BLOCK,CYCLIC(K)) in memory, (BLOCK,*) on disk 14 write operation, 2x4 compute node mesh (CYCLIC(K),BLOCK) in memory, (BLOCK,*) on disk 18 64MB 64MB 16 12 64MB 64MB 8 6 4 16MB Panda Response Time (sec) Panda Response Time (sec) 14 10 64MB 12 64MB 10 8 6 16MB 16MB 4 16MB 2 2 4MB 4MB 16MB 16MB 4MB 4MB 4MB 4MB 0 0 2 4 # of I/O nodes K=32 K=64 2 8 4 # of I/O nodes K=32 K=128 Figure 7: Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh: 2 4. Memory distribution: (BLOCK, CYCLIC(K)). Disk mesh: n 1, where n is the number of I/O nodes. Disk distribution: (BLOCK, *). K=64 8 K=128 Figure 9: Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh: 2 4. Memory distribution: (CYCLIC(K), BLOCK). Disk mesh: n 1, where n is the number of I/O nodes. Disk distribution: (BLOCK, *). write operation, 4x2 compute node mesh (BLOCK,CYCLIC(K)) in memory, (BLOCK,*) on disk write operation, 4x2 compute node mesh (CYCLIC(K),BLOCK) in memory, (BLOCK,*) on disk 14 18 64MB 16 64MB 12 10 8 64MB 6 64MB 4 16MB Panda Response Time (sec) Panda Response Time (sec) 14 64MB 12 64MB 10 8 6 16MB 16MB 4 2 16MB 4MB 4MB 16MB 16MB 4MB 4MB 4MB 4MB 0 0 2 Figure 8: 2 4 # of I/O nodes K=32 K=64 K=128 2 8 Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh: 4 2. Memory distribution: (BLOCK, CYCLIC(K)). Disk mesh: n 1, where n is the number of I/O nodes. Disk distribution: (BLOCK, *). 4 # of I/O nodes K=32 Figure 10: K=64 8 K=128 Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh: 4 2. Memory distribution: (CYCLIC(K),BLOCK). Disk mesh: n 1, where n is the number of I/O nodes. Disk distribution: (BLOCK, *). gracefully than FDDI in the presence of high network contention in our experiments. The experiments described in this section show that even with optimal I/O nodes, performance is very dependent not only on the amount of local data transfer, but also on the compute node mesh chosen, array distribution on disk and the number of I/O nodes. To obtain acceptable performance, the user needs help from Panda in predicting I/O performance. Also, the use of local data makes it hard to tell how well Panda utilizes available network bandwidth. In the next section, we present a model for Panda's performance with FDDI, to be used for these purposes. 4 Performance Model and Analysis In previous work, we developed a performance model for Panda on the IBM SP2, where message passing contention is not an issue [Chen96]. However, in a workstation cluster whose disks can outrun the interconnect, message passing contention and throughput is the key element to model for performance prediction, so the old model is of no assistance. As the interpretation of the initial performance results obtained from the cluster was not obvious, we developed a new performance model from a priori principles (i.e., not depending on the experimental results in any way), and performed experiments presented in section 3 to validate the model. In this section we describe our model for gathering and scattering a subchunk in Panda using part-time I/O on FDDI, show that it gives very accurate performance predictions, and show how we use it in performance analysis. 4.1 The LogP, LogGP Model We build on the LogP model [Culler93] and its extension LogGP [Alexandrov95] due to its simplicity and its ability to model the overlapping of communication and computation. LogP and LogGP model a parallel application using the following parameters: L (latency) Delay incurred in communicating a message containing one word from source to destination. o (overhead) Length of time that a processor is engaged in the transmission or reception of each message. Processors cannot perform any other operation during this time. g (gap) Minimum time interval between consecutive transmissions or receptions of a short message at a process. P Number of processes. G (gap per byte) Time per byte to transmit or receive a long message. Thus, if a sender transmits a k-byte message at time 0, it will be received by the receiver at time o + G(k ? 1)+ L + o. If two consecutive messages of size k are to be sent out, the rst byte of the second message enters the network at o + G(k ? 1) + max(o; g). Measurements of parameter values on the FDDI HP workstation cluster are done at the MPI level and the results are shown in Table 1. The overhead o was measured as the time re- o g G1 G2 235.4 sec 300 sec 4.5 sec 0.30 sec Table 1: LogGP parameter values. G1 is the gap per byte for a medium-size message and G2 is the gap per byte for a large message. quired for a process to post a non-blocking MPI send to another process. For the value of g, a process sends to another process 500 one-word messages using MPI's non-blocking send operation, and the receiver sends an acknowledgement once it has received all the messages. Dividing the total time by the number of messages gives g. G is measured in the same way but using larger messages. G is measured for two dierent message sizes: G1 for Panda schema messages (68 bytes for a 2D array) and G2 for overlapping subchunk data messages (usually hundreds of KB). Measurement of L is discussed in the next section. 4.2 Abstract Performance Model and Validation The FDDI used for our experiments is a shared media network, with arbitration done by a token moving in a xed circle around the network. A node has to grab the token before it sends out any message. Our latency parameter must include the time to wait for the token on the network when there are several waiting senders. If l is raw network latency when the network is contention free and w is raw network bandwidth, the adjusted latency when there are s senders trying to send an n byte message, L(s; n) can be formulated as follows: L(1; n) = l; L(2; n) = (l + n=w)+ l; : : : L(s; n) = (s ? 1)(l + n=w)+ l. In our performance model, we will use the maximum latency, L(s; n), instead of the average latency because Panda performance is determined by the slowest node. The raw network bandwidth w for FDDI is 100 Mb/s for FDDI, and the raw network latency l is extracted from [Banks93] as 10 s. With the adjusted latency, the time for S senders to each send a k-byte message is T1 = o + L(s; k)+ G(k ? 1). For S processes each to send n k-byte messages, it takes time Tn = o + L(s; nk)+ nG(k ? 1)+(n ? 1)max(g; o). In a more general form, the time for a process to send n1 messages and receive n2 messages of size k, when there are S processes including itself, each of which sends at most n3 k-byte messages, is Tn1 ;n2 = o + L(s; n3 k) + n1 G(k ? 1) + (n1 ? 1)max(g; o) +n2 G(k ? 1) + n2 max(g; o) (1) Table 2 shows the parameters necessary to model Panda's subchunk gather operations when part-time I/O is used. Scatter operations are the opposite of gathers and can be modeled using the same parameters. For a gather operation, I/O nodes make requests for remote chunks by sending schema messages to appropriate remote overlapping clients, then I/O nodes copy a chunk from the local client if there is a local overlap. Next I/O nodes receive the requested chunk data from the corresponding compute nodes. The model for gathering a subchunk assumes that each I/O node gets the same number of overlapping clients and that overlapping subchunks are of roughly equal size. When this is not the case, the model needs to be rened by including additional parameters. I C Number of I/O nodes Number of nodes which send chunk data to a remote I/O node concurrently N Size of overlapping subchunk Schema len Size of schema message I schema Number of I/O nodes which actually make data requests to remote compute nodes Req proc Maximum number of data requests that a compute node processes Req send Maximum number of remote overlapping clients among the I/O nodes Req rec Maximum number of schema messages received by an I/O node Local 1 if all I/O nodes currently have a local overlap, otherwise 0 Table 2: Panda parameters necessary to model a gather/scatter operation when part-time I/O is used. The total time that Panda spends in gathering subchunks for an I/O request for a single array is array size Tsubchunk Ttotal = subchunk size I (2) The total time to gather a subchunk at an I/O node is the sum of the time to send schema messages and to receive chunk data from compute nodes: Tsubchunk = Tschema + Tdata (3) Tschema is the time for Ischema processes to send Req send messages of size Schema len plus the time to receive Req rec schema messages and can be computed by using Equation 1: Tschema = o + Lschema (4) +(Req send + Req rec) G1 (Schema len ? 1) +(Req send + Req rec ? 1) max(o; g) Tdata is the time for all part-time I/O nodes to send Req rec and receive Req send messages of size N each, when there are a total of C processes each trying to send Req proc messages of size N plus the time to do a local copy if the I/O nodes have a local overlapping subchunk: Tdata = o + Ldata +(Req rec + Req send) G2 (N ? 1) +(Req rec + Req send ? 1) max(o; g) + Local Memcpy(N ) (5) Lschema and Ldata can be computed by using the adjusted latency described earlier. The values for the parameters in Table 2 are easily determined given the array distributions and meshes for compute and I/O nodes. Parameter values for dierent congurations are presented separately in Table 3. Table 4 shows the actual and predicted Panda response time to write an array of under dierent compute and I/O node meshes. Arrays are distributed (BLOCK, BLOCK) in memory and (BLOCK, ) or (, BLOCK) on disk. Across all different array sizes, compute node meshes, and I/O node meshes, the predicted time is accurate within a 13% margin of error. 4.3 Analysis We now use the validated performance model to explain the performance trends in the earlier experiments. From Equation 2, the term Tsubchunk =I determines Panda response time for an I/O request because array size and subchunk size are xed. Tsubchunk , in turns, depends on Tschema and Tdata as in Equation 3. But the schema messages are very short compared to the data messages and contribute very little to Tsubchunk , so we focus instead on Tdata=I . In formula (6) for Tdata, the dominant factor that determines performance variations between congurations is Req rec + Req send so the analysis below focuses on the value of this factor, divided by I . Table 3 shows Panda parameter values for various congurations using the default subchunk size, 1 MB, and xed or optimal I/O nodes. We will use the Panda parameters with subscripts fix and opt to distinguish the values for xed and optimal part-time I/O node. We are ready to explain the dierent speedups as we move from 2 to 4 optimal I/O nodes in the 24 and 42 compute meshes in Figures 3 and 4. If we evaluate the coecient (Req recopt + Req sendopt)=I for both meshes, the value for the 24 compute node mesh changes from 1.5 to 1.0, whereas for the 42 mesh it drops from 0.75 to 0.25. This coecient highlights the maximum amount of data transferred from remote compute nodes. Thus the decrease in remote data transfer by a factor of 3 largely explains the superlinear speedup for the 42 compute node mesh when we increase the number of I/O nodes. The model also explains the performance with xed I/O nodes when the (*, BLOCK) distribution is used on disk as in Figures 5 and 6. For a 24 compute mesh, the model predicts that Tsubchunk =I is 0.25, 0.14, and 0.16 second when I is 2, 4, and 8, respectively. Thus the model correctly predicts that 4 I/O nodes yield the shortest response time. To understand the slowdown when we move from 4 to 8 I/O nodes, we examine the changes in the value of the coecient (Req recfix + Req sendfix)=I . It increases by 50% and it has its minimum value when there are 4 I/O nodes. The slowdown caused by increasing xed I/O nodes from 2 to 4 for a 42 compute node mesh can be explained similarly. The value of Tsubchunk =I can also predict the meshes and the number of I/O nodes which yield the shortest response time. The conguration with smallest Tsubchunk =I yields the shortest response time, assuming all I/O nodes have the same disk performance. The 42 compute node mesh with 8 I/O nodes has the smallest value of Tsubchunk =I for the (BLOCK, ) distribution on disk, which matches the experimental results. For the (, BLOCK) distribution on disk, a 24 compute mesh with either 4 or 8 I/O nodes has the smallest value of Tsubchunk =I . The experimental results show improved performance for 8 I/O nodes, but the improvement is within the margin of error of the model. The minimization of remote data transfer also simplies the communication pattern to make better utilization of peak message passing bandwidth. We use the performance model to gure out how much time Panda spends in remote data transfer and then examine how close it is to the peak MPI bandwidth when I/O node selection is optimized. Figure 11 shows the fraction of peak MPI throughputs attained for remote data transfer in Panda. Firstly, we divided Panda's predicted throughput by MPI throughput measured earlier; the result shows around 60% of available throughput for xed I/O nodes and 80% is used with optimal I/O nodes. However, on a sharedmedia network if two or more nodes start to send a certain size of message at the same time, they will complete the transfer at dierent times. Since Panda's performance is limited by the slowest node, in practice Panda's performance will be determined by the bandwidth obtained by the node experiencing the most latency. Dividing Panda's predicted throughput for remote data transfer by the average MPI throughput assuming maximum latency gives the second graph, in which Panda attains 85{98% of average MPI throughput except with a 4x2 compute mesh, 8 I/O nodes, and a (*, BLOCK) distribution on disk. The problem here is that 4 I/O nodes have identical overlapping clients so that overlapping chunk transfer is serialized. 5 Related Work Our work is related to previous work in chunking multidimensional arrays on secondary storage and parallel I/O on a workstation cluster. [Sarawagi94] is an eort to enhance the POSTGRES DBMS to support large multidimensional arrays. Chunking is used to reduce the number of disk blocks to be fetched for a given acess pattern in read-intensive applications. [Seamons94] describes chunking in Panda as well as its interface for write-intensive applications on massively parallel platforms. Technology transfer from the Panda project has helped lead to an implementation of multidimensional array chunking, coupled with compression, in the HDF scientic data management package [Velamparampil97]. PIOUS [Moyer94] is a pioneer work in parallel I/O on a workstation cluster. PIOUS is a parallel le system with a Unix-style le interface; coordinated access to a le is guaranteed using transactions. Our strategies for collective I/O and for slow shared-media networks could be built on top of PIOUS. VIP-FS [Harry95] provides a collective I/O interface for scientic applications running in parallel and distributed environments. Their assumed-request strategy is designed for distributed systems where the network is a potentially congested shared medium. It reduces the number of I/O requests made by all the compute nodes involved in a collective I/O operation, to reduce congestion. In our experience with FDDI and Ethernet, the time saved by minimizing remote data transfer generally outweighs the potential savings from reducing I/O request messages. VIPIOS [Brezany96] is a design of a parallel I/O system to be used in conjunction with Vienna Fortran. VIPIOS exploits logical data locality in mapping between servers and application processes and physical data locality between servers and disks, which is similar to our approach in exploiting local data. Our approach adds an algorithm that guarantees minimal remote data access during I/O, quanties the savings, and uses an analytical model to explain other performance trends. Our research is also related to better utilization of resources on massively parallel processors, e.g. using parttime I/O. Disk Resident Arrays [Nieplocha96] is an I/O library for out-of-core computation on distributed memory platforms. For I/O operations, a xed set of compute nodes is selected as part-time I/O nodes; thus on an interconnect like FDDI, DRA could probably benet from consideration of array distribution, meshes, and access patterns. [Kotz95] exploits the I/O nodes in a MIMD multiprocessor as compute nodes for those applications which can tolerate I/O interrupts. The results from their simulation study show I/O nodes are mostly underutilized, with 80-97% availability for computation, except for some complicated le-request patterns. Like our work, this is an eort to utilize as many nodes as possible for computation. But their assumption was that I/O nodes and compute nodes are separate and I/O nodes are utilized by another application, whereas our notion of part-time I/O assumes there is no distinction between compute nodes and I/O nodes and one application can use the entire node. One advantage of our assumption is that I/O nodes can be chosen exibly, giving customized service to each application and increasing the amount of local data transfer. Jovian-2 [Acharya96] allows all nodes to run both an application thread and an I/O server thread in its peer-to-peer conguration. Each node can make an I/O request to another, thus processing of an I/O request can be delayed by the computation performed by the requested node. Their remedy for slow I/O performance was to make use of global I/O requirement information available from the application and prefetch and cache the data to minimize the I/O time. Also, in contrast to the other works cited in this section, [Acharya96] argues that collective I/O is not as ecient as I/O to a single distributed le using an independent le pointer for each compute node, if the application code is appropriately organized. We prefer to leave the application code as is, and concentrate on handling high-level I/O requests eciently. Previous work from the Panda group also targets part-time I/O. [Subraman96] describes Panda 2.1's implementation of part-time I/O and gives basic performance results. [Kuo96] experimented with part-time I/O for a real application on the SP2, and found that part-time I/O can also be used eciently for an application which requires ooading the data to Unitree when the application is nished, if the rest of the compute nodes which are not involved in ooading the data can be released right away. 6 Summary and Conclusions For a workstation cluster where message passing delivers low bandwidth to the end user due to software overhead and shared media, message passing is more likely to be a bottleneck for parallel I/O than disk speed. This paper has shown that in such an environment, proper selection of parameters like compute node mesh, array distribution, and number of I/O nodes has a large impact on I/O performance. We experimented with the Panda parallel I/O library on a workstation cluster connected by FDDI and Ethernet, two currently popular LANs. On a small cluster, it is not practical to dedicate nodes to I/O, so we used a \part-time I/O" strategy so that compute nodes could double as I/O nodes. Another important issue, given a relatively slow interconnect, is the choice of I/O nodes to minimize both message trac across the network and waiting time. We introduced an algorithm that chooses I/O nodes so as to guarantee the minimum amount of data movement across the network, and found that it gave speedups of 1.2{2.1 when compared with the naive choice of I/O nodes on an 8-node HP workstation cluster using FDDI. If optimal I/O nodes are to be used for persistent arrays, the I/O nodes need to maintain directory information regarding the arrays stored, as Panda does in its \schema" les. Even with the optimal choice of I/O nodes, Panda's performance is very sensitive to the compute node mesh and array distribution. The ability to predict performance is important if we are to guide the user in the selection of the mesh, array distribution on disk and the number of I/O nodes. We introduced an analytical performance model that accurately predicts performance on an FDDI interconnect, and used it to select the optimal values for these parameters. The performance model can be reused for dierent interconnects, such as ATM or Myrinet [Boden95], if values for the relevant message passing parameters are measured and the latency is appropriately modeled. This study is the rst step toward a version of Panda that can automatically make optimal choices of meshes and numbers of I/O nodes, when scientists want to be relieved of that task. In the future, we also plan to study the performance of Panda on a larger cluster with a faster interconnect, the switch-based Myrinet, and explore alternative I/O strategies for that platform. 7 Acknowledgements This research was supported by an ARPA Fellowship in High Performance Computing administered by the Institute for Advanced Computer Studies at the University of Maryland, by NSF under PYI grant IRI 89 58582, and by NASA under NAGW 4244 and NCC5 106. We thank John Wilkes for providing access to the HP workstation cluster at HP Labs in Palo Alto and valuable comments on our initial manuscript draft. We also thank Milon Mackey for system information describing the HP cluster and all the users at HPL who suered through our heavy use of the cluster and its Ethernet connection. We also thank David Kotz at Dartmouth College for providing access to the FLEET lab. References [Acharya96] A. Acharya, M. Uysal, R. Bennett, A. Mendelson, M. Beynon, J. Hollingsworth, J. Saltz and A. Sussman, Tuning the Performance of I/O-Intensive Parallel Applications, Proceedings of the Fourth Annual Workshop on I/O in Parallel and Distributed Systems, pages 15-27, May 1996. [Alexandrov95] A. Alexandrov, M. Ionescu, K. Schauser, and C. Scheiman, LogGP: Incorporating Long Messages into the LogP Model - One Step Closer Towards a Realistic Model for Parallel Computation, Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 95-105, 1995. [Banks93] D. Banks and M. Prudence, A HighPerformance Network Architecture for a PARISC Workstation, IEEE Journal on Selected Areas in Communications, Vol. 2, No. 2, pages 191-202, February 1993. [Boden95] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, 15(1), pages 29-36, February 1995. [Brezany96] P. Brezany, T. A. Mueck and E. Schikuta, A Software Architecture for Massively Parallel Input-Output, Proceedings of the Third International Workshop PARA'96, LNCS Springer Verlag, Lyngby, Denmark, August 1996. [Chen96] Y. Chen, M. Winslett, S. Kuo, Y. Cho, M. Subramaniam and K. Seamons, Performance Modeling for the Panda Array I/O Library, Proceedings of Supercomputing 96, November 1996. [Culler93] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser and E. Santos, LogP: Towards a Realistic Model of Parallel Computations, Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, May 1993. [Harry95] M. Harry, J. Rosario and A. Choudhary, VIPFS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing, Proceedings of the Ninth International Parallel Processing Symposium, April 1995. [Keeton95] K. K. Keeton, T. E. Anderson and D. A. Patterson, LogP Quantied: The Case for LowOverhead Local Area Networks, Hot Interconnects III, August 1995. [Kotz95] D. Kotz and T. Cai, Exploring the Use of I/O Nodes for Computation in a MIMD Multiprocessor, Proceedings of the IPPS '95 Third Annual Workshop on I/O in Parallel and Distributed Systems, pages 78-89, 1995. [Kuo96] S. Kuo, M. Winslett, K. E. Seamons, Y. Chen, Y. Cho, M. Subramaniam, Application Experience with Parallel Input/Output: Panda and the H3expresso Black Hole Simulation on the SP2, Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientic Computing, March 1997. [Moyer94] S. Moyer and V. Sunderam, PIOUS: A Scalable Parallel I/O System for Distributed Computing Environments, Computer Science Technical Report CSTR-940302, Department of Math and Computer Science, Emory University, November 1994. [MPI] Message Passing Interface Forum, MPI: A Message-Passing interface Standard, June, 1995. [MPICH] W. Gropp, E. Lusk and A. Skjellum, A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard, http://www.mcs.anl.gov/mpi/mpicharticle/ paper.html, July 1996. [Nieplocha96] J. Nieplocha, I. Foster, Disk Resident Arrays: An Array-Oriented I/O Library for Outof-Core Computation, Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 196-204, October 1996. [Papadimitriou82] C. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall Inc., 1982. [Sarawagi94] S. Sarawagi and M. Stonebraker, Ecient Organization of Large Multidimensional Arrays, Proceedings of the International Conference on Data Engineering, pages 328{336, 1994. [Seamons94] K. E. Seamons and M. Winslett, Physical Schemas for Large Multidimensional Arrays in Scientic Computing Applications, Proceedings of the 7th International Working Conference on Scientic and Statistical Database Management, Charlottesville, Virginia, pages 218-227, September 1994. [Seamons95] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak and M. Winslett, Server-Directed Collective I/O in Panda, Proceedings of Supercomputing '95, November 1995. [Subraman96] M. Subramaniam, High Performance Implementation of Server Directed I/O, M.S. Thesis, Department of Computer Science, University of Illinois, 1996. [Velamparampil97] G. Velamparampil, Data Management Techniques to Handle Large Data Arrays in HDF, M.S. Thesis, Department of Computer Science, University of Illinois, 1997. Predicted Panda message passing throughput divided by average MPI bandwidth Throughput (% of average MPI bandwidth) 100 90 80 70 60 I schema Req send Req rec Local Req proc C N (MB) 50 40 30 20 10 0 fixed 2x4-2-r 2x4-2-c Throughput (% of average MPI bandwidth) I schema Req send Req rec Local Req proc C N (MB) 2x4-r-r 2x4-8-c optimal 4x2-2-r 4x2-4-r 4x2-4-c 4x2-8-c Predicted Panda message passing throughput divided by the average MPI bandwidth assuming maximum latency 100 I schema Req send Req rec Local Req proc C N (MB) 90 80 I schema Req send Req rec Local Req proc C N (MB) 70 60 50 40 30 20 2x4 compute node mesh Fixed I/O nodes 2 4 8 r c r c r c 2 22 4 4 8 7 4 22 4 1 3 1 1 10 2 0 3 2 0 00 0 0 1 0 1 12 2 1 3 2 7 34 6 4 8 3 0.25 0.5 0.5 0.25 1.0 0.25 1.0 Optimal I/O nodes 2 4 8 r c r c r c 2 22 4 4 8 4 3 12 3 1 3 1 0 00 1 0 3 1 1 10 1 0 1 1 1 11 2 1 3 1 6 24 4 4 8 4 0.25 0.5 0.5 0.25 1.0 0.25 1.0 4x2 compute node mesh Fixed I/O nodes 2 4 8 r c r c r c 22 2 4 34 8 7 22 1 2 11 1 1 10 0 1 20 1 4 00 0 0 00 1 0 11 1 1 22 1 4 34 2 7 22 8 2 0.5 0.5 0.5 0.5 1.0 1.0 0.5 1.0 Optimal I/O nodes 2 4 8 r c r c r c 20 2 4 24 8 6 12 1 1 11 1 1 00 0 0 00 1 3 10 0 1 00 1 0 11 1 1 22 1 3 24 2 4 22 8 2 0.5 0.5 0.5 0.5 1.0 1.0 0.5 1.0 10 0 fixed 2x4-2-r 2x4-2-c 2x4-r-r 2x4-8-c optimal 4x2-2-r 4x2-4-r 4x2-4-c 4x2-8-c Figure 11: Time that Panda spends for remote data transfer as a percentage of the average and worst case MPI bandwidth. Each legend is of the form x-y-z, where x is a compute mesh, y is an I/O node mesh and z is r (BLOCK, *) or c (*, BLOCK). Table 3: Panda parameter values for a 2D array distributed (BLOCK, BLOCK) in memory across 2 4 compute nodes and 4 2 compute nodes. The columns r and c represent the values when (BLOCK, ) and (, BLOCK) distributions are used on disk respectively. The 4x2 compute mesh and 2 I/O nodes with (BLOCK, *) distribution on disk have two parameter lists because they gather from two dierent sets of clients. Initially, I/O node 0 gathers a subchunk from compute nodes 0 and 1, and I/O node 1 gathers one from compute nodes 4 and 5. Next I/O node 0 gathers a subchunk from compute nodes 2 and 3 and I/O node 1 gathers one from nodes 6 and 7. The 2x4 compute node mesh with 2 I/O nodes and a (*, BLOCK) distribution on disk is similar. 4MB compute mesh I/O nodes measured predicted 2x4 2(r) 1.05 0.98 2x4 4(r) 0.76 0.72 2x4 2(c) 1.07 1.04 2x4 8(c) 0.44 0.45 4x2 2(r) 1.03 1.04 4x2 4(r) 0.56 0.63 4x2 4(c) 0.71 0.77 4x2 8(c) 0.66 0.65 16MB compute mesh I/O nodes measured predicted 2x4 2(r) 3.52 3.22 2x4 4(r) 2.30 2.31 2x4 2(c) 3.36 3.14 2x4 8(c) 1.46 1.34 4x2 2(r) 3.42 3.32 4x2 4(r) 1.52 1.33 4x2 4(c) 2.40 2.08 4x2 8(c) 2.08 1.97 64MB compute mesh I/O nodes measured predicted 2x4 2(r) 13.41 12.43 2x4 4(r) 8.28 8.66 2x4 2(c) 12.94 12.51 2x4 8(c) 4.72 4.40 4x2 2(r) 12.88 12.74 4x2 4(r) 5.41 4.83 4x2 4(c) 8.70 7.88 4x2 8(c) 8.01 7.79 Table 4: Panda response time in seconds to write a 2D array distributed (BLOCK, BLOCK) in memory, using optimal I/O nodes. In the second column, "r" and "c" stand for (BLOCK, *) and (*, BLOCK) distribution on disk respectively. Results are shown only for the cases where optimal I/O nodes are dierent from xed I/O nodes.