Exploiting Local Data in Parallel Array I/O on a Practical Network of

advertisement
Exploiting Local Data in Parallel Array I/O on a Practical Network of
Workstations
Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo, K. E. Seamonsy
Computer Science Department
University of Illinois, Urbana, Illinois 61801
fycho,winslettg@cs.uiuc.edu
Abstract
A cost-eective way to run a parallel application is to
use existing workstations connected by a local area network such as Ethernet or FDDI. In this paper, we present
an approach for parallel I/O of multidimensional arrays
on small networks of workstations with a shared-media
interconnect, using the Panda I/O library. In such an
environment, the message passing throughput per node
is lower than the throughput obtainable from a fast disk
and it is not easy for users to determine the conguration which will yield the best I/O performance. We
introduce an I/O strategy that exploits local data to reduce the amount of data that must be shipped across the
network, present experimental results, and analyze the
results using an analytical performance model and predict the best choice of I/O parameters. Our experiments
show that the new strategy results in a factor of 1.2{2.1
speedup in response time compared to the Panda version originally developed for the IBM SP2, depending
on the array sizes, distributions and compute and I/O
node meshes. Further, the performance model predicts
the results within a 13% margin of error.
1 Introduction
Workstation clusters are emerging as an economical way
to run parallel applications. Despite their attractive
features these clusters still have some limits when compared to massively parallel processors. First, usually
fewer nodes are available limiting the scale of experiments. Second, there are no widely available implementations of portable libraries like PVM or MPI that have
been highly tuned up for the clusters. Third, the popular FDDI and Ethernet are slower than interconnects of
massively parallel processors [Keeton95]. For example,
Presently
y Presently
aliated with Oracle Corporation.
aliated with Transarc Corporation.
with a single sender and receiver on an FDDI HP workstation cluster, we obtained 3.33 MB/sec on average
out of the peak FDDI bandwidth
of 12.5 MB/sec using
MPICH 1.0.11 [MPICH]1 . As the number of senders
and receivers increases, bandwidth per node degrades
although the aggregate bandwidth increases. Thus interconnect speed, not disk speed, will often be the limiting factor for parallel I/O performance.
PIOUS [Moyer94], VIP-FS [Harry95] and VIPIOS
[Brezany96] provide parallel I/O on workstation clusters. However, each of these lacks one or more of the
features desired for parallel scientic applications running on a network of workstations: collective I/O, special considerations for low message passing bandwidth,
better utilization of limited resources, and minimized
data transfer over the network. Our own Panda parallel
I/O library2 provided collective I/O but lacked the other
desired features. Thus we set out to provide them, by
addressing three important issues: sharing the limited
number of nodes between I/O and computation, making
an optimal choice of the exact I/O nodes to save network bandwidth and to reduce concurrent access to the
network, and choosing other parameters relevant to I/O
such as the logical topology of compute and I/O nodes
(hereafter called the compute mesh and I/O mesh respectively), array distribution in memory and on disk.
To help in choosing parameter values and ease the performance analysis, we developed an accurate analytical
performance model.
In the remainder of the paper, we begin by briey
introducing Panda in section 2. In section 3, we discuss
\part-time" I/O, present an algorithm that optimizes
I/O node selection and show the performance results.
In section 4 we present a simple abstract performance
model, analyze the results, and show how our performance model is used to select the conguration with
the shortest response time for our running examples.
Related work is described in section 5 and then we conclude the paper in section 6.
1 Throughput for passing smaller messages on similar HP platforms connected by an isolated FDDI gets up to 11 MB/sec
(http://onet1.external.hp.com/netperf/NetperfPage.html). The
dierence can be attributed to MPI overhead and a non-isolated
network.
2 http://drl.cs.uiuc.edu/panda/.
2 Panda Array I/O Library
Panda is a parallel I/O library for multidimensional arrays. Its original design was intended for SPMD applications running on distributed memory architectures,
with arrays distributed across multiple processors, and
that are fairly closely synchronized at I/O time. Panda
supports HPF-style BLOCK, CYCLIC, and data distributions across the multiple compute nodes on which
Panda clients are running. Panda's approach to highperformance I/O in this environment is called serverdirected I/O [Seamons95].
0
0
1
1
2
3
2
0
0
1
1
3
clients
(compute nodes)
servers
(I/O nodes)
Figure 1: Dierent array data distributions in memory and
on disk provided by Panda.
Figure 1 shows a 2D array distributed (BLOCK,
BLOCK) across 4 compute nodes arranged in a 22
mesh. Each piece of the distributed array is called a
compute chunk, and each compute chunk resides in the
memory of one compute node. The I/O nodes are also
arranged in a mesh, and the data can be distributed
across them using HPF-style directives BLOCK and .
CYCLIC(K) is only supported on disk when the distributions in memory and on disk are identical, called
natural chunking [Seamons94] in Panda. (We have not
found any scientists yet who need a CYCLIC distribution on disk.)
The array distribution on disk can also be radically
dierent from that in memory. For instance, the array in
Figure 1 has a (BLOCK, *) distribution on disk. Using
this distribution, the resulting data les can be concatenated together to form a single array in row-major or
column-major order, which is particularly useful if the
array is to be sent to a workstation for postprocessing
with a visualization tool, as is often the case. Each I/O
chunk resulting from the distribution chosen for disk
will be buered and sent to (or read from) local disk at
one I/O node, and that I/O node is in charge of reading, writing, gathering, and scattering the I/O chunk.
For example, in Figure 1, during a Panda write operation I/O node 0 gathers compute chunks from compute
nodes 0 and 1, reorganizes them into a single I/O chunk,
and writes it to disk. In parallel, I/O node 1 is gathering, reorganizing and writing its own I/O chunk. For a
read operation, the reverse process is used.
During I/O Panda divides large I/O chunks into
smaller pieces, called subchunks, in order to obtain better le system performance and keep I/O node buer
space requirements low. For a write operation, a Panda
server will repeatedly gather and write subchunks, one
by one. More precisely, for each I/O subchunk, the
server determines which clients own a piece of the subchunk (\overlapping clients") and the indices of the
overlapping subarray at each client. The server sends
each overlapping client a message (the \schema mes-
sage") requesting the relevant subarray. Clients respond
to schema messages by sending the corresponding array
portions in a single message (whether or not the data
was contiguous at the client). With a ne-grained data
distribution, a client may have several overlapping subarrays for a single subchunk. To keep message trac
low, the server requests all overlapping subarrays from
a client in a single message, and receives them in a single response message. The experiments described in this
paper extend Panda 2.1 [Subraman96], which uses nonblocking communication between clients and servers to
avoid any possible deadlock in part-time I/O (discussed
in the following section) and to overlap communication
and computation.
3 Part-time I/O with Optimal I/O Nodes
A network of workstations usually has only a limited
number of nodes and scientists often want to run their
applications on as many of them as possible [Kuo96].
It would be wasteful to dedicate nodes to I/O, leaving them idle during computation. Panda 2.1 supports
an alternative I/O strategy, part-time I/O, where there
are no dedicated I/O nodes. Instead some of the compute nodes become I/O nodes at I/O time, and return to computation after nishing the I/O operation
[Subraman96]. When a client reaches a collective I/O
call, it may temporarily become a server. In the rest
of the paper we will use the terms \I/O nodes" and
\part-time I/O nodes" interchangeably.
As implemented in Panda 2.1, part-time I/O chooses
the rst k of the compute nodes as I/O nodes, regardless
of array distribution or compute node mesh. A similar
strategy has been taken in other libraries that provide
part-time I/O [Nieplocha96]. This provides acceptable
performance on a high speed, switch-based interconnect
whose processors are homogeneous with respect to I/O
ability, as on the SP2 [Subraman96], but on a sharedmedia interconnect like FDDI or Ethernet, we found
that performance is generally unsatisfactory and tends
to vary widely with the exact choice of compute and
I/O node meshes.
To see the source of the problem, consider the example in Figure 1. With the naive selection of compute
nodes 0 and 1 as I/O nodes, compute node 1 needs to
send its local chunk to compute node 0 and gather a subchunk from compute nodes 2 and 3. This incurs extra
message passing which is unnecessary if compute node
2 doubles as an I/O node, instead of compute node 1.
3.1 I/O Node Selection Strategy
In an environment where the interconnect will clearly be
the bottleneck for I/O, as is the case for the HP workstation cluster with an FDDI interconnect, the single
most important optimization we can make is to minimize the amount of remote data transfer. In this section we show how to choose I/O nodes in a manner
that will minimize the number of array elements that
must be shipped across the network during I/O. More
precisely, suppose we are given a target number m of
part-time I/O nodes, the current distribution of data
across processor memories, and a desired distribution
of data across I/O nodes. The data distribution in
memory need not be an HPF-style distribution; for example, some nodes may have signicantly more data
than others, as might happen during an adaptive meshrenement computation or when some nodes are significantly more powerful than
others, as might be the case
in a workstation cluster.3 Similarly, the distribution on
disk need not be HPF-style; we need only know which
of the m I/O nodes is to receive each byte of data. Under these assumptions, we show how to choose the I/O
nodes from among the set of n m compute nodes so
that remote data transfer is minimized.
We begin by forming an m n array called the I/O
matrix (M ), where each row represents one of the I/O
nodes and each column represents one of the n compute
nodes. The (i; j ) entry in the I/O matrix, M (i; j ) is
the total number of array elements that the ith I/O
node will have to gather from the j th compute node,
which can be computed from the array size, in-memory
distribution and the target disk distribution. In Panda,
every node involved in an array I/O operation has access
to array size and distribution information, so M can be
generated at run time.
For example, if the memory and disk distributions
are the HPF-style distributions shown in Figure 1, the
I/O matrix consists of 2 rows and 4 columns. For an 8
8 array of 1-byte elements, distributed as in Figure 1,
M is:
0 0
M = 016 016 16
16
Given the I/O matrix, the goal of choosing the m
I/O nodes that will minimize remote data transfer can
be formalized as the problem of choosing m matrix entries M (i1 ; j1 ); : : : ; M (im ; jm ) such that no two entries
lie in the same column
P or row (ik 6= il and jk 6= jl , for
1 k < l m) and 1km M (ik ; jk ) is maximal. We
can simplify the problem by traversing the I/O matrix
and selecting m of the largest entries from
each row. We
call this the simplied I/O matrix (M 0 ). The proposition below shows that only the simplied I/O matrix
need be considered when searching for a solution.
Proposition 1 An assignment of part-time I/O nodes
that minimizes remote data transfer can be found among
the entries in the simplied I/O matrix.
Proof 1 Suppose we have an assignment of part-time
I/O nodes that minimizes remote data transfer and includes an entry M (i; j ) that is not part of the simplied
I/O matrix (the `old solution'). Then the simplied matrix contains m entries M (i; jk ) M (i; j ), 1 k m,
none of which have been chosen as part of the old solution. Thus if we can substitute one of the M (i; jk )|
call it M (i; l)|for M (i; j ), creating a new solution, the
3 If the data were very unevenly distributed across compute
nodes and/or I/O nodes, or if dierent I/O nodes had dierent
capabilities, then the optimal choice of I/O nodes might need to
consider not only minimal remote data transfer but also other
costs at each node, to avoid potential load imbalance that could
adversely aect performance. To be accurate in this case, the
LogGP analytical model used in section 4 would need to be extended to model these costs (disk access time, local copying,
etc.).
amount of remote data transfer will be at least as low as
before. However, the substitution can only be performed
if we can nd a choice of M (i; l) such that no other
entry in the new solution is from the ith row or from
the lth column. Since the old solution can contain only
one entry from row i, we can safely substitute M (i; l)
for M (i; j ) without introducing a row conict. Suppose,
however, that for each of the m possible choices of l
we nd that the new solution contains two entries from
column l. Then the old solution has m entries from
columns other than j , which is impossible. We conclude
that a solution can be found by considering only the simplied I/O matrix.
A matching in a graph is a maximal subset of the
edges of a graph such that no two edges share
the same
endpoint. Given a simplied I/O matrix M 0 , the problem of assigning I/O nodes is equivalent to a weighted
version of the matching problem in a bipartite graph G.
Consider every row (I/O nodes) and column (compute
nodes) of M 0 to be a point in G and each entry M 0 (i; j )
to be the weight of the edge connecting points i and
j . We wish to nd the matching of G with the largest
possible sum of weights. The optimal solution can be
obtained3 by the Hungarian Method [Papadimitriou82]
in O(m ) time, where m is the number of part-time I/O
nodes.
Figure 2 shows the simplied I/O matrix M 0 and
corresponding bipartite graph G for M .
16
M0 =
16 16
16 16
0
16
1
16
G:
16
c0
c1
c2
c3
Figure 2: Simplied I/O matrix M 0 and the corresponding
bipartite graph G. The original indices for the entries in
M 0 are separately recorded. The edges of G in thicker
lines are the matching selected to minimize the amount of
remote data transfer.
Proposition 2 An assignment of m part-time I/O nodes
that minimizes remote data transfer can
be determined
from the simplied I/O matrix in O(m3 ) time.
Proof 2 Follows directly from the preceding discussion.
The preceding discussion assumes that the goal is
to optimize the choice of I/O nodes for a single write
request. With small modications, the same approach
can be used to optimize the choice of I/O nodes for a sequence of read and write requests, including any initial
bulk data loading. In this case, the I/O matrix must
contain the total number of bytes transferred between
each client and server, which can be obtained from a description of the anticipated data distributions, meshes,
and numbers and types of read and write requests. As
in the case of a single I/O request, uneven data distributions or unequal node capabilities may mean that the
optimization decision should consider additional factors
beyond minimal data transfer. In addition, if optimal
I/O nodes are to be used for persistent arrays, the I/O
nodes need to maintain directory information regarding
the arrays stored, as Panda does in its \schema" les.
write operation, 2x4 compute node mesh
(BLOCK,BLOCK) in memory, (BLOCK,*) on disk
3.2 Performance Results
4 Panda I/O nodes also write 1 MB unit, the subchunk size in
all experiments.
18
16
Panda Response Time (sec)
64MB
14
12
64MB
10
64MB
8
6
16MB
16MB
4
16MB
2
4MB
4MB
4MB
0
2
4
# of I/O nodes
fixed
8
optimal
Figure 3: Panda response time for writing an array using
xed or optimal I/O nodes. Memory mesh: 2 4. Memory
distribution: (BLOCK, BLOCK). Disk mesh: n 1, where
n is the number of I/O nodes. Disk distribution: (BLOCK,
*).
write operation, 4x2 compute node mesh
(BLOCK,BLOCK) in memory, (BLOCK,*) on disk
20
18
16
64MB
Panda Response Time (sec)
The experiments were conducted on an 8-node HP 9000/
735 workstation cluster running HP-UX 9.07 on each
node, connected by FDDI. Each node has a main memory of size 144 MB and two local disks and each I/O
node used a 4 GB local disk. At the time of our experiments, the disks were 45{90% full depending on the
node. We measured the le system throughput for write
operations by averaging 8 trials of writing
a 128 MB le
using 1 MB application write requests4 : 5.96 MB/sec for
the least occupied disk and 5.63 MB/sec for the fullest
disk with standard deviations of 0.041 and 0.153 respectively. For the message passing layer, we used MPICH
1.0.11 and obtained an average message passing bandwidth per node of 3.2, 2.9 or 2.3 MB/sec when there are
1, 2 or 4 pairs of senders and receivers, respectively, for
Panda's common message sizes (32{512 KB). The cluster is not fully isolated from other networks, so we did
our experiments when no other user job was executing
on the cluster. All the experimental results shown are
the average of 3 or more trials and error bars show a
95% condence interval for the average.
We used all 8 nodes as compute nodes in our experiments, as that conguration is probably most representative of scientists' needs. The in-memory distribution
was (BLOCK, BLOCK) and we tested performance using the two obvious compute node meshes for a 2D array
on 8 nodes|24 and 42. We chose a 2D instead of
3D array so that we could examine the performance impact of changing the dierent compute node mesh (there
is only one obvious mesh, 222, for a 3D array distributed (BLOCK,BLOCK,BLOCK)). We used 2, 4, or
8 part-time I/O nodes, while increasing the array size
from 4 MB (10241024) to 16 MB (20482048) and 64
MB (40964096). For the disk distribution, we used
either (BLOCK, *) or (*, BLOCK) to show the eect
of a radically dierent distribution. We present results
for writes; reads are similar.
Figures 3, 4, 5 and 6 compare the time to write an
array using an optimal choice of I/O nodes or using
the rst k nodes as I/O nodes (\xed" I/O nodes). A
group of 6 bars is shown for each number of I/O nodes.
Each pair of bars shows the response time to write an
array of the given size using xed I/O nodes and optimal
I/O nodes respectively. The use of optimal I/O nodes
reduces array output time by at least 19% across all
dierent combinations of array sizes and meshes, except
for the case of 8 I/O nodes, for which the xed and
optimal I/O nodes are identical.
In Figures 3 and 4, dierent congurations perform
very dierently. For instance, for a 42 compute node
mesh (Figure 4) with 4 I/O nodes, the change to optimal
I/O nodes halves response time. Moving from 2 to 4
I/O nodes gives a superlinear speedup when optimal
I/O nodes are used, but only in the 42 mesh; the 24
mesh sees only a factor of 1.5 speedup. The reasons for
these dierences will be explored in the next section.
20
14
12
64MB
10
8
6
64MB
16MB
4
2
16MB
4MB
16MB
4MB
4MB
0
2
4
# of I/O nodes
fixed
8
optimal
Figure 4: Panda response time for writing an array using
xed or optimal I/O nodes. Memory mesh: 4 2. Memory
distribution: (BLOCK, BLOCK). Disk mesh: n 1, where
n is the number of I/O nodes. Disk distribution: (BLOCK,
*).
write operation, 2x4 compute node mesh
(BLOCK,BLOCK) in memory, (*,BLOCK) on disk
20
18
Panda Response Time (sec)
16
14
64MB
12
10
64MB
8
64MB
6
16MB
4
16MB
2
16MB
4MB
4MB
4MB
0
2
4
# of I/O nodes
fixed
8
optimal
Figure 5: Panda response time for writing an array using
xed or optimal I/O nodes. Memory mesh: 2 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: 1 n,
where n is the number of I/O nodes. Disk distribution:
(*, BLOCK).
write operation, 4x2 compute node mesh
(BLOCK,BLOCK) in memory, (*,BLOCK) on disk
12
64MB
64MB
64MB
Panda Response Time (sec)
10
8
6
4
16MB
16MB
16MB
2
4MB
4MB
4MB
0
2
4
# of I/O nodes
fixed
8
optimal
Figure 6: Panda response time for writing an array using
xed or optimal I/O nodes. Memory mesh: 4 2. Memory distribution: (BLOCK, BLOCK). Disk mesh: 1 n,
where n is the number of I/O nodes. Disk distribution:
(*, BLOCK).
Figures 5 and 6 show the results obtained with a (*,
BLOCK) distribution on disk. In Figures 3 and 4, use
of additional I/O nodes always improved response time.
However, for the 24 compute node mesh in Figure 5,
response time increases when we move from 4 to 8 I/O
nodes, unless the I/O node selection is optimized. Similarly, xed part-time I/O slows down when we move
from 2 to 4 I/O nodes, for a 42 compute node mesh
in Figure 6.
The performance results in Figures 7 and 8 show
the Panda response time to write out data with a negrained memory distribution, using optimal I/O nodes.
The same arrays are used as in earlier gures, but with a
(BLOCK, CYCLIC(K)) memory distribution, for K 2
f32; 64; 128g. For a given number of I/O nodes and array size, each group of three bars shows the performance
for dierent choices of K. The performance is very close
to that of the (BLOCK, BLOCK) distribution with optimal I/O nodes shown in Figures 3 and 4, with a little
increase in response time. The increase is due to extra overhead for computing the overlaps for CYCLIC
distributions, and the smaller granularity of contiguous data to be sent to I/O nodes, which makes remote
data transfer more expensive. The cyclic block size K
does not impact performance signicantly in Figures 7
and 8. Changing K does not aect the amount of local data at each I/O node, which is the same as for the
(BLOCK, BLOCK) distribution and is a primary determinant of performance. Further, each I/O node has
the same number of overlapping clients for each chunk
as it did for the (BLOCK, BLOCK) distribution, so
message-passing costs are much the same.
We repeated the same experiments while distributing the in-memory array (CYCLIC(K), BLOCK) and
the results are shown in Figures 9 and 10. Compared to
Figures 7 and 8, there's a noticeable performance degradation. For the previous congurations, at each I/O
node either 1/2 (42 mesh with 4 or 8 I/O nodes) or 1/4
(all other cases) of the data was local. Now the amount
of local data has dropped to 1/8, with four exceptions
in the 42 mesh (1/2 local: 4 MB, K = 128, 8 I/O
nodes; 1/4 local: 16 MB, K = 128, 8 I/O nodes; 4 MB,
K = 64, 8 I/O nodes; 4 MB, K = 128, 4 I/O nodes).
An examination of the latter congurations in Figure
8 shows how increasing the amount of local data, by
increasing K or decreasing the array size, signicantly
impacts performance. Moreover, when only 1/8 of the
data is local, each I/O node must communicate with
every other node to gather each chunk, aggravating network contention and increasing wait times. Distributing
the in-memory array (CYCLIC(K), CYCLIC(K)) gave
performance close to that of (CYCLIC(K), BLOCK).
We reran the same tests on the same cluster, but
using a 10 Mb/s Ethernet, and observed similar general trends. Panda provided lower I/O throughput than
with FDDI, as expected, but also severely suered when
there were many overlapping clients. For instance, with
a 24 compute node mesh, a (BLOCK, BLOCK) memory distribution, and a (BLOCK, *) data distribution on
disk, each I/O node has 4 overlapping clients, and doubling the number of I/O nodes reduces response time by
less than 10%, even when using optimal I/O nodes. The
Ethernet uses random exponential backo on packet collisions, which means that the Ethernet degrades less
write operation, 2x4 compute node mesh
(BLOCK,CYCLIC(K)) in memory, (BLOCK,*) on disk
14
write operation, 2x4 compute node mesh
(CYCLIC(K),BLOCK) in memory, (BLOCK,*) on disk
18
64MB
64MB
16
12
64MB
64MB
8
6
4
16MB
Panda Response Time (sec)
Panda Response Time (sec)
14
10
64MB
12
64MB
10
8
6
16MB
16MB
4
16MB
2
2 4MB
4MB
16MB
16MB
4MB
4MB
4MB
4MB
0
0
2
4
# of I/O nodes
K=32
K=64
2
8
4
# of I/O nodes
K=32
K=128
Figure 7:
Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh:
2 4. Memory distribution: (BLOCK, CYCLIC(K)). Disk
mesh: n 1, where n is the number of I/O nodes. Disk
distribution: (BLOCK, *).
K=64
8
K=128
Figure 9:
Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh:
2 4. Memory distribution: (CYCLIC(K), BLOCK). Disk
mesh: n 1, where n is the number of I/O nodes. Disk
distribution: (BLOCK, *).
write operation, 4x2 compute node mesh
(BLOCK,CYCLIC(K)) in memory, (BLOCK,*) on disk
write operation, 4x2 compute node mesh
(CYCLIC(K),BLOCK) in memory, (BLOCK,*) on disk
14
18
64MB
16
64MB
12
10
8
64MB
6
64MB
4
16MB
Panda Response Time (sec)
Panda Response Time (sec)
14
64MB
12
64MB
10
8
6
16MB
16MB
4
2
16MB
4MB
4MB
16MB
16MB
4MB
4MB
4MB
4MB
0
0
2
Figure 8:
2
4
# of I/O nodes
K=32
K=64
K=128
2
8
Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh:
4 2. Memory distribution: (BLOCK, CYCLIC(K)). Disk
mesh: n 1, where n is the number of I/O nodes. Disk
distribution: (BLOCK, *).
4
# of I/O nodes
K=32
Figure 10:
K=64
8
K=128
Panda response time for writing a nelydistributed array using optimal I/O nodes. Memory mesh:
4 2. Memory distribution: (CYCLIC(K),BLOCK). Disk
mesh: n 1, where n is the number of I/O nodes. Disk
distribution: (BLOCK, *).
gracefully than FDDI in the presence of high network
contention in our experiments.
The experiments described in this section show that
even with optimal I/O nodes, performance is very dependent not only on the amount of local data transfer,
but also on the compute node mesh chosen, array distribution on disk and the number of I/O nodes. To obtain acceptable performance, the user needs help from
Panda in predicting I/O performance. Also, the use of
local data makes it hard to tell how well Panda utilizes
available network bandwidth. In the next section, we
present a model for Panda's performance with FDDI,
to be used for these purposes.
4 Performance Model and Analysis
In previous work, we developed a performance model
for Panda on the IBM SP2, where message passing contention is not an issue [Chen96]. However, in a workstation cluster whose disks can outrun the interconnect,
message passing contention and throughput is the key
element to model for performance prediction, so the old
model is of no assistance. As the interpretation of the
initial performance results obtained from the cluster was
not obvious, we developed a new performance model
from a priori principles (i.e., not depending on the experimental results in any way), and performed experiments presented in section 3 to validate the model.
In this section we describe our model for gathering
and scattering a subchunk in Panda using part-time I/O
on FDDI, show that it gives very accurate performance
predictions, and show how we use it in performance
analysis.
4.1 The LogP, LogGP Model
We build on the LogP model [Culler93] and its extension LogGP [Alexandrov95] due to its simplicity and
its ability to model the overlapping of communication
and computation. LogP and LogGP model a parallel
application using the following parameters:
L (latency) Delay incurred in communicating a
message containing one word from source to destination.
o (overhead) Length of time that a processor is
engaged in the transmission or reception of each
message. Processors cannot perform any other operation during this time.
g (gap) Minimum time interval between consecutive transmissions or receptions of a short message
at a process.
P Number of processes.
G (gap per byte) Time per byte to transmit or
receive a long message.
Thus, if a sender transmits a k-byte message at time 0,
it will be received by the receiver at time o + G(k ? 1)+
L + o. If two consecutive messages of size k are to be
sent out, the rst byte of the second message enters the
network at o + G(k ? 1) + max(o; g). Measurements of
parameter values on the FDDI HP workstation cluster
are done at the MPI level and the results are shown in
Table 1. The overhead o was measured as the time re-
o
g
G1
G2
235.4 sec 300 sec 4.5 sec 0.30 sec
Table 1: LogGP parameter values. G1 is the gap per byte
for a medium-size message and G2 is the gap per byte for
a large message.
quired for a process to post a non-blocking MPI send
to another process. For the value of g, a process sends
to another process 500 one-word messages using MPI's
non-blocking send operation, and the receiver sends an
acknowledgement once it has received all the messages.
Dividing the total time by the number of messages gives
g. G is measured in the same way but using larger messages. G is measured for two dierent message sizes: G1
for Panda schema messages (68 bytes for a 2D array)
and G2 for overlapping subchunk data messages (usually hundreds of KB). Measurement of L is discussed in
the next section.
4.2 Abstract Performance Model and Validation
The FDDI used for our experiments is a shared media
network, with arbitration done by a token moving in a
xed circle around the network. A node has to grab the
token before it sends out any message. Our latency parameter must include the time to wait for the token on
the network when there are several waiting senders. If l
is raw network latency when the network is contention
free and w is raw network bandwidth, the adjusted latency when there are s senders trying to send an n byte
message, L(s; n) can be formulated as follows: L(1; n) =
l; L(2; n) = (l + n=w)+ l; : : : L(s; n) = (s ? 1)(l + n=w)+ l.
In our performance model, we will use the maximum
latency, L(s; n), instead of the average latency because
Panda performance is determined by the slowest node.
The raw network bandwidth w for FDDI is 100 Mb/s
for FDDI, and the raw network latency l is extracted
from [Banks93] as 10 s.
With the adjusted latency, the time for S senders to
each send a k-byte message is T1 = o + L(s; k)+ G(k ? 1).
For S processes each to send n k-byte messages, it takes
time Tn = o + L(s; nk)+ nG(k ? 1)+(n ? 1)max(g; o). In
a more general form, the time for a process to send n1
messages and receive n2 messages of size k, when there
are S processes including itself, each of which sends at
most n3 k-byte messages, is
Tn1 ;n2 = o + L(s; n3 k) + n1 G(k ? 1) + (n1 ? 1)max(g; o)
+n2 G(k ? 1) + n2 max(g; o)
(1)
Table 2 shows the parameters necessary to model
Panda's subchunk gather operations when part-time I/O
is used. Scatter operations are the opposite of gathers
and can be modeled using the same parameters. For a
gather operation, I/O nodes make requests for remote
chunks by sending schema messages to appropriate remote overlapping clients, then I/O nodes copy a chunk
from the local client if there is a local overlap. Next
I/O nodes receive the requested chunk data from the
corresponding compute nodes. The model for gathering a subchunk assumes that each I/O node gets the
same number of overlapping clients and that overlapping subchunks are of roughly equal size. When this is
not the case, the model needs to be rened by including
additional parameters.
I
C
Number of I/O nodes
Number of nodes which send
chunk data to a remote I/O node
concurrently
N
Size of overlapping subchunk
Schema len Size of schema message
I schema Number of I/O nodes which actually make data requests to remote
compute nodes
Req proc Maximum number of data requests
that a compute node processes
Req send Maximum number of remote overlapping clients among the I/O
nodes
Req rec Maximum number of schema messages received by an I/O node
Local
1 if all I/O nodes currently have a
local overlap, otherwise 0
Table 2: Panda parameters necessary to model a
gather/scatter operation when part-time I/O is used.
The total time that Panda spends in gathering subchunks for an I/O request for a single array is
array size
Tsubchunk
Ttotal = subchunk
size I
(2)
The total time to gather a subchunk at an I/O node
is the sum of the time to send schema messages and to
receive chunk data from compute nodes:
Tsubchunk = Tschema + Tdata
(3)
Tschema is the time for Ischema processes to send Req send
messages of size Schema len plus the time to receive
Req rec schema messages and can be computed by using Equation 1:
Tschema = o + Lschema
(4)
+(Req send + Req rec) G1 (Schema len ? 1)
+(Req send + Req rec ? 1) max(o; g)
Tdata is the time for all part-time I/O nodes to send
Req rec and receive Req send messages of size N each,
when there are a total of C processes each trying to
send Req proc messages of size N plus the time to do
a local copy if the I/O nodes have a local overlapping
subchunk:
Tdata = o + Ldata
+(Req rec + Req send) G2 (N ? 1)
+(Req rec + Req send ? 1) max(o; g)
+ Local Memcpy(N )
(5)
Lschema and Ldata can be computed by using the
adjusted latency described earlier. The values for the
parameters in Table 2 are easily determined given the
array distributions and meshes for compute and I/O
nodes. Parameter values for dierent congurations are
presented separately in Table 3. Table 4 shows the actual and predicted Panda response time to write an array of under dierent compute and I/O node meshes.
Arrays are distributed (BLOCK, BLOCK) in memory
and (BLOCK, ) or (, BLOCK) on disk. Across all different array sizes, compute node meshes, and I/O node
meshes, the predicted time is accurate within a 13%
margin of error.
4.3 Analysis
We now use the validated performance model to explain
the performance trends in the earlier experiments. From
Equation 2, the term Tsubchunk =I determines Panda response time for an I/O request because array size and
subchunk size are xed. Tsubchunk , in turns, depends
on Tschema and Tdata as in Equation 3. But the schema
messages are very short compared to the data messages
and contribute very little to Tsubchunk , so we focus instead on Tdata=I . In formula (6) for Tdata, the dominant
factor that determines performance variations between
congurations is Req rec + Req send so the analysis below focuses on the value of this factor, divided by I .
Table 3 shows Panda parameter values for various
congurations using the default subchunk size, 1 MB,
and xed or optimal I/O nodes. We will use the Panda
parameters with subscripts fix and opt to distinguish
the values for xed and optimal part-time I/O node.
We are ready to explain the dierent speedups as
we move from 2 to 4 optimal I/O nodes in the 24
and 42 compute meshes in Figures 3 and 4. If we
evaluate the coecient (Req recopt + Req sendopt)=I for
both meshes, the value for the 24 compute node mesh
changes from 1.5 to 1.0, whereas for the 42 mesh it
drops from 0.75 to 0.25. This coecient highlights the
maximum amount of data transferred from remote compute nodes. Thus the decrease in remote data transfer
by a factor of 3 largely explains the superlinear speedup
for the 42 compute node mesh when we increase the
number of I/O nodes.
The model also explains the performance with xed
I/O nodes when the (*, BLOCK) distribution is used
on disk as in Figures 5 and 6. For a 24 compute mesh,
the model predicts that Tsubchunk =I is 0.25, 0.14, and
0.16 second when I is 2, 4, and 8, respectively. Thus
the model correctly predicts that 4 I/O nodes yield the
shortest response time. To understand the slowdown
when we move from 4 to 8 I/O nodes, we examine the
changes in the value of the coecient (Req recfix +
Req sendfix)=I . It increases by 50% and it has its minimum value when there are 4 I/O nodes. The slowdown
caused by increasing xed I/O nodes from 2 to 4 for a
42 compute node mesh can be explained similarly.
The value of Tsubchunk =I can also predict the meshes
and the number of I/O nodes which yield the shortest response time. The conguration with smallest Tsubchunk =I
yields the shortest response time, assuming all I/O nodes
have the same disk performance. The 42 compute
node mesh with 8 I/O nodes has the smallest value of
Tsubchunk =I for the (BLOCK, ) distribution on disk,
which matches the experimental results. For the (,
BLOCK) distribution on disk, a 24 compute mesh
with either 4 or 8 I/O nodes has the smallest value of
Tsubchunk =I . The experimental results show improved
performance for 8 I/O nodes, but the improvement is
within the margin of error of the model.
The minimization of remote data transfer also simplies the communication pattern to make better utilization of peak message passing bandwidth. We use the
performance model to gure out how much time Panda
spends in remote data transfer and then examine how
close it is to the peak MPI bandwidth when I/O node
selection is optimized.
Figure 11 shows the fraction of peak MPI throughputs attained for remote data transfer in Panda. Firstly,
we divided Panda's predicted throughput by MPI throughput measured earlier; the result shows around 60% of
available throughput for xed I/O nodes and 80% is
used with optimal I/O nodes. However, on a sharedmedia network if two or more nodes start to send a
certain size of message at the same time, they will complete the transfer at dierent times. Since Panda's performance is limited by the slowest node, in practice
Panda's performance will be determined by the bandwidth obtained by the node experiencing the most latency. Dividing Panda's predicted throughput for remote data transfer by the average MPI throughput assuming maximum latency gives the second graph, in
which Panda attains 85{98% of average MPI throughput except with a 4x2 compute mesh, 8 I/O nodes, and
a (*, BLOCK) distribution on disk. The problem here
is that 4 I/O nodes have identical overlapping clients so
that overlapping chunk transfer is serialized.
5 Related Work
Our work is related to previous work in chunking multidimensional arrays on secondary storage and parallel
I/O on a workstation cluster.
[Sarawagi94] is an eort to enhance the POSTGRES
DBMS to support large multidimensional arrays. Chunking is used to reduce the number of disk blocks to be
fetched for a given acess pattern in read-intensive applications. [Seamons94] describes chunking in Panda as
well as its interface for write-intensive applications on
massively parallel platforms. Technology transfer from
the Panda project has helped lead to an implementation of multidimensional array chunking, coupled with
compression, in the HDF scientic data management
package [Velamparampil97].
PIOUS [Moyer94] is a pioneer work in parallel I/O
on a workstation cluster. PIOUS is a parallel le system
with a Unix-style le interface; coordinated access to a
le is guaranteed using transactions. Our strategies for
collective I/O and for slow shared-media networks could
be built on top of PIOUS.
VIP-FS [Harry95] provides a collective I/O interface
for scientic applications running in parallel and distributed environments. Their assumed-request strategy
is designed for distributed systems where the network
is a potentially congested shared medium. It reduces
the number of I/O requests made by all the compute
nodes involved in a collective I/O operation, to reduce
congestion. In our experience with FDDI and Ethernet, the time saved by minimizing remote data transfer
generally outweighs the potential savings from reducing
I/O request messages.
VIPIOS [Brezany96] is a design of a parallel I/O
system to be used in conjunction with Vienna Fortran.
VIPIOS exploits logical data locality in mapping between servers and application processes and physical
data locality between servers and disks, which is similar
to our approach in exploiting local data. Our approach
adds an algorithm that guarantees minimal remote data
access during I/O, quanties the savings, and uses an
analytical model to explain other performance trends.
Our research is also related to better utilization of resources on massively parallel processors, e.g. using parttime I/O. Disk Resident Arrays [Nieplocha96] is an I/O
library for out-of-core computation on distributed memory platforms. For I/O operations, a xed set of compute nodes is selected as part-time I/O nodes; thus on
an interconnect like FDDI, DRA could probably benet from consideration of array distribution, meshes, and
access patterns.
[Kotz95] exploits the I/O nodes in a MIMD multiprocessor as compute nodes for those applications which
can tolerate I/O interrupts. The results from their simulation study show I/O nodes are mostly underutilized,
with 80-97% availability for computation, except for
some complicated le-request patterns. Like our work,
this is an eort to utilize as many nodes as possible
for computation. But their assumption was that I/O
nodes and compute nodes are separate and I/O nodes
are utilized by another application, whereas our notion
of part-time I/O assumes there is no distinction between
compute nodes and I/O nodes and one application can
use the entire node. One advantage of our assumption
is that I/O nodes can be chosen exibly, giving customized service to each application and increasing the
amount of local data transfer.
Jovian-2 [Acharya96] allows all nodes to run both
an application thread and an I/O server thread in its
peer-to-peer conguration. Each node can make an I/O
request to another, thus processing of an I/O request
can be delayed by the computation performed by the
requested node. Their remedy for slow I/O performance
was to make use of global I/O requirement information
available from the application and prefetch and cache
the data to minimize the I/O time. Also, in contrast to
the other works cited in this section, [Acharya96] argues
that collective I/O is not as ecient as I/O to a single
distributed le using an independent le pointer for each
compute node, if the application code is appropriately
organized. We prefer to leave the application code as
is, and concentrate on handling high-level I/O requests
eciently.
Previous work from the Panda group also targets
part-time I/O. [Subraman96] describes Panda 2.1's implementation of part-time I/O and gives basic performance results. [Kuo96] experimented with part-time
I/O for a real application on the SP2, and found that
part-time I/O can also be used eciently for an application which requires ooading the data to Unitree when
the application is nished, if the rest of the compute
nodes which are not involved in ooading the data can
be released right away.
6 Summary and Conclusions
For a workstation cluster where message passing delivers
low bandwidth to the end user due to software overhead
and shared media, message passing is more likely to be a
bottleneck for parallel I/O than disk speed. This paper
has shown that in such an environment, proper selection
of parameters like compute node mesh, array distribution, and number of I/O nodes has a large impact on
I/O performance.
We experimented with the Panda parallel I/O library on a workstation cluster connected by FDDI and
Ethernet, two currently popular LANs. On a small cluster, it is not practical to dedicate nodes to I/O, so we
used a \part-time I/O" strategy so that compute nodes
could double as I/O nodes. Another important issue,
given a relatively slow interconnect, is the choice of I/O
nodes to minimize both message trac across the network and waiting time. We introduced an algorithm
that chooses I/O nodes so as to guarantee the minimum amount of data movement across the network, and
found that it gave speedups of 1.2{2.1 when compared
with the naive choice of I/O nodes on an 8-node HP
workstation cluster using FDDI.
If optimal I/O nodes are to be used for persistent
arrays, the I/O nodes need to maintain directory information regarding the arrays stored, as Panda does in its
\schema" les.
Even with the optimal choice of I/O nodes, Panda's
performance is very sensitive to the compute node mesh
and array distribution. The ability to predict performance is important if we are to guide the user in the
selection of the mesh, array distribution on disk and
the number of I/O nodes. We introduced an analytical performance model that accurately predicts performance on an FDDI interconnect, and used it to select
the optimal values for these parameters. The performance model can be reused for dierent interconnects,
such as ATM or Myrinet [Boden95], if values for the
relevant message passing parameters are measured and
the latency is appropriately modeled.
This study is the rst step toward a version of Panda
that can automatically make optimal choices of meshes
and numbers of I/O nodes, when scientists want to be
relieved of that task. In the future, we also plan to
study the performance of Panda on a larger cluster with
a faster interconnect, the switch-based Myrinet, and explore alternative I/O strategies for that platform.
7 Acknowledgements
This research was supported by an ARPA Fellowship
in High Performance Computing administered by the
Institute for Advanced Computer Studies at the University of Maryland, by NSF under PYI grant IRI 89
58582, and by NASA under NAGW 4244 and NCC5
106.
We thank John Wilkes for providing access to the HP
workstation cluster at HP Labs in Palo Alto and valuable comments on our initial manuscript draft. We also
thank Milon Mackey for system information describing
the HP cluster and all the users at HPL who suered
through our heavy use of the cluster and its Ethernet
connection. We also thank David Kotz at Dartmouth
College for providing access to the FLEET lab.
References
[Acharya96] A. Acharya, M. Uysal, R. Bennett, A.
Mendelson, M. Beynon, J. Hollingsworth, J.
Saltz and A. Sussman, Tuning the Performance
of I/O-Intensive Parallel Applications, Proceedings of the Fourth Annual Workshop on I/O in
Parallel and Distributed Systems, pages 15-27,
May 1996.
[Alexandrov95]
A. Alexandrov, M. Ionescu, K. Schauser, and
C. Scheiman, LogGP: Incorporating Long Messages into the LogP Model - One Step Closer
Towards a Realistic Model for Parallel Computation, Proceedings of the 7th Annual ACM
Symposium on Parallel Algorithms and Architectures, pages 95-105, 1995.
[Banks93] D. Banks and M. Prudence, A HighPerformance Network Architecture for a PARISC Workstation, IEEE Journal on Selected
Areas in Communications, Vol. 2, No. 2, pages
191-202, February 1993.
[Boden95] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic and W. Su, Myrinet:
A Gigabit-per-Second Local Area Network,
IEEE Micro, 15(1), pages 29-36, February 1995.
[Brezany96] P. Brezany, T. A. Mueck and E. Schikuta,
A Software Architecture for Massively Parallel
Input-Output, Proceedings of the Third International Workshop PARA'96, LNCS Springer
Verlag, Lyngby, Denmark, August 1996.
[Chen96] Y. Chen, M. Winslett, S. Kuo, Y. Cho, M.
Subramaniam and K. Seamons, Performance
Modeling for the Panda Array I/O Library,
Proceedings of Supercomputing 96, November
1996.
[Culler93] D. Culler, R. Karp, D. Patterson, A. Sahay,
K. E. Schauser and E. Santos, LogP: Towards a
Realistic Model of Parallel Computations, Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practices of Parallel
Programming, May 1993.
[Harry95] M. Harry, J. Rosario and A. Choudhary, VIPFS: A Virtual, Parallel File System for High
Performance Parallel and Distributed Computing, Proceedings of the Ninth International Parallel Processing Symposium, April 1995.
[Keeton95] K. K. Keeton, T. E. Anderson and D. A.
Patterson, LogP Quantied: The Case for LowOverhead Local Area Networks, Hot Interconnects III, August 1995.
[Kotz95] D. Kotz and T. Cai, Exploring the Use of
I/O Nodes for Computation in a MIMD Multiprocessor, Proceedings of the IPPS '95 Third
Annual Workshop on I/O in Parallel and Distributed Systems, pages 78-89, 1995.
[Kuo96] S. Kuo, M. Winslett, K. E. Seamons, Y. Chen,
Y. Cho, M. Subramaniam, Application Experience with Parallel Input/Output: Panda and
the H3expresso Black Hole Simulation on the
SP2, Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientic Computing, March 1997.
[Moyer94] S. Moyer and V. Sunderam, PIOUS: A
Scalable Parallel I/O System for Distributed
Computing Environments, Computer Science
Technical Report CSTR-940302, Department of
Math and Computer Science, Emory University,
November 1994.
[MPI] Message Passing Interface Forum, MPI: A
Message-Passing interface Standard, June,
1995.
[MPICH] W. Gropp, E. Lusk and A. Skjellum, A
High-Performance, Portable Implementation of
the MPI Message Passing Interface Standard,
http://www.mcs.anl.gov/mpi/mpicharticle/
paper.html, July 1996.
[Nieplocha96] J. Nieplocha, I. Foster, Disk Resident Arrays: An Array-Oriented I/O Library for Outof-Core Computation, Proceedings of the Sixth
Symposium on the Frontiers of Massively Parallel Computation, pages 196-204, October 1996.
[Papadimitriou82] C. Papadimitriou and K. Steiglitz,
Combinatorial Optimization: Algorithms and
Complexity, Prentice-Hall Inc., 1982.
[Sarawagi94] S. Sarawagi and M. Stonebraker, Ecient
Organization of Large Multidimensional Arrays, Proceedings of the International Conference on Data Engineering, pages 328{336, 1994.
[Seamons94] K. E. Seamons and M. Winslett, Physical
Schemas for Large Multidimensional Arrays in
Scientic Computing Applications, Proceedings
of the 7th International Working Conference
on Scientic and Statistical Database Management, Charlottesville, Virginia, pages 218-227,
September 1994.
[Seamons95] K. E. Seamons, Y. Chen, P. Jones, J.
Jozwiak and M. Winslett, Server-Directed Collective I/O in Panda, Proceedings of Supercomputing '95, November 1995.
[Subraman96] M. Subramaniam, High Performance Implementation of Server Directed I/O, M.S. Thesis, Department of Computer Science, University of Illinois, 1996.
[Velamparampil97] G. Velamparampil, Data Management Techniques to Handle Large Data Arrays
in HDF, M.S. Thesis, Department of Computer
Science, University of Illinois, 1997.
Predicted Panda message passing throughput
divided by average MPI bandwidth
Throughput (% of average MPI bandwidth)
100
90
80
70
60
I schema
Req send
Req rec
Local
Req proc
C
N (MB)
50
40
30
20
10
0
fixed
2x4-2-r
2x4-2-c
Throughput (% of average MPI bandwidth)
I schema
Req send
Req rec
Local
Req proc
C
N (MB)
2x4-r-r
2x4-8-c
optimal
4x2-2-r
4x2-4-r
4x2-4-c
4x2-8-c
Predicted Panda message passing throughput
divided by the average MPI bandwidth
assuming maximum latency
100
I schema
Req send
Req rec
Local
Req proc
C
N (MB)
90
80
I schema
Req send
Req rec
Local
Req proc
C
N (MB)
70
60
50
40
30
20
2x4 compute node mesh
Fixed I/O nodes
2
4
8
r
c
r
c
r
c
2
22
4
4
8
7
4
22
4
1
3
1
1
10
2
0
3
2
0
00
0
0
1
0
1
12
2
1
3
2
7
34
6
4
8
3
0.25 0.5 0.5 0.25 1.0 0.25 1.0
Optimal I/O nodes
2
4
8
r
c
r
c
r
c
2
22
4
4
8
4
3
12
3
1
3
1
0
00
1
0
3
1
1
10
1
0
1
1
1
11
2
1
3
1
6
24
4
4
8
4
0.25 0.5 0.5 0.25 1.0 0.25 1.0
4x2 compute node mesh
Fixed I/O nodes
2
4
8
r
c
r
c
r
c
22
2
4
34
8
7
22
1
2
11
1
1
10
0
1
20
1
4
00
0
0
00
1
0
11
1
1
22
1
4
34
2
7
22
8
2
0.5 0.5 0.5
0.5 1.0 1.0 0.5 1.0
Optimal I/O nodes
2
4
8
r
c
r
c
r
c
20
2
4
24
8
6
12
1
1
11
1
1
00
0
0
00
1
3
10
0
1
00
1
0
11
1
1
22
1
3
24
2
4
22
8
2
0.5 0.5 0.5
0.5 1.0 1.0 0.5 1.0
10
0
fixed
2x4-2-r
2x4-2-c
2x4-r-r
2x4-8-c
optimal
4x2-2-r
4x2-4-r
4x2-4-c
4x2-8-c
Figure 11: Time that Panda spends for remote data transfer as a percentage of the average and worst case MPI
bandwidth. Each legend is of the form x-y-z, where x is a
compute mesh, y is an I/O node mesh and z is r (BLOCK,
*) or c (*, BLOCK).
Table 3: Panda parameter values for a 2D array distributed
(BLOCK, BLOCK) in memory across 2 4 compute nodes
and 4 2 compute nodes. The columns r and c represent
the values when (BLOCK, ) and (, BLOCK) distributions are used on disk respectively. The 4x2 compute mesh
and 2 I/O nodes with (BLOCK, *) distribution on disk have
two parameter lists because they gather from two dierent
sets of clients. Initially, I/O node 0 gathers a subchunk
from compute nodes 0 and 1, and I/O node 1 gathers one
from compute nodes 4 and 5. Next I/O node 0 gathers
a subchunk from compute nodes 2 and 3 and I/O node 1
gathers one from nodes 6 and 7. The 2x4 compute node
mesh with 2 I/O nodes and a (*, BLOCK) distribution on
disk is similar.
4MB
compute mesh I/O nodes measured predicted
2x4
2(r)
1.05
0.98
2x4
4(r)
0.76
0.72
2x4
2(c)
1.07
1.04
2x4
8(c)
0.44
0.45
4x2
2(r)
1.03
1.04
4x2
4(r)
0.56
0.63
4x2
4(c)
0.71
0.77
4x2
8(c)
0.66
0.65
16MB
compute mesh I/O nodes measured predicted
2x4
2(r)
3.52
3.22
2x4
4(r)
2.30
2.31
2x4
2(c)
3.36
3.14
2x4
8(c)
1.46
1.34
4x2
2(r)
3.42
3.32
4x2
4(r)
1.52
1.33
4x2
4(c)
2.40
2.08
4x2
8(c)
2.08
1.97
64MB
compute mesh I/O nodes measured predicted
2x4
2(r)
13.41
12.43
2x4
4(r)
8.28
8.66
2x4
2(c)
12.94
12.51
2x4
8(c)
4.72
4.40
4x2
2(r)
12.88
12.74
4x2
4(r)
5.41
4.83
4x2
4(c)
8.70
7.88
4x2
8(c)
8.01
7.79
Table 4: Panda response time in seconds to write a 2D
array distributed (BLOCK, BLOCK) in memory, using optimal I/O nodes. In the second column, "r" and "c" stand
for (BLOCK, *) and (*, BLOCK) distribution on disk respectively. Results are shown only for the cases where
optimal I/O nodes are dierent from xed I/O nodes.
Download