Using Active NVRAM for I/O Staging Sudarsun Kannan Ada Gavrilovska Karsten Schwan

advertisement
Using Active NVRAM for I/O Staging
Sudarsun Kannan
Ada Gavrilovska
Karsten Schwan
Georgia Institute of
Technology
Atlanta, GA, USA
Georgia Institute of
Technology
Atlanta, GA, USA
Georgia Institute of
Technology
Atlanta, GA, USA
schwan@cc.gatech.edu
ada@cc.gatech.edu
sudarsun@gatech.edu
Dejan Milojicic
Vanish Talwar
HP Labs
Palo Alto, CA, USA
dejan.milojicic@hp.com
ABSTRACT
With HPC machines moving to the exascale, scaling the
I/O performance of applications is a well known problem.
Also, a closely related problem is, how to efficiently analyze
and extract useful information from I/O data, viz. data post
processing. With advent of nonvolatile memory technologies
(NVMs) like SSD, PCM and Memristor, research has been
focusing on how to improve the file systems performance and
optimizations to overcome disk latencies. In the other end,
there has been extensive focus on ’DataStaging’ or ’in situ’
I/O processing where I/O data are moved from computational cores to memory buffers of dedicated data processing
or staging nodes using high performance I/O channels. The
I/O data gets processed in these nodes before writing them
to persistent storage like disks. However, issues with such
approaches include (1) the limitation that they cannot easily
analyze temporal data relationships or characteristics embedded in multiple simulation output steps, due to the limited aggregate memory capacity of staging nodes, and (2) the
need to ’right size’ such staging memory, sometimes even for
single output/checkpoint steps when data volumes are large.
Failing to properly allocate staging memory buffers (2) can
cause applications to block and severely degrade the performance improvements sought by the extensive parallelization
efforts undertaken by application developers. The limitation
posed by (1) can degrade the utility of the Staging approach
seen by end users.
This paper explores an alternative solution for ’right memory sizing’ issue for staging I/O. In this solution, memory
scaling avoids the cost and power constraints imposed on
machine designers by the use of DRAM (memory), by instead, using active NVRAM (nonvolatile memory) to enhance the memory capacities of compute and staging nodes.
Active NVRAMs are node-local NVRAMs that are embedded with a low power system-on-chip compute element. We
propose a mechanism, in which each physical node has an ad-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
PDAC’11, November 14, 2011, Seattle, Washington, USA.
Copyright 2011 ACM 978-1-4503-1130-4/11/11 ...$10.00.
HP Labs
Palo Alto, CA, USA
vanish.talwar@hp.com
ditional active NVRAM component to stage I/O and apply
simple data analytics operations over the I/O data. While
such node local data storage provides an obvious I/O acceleration, our experimental results show the effectiveness of
our approach in addressing ’right memory sizing issue’ by
efficient I/O data processing. We also discuss the overheads
in using Active NVRAM based approach for I/O staging.
Categories and Subject Descriptors
D.4.2 [Operating Systems]: Storage Management; D.4.8
[Operating Systems]: Performance
General Terms
Design, Performance
Keywords
Non-volatile memory, DataStaging, Post processing
1. INTRODUCTION
As we move toward exascale computing systems, there
will be increased performance bottlenecks due to limited I/O
bandwidths and the costs of data movement from where it
is produced to where it is stored. As a result, it will likely
no longer be possible to analyze the output data produced
by large-scale simulations via offline, workflow-based methods that first move massively sized data to disk, then refetch it for repeated post-processing, analysis, and visualization. These facts are causing researchers to develop insitu methods for post-processing in which data is reduced,
re-organized, analyzed, and visualized on compute nodes
(’inline’ or on additional ’helper’ cores), on staging nodes
(’DataStaging’), or both – hybrid staging, before moving it
to longer term storage [3, 17]. With the former, a set of
compute node cores is dedicated to running analysis functions and with the latter, a dedicated set of staging nodes is
responsible for collecting I/O data from compute nodes and
asynchronously post processing it before writing it to disk.
In either case, memory resources are needed for intermediate
data buffering or storage, or to retain results obtained from
earlier output steps, where there are notable penalties when
such memory is not properly sized. More specifically, inadequate memory resources may cause the application to block
because the data have not yet been evicted from its output
buffers, and it may make certain analyses infeasible because
of insufficient space for history maintained from earlier time
steps. Another issue, of course, is the processing applied
to the data, where again, inadequate CPU resources may
cause bottlenecks that eventually result in the application
blocking on its I/O.
Figure 1 shows a general system architecture of DataStaging framework where a small set of nodes is specifically dedicated to service I/O needs of any large number of compute
nodes. The data is moved from compute nodes to staging
nodes with RDMA transport, and data stored in staging
nodes ”DRAM/ volatile memory” for data processing before flushing them to disk. Even though powerful compute
cores are used to process data in the case of both inline and
staging approaches, still substantial I/O staging memory is
needed for data post processing. Failure to ’right size’ staging memory can cause applications to block, thereby severely
degrading intended performance improvements. This paper
explores how new technology – active NVRAM – can be used
to cope with such differences or more precisely, to ’right size’
the memory available for post-processing on compute and
staging nodes.
Active NVRAM is non-volatile memory coupled with a
low power compute element (see Figure 2) attached to a
server or compute nodes. Active NVRAM [14] is based
on a system-on-a-chip architecture providing high memory
bandwidth to hosts, where each active NVRAM has its own
run time. NVRAM technologies such as Memristor [2] and
PCM [10] offer 100X faster read-write performance and have
the endurance of about a million writes. They are scalable
in terms of storage density, with the ability to store multiple bits per cell, and they consume low energy (approx. 5-10
pJ/bit to write and no refresh required) when compared to
DRAM. This makes them a suitable replacement for disks on
common compute or server nodes, perhaps relegating disks
to back end storage farms or other long term storage solutions. It also makes them viable candidates for extending
the capacities of byte-addressable memory on future exascale machines. While NVRAM provides data persistence
required for an I/O data (e.g., checkpointing), the active
compute element embedded with NVRAM helps in simple
I/O data processing.
In this paper, we propose a system design, where each
physical node has an additional Active NVRAM component.
The Active NVRAMs buffer I/O data and perform I/O data
processing asynchronously on I/O data chunks without affecting application performance. Section 3 will discuss in
detail the system architecture. Our initial analysis with active NVRAM showed high I/O performance gains over the
NFS mounted disk when using NVRAM for I/O buffering
and some of the earlier works have already focused on I/O
throughput advantages. In this paper we specifically focus
on the effectiveness of using active NVRAM for I/O staging and simple post processing operations (e.g., sorting) using the ‘less powerful’ low power processors associated with
the NVRAM. We argue that, by using Active NVRAM, the
memory sizing issue described earlier can be substantially
alleviated. Our initial experiments show that, use of active NVRAM can help balance computational vs. memory/buffering needs for data, and it can lead to more flexibility in the ratio of the total number of compute nodes to
staging nodes used in the system (e.g., F. Zheng et al.[17]
use a 128:1 configuration).
Our technical contributions include the following:
Compute
Core
Compute
Core
RDMA
I/O
Node
Data processing
on I/O node
Remote
Storage (DISK)
Figure 1: DataStaging
Figure 2: Architecture - Active NVRAM
• Addressing ’right memory sizing’ issue – Analysis of
’right memory sizing’ issue in DataStaging mechanisms
and using Active NVRAM approach for I/O staging to
address the issue.
• Application evaluation – Using real world HPC application to evaluate our approach against the DataSizing
mechanism and quantifying the benefits and implications of our Active NVRAM based approach.
2. RELATED WORK
Using NVRAMs as an I/O device for HPC applications
was first proposed by M. Caulfield et al. [8] for an architecture called Moneta. Moneta was built as a hardware emulator to understand the benefits of using different NVRAMs
like SSD, PCM. A. Akel et al. [4] extended the Moneta system with a real PCM device to understand the performance
implications of using NVRAMs. X. Dong et al. [9] further
proposed using NVRAMs for HPC application checkpointing. The authors proposed a 3D PCM architecture to improve the overall I/O bandwidth for applications and showed
that high bandwidth up to 30GB can be achieved. Scalable checkpoint library [12] achieves high I/O performance
by using node local storage disks (or SSD) for checkpointing. Recent work on high performance file systems such as
Pansas [16] and PLFS [7] aims to improve I/O performance.
Panasas achieves high I/O performance by classifying data
(e.g., metadata, actual data) and storing them in different
hardware with object storage semantics to improve I/O performance. All the above research was focused on improving
for ’on-compute core’ (inline approach) and Active NVRAM
approach. Our analysis showed that for highly compute intensive data processing code, just using less powerful active
compute element may not be sufficient.
#$%
&'
((
)
*'
!
!
!!
!!
"
+ ,-)./00/
Figure 3: DataStaging- I/O impact
just the I/O throughput. Our work too relies on node local data storage and uses NVRAM to improve I/O but differs from other work by not only improving I/O throughput
by using node local NVRAM but also increasing the post
processing performance (in-situ processing) of I/O. Active
NVRAM is based on the idea of Nanostores, which aims to
reduce data movement across systems.
Several studies have been done for in-situ data processing
of I/O data. To our knowledge, K.L. Ma et al. [11] were
one of the first to propose in-situ processing of I/O data before writing to disk. Their approach was to use the same
compute core by dedicating few additional cycles for I/O
processing also. Some of the post processing examples cited
by the authors were data filtering and compression. This
mechanism avoids several copies of the data, and also reduces it before moving them to a remote storage. While this
method can be useful for low frequency I/O applications,
and for cloud based I/O with slow data movement channel (i.e., Gigabit Ethernet), using them for data processing
on checkpoint data can completely degrade the application
performance [15] because of frequent pauses in application
execution.
This work is inspired by our group’s earlier work on DataStaging [3]. H. Abbasi et al. [3] discuss the need for efficient insitu processing using the DataStaging framework. The work
further describes the benefits and overheads of DataStaging. As shown in Figure 3, DataStaging has a minimum
I/O overhead when the ratio of ’compute to staging nodes’
is small. But with increasing ratio, the I/O impact becomes substantial due to memory buffering bottlenecks of
staging nodes and data movement overheads as described
in Section 1. The authors address the I/O impact issues
by proposing a set of I/O scheduling policies to move data
to staging memory. While scheduling methods can reduce
the I/O impact, the policies are application dependent. We
predict, the I/O overheads can become higher for future exascale systems where power factor can limit the number of
staging I/O nodes. F. Zheng et al. [17] too use dedicated
I/O staging nodes specifically for post processing by holding data in memory. Our initial analysis [15] using I/O intensive benchmarks for Cloud and HPC applications evaluated effectiveness of different post processing applications
3. SYSTEM ARCHITECTURE
3.1 NVRAM
Future storage class memories like PCM and Memristors
are byte addressable. This paper assumes that, NVRAMs
would be connected to the memory bus (see Figure 2), and
placed in parallel to DRAM in the system memory hierarchy.
Hiding them behind DRAM limits the use cases and application level performance benefits gained by using NVRAMs,
by just using them as buffer cache. The bandwidth of such
memory devices would be dependent/limited by the system
hardware interface. While it is true that using NVRAM as
memory nodes would increase the wear and tear of such devices, prior research like start gap wear leveling [13] project
life time of NVRAMs close to 4 - 30 years in server platforms. Also we assume that such storage class memories
would be segmented in page level granularity. Our emulator partitions NVRAM to an data region and persistent
structure region. Data region has the process level I/O data
whereas persistent structure region has data structures required which maintains process level meta-data across application sessions. Section 5 discusses how persistent structures
are used in our framework.
3.2 Active compute element
The active compute elements in our design are low power
cores capable of post processing and massaging I/O data
with simple operations. In our experiments too, we emulate a dedicated Intel 2.33 GHZ core as an active compute
element which is close to the Intel Atom cores in terms of
processing capabilities. While these cores are not efficient
for extreme compute intensive data processing (e.g., High
resolution data visualization, image compression), previous
research like ’FAWN’ [6] has shown high scalability for distributed data processing with low power cores. This motivates the use of using Active NVRAMs for in-situ data
analysis. We believe with our proposed approach, horizontal scaling (i.e., each node having an Active NVRAM component) can be achieved and efficient distributed I/O data
operations can be performed. Our results in Section 7 discuss the tradeoffs of using active compute elements.
4. DRIVING APPLICATION
We use GTC (Gyrokinetic Toroidal Code) [1] as the driving application for our analysis. GTC is a ’Dimensional
Particle-In-Cell’ code for studying micro turbulence in magnetic confinement fusion. We choose GTC because, it is I/O
intensive, scalable and has been extensively used for prior
DataStaging related research. For I/O processing example,
we built a distributed parallel merge sort application. We
use the checkpoint data of the GTC application as the I/O
source and perform sorting over it. Sorting is a commonly
used data analytics application in high performance computing [17]. Our parallel merge sort application consists of
two steps, 1. a node local data sort followed by a distributed
global sort implemented using the MPI framework. Experimental evaluations in Section 7 will analyze the performance
Figure 4: Design
struct chunk {
uid - unique id
order_id - chkpt time step or virtual time stamp
phys_addr - address in NVRAM
length - total bytes of data
commit_flag - has data been committed?
struct ops *op - set of operations on a chunk
}
Figure 5: Chunk structure
implications of using such post processing codes using Active
NVRAM.
5.
SYSTEM DESIGN AND EMULATION
Since, most of the currently available NVRAM prototypes
like PCM cannot be directly used with existing memory
controller, we use our extended generic user level NVRAM
framework (emulator) to understand the benefits of Active
NVRAM. While major design goal of the framework has
been to support a wide range of applications, we extended
it for supporting I/O needs of HPC applications. Our design assumes that applications are aware of the existence of
NVRAM and our library provides application with set of
NVRAM interfaces for writing I/O data. We first describe
the design, basic interfaces provided to application, and followed by a set of optimizations to support checkpointing
and multiple process support. The main design goals of this
NVRAM architecture are to support
• chunk and byte level data access
• data persistence
• minimal application modification.
NVRAM data region. Each NVRAM in our design is
divided into a data region and a persistent structure region
(see Figure 2). An application can contain several processes
and each process has its own data compartment in the data
region. This is similar to a process address space. An application process can read and write data to its compartment
in chunks of data. A chunk is a sequential stream of data
of any length (unless restricted by the OS). As shown in
Figure 4, a process uses nvmalloc() call to allocate a data
chunk in its data compartment. Once a data chunk is allocated, a process can access chunks like any regular memory
variables, which makes them byte addressable. To emulate
persistence, the framework’s current implementation internally uses memory mapped files, but its generalized design
also permits the use of any memory mapped storage devices.
When using memory mapped files, the bandwidth offered
by the NVRAM interface depends on the total free memory pages available. Further details about NVRAM data
structures and interfaces is described below.
NVRAM persistent structure region. Each NVRAM
has a persistent structure region primarily consisting of process level data structures to maintain persistence information (process allocation table). For every chunk allocated by
a process, a corresponding chunk metadata is recorded in the
process table entry. The chunk metadata is represented by
a chunk structure as shown in Figure 5. The chunk metadata has an unique identifier, a virtual version timestamp to
indicate its creation time. The chunk structure also has a
commit flag. Unless the application calls a nvcommit(), this
flag is not set, indicating that data has not been committed
by application and can be deleted during garbage collection.
In our current design, we assume application’s responsibility to commit the data after allocation and writing to it.
Figure 4 gives an example of application writing checkpoint
data to NVRAM. Our current transactional semantics provide basic persistence guarantees and we plan to investigate
more in the future.
Data processing with Active element. One of our major design goals is to make active compute element completely transparent of an application under execution. The
communication between computational cores and active compute element is avoided and data synchronization bottlenecks have been kept minimum. The active compute element and general compute cores execute different processes
like a producer-consumer. When application commits data
chunks to NVRAM, it also updates the operation structure
field in chunk metadata (See Figure 5). This operation
structure indicates how data chunk needs to be processed
asynchronously by active compute element. Our framework
maintains a global active queue, and when application commits a data chunk, the chunk metadata is added to the global
queue also. Note that, active queues are process independent. Since multiple processes update the queue, it is protected by a spinlock. A commit guarantees an atomic failsafe
update to NVRAM and also to the global metadata queue
after which the active compute element can start post processing data. The active compute element uses this queue
to iterate through chunk metadata added to NVRAM and
uses the operation structure to find out set of operations that
needs to be applied over the data chunk. Active compute
element uses the physical address field from chunk metadata structure to fetch and process data from ’process data
compartment’.
I/O Staging optimizations. Data ordering: Exascale
predictions show that, number of cores per each physical
node will increase to around 128, and also system failure
rates close to once in two hours. This implies application
checkpointing frequency is bound to increase by several times.
Currently, applications write checkpoint data to I/O disk,
which are later used by some post processing applications
like visualization. Before data is presented to these appli-
6.
PERFORMANCE MODEL
Our experiments compare the performance of the active
NVRAM approach with DataStaging approach. To quantify
the performance of Active NVRAM, we propose an evaluation model. The model is primarily based on I/O impact on
simulation (application) execution time.
Active NVRAM App. run time (sec) ≈ C + OT
OT (sec) = TIO + γ
C - App. Compute time
OT - In situ processing overheads
TIO - Time to read/write from slower memory
γ(sec) - NVRAM contention, locks
Pr (sec) - I/O data processing time
As specified in the model above, the application run time
is dependent on the I/O time to NVRAM given by TIO .
Our experiments in Section 7 discuss time to read from and
write to slower memory. γ specifies the overhead time due
to NVRAM contention between threads. With moving toward exascale and increasing number of cores and application threads, contention effects can be substantial, which we
plan to investigate and propose method to address them for
our future work. For staging approach, the model can be
formulated as
DataStaging App. run time (sec) ≈ C + OT
OT (sec) = TB + TC
TB - Application execution block time
TC - Communication channel overheads
As discussed earlier, TB is primarily due to insufficient
staging memory which causes the application to pause until
sufficient staging memory is available for buffering I/O data.
Parallel applications use interconnects for fast MPI based
Post processing - NVRAM(Sec)
Post processing- Staging(Sec)
1800
1600
1400
1200
Time(Sec)
cations, the data layout needs to be reorganized [17] which
is time consuming. With increase in checkpoint frequency,
such overheads are bound to multiply, affecting the performance of time sensitive post processing applications like real
time visualization. Our design takes this into considerations
and provides flexibility for application to order I/O data
when they are committed to NVRAM. When an application
commits data to NVRAM, it can set an ordering value in the
metadata structure. As shown in Figure 4, process allocation table maintains each allocation of a process in a binary
tree, with each node in the tree corresponding to a chunk
metadata. The chunk metadata is ordered by the orderid
field shown in Figure 5. If the orderid is not set, then our
Active NVRAM library uses virtual timestamp to order and
update the chunk in allocation tree. While our performance
measurements in this paper do not quantify the benefits, we
plan to investigate on the role of active compute element in
such data reorganization tasks.
I/O Buffering: In our initial design, we used DRAM to
buffer I/O data until the application invokes a nvcommit()
after which data is written to NVRAM to reduce wear out
of the NVRAM cells. While this might not be an issue in
terms of data reliability, it can affect the performance of
highly memory intensive HPC applications by reducing the
overall memory availability. To avoid such overheads, in our
current design we do not use any of the memory buffers. As
indicated by our performance measurements in 7, this causes
a small I/O overhead due to slow read and writes directly
to NVRAM.
1000
Smaller data size, I/O
staging performs
better
800
Large I/O data
size, active
NVRAM performs
better
600
400
200
0
20
22
30
35
I/O Data (GB)
Figure 6: Post processing performance
messaging and communication.
nism uses the same interconnects
compute node to staging node.
connect contention (TC ) causing
reported by H. Abbasi et al..
The DataStaging mechafor moving I/O data from
This results in an intersignificant interference as
7. EXPERIMENTAL EVALUATION
Experimental results were obtained with a 64 core (16
nodes) cluster at HP with 6 GB memory/node and 20Gb/s
InfiniBand based interconnect. To emulate node local active
NVRAM, we dedicate one core for active NVRAM processing in each node, while the other cores are used for application computation. For all experiments, each core runs one
MPI process. For the dedicated I/O staging approach, the
ratio of compute cores to staging cores is 3:1. In all experiments, the number of Active NVRAM process cores and
staging cores are identical, and we use the MVAPICH library
with InfiniBand support. Measurements use the GTC [1] application, with applications restart data (checkpoints) as an
I/O data source. Our experiments
• evaluates the effectiveness of Active NVRAM in addressing the ’right memory sizing’ issue and analyzes
the benefits and implications by comparing with DataStaging,
• discusses the overheads in using Active NVRAM, with
distributed sort as a post processing application.
7.1 Post processing performance
Figure 6 validates the tradeoffs of using node local NVRAM
and a distributed sort as a post processing example. Experiments were performed with 36 processes executing one hundred iterations of the application. We keep the checkpoint
interval constant (10 iterations) and vary the total amount of
I/O data written by the application. As mentioned earlier,
the bandwidth of NVRAM using our memory mapped file
emulator depends on the total available free system memory. The memory (DRAM) to NVRAM copy bandwidth is
around 360 MB. While this value is close to some of the first
real hardware prototypes [5], the bandwidth of the production level hardware is expected to be higher. The y-axis
in Figure 6 indicates the total time for post processing. In
case of DataStaging, data is moved from the compute node
NVRAM Compute Time
70
DataStaging Compute Time
60
50
40
Overheads due
to memory
pressure/page
swapping
Chunk Size: ~33 MB
3 staging nodes
GTC simulation
completion time
30
1200
Increasing frequency
of I/O writes, Active NVRAM
performs
better
Less frequent I/O writes, Staging
approach better by ~15%
1000
Post rocessing time(Sec)
Parallel Merge Sort Time(Sec)
80
800
App. Run time - NVRAM
600
App. Run time - Staging
400
Post processing timeNVRAM
Post processing time- Staging
200
20
0
10
3
5
10
I/O interval
20
Figure 8: Post processing time vs. I/O interval
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73
I/0 output time step #
Figure 7: DataStaging - Memory pressure
to a staging node queue using InfiniBand-based RDMA, and
then post processed asynchronously.
Observations: As seen in the Figure 6, for small data sizes,
the I/O staging approach performs around (6%) better than
the NVRAM approach. With increasing I/O data size, the
performance of staging diminishes due to the fact that the
rate of inflow of data into the staging node queue is higher
compared to the speed with which such data can be processed which results in substantial reduction of total available memory. With further increase in I/O data and reduction in staging node free memory, page swapping increases
and such memory pressure (See Figure 7) directly impacts
the time to process data chunks emitted by the application. This is particularly serious for the sort operation as it
uses intermediate buffers when processing data. In case of
NVRAM, although the time to process each chunk is slower
compared to the staging approach (due to its comparatively
wimpy processors and low NVRAM bandwidth), for larger
data sizes, improved performance is realized by avoiding the
right memory sizing related issues. Therefore, an important
insight from this work is that storage class memories like
NVRAM can be effective in extending the memory capacity
of compute and staging nodes. Further, these power and cost
efficient ways to extend compute or staging node memories
can both reduce the level of memory pressure caused by in
situ data analytics and enable new online analytics across
multiple output steps or requiring substantial intermediate
data.
7.2 I/O Frequency vs. Application Performance
Figure 8 analyzes performance of DataStaging and active
NVRAM approaches by varying the checkpoint (I/O) frequency of for GTC application and shows the impact on
application run time. The x-axis in the graph indicates the
number of compute iterations after which data is written to
staging nodes or active NVRAM. The top two curves show
the post processing time and the curves at the bottom show
application execution time.
Observation: When the I/O interval is high (around 20
iterations), time to post process data is almost same as the
application run time, for both staging and Active NVRAM
approaches. With increasing output frequency, I/O staging
Operation
Local sort
Global sort
Data movement
Staging (Sec)
4.38
4. 76
0.02
NVRAM (Sec)
6.03
6.14
0.0902
Table 1: Post processing microbenchmark
performs better compared to Active NVRAM in terms of
post processing, again due to our NVRAM implementation’s
relatively higher processing latency and lower bandwidth
compared to IB-based I/O staging. However, when further
increasing the frequency of checkpoints, the performance of
staging begins to degrade due to memory scarcity, as discussed throughout this paper. In contrast, the post processing performance curve of Active NVRAM remains constant
and directly depends on data size. Similarly, there is minimal impact on application run time for both approaches
for low I/O frequencies, but for higher frequency I/O, with
DataStaging, we noticed an 8-10% increase in run time primarily attributed to staging node blocking when pull I/O
data.
7.3 Distributed Sort - Microbenchmarks
A distribute merge sort consists of a local sort followed by
a global sort. Table 1 compares the performance of each sort
step at a finer granularity. The I/O chunk size written by
each process is around 33 MB. For our scale of experimentation, since the data moved from compute to staging nodes for
each checkpoint step is relatively small compared to the total available network bandwidth (InfiniBand), there is very
little overhead for moving data to staging nodes, whereas
data fetch costs of NVRAM post processing are higher in
our test configuration. This explains why for data intensive operations like sorting, Active NVRAM processing is
slower compared to I/O staging. This may change when
scaling to many thousands of cores. Further, a subject of
our future work will be to consider the comparative costs
in energy usage when data is moved to Active NVRAM vs.
across a high performance interconnect. Our future work
will also take NVRAM latency into consideration and use
static instrumentation techniques to measure the total read
and writes to NVRAM.
8.
CONCLUSIONS
This paper explores the problem of memory ’right sizing’
for I/O DataStaging. It proposes an alternative solution
for doing so, termed ’Active NVRAM’, which permits node
local I/O on both compute and staging nodes. We implement and evaluate an active NVRAM framework in which
data analysis can be done in NVRAM asynchronously with
the application’s execution. We use the GTC fusion modeling code [1] to evaluate post processing performance for
the common example of data sorting. We further evaluate
the tradeoffs of using active NVRAM for different I/O data
sizes, by comparing it with an approach using dedicated I/O
staging nodes. Preliminary evaluations show the benefits
of using active NVRAM when staging nodes have insufficient memory resource to hold, post process, and flush output data, thereby affecting post processing and application
performance. The key outcome of this work is its demonstration of the potential benefits of using NVRAM to ’right
size’ the memory available for in situ data analysis, on staging nodes. Also assessed are limitations with NVRAM use
due to relatively low per-node access bandwidths. At larger
scales, however, the aggregate high bandwidth of node-local
NVRAM will likely outperform the relatively costly crosssection bandwidth offered by high end interconnects, and
additional gains may come from the energy efficient nature of future NVRAM hardware. Our future work will address some of these limitations by improving our emulated
NVRAM implementation and by investigating these problems at larger scales. In addition, we will consider other
post processing actions, to range from efficient methods for
data filtering and reduction, to computationally expensive
analysis techniques like compression, to those that expand
data volumes like indexing.
9.
ACKNOWLEDGMENTS
The authors would like to thank the PDAC program committee reviewers for their helpful comments. We gratefully
acknowledge the technical support and guidance provided
by researchers at Georgia Tech and HP Labs in Palo Alto.
10. REFERENCES
[1] Gyrokinetic Toroidal Code.
http://phoenix.ps.uci.edu/GTC.
[2] Memristor. http://www.hpl.hp.com/news/2008/
apr-jun/memristor.html.
[3] H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky,
K. Schwan, and F. Zheng. Datastager: scalable data
staging services for petascale applications. In HPDC,
New York, NY, 2009. ACM.
[4] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta,
and S. Swanson. Onyx: a protoype phase change
memory storage array. In Proceedings of the 3rd
USENIX conference on Hot topics in storage and file
systems, HotStorage’11, Berkeley, CA, USA, 2011.
USENIX Association.
[5] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta,
and S. Swanson. Onyx: a protoype phase change
memory storage array. HotStorage, Berkeley, CA,
USA, 2011. USENIX Association.
[6] D. G. Andersen, J. Franklin, M. Kaminsky,
A. Phanishayee, L. Tan, and V. Vasudevan. Fawn: a
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
fast array of wimpy nodes. SOSP, NY, USA, 2009.
ACM.
J. Bent, G. Gibson, G. Grider, B. McClelland,
P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate.
Plfs: a checkpoint filesystem for parallel applications.
In SC, New York, NY, USA, 2009. ACM.
A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K.
Gupta, and S. Swanson. Moneta: A high-performance
storage array architecture for next-generation,
non-volatile memories. In MICRO, Washington, DC,
USA, 2010. IEEE Computer Society.
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann,
and Y. Xie. Leveraging 3d pcram technologies to
reduce checkpoint overhead for future exascale
systems. In Proceedings of the Conference on High
Performance Computing Networking, Storage and
Analysis, SC ’09, New York, NY, USA, 2009. ACM.
B. C. Lee, E. Ipek, O. Mutlu, and D. Burger.
Architecting phase change memory as a scalable dram
alternative. In ISCA. ACM, 2009.
K. liu Ma, C. Wang, H. Yu, and A. Tikhonov. In-situ
processing and visualization for ultrascale simulations.
Journal of Physics: Conference Series, 78(1), 2007.
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d.
Supinski. Design, modeling, and evaluation of a
scalable multi-level checkpointing system. In SC,
Washington, DC, USA, 2010. IEEE Computer Society.
M. K. Qureshi, J. Karidis, M. Franceschini,
V. Srinivasan, L. Lastras, and B. Abali. Enhancing
lifetime and security of pcm-based main memory with
start-gap wear leveling. MICRO 42, New York, NY,
USA, 2009. ACM.
P. Ranganathan. From microprocessors to nanostores:
Rethinking data-centric systems. Computer, 44:39–48.
K. Sudarsun, D. Milojicic, T. Vanish, G. Ada,
S. Karsten, and A. Hassan. Using active nvram for
cloud i/o. In OpenCirrus Summit, Atlanta, GA, 2011.
B. Welch, M. Unangst, Z. Abbasi, G. Gibson,
B. Mueller, J. Small, J. Zelenka, and B. Zhou.
Scalable performance of the panasas parallel file
system. In FAST, 2008.
F. Zheng, H. Abbasi, C. Docan, J. Lofstead, and
K. Scott, et al. PreDatA - Preparatory Data Analytics
on Peta-Scale Machines. In IPDPS, Atlanta, GA, 2010.
Download