Using Active NVRAM for I/O Staging Sudarsun Kannan Ada Gavrilovska Karsten Schwan Georgia Institute of Technology Atlanta, GA, USA Georgia Institute of Technology Atlanta, GA, USA Georgia Institute of Technology Atlanta, GA, USA schwan@cc.gatech.edu ada@cc.gatech.edu sudarsun@gatech.edu Dejan Milojicic Vanish Talwar HP Labs Palo Alto, CA, USA dejan.milojicic@hp.com ABSTRACT With HPC machines moving to the exascale, scaling the I/O performance of applications is a well known problem. Also, a closely related problem is, how to efficiently analyze and extract useful information from I/O data, viz. data post processing. With advent of nonvolatile memory technologies (NVMs) like SSD, PCM and Memristor, research has been focusing on how to improve the file systems performance and optimizations to overcome disk latencies. In the other end, there has been extensive focus on ’DataStaging’ or ’in situ’ I/O processing where I/O data are moved from computational cores to memory buffers of dedicated data processing or staging nodes using high performance I/O channels. The I/O data gets processed in these nodes before writing them to persistent storage like disks. However, issues with such approaches include (1) the limitation that they cannot easily analyze temporal data relationships or characteristics embedded in multiple simulation output steps, due to the limited aggregate memory capacity of staging nodes, and (2) the need to ’right size’ such staging memory, sometimes even for single output/checkpoint steps when data volumes are large. Failing to properly allocate staging memory buffers (2) can cause applications to block and severely degrade the performance improvements sought by the extensive parallelization efforts undertaken by application developers. The limitation posed by (1) can degrade the utility of the Staging approach seen by end users. This paper explores an alternative solution for ’right memory sizing’ issue for staging I/O. In this solution, memory scaling avoids the cost and power constraints imposed on machine designers by the use of DRAM (memory), by instead, using active NVRAM (nonvolatile memory) to enhance the memory capacities of compute and staging nodes. Active NVRAMs are node-local NVRAMs that are embedded with a low power system-on-chip compute element. We propose a mechanism, in which each physical node has an ad- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PDAC’11, November 14, 2011, Seattle, Washington, USA. Copyright 2011 ACM 978-1-4503-1130-4/11/11 ...$10.00. HP Labs Palo Alto, CA, USA vanish.talwar@hp.com ditional active NVRAM component to stage I/O and apply simple data analytics operations over the I/O data. While such node local data storage provides an obvious I/O acceleration, our experimental results show the effectiveness of our approach in addressing ’right memory sizing issue’ by efficient I/O data processing. We also discuss the overheads in using Active NVRAM based approach for I/O staging. Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management; D.4.8 [Operating Systems]: Performance General Terms Design, Performance Keywords Non-volatile memory, DataStaging, Post processing 1. INTRODUCTION As we move toward exascale computing systems, there will be increased performance bottlenecks due to limited I/O bandwidths and the costs of data movement from where it is produced to where it is stored. As a result, it will likely no longer be possible to analyze the output data produced by large-scale simulations via offline, workflow-based methods that first move massively sized data to disk, then refetch it for repeated post-processing, analysis, and visualization. These facts are causing researchers to develop insitu methods for post-processing in which data is reduced, re-organized, analyzed, and visualized on compute nodes (’inline’ or on additional ’helper’ cores), on staging nodes (’DataStaging’), or both – hybrid staging, before moving it to longer term storage [3, 17]. With the former, a set of compute node cores is dedicated to running analysis functions and with the latter, a dedicated set of staging nodes is responsible for collecting I/O data from compute nodes and asynchronously post processing it before writing it to disk. In either case, memory resources are needed for intermediate data buffering or storage, or to retain results obtained from earlier output steps, where there are notable penalties when such memory is not properly sized. More specifically, inadequate memory resources may cause the application to block because the data have not yet been evicted from its output buffers, and it may make certain analyses infeasible because of insufficient space for history maintained from earlier time steps. Another issue, of course, is the processing applied to the data, where again, inadequate CPU resources may cause bottlenecks that eventually result in the application blocking on its I/O. Figure 1 shows a general system architecture of DataStaging framework where a small set of nodes is specifically dedicated to service I/O needs of any large number of compute nodes. The data is moved from compute nodes to staging nodes with RDMA transport, and data stored in staging nodes ”DRAM/ volatile memory” for data processing before flushing them to disk. Even though powerful compute cores are used to process data in the case of both inline and staging approaches, still substantial I/O staging memory is needed for data post processing. Failure to ’right size’ staging memory can cause applications to block, thereby severely degrading intended performance improvements. This paper explores how new technology – active NVRAM – can be used to cope with such differences or more precisely, to ’right size’ the memory available for post-processing on compute and staging nodes. Active NVRAM is non-volatile memory coupled with a low power compute element (see Figure 2) attached to a server or compute nodes. Active NVRAM [14] is based on a system-on-a-chip architecture providing high memory bandwidth to hosts, where each active NVRAM has its own run time. NVRAM technologies such as Memristor [2] and PCM [10] offer 100X faster read-write performance and have the endurance of about a million writes. They are scalable in terms of storage density, with the ability to store multiple bits per cell, and they consume low energy (approx. 5-10 pJ/bit to write and no refresh required) when compared to DRAM. This makes them a suitable replacement for disks on common compute or server nodes, perhaps relegating disks to back end storage farms or other long term storage solutions. It also makes them viable candidates for extending the capacities of byte-addressable memory on future exascale machines. While NVRAM provides data persistence required for an I/O data (e.g., checkpointing), the active compute element embedded with NVRAM helps in simple I/O data processing. In this paper, we propose a system design, where each physical node has an additional Active NVRAM component. The Active NVRAMs buffer I/O data and perform I/O data processing asynchronously on I/O data chunks without affecting application performance. Section 3 will discuss in detail the system architecture. Our initial analysis with active NVRAM showed high I/O performance gains over the NFS mounted disk when using NVRAM for I/O buffering and some of the earlier works have already focused on I/O throughput advantages. In this paper we specifically focus on the effectiveness of using active NVRAM for I/O staging and simple post processing operations (e.g., sorting) using the ‘less powerful’ low power processors associated with the NVRAM. We argue that, by using Active NVRAM, the memory sizing issue described earlier can be substantially alleviated. Our initial experiments show that, use of active NVRAM can help balance computational vs. memory/buffering needs for data, and it can lead to more flexibility in the ratio of the total number of compute nodes to staging nodes used in the system (e.g., F. Zheng et al.[17] use a 128:1 configuration). Our technical contributions include the following: Compute Core Compute Core RDMA I/O Node Data processing on I/O node Remote Storage (DISK) Figure 1: DataStaging Figure 2: Architecture - Active NVRAM • Addressing ’right memory sizing’ issue – Analysis of ’right memory sizing’ issue in DataStaging mechanisms and using Active NVRAM approach for I/O staging to address the issue. • Application evaluation – Using real world HPC application to evaluate our approach against the DataSizing mechanism and quantifying the benefits and implications of our Active NVRAM based approach. 2. RELATED WORK Using NVRAMs as an I/O device for HPC applications was first proposed by M. Caulfield et al. [8] for an architecture called Moneta. Moneta was built as a hardware emulator to understand the benefits of using different NVRAMs like SSD, PCM. A. Akel et al. [4] extended the Moneta system with a real PCM device to understand the performance implications of using NVRAMs. X. Dong et al. [9] further proposed using NVRAMs for HPC application checkpointing. The authors proposed a 3D PCM architecture to improve the overall I/O bandwidth for applications and showed that high bandwidth up to 30GB can be achieved. Scalable checkpoint library [12] achieves high I/O performance by using node local storage disks (or SSD) for checkpointing. Recent work on high performance file systems such as Pansas [16] and PLFS [7] aims to improve I/O performance. Panasas achieves high I/O performance by classifying data (e.g., metadata, actual data) and storing them in different hardware with object storage semantics to improve I/O performance. All the above research was focused on improving for ’on-compute core’ (inline approach) and Active NVRAM approach. Our analysis showed that for highly compute intensive data processing code, just using less powerful active compute element may not be sufficient. #$% &' (( ) *' ! ! !! !! " + ,-)./00/ Figure 3: DataStaging- I/O impact just the I/O throughput. Our work too relies on node local data storage and uses NVRAM to improve I/O but differs from other work by not only improving I/O throughput by using node local NVRAM but also increasing the post processing performance (in-situ processing) of I/O. Active NVRAM is based on the idea of Nanostores, which aims to reduce data movement across systems. Several studies have been done for in-situ data processing of I/O data. To our knowledge, K.L. Ma et al. [11] were one of the first to propose in-situ processing of I/O data before writing to disk. Their approach was to use the same compute core by dedicating few additional cycles for I/O processing also. Some of the post processing examples cited by the authors were data filtering and compression. This mechanism avoids several copies of the data, and also reduces it before moving them to a remote storage. While this method can be useful for low frequency I/O applications, and for cloud based I/O with slow data movement channel (i.e., Gigabit Ethernet), using them for data processing on checkpoint data can completely degrade the application performance [15] because of frequent pauses in application execution. This work is inspired by our group’s earlier work on DataStaging [3]. H. Abbasi et al. [3] discuss the need for efficient insitu processing using the DataStaging framework. The work further describes the benefits and overheads of DataStaging. As shown in Figure 3, DataStaging has a minimum I/O overhead when the ratio of ’compute to staging nodes’ is small. But with increasing ratio, the I/O impact becomes substantial due to memory buffering bottlenecks of staging nodes and data movement overheads as described in Section 1. The authors address the I/O impact issues by proposing a set of I/O scheduling policies to move data to staging memory. While scheduling methods can reduce the I/O impact, the policies are application dependent. We predict, the I/O overheads can become higher for future exascale systems where power factor can limit the number of staging I/O nodes. F. Zheng et al. [17] too use dedicated I/O staging nodes specifically for post processing by holding data in memory. Our initial analysis [15] using I/O intensive benchmarks for Cloud and HPC applications evaluated effectiveness of different post processing applications 3. SYSTEM ARCHITECTURE 3.1 NVRAM Future storage class memories like PCM and Memristors are byte addressable. This paper assumes that, NVRAMs would be connected to the memory bus (see Figure 2), and placed in parallel to DRAM in the system memory hierarchy. Hiding them behind DRAM limits the use cases and application level performance benefits gained by using NVRAMs, by just using them as buffer cache. The bandwidth of such memory devices would be dependent/limited by the system hardware interface. While it is true that using NVRAM as memory nodes would increase the wear and tear of such devices, prior research like start gap wear leveling [13] project life time of NVRAMs close to 4 - 30 years in server platforms. Also we assume that such storage class memories would be segmented in page level granularity. Our emulator partitions NVRAM to an data region and persistent structure region. Data region has the process level I/O data whereas persistent structure region has data structures required which maintains process level meta-data across application sessions. Section 5 discusses how persistent structures are used in our framework. 3.2 Active compute element The active compute elements in our design are low power cores capable of post processing and massaging I/O data with simple operations. In our experiments too, we emulate a dedicated Intel 2.33 GHZ core as an active compute element which is close to the Intel Atom cores in terms of processing capabilities. While these cores are not efficient for extreme compute intensive data processing (e.g., High resolution data visualization, image compression), previous research like ’FAWN’ [6] has shown high scalability for distributed data processing with low power cores. This motivates the use of using Active NVRAMs for in-situ data analysis. We believe with our proposed approach, horizontal scaling (i.e., each node having an Active NVRAM component) can be achieved and efficient distributed I/O data operations can be performed. Our results in Section 7 discuss the tradeoffs of using active compute elements. 4. DRIVING APPLICATION We use GTC (Gyrokinetic Toroidal Code) [1] as the driving application for our analysis. GTC is a ’Dimensional Particle-In-Cell’ code for studying micro turbulence in magnetic confinement fusion. We choose GTC because, it is I/O intensive, scalable and has been extensively used for prior DataStaging related research. For I/O processing example, we built a distributed parallel merge sort application. We use the checkpoint data of the GTC application as the I/O source and perform sorting over it. Sorting is a commonly used data analytics application in high performance computing [17]. Our parallel merge sort application consists of two steps, 1. a node local data sort followed by a distributed global sort implemented using the MPI framework. Experimental evaluations in Section 7 will analyze the performance Figure 4: Design struct chunk { uid - unique id order_id - chkpt time step or virtual time stamp phys_addr - address in NVRAM length - total bytes of data commit_flag - has data been committed? struct ops *op - set of operations on a chunk } Figure 5: Chunk structure implications of using such post processing codes using Active NVRAM. 5. SYSTEM DESIGN AND EMULATION Since, most of the currently available NVRAM prototypes like PCM cannot be directly used with existing memory controller, we use our extended generic user level NVRAM framework (emulator) to understand the benefits of Active NVRAM. While major design goal of the framework has been to support a wide range of applications, we extended it for supporting I/O needs of HPC applications. Our design assumes that applications are aware of the existence of NVRAM and our library provides application with set of NVRAM interfaces for writing I/O data. We first describe the design, basic interfaces provided to application, and followed by a set of optimizations to support checkpointing and multiple process support. The main design goals of this NVRAM architecture are to support • chunk and byte level data access • data persistence • minimal application modification. NVRAM data region. Each NVRAM in our design is divided into a data region and a persistent structure region (see Figure 2). An application can contain several processes and each process has its own data compartment in the data region. This is similar to a process address space. An application process can read and write data to its compartment in chunks of data. A chunk is a sequential stream of data of any length (unless restricted by the OS). As shown in Figure 4, a process uses nvmalloc() call to allocate a data chunk in its data compartment. Once a data chunk is allocated, a process can access chunks like any regular memory variables, which makes them byte addressable. To emulate persistence, the framework’s current implementation internally uses memory mapped files, but its generalized design also permits the use of any memory mapped storage devices. When using memory mapped files, the bandwidth offered by the NVRAM interface depends on the total free memory pages available. Further details about NVRAM data structures and interfaces is described below. NVRAM persistent structure region. Each NVRAM has a persistent structure region primarily consisting of process level data structures to maintain persistence information (process allocation table). For every chunk allocated by a process, a corresponding chunk metadata is recorded in the process table entry. The chunk metadata is represented by a chunk structure as shown in Figure 5. The chunk metadata has an unique identifier, a virtual version timestamp to indicate its creation time. The chunk structure also has a commit flag. Unless the application calls a nvcommit(), this flag is not set, indicating that data has not been committed by application and can be deleted during garbage collection. In our current design, we assume application’s responsibility to commit the data after allocation and writing to it. Figure 4 gives an example of application writing checkpoint data to NVRAM. Our current transactional semantics provide basic persistence guarantees and we plan to investigate more in the future. Data processing with Active element. One of our major design goals is to make active compute element completely transparent of an application under execution. The communication between computational cores and active compute element is avoided and data synchronization bottlenecks have been kept minimum. The active compute element and general compute cores execute different processes like a producer-consumer. When application commits data chunks to NVRAM, it also updates the operation structure field in chunk metadata (See Figure 5). This operation structure indicates how data chunk needs to be processed asynchronously by active compute element. Our framework maintains a global active queue, and when application commits a data chunk, the chunk metadata is added to the global queue also. Note that, active queues are process independent. Since multiple processes update the queue, it is protected by a spinlock. A commit guarantees an atomic failsafe update to NVRAM and also to the global metadata queue after which the active compute element can start post processing data. The active compute element uses this queue to iterate through chunk metadata added to NVRAM and uses the operation structure to find out set of operations that needs to be applied over the data chunk. Active compute element uses the physical address field from chunk metadata structure to fetch and process data from ’process data compartment’. I/O Staging optimizations. Data ordering: Exascale predictions show that, number of cores per each physical node will increase to around 128, and also system failure rates close to once in two hours. This implies application checkpointing frequency is bound to increase by several times. Currently, applications write checkpoint data to I/O disk, which are later used by some post processing applications like visualization. Before data is presented to these appli- 6. PERFORMANCE MODEL Our experiments compare the performance of the active NVRAM approach with DataStaging approach. To quantify the performance of Active NVRAM, we propose an evaluation model. The model is primarily based on I/O impact on simulation (application) execution time. Active NVRAM App. run time (sec) ≈ C + OT OT (sec) = TIO + γ C - App. Compute time OT - In situ processing overheads TIO - Time to read/write from slower memory γ(sec) - NVRAM contention, locks Pr (sec) - I/O data processing time As specified in the model above, the application run time is dependent on the I/O time to NVRAM given by TIO . Our experiments in Section 7 discuss time to read from and write to slower memory. γ specifies the overhead time due to NVRAM contention between threads. With moving toward exascale and increasing number of cores and application threads, contention effects can be substantial, which we plan to investigate and propose method to address them for our future work. For staging approach, the model can be formulated as DataStaging App. run time (sec) ≈ C + OT OT (sec) = TB + TC TB - Application execution block time TC - Communication channel overheads As discussed earlier, TB is primarily due to insufficient staging memory which causes the application to pause until sufficient staging memory is available for buffering I/O data. Parallel applications use interconnects for fast MPI based Post processing - NVRAM(Sec) Post processing- Staging(Sec) 1800 1600 1400 1200 Time(Sec) cations, the data layout needs to be reorganized [17] which is time consuming. With increase in checkpoint frequency, such overheads are bound to multiply, affecting the performance of time sensitive post processing applications like real time visualization. Our design takes this into considerations and provides flexibility for application to order I/O data when they are committed to NVRAM. When an application commits data to NVRAM, it can set an ordering value in the metadata structure. As shown in Figure 4, process allocation table maintains each allocation of a process in a binary tree, with each node in the tree corresponding to a chunk metadata. The chunk metadata is ordered by the orderid field shown in Figure 5. If the orderid is not set, then our Active NVRAM library uses virtual timestamp to order and update the chunk in allocation tree. While our performance measurements in this paper do not quantify the benefits, we plan to investigate on the role of active compute element in such data reorganization tasks. I/O Buffering: In our initial design, we used DRAM to buffer I/O data until the application invokes a nvcommit() after which data is written to NVRAM to reduce wear out of the NVRAM cells. While this might not be an issue in terms of data reliability, it can affect the performance of highly memory intensive HPC applications by reducing the overall memory availability. To avoid such overheads, in our current design we do not use any of the memory buffers. As indicated by our performance measurements in 7, this causes a small I/O overhead due to slow read and writes directly to NVRAM. 1000 Smaller data size, I/O staging performs better 800 Large I/O data size, active NVRAM performs better 600 400 200 0 20 22 30 35 I/O Data (GB) Figure 6: Post processing performance messaging and communication. nism uses the same interconnects compute node to staging node. connect contention (TC ) causing reported by H. Abbasi et al.. The DataStaging mechafor moving I/O data from This results in an intersignificant interference as 7. EXPERIMENTAL EVALUATION Experimental results were obtained with a 64 core (16 nodes) cluster at HP with 6 GB memory/node and 20Gb/s InfiniBand based interconnect. To emulate node local active NVRAM, we dedicate one core for active NVRAM processing in each node, while the other cores are used for application computation. For all experiments, each core runs one MPI process. For the dedicated I/O staging approach, the ratio of compute cores to staging cores is 3:1. In all experiments, the number of Active NVRAM process cores and staging cores are identical, and we use the MVAPICH library with InfiniBand support. Measurements use the GTC [1] application, with applications restart data (checkpoints) as an I/O data source. Our experiments • evaluates the effectiveness of Active NVRAM in addressing the ’right memory sizing’ issue and analyzes the benefits and implications by comparing with DataStaging, • discusses the overheads in using Active NVRAM, with distributed sort as a post processing application. 7.1 Post processing performance Figure 6 validates the tradeoffs of using node local NVRAM and a distributed sort as a post processing example. Experiments were performed with 36 processes executing one hundred iterations of the application. We keep the checkpoint interval constant (10 iterations) and vary the total amount of I/O data written by the application. As mentioned earlier, the bandwidth of NVRAM using our memory mapped file emulator depends on the total available free system memory. The memory (DRAM) to NVRAM copy bandwidth is around 360 MB. While this value is close to some of the first real hardware prototypes [5], the bandwidth of the production level hardware is expected to be higher. The y-axis in Figure 6 indicates the total time for post processing. In case of DataStaging, data is moved from the compute node NVRAM Compute Time 70 DataStaging Compute Time 60 50 40 Overheads due to memory pressure/page swapping Chunk Size: ~33 MB 3 staging nodes GTC simulation completion time 30 1200 Increasing frequency of I/O writes, Active NVRAM performs better Less frequent I/O writes, Staging approach better by ~15% 1000 Post rocessing time(Sec) Parallel Merge Sort Time(Sec) 80 800 App. Run time - NVRAM 600 App. Run time - Staging 400 Post processing timeNVRAM Post processing time- Staging 200 20 0 10 3 5 10 I/O interval 20 Figure 8: Post processing time vs. I/O interval 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 I/0 output time step # Figure 7: DataStaging - Memory pressure to a staging node queue using InfiniBand-based RDMA, and then post processed asynchronously. Observations: As seen in the Figure 6, for small data sizes, the I/O staging approach performs around (6%) better than the NVRAM approach. With increasing I/O data size, the performance of staging diminishes due to the fact that the rate of inflow of data into the staging node queue is higher compared to the speed with which such data can be processed which results in substantial reduction of total available memory. With further increase in I/O data and reduction in staging node free memory, page swapping increases and such memory pressure (See Figure 7) directly impacts the time to process data chunks emitted by the application. This is particularly serious for the sort operation as it uses intermediate buffers when processing data. In case of NVRAM, although the time to process each chunk is slower compared to the staging approach (due to its comparatively wimpy processors and low NVRAM bandwidth), for larger data sizes, improved performance is realized by avoiding the right memory sizing related issues. Therefore, an important insight from this work is that storage class memories like NVRAM can be effective in extending the memory capacity of compute and staging nodes. Further, these power and cost efficient ways to extend compute or staging node memories can both reduce the level of memory pressure caused by in situ data analytics and enable new online analytics across multiple output steps or requiring substantial intermediate data. 7.2 I/O Frequency vs. Application Performance Figure 8 analyzes performance of DataStaging and active NVRAM approaches by varying the checkpoint (I/O) frequency of for GTC application and shows the impact on application run time. The x-axis in the graph indicates the number of compute iterations after which data is written to staging nodes or active NVRAM. The top two curves show the post processing time and the curves at the bottom show application execution time. Observation: When the I/O interval is high (around 20 iterations), time to post process data is almost same as the application run time, for both staging and Active NVRAM approaches. With increasing output frequency, I/O staging Operation Local sort Global sort Data movement Staging (Sec) 4.38 4. 76 0.02 NVRAM (Sec) 6.03 6.14 0.0902 Table 1: Post processing microbenchmark performs better compared to Active NVRAM in terms of post processing, again due to our NVRAM implementation’s relatively higher processing latency and lower bandwidth compared to IB-based I/O staging. However, when further increasing the frequency of checkpoints, the performance of staging begins to degrade due to memory scarcity, as discussed throughout this paper. In contrast, the post processing performance curve of Active NVRAM remains constant and directly depends on data size. Similarly, there is minimal impact on application run time for both approaches for low I/O frequencies, but for higher frequency I/O, with DataStaging, we noticed an 8-10% increase in run time primarily attributed to staging node blocking when pull I/O data. 7.3 Distributed Sort - Microbenchmarks A distribute merge sort consists of a local sort followed by a global sort. Table 1 compares the performance of each sort step at a finer granularity. The I/O chunk size written by each process is around 33 MB. For our scale of experimentation, since the data moved from compute to staging nodes for each checkpoint step is relatively small compared to the total available network bandwidth (InfiniBand), there is very little overhead for moving data to staging nodes, whereas data fetch costs of NVRAM post processing are higher in our test configuration. This explains why for data intensive operations like sorting, Active NVRAM processing is slower compared to I/O staging. This may change when scaling to many thousands of cores. Further, a subject of our future work will be to consider the comparative costs in energy usage when data is moved to Active NVRAM vs. across a high performance interconnect. Our future work will also take NVRAM latency into consideration and use static instrumentation techniques to measure the total read and writes to NVRAM. 8. CONCLUSIONS This paper explores the problem of memory ’right sizing’ for I/O DataStaging. It proposes an alternative solution for doing so, termed ’Active NVRAM’, which permits node local I/O on both compute and staging nodes. We implement and evaluate an active NVRAM framework in which data analysis can be done in NVRAM asynchronously with the application’s execution. We use the GTC fusion modeling code [1] to evaluate post processing performance for the common example of data sorting. We further evaluate the tradeoffs of using active NVRAM for different I/O data sizes, by comparing it with an approach using dedicated I/O staging nodes. Preliminary evaluations show the benefits of using active NVRAM when staging nodes have insufficient memory resource to hold, post process, and flush output data, thereby affecting post processing and application performance. The key outcome of this work is its demonstration of the potential benefits of using NVRAM to ’right size’ the memory available for in situ data analysis, on staging nodes. Also assessed are limitations with NVRAM use due to relatively low per-node access bandwidths. At larger scales, however, the aggregate high bandwidth of node-local NVRAM will likely outperform the relatively costly crosssection bandwidth offered by high end interconnects, and additional gains may come from the energy efficient nature of future NVRAM hardware. Our future work will address some of these limitations by improving our emulated NVRAM implementation and by investigating these problems at larger scales. In addition, we will consider other post processing actions, to range from efficient methods for data filtering and reduction, to computationally expensive analysis techniques like compression, to those that expand data volumes like indexing. 9. ACKNOWLEDGMENTS The authors would like to thank the PDAC program committee reviewers for their helpful comments. We gratefully acknowledge the technical support and guidance provided by researchers at Georgia Tech and HP Labs in Palo Alto. 10. REFERENCES [1] Gyrokinetic Toroidal Code. http://phoenix.ps.uci.edu/GTC. [2] Memristor. http://www.hpl.hp.com/news/2008/ apr-jun/memristor.html. [3] H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng. Datastager: scalable data staging services for petascale applications. In HPDC, New York, NY, 2009. ACM. [4] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and S. Swanson. Onyx: a protoype phase change memory storage array. In Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems, HotStorage’11, Berkeley, CA, USA, 2011. USENIX Association. [5] A. Akel, A. M. Caulfield, T. I. Mollov, R. K. Gupta, and S. Swanson. Onyx: a protoype phase change memory storage array. HotStorage, Berkeley, CA, USA, 2011. USENIX Association. [6] D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. Fawn: a [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] fast array of wimpy nodes. SOSP, NY, USA, 2009. ACM. J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate. Plfs: a checkpoint filesystem for parallel applications. In SC, New York, NY, USA, 2009. ACM. A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, and S. Swanson. Moneta: A high-performance storage array architecture for next-generation, non-volatile memories. In MICRO, Washington, DC, USA, 2010. IEEE Computer Society. X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie. Leveraging 3d pcram technologies to reduce checkpoint overhead for future exascale systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, New York, NY, USA, 2009. ACM. B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase change memory as a scalable dram alternative. In ISCA. ACM, 2009. K. liu Ma, C. Wang, H. Yu, and A. Tikhonov. In-situ processing and visualization for ultrascale simulations. Journal of Physics: Conference Series, 78(1), 2007. A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC, Washington, DC, USA, 2010. IEEE Computer Society. M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali. Enhancing lifetime and security of pcm-based main memory with start-gap wear leveling. MICRO 42, New York, NY, USA, 2009. ACM. P. Ranganathan. From microprocessors to nanostores: Rethinking data-centric systems. Computer, 44:39–48. K. Sudarsun, D. Milojicic, T. Vanish, G. Ada, S. Karsten, and A. Hassan. Using active nvram for cloud i/o. In OpenCirrus Summit, Atlanta, GA, 2011. B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the panasas parallel file system. In FAST, 2008. F. Zheng, H. Abbasi, C. Docan, J. Lofstead, and K. Scott, et al. PreDatA - Preparatory Data Analytics on Peta-Scale Machines. In IPDPS, Atlanta, GA, 2010.