Dynamic Active Storage for High Performance I/O Chao Chen(chao.chen@ttu.edu) 4.02.2012 UREaSON Outline Ø Background Ø Active Storage Ø Issues/challenges Ø Dynamic Active Storage Ø Prototyping and Evaluation Ø Conclusion and future work Background Ø Applications from the area of geographical information systems, Climate Science, astrophysics, high-energy physics, etc. are becoming more and more data intensive. § NASA’s Shuttle Radar Topography Mission (10TB) § FLASH: Buoyancy-Driven Turbulent Nuclear Burning(75TB~300TB) § Climate Science (10TB~355TB) Ø Efficient tools are needed to store and analyze these data sets. Background Ø CN: compute nodes, dedicated for processing (sum, minus, multiple etc.) Ø It is very time consuming. Ø SN: storage nodes, dedicated for storing the data. Ø I/O operations dominate the system performance CN1 CN2 CN3 CNn Network Compute Node SN2 Analysis kernel I/O request Storage Node SN1 Application SNm Data Disk Active Storage Ø Active Storage was proposed to mitigate such issue, and attracted intensive attention. Ø It moves appropriate computations near data (Storage Nodes) Compute Application Node I/O request Storage Node Result Analysis kernel Data Disk Network bandwidth cost is reduced Active Storage Two famous prototype: Ø Felix et. al proposed the first prototype based on Lustre § Supports limited and simple operations § Lacks a flexible method to add processing kernels NAL User Space Processing Component OST ASOBD OBDfilter ext3 ASDEV Active Storage Ø Woo et. al proposed another prototype based on PVFS § It provides a more sophisticated prototype based on MPI § User can register their process kernels Application Client 1 Client 2 … Parallel File System Client Client n Parallel File System API Active Storage API Parallel File System API Kernels Interconnection network Server 1 Server 2 … Server n Disk GPU Issues/Challenges Ø All existing studies don’t consider data dependence Ø Dependence commonly exists among data accesses Issues/Challenges for example, flow-direction and flow-accumulation operations in terrain analysis latitude longitude Single direction multi-direction Fig .1 Examples of SFD and MFD SFD: Single flow direction MFD: Multiple flow direction Issues/Challenges Ø Dependence has a great impact on performance The)performance)of)Ac&ve)Storage)(no)dependence)) Performance)of)Ac&vestorage)(with)dependence)) 10000" 3500" 9000" 3000" 7000" 6000" 5000" TS" 4000" AS" 3000" Execu&on)&me)(s)) Execu&on)&me)(s)) 8000" 2500" 2000" AS" 1500" TS" 1000" 2000" 500" 1000" 0" 0" 24" 36" 48" Data)size)(GB)) SUM operation 60" 24GB" 36GB" 48GB" 60GB" Data)size) flow-routing operation Question: Is every operation suitable to be offloaded to storage node? Data Dependence Stripe L Stripe 1 1 Terrain map 2 3 4 5 … … N-4 N-3 N-2 Each stripe is 64kb in PVFS N-1 N Possible Data distribution 2 3 Stripe o s1 s2 s3 … s4 s s5 M-3 s6 s7 s8 4 M-2 Stripe p Server a Server b Server c Analysis Kernel Analysis Kernel Analysis Kernel Disk Stripe q Stripe o … Disk Stripe p … M-1 M Possible Bandwidth cost: 2 times Disk Stripe q Dynamic Active Storage A Dynamic Active Storage Prototype is proposed: Ø Predicts the I/O bandwidth cost before the active I/O is accepted Ø Dynamically determines operations that are beneficial to be offloaded and processed on storage nodes Ø Introduces a new data layout method DAS System Architecture NEW Key components: 1. Bandwidth prediction 2. Data Distribution calculation (layout optimizer) 3. Kernel features 4. Local I/O API 5. Processing kernels Bandwidth Prediction Known the dependence patterns, we can calculate data locations, and then estimate the bandwidth cost previously: k i stride j stride i,j,k – ith, jth, kth data elements E – Data element size D – Num. of Storage Nodes L – Location of data elements Stripe_size – parallel file system parameter Bandwidth Prediction if Formula 1 then All dependency data is located at same storage node , and accept download requirement else It would cost 2 times bandwidth of file size, and should reject Active I/O requirement Issues/Challenges On the other hands, it is common that successive operations share the same data access patterns in terrain analysis and image processing § for example, flow-direction is always followed by flow-accumulation operation in terrain analysis § flow-direction generate intermediate image/map for flow-accumulation Layout Optimizer A new data distribution method is introduced: Ø Adopt an suitable data distribution method to store intermediate image/data Ø Ensure no/little data dependency for successive operations (such as flowaccumulation) Ø round-robin pattern is discarded, and each storage node stores k successive stripes. Ø Two copies of the boundary data strips are stored in successive two storage nodes Layout Optimizer Stripe L Stripe 1 1 2 3 4 5 … … N-4 N-3 N-2 N-1 N Normal Data layout 2 3 Stripe l 4 s1 s2 s3 Stripe m … s4 s s5 Stripe n M-3 s6 s7 M-2 M-1 M Stripe o Stripe p s8 Stripe q Server a Stripe l Stripe m Stripe n Server b Data Transfer Stripe o Stripe p Stripe q Layout Optimizer Server a Stripe l Stripe m Stripe n Stripe o Stripe p Stripe q Server b Data Transfer …… Stripe l Stripe m Stripe n …… Copy Stripe o Stripe p Stripe q Layout Optimizer New formulas: What the prototype need to do is to calculate a suitable value for k, D and stripe_size Evaluation Platform Hrothgar Cluster # of Nodes 24, 36, 48, 60 Evaluated operations Flow-routing, Flow-accumulation and 2D Gaussian Filter Data set size 24GB, 36GB, 48GB and 60GB Evaluated schemes TS: traditional storage, NAS: normal active storage, DAS: proposed prototype Impact of Data Dependence Performance)Impact)of)Data)Dependece) 16000" 14000" flow_rou/ng_NAS" Execu&on)Time)(s)) 12000" flow_rou/ng_TS" 10000" flow_accumula/on_NAS" 8000" 6000" flow_accumula/on_TS" 4000" gaussian_NAS" 2000" gaussian_TS" 0" 24" 36" 48" 60" Data)Size)(GB)) Execution time of NAS scheme is compared with one of TS scheme Performance Improvement Execu&on)Time)of)Each)Scheme) 6000" § 30% improvement V.S. TS Execu&on)Time)(s)) 5000" 4000" NAS" 3000" DAS" TS" 2000" 1000" 0" Flow-rou0ng" Flow-accumula0on" Gaussian"Filter" Opera&ons) Comparison of Execution Time OF NAS, TS and DAS. (24GB data, 24 nodes) § 60% improvement V.S. NAS Scalability Analysis Scalability)with)Varied)Number)of)Nodes) 10000" 9000" flow_rou2ng_DAS" Execu&on)Time)(s)) 8000" 7000" flow_rou2ng_TS" 6000" flow_accumula2on_DAS" 5000" 4000" flow_accumula2on_TS" 3000" gaussian_DAS" 2000" 1000" gaussian_TS" 0" 24" 36" 48" 60" Number)of)Nodes) Comparison of Execution Time when the Number of Nodes Increased All decreased 15% with increasing 12 nodes Scalability Analysis Scalability)with)varied)Data)Set)Size) 16000" flow_rou/ng_NAS" 14000" flow_rou/ng_DAS" Execu&on)Time)(s)) 12000" flow_rou/ng_TS" 10000" flow_accumula/on_NAS" 8000" flow_accumula/on_DAS" 6000" flow_accumula/on_TS" 4000" gaussian_NAS" 2000" gaussian_DAS" 0" 24" 36" 48" 48" Data)Size)(GB)) Comparison of Execution Time with varied data size gaussian_TS" execution time increases: DAS: 15% NAS: 30% TS: 30% When data increased 12GB Bandwidth Improvement Normalized+Bandwidth+ 2.5" Normalized+band+width+ 2" 1.5" NAS" DAS" 1" TS" 0.5" 0" 24" 36" 48" Data+size(GB)+ Normalized Sustained Bandwidth Improvement 60" Compared to TS DAS: 1.8 times bandwidth NAS: 0.7 times bandwidth Conclusion and Future Work Ø Data dependence has a great impact on performance of Active Storage Ø DAS is introduced to solve such challenge issue Ø Resource contention Reference 1. R. Ross, R. Latham, M. Unangst and B. Welch. Paralell I/O in Practice. Tutorial in the ACM/ IEEE Supercomputing Conference, 2009. 2. J. F. O. Callaghan and M. D. M. The Extraction of Drainage Networks from Digital Elevation Data. Computer Vision, Graphics and Image Processing, 8:323–344, 1984. 3. J. Piernas, J. Nieplocha, and E. J. Felix. Evaluation of Active Storage Strategies for the Lustre Parallel File System. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007. 4. E. J. Felix, K. Fox, K. Regimbal, and J. Nieplocha. Active Storage 5. Processing in a Parallel File System. In 6th LCI International Conference on Linux Clusters: The HPC Revolution, Chapel Hill, North Carolina, 2005. 6. . W. Son, S. Lang, P. Carns, R. Ross, and R. Thakur. Enabling Active Storage on Parallel I / O Software Stacks. In 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010. ..etc. Thank you