HPIS3: Towards a High-Performance Simulator for Hybrid Parallel I/O and Storage Systems Bo Feng, Ning Liu, Shuibing He, Xian-He Sun Department of Computer Science Illinois Institute of Technology, Chicago, IL Email: {bfeng5, nliu8}@hawk.iit.edu, {she11, sun}@iit.edu Outline • Introduction • Related Work • Design and Implementation • Experiments • Conclusions and Future Work Speaker: Bo Feng 2 Outline • Introduction • Related Work • Design and Implementation • Experiments • Conclusions and Future Work Speaker: Bo Feng 3 To Meet the High I/O Demands 2. SSD SSD-seq BANDWITH (MB/SEC) 1. PFS HDD-seq HDD-ran 300 250 200 150 100 50 0 0 Speaker: Bo Feng SSD-ran 50 100 150 200 REQUEST SIZE (KB) 250 300 4 HPIS3: Hybrid Parallel I/O and Storage System Simulator Event 1 • Parallel discrete event simulator • A variety of hardware and software configurations • Hybrid settings • Buffered-SSD • Tiered-SSD •… Event 5 Event 4 Event 2 Event 3 • HDD and SSD latency and bandwidth under parallel file systems • Efficient and high-performance Speaker: Bo Feng 5 Outline • Introduction • Related Work • Design and Implementation • Experiments • Conclusions and Future Work Speaker: Bo Feng 6 Related Work Co-design tool for hybrid parallel I/O and storage systems • S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems [1] • A Cost-Aware Region-Level Data Placement Scheme For Hybrid Parallel I/O Systems [2] • On the Role of Burst Buffers in Leadership-Class Storage Systems [3] • iBridge: Improving Unaligned Parallel File Access with Solid-State Drives [4] • More… [1] S. He, X.-H. Sun, and B. Feng, “S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems,” in Proceedings of International Conference on Distributed Computing Systems (ICDCS), 2014. [2] S. He, X.-H. Sun, B. Feng, X. Huang, and K. Feng, “A Cost-Aware Region-Level Data Placement Scheme for Hybrid Parallel I/O Systems,” in Proceedings of 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013. [3] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, “On the Role of Burst Buffers in Leadership-Class Storage Systems,” in Proceedings of 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), 2012. [4] X. Zhang, K. Liu, K. Davis, and S. Jiang, “iBridge: Improving unaligned parallel file access with solid-state drives,” in Proceedings of the 2013 IEEE 27th International Parallel and Distributed Processing Symposium (IPDPS), 2013. Speaker: Bo Feng 7 Outline • Introduction • Related Work • Design and Implementation • Experiments • Conclusions and Future Work Speaker: Bo Feng 8 Design Overview • Platform: ROSS • Target: PVFS • Architecture Overview Application I/O Workloads PVFS Clients • Client LPs • Server LPs • Drive LPs PVFS Servers • Note: LP is short of logical process. They act like real processes in the system and are synchronized by Time Warp protocol. Storage Devices Speaker: Bo Feng Client Client Node Node SSD HDD ... Client ... Node ... HDD 9 File, Queue and PVFS Client Modeling • File requests and request queue modeling File File Request • <file_id, length, file_offset> • State variables define queues Request S Request S S • PVFS client modeling File S Str Sn Str • Stripping mechanism File Servers (a) Speaker: Bo Feng File Servers (b) File Servers (c) 10 PVFS Sever Modeling • Connected with clients and drives Client Client ... Client Node SSD Speaker: Bo Feng 11 Event flow in HPIS3: a write example Client Requests Client Events Server Events Drive Events write FWRITE_init FWRITE_insp ect_attr prelude HW_START FWRITE_io_d atafile_setup FWRITE_posit ive_ack HW_READY FWRITE_dataf ile_post_msgpa ris FWRITE_start _flow WRITE_ACK FWRITE_dataf ile_complete_o perations FWRITE_com pletion_ack WRITE_END FWRITE_clea nup release FWRITE_io_g etattr • Write event flow for HDD • Single-queue effect • Write event flow for SSD • Multi-queue effect end Speaker: Bo Feng FWRITE_term inate 12 Storage Device Modeling: HDD vs. SSD (1) HDD FWRITE_start _flow FWRITE_com pletion_ack SSD HW_START FWRITE_start _flow HW_START HW_READY HW_READY HW_READY HW_READY HW_READY WRITE_ACK WRITE_ACK WRITE_ACK WRITE_ACK WRITE_ACK WRITE_END WRITE_END WRITE_END WRITE_END WRITE_END FWRITE_com pletion_ack Speaker: Bo Feng 13 Storage Device Modeling: HDD vs. SSD (2) HDD SSD • Start up time • Seek time • Data transfer time • Start up time • FTL mapping time • GC time Read Write Sequential SR SW Random RR RW Speaker: Bo Feng 14 Hybrid PVFS I/O and Storage Modeling Application Application Sservers Buffer S S SSD Tier S Hservers HDD Tier Storage H S H H H Buffered-SSD setting H H H H Tiered-SSD setting • S is short for SSD-Server, which is a server node with SSD. • H is short for HDD-Server, which is a server node with HDD. Speaker: Bo Feng 15 Outline • Introduction • Related Work • Design and Implementation • Experiments • Conclusions and Future Work Speaker: Bo Feng 16 Experimental setup 65-nodes SUN Fire Linux Cluster CPU Quad-Core AMD Opteron(tm) Processor 2376 * 2 Memory 4 * 2GB, DDR2 333MHz Network 1 Gbps Ethernet Storage HDD: Seagate SATA II 250GB, 7200RPM SSD: OCZ PCI-E X4 100GB OS Linux kernel 2.6.28.10 File system OrangeFS 2.8.6 • 32 nodes used throughout our experiments in this study Speaker: Bo Feng 17 Benchmark and Trace tool • IOR • Sequential read/write • Random read/write • IOSIG • Conducted from trace replay to trigger events Speaker: Bo Feng 18 Simulation Validity Throughput (MB/Sec) H-ran S-ran H-sim-ran S-sim-ran • 8 clients 500 450 400 350 300 250 200 150 100 50 0 • 4 HDD-servers • 4 SSD-servers 4 32 256 2048 Transfer Size (KB) 16384 • Lowest error rate is 2% • Average error rate is 11.98% Speaker: Bo Feng 19 Simulation Performance Study Running time (sec) 70,000 400 60,000 350 50,000 300 250 40,000 200 30,000 150 20,000 100 10,000 50 0 0 2 4 8 16 32 64 • 32 physical nodes Running time (sec) Event rate (evt/sec) Event rate (events/sec) • 2048 clients • 1024 servers • # of processes from 2 to 256 128 256 # of processes Speaker: Bo Feng 20 Case study: Tiered-SSD Performance Tuning 300 275.67 240.96 Throughput (MB/sec) 250 200 150 100 99.92 86.053 50 • 16 clients • 64K random requests • 4 HDD-servers + 4 SSD-servers • Performance boosts about 15% for original setting • Performance boosts about 140% for tuned setting 0 4HDD 4SSD Original Tuned Speaker: Bo Feng 21 Outline • Introduction • Related Work • Design and Implementation • Experiments • Conclusions and Future Work Speaker: Bo Feng 22 Conclusions and Future Work • HPIS3 simulator: a hybrid parallel I/O and storage simulation system • • • • • • • Models of PVFS clients, servers, HDDs and SSDs Validate against benchmarks Minimum error rate is 2% and average is about 11.98% in IOR tests. Scalable: # of processes from 2 to 256 Showcase of tiered-SSD settings under PVFS Useful to find optimal settings Useful to self-tuning at runtime • Future work • More evaluation for tiered-SSD vs. buffered-SSD • Improve accuracy by detailed models • Client-side settings and more Speaker: Bo Feng 23 Thank you Questions? Bo Feng bfeng5@hawk.iit.edu Speaker: Bo Feng 24