HIPS3: Towards a High-Performance Simulator for Hybrid Parallel I

advertisement
HPIS3: Towards a High-Performance
Simulator for Hybrid Parallel I/O and
Storage Systems
Bo Feng, Ning Liu, Shuibing He, Xian-He Sun
Department of Computer Science
Illinois Institute of Technology, Chicago, IL
Email: {bfeng5, nliu8}@hawk.iit.edu, {she11, sun}@iit.edu
Outline
• Introduction
• Related Work
• Design and Implementation
• Experiments
• Conclusions and Future Work
Speaker: Bo Feng
2
Outline
• Introduction
• Related Work
• Design and Implementation
• Experiments
• Conclusions and Future Work
Speaker: Bo Feng
3
To Meet the High I/O Demands
2. SSD
SSD-seq
BANDWITH (MB/SEC)
1. PFS
HDD-seq
HDD-ran
300
250
200
150
100
50
0
0
Speaker: Bo Feng
SSD-ran
50
100
150
200
REQUEST SIZE (KB)
250
300
4
HPIS3: Hybrid Parallel I/O and
Storage System Simulator
Event 1
• Parallel discrete event simulator
• A variety of hardware and software configurations
• Hybrid settings
• Buffered-SSD
• Tiered-SSD
•…
Event 5
Event 4
Event 2
Event 3
• HDD and SSD latency and bandwidth under parallel file systems
• Efficient and high-performance
Speaker: Bo Feng
5
Outline
• Introduction
• Related Work
• Design and Implementation
• Experiments
• Conclusions and Future Work
Speaker: Bo Feng
6
Related Work
Co-design tool for hybrid parallel I/O and storage systems
• S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems [1]
• A Cost-Aware Region-Level Data Placement Scheme For Hybrid
Parallel I/O Systems [2]
• On the Role of Burst Buffers in Leadership-Class Storage Systems
[3]
• iBridge: Improving Unaligned Parallel File Access with Solid-State
Drives [4]
• More…
[1] S. He, X.-H. Sun, and B. Feng, “S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems,” in Proceedings of International Conference on Distributed Computing Systems (ICDCS), 2014.
[2] S. He, X.-H. Sun, B. Feng, X. Huang, and K. Feng, “A Cost-Aware Region-Level Data Placement Scheme for Hybrid Parallel I/O Systems,” in Proceedings of 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013.
[3] N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn, “On the Role of Burst Buffers in Leadership-Class Storage Systems,” in Proceedings of 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), 2012.
[4] X. Zhang, K. Liu, K. Davis, and S. Jiang, “iBridge: Improving unaligned parallel file access with solid-state drives,” in Proceedings of the 2013 IEEE 27th International Parallel and Distributed Processing Symposium (IPDPS), 2013.
Speaker: Bo Feng
7
Outline
• Introduction
• Related Work
• Design and Implementation
• Experiments
• Conclusions and Future Work
Speaker: Bo Feng
8
Design Overview
• Platform: ROSS
• Target: PVFS
• Architecture Overview
Application I/O Workloads
PVFS
Clients
• Client LPs
• Server LPs
• Drive LPs
PVFS
Servers
• Note: LP is short of logical process. They act
like real processes in the system and are
synchronized by Time Warp protocol.
Storage
Devices
Speaker: Bo Feng
Client
Client
Node
Node
SSD
HDD
...
Client
...
Node
...
HDD
9
File, Queue and PVFS Client Modeling
• File requests and request queue
modeling
File
File
Request
• <file_id, length, file_offset>
• State variables define queues
Request
S
Request
S
S
• PVFS client modeling
File
S
Str
Sn
Str
• Stripping mechanism
File Servers
(a)
Speaker: Bo Feng
File Servers
(b)
File Servers
(c)
10
PVFS Sever Modeling
• Connected with clients and
drives
Client
Client
...
Client
Node
SSD
Speaker: Bo Feng
11
Event flow in HPIS3:
a write example
Client Requests
Client Events
Server Events
Drive Events
write
FWRITE_init
FWRITE_insp
ect_attr
prelude
HW_START
FWRITE_io_d
atafile_setup
FWRITE_posit
ive_ack
HW_READY
FWRITE_dataf
ile_post_msgpa
ris
FWRITE_start
_flow
WRITE_ACK
FWRITE_dataf
ile_complete_o
perations
FWRITE_com
pletion_ack
WRITE_END
FWRITE_clea
nup
release
FWRITE_io_g
etattr
• Write event flow for HDD
• Single-queue effect
• Write event flow for SSD
• Multi-queue effect
end
Speaker: Bo Feng
FWRITE_term
inate
12
Storage Device Modeling:
HDD vs. SSD (1)
HDD
FWRITE_start
_flow
FWRITE_com
pletion_ack
SSD
HW_START
FWRITE_start
_flow
HW_START
HW_READY
HW_READY
HW_READY
HW_READY
HW_READY
WRITE_ACK
WRITE_ACK
WRITE_ACK
WRITE_ACK
WRITE_ACK
WRITE_END
WRITE_END
WRITE_END
WRITE_END
WRITE_END
FWRITE_com
pletion_ack
Speaker: Bo Feng
13
Storage Device Modeling:
HDD vs. SSD (2)
HDD
SSD
• Start up time
• Seek time
• Data transfer time
• Start up time
• FTL mapping time
• GC time
Read
Write
Sequential
SR
SW
Random
RR
RW
Speaker: Bo Feng
14
Hybrid PVFS I/O and Storage
Modeling
Application
Application
Sservers
Buffer
S
S
SSD Tier
S
Hservers
HDD Tier
Storage
H
S
H
H
H
Buffered-SSD setting
H
H
H
H
Tiered-SSD setting
• S is short for SSD-Server, which is a server node with SSD.
• H is short for HDD-Server, which is a server node with HDD.
Speaker: Bo Feng
15
Outline
• Introduction
• Related Work
• Design and Implementation
• Experiments
• Conclusions and Future Work
Speaker: Bo Feng
16
Experimental setup
65-nodes SUN Fire Linux Cluster
CPU
Quad-Core AMD Opteron(tm) Processor 2376 * 2
Memory
4 * 2GB, DDR2 333MHz
Network
1 Gbps Ethernet
Storage
HDD: Seagate SATA II 250GB, 7200RPM
SSD: OCZ PCI-E X4 100GB
OS
Linux kernel 2.6.28.10
File system
OrangeFS 2.8.6
• 32 nodes used throughout our experiments in this study
Speaker: Bo Feng
17
Benchmark and Trace tool
• IOR
• Sequential read/write
• Random read/write
• IOSIG
• Conducted from trace replay to trigger events
Speaker: Bo Feng
18
Simulation Validity
Throughput (MB/Sec)
H-ran
S-ran
H-sim-ran
S-sim-ran
• 8 clients
500
450
400
350
300
250
200
150
100
50
0
• 4 HDD-servers
• 4 SSD-servers
4
32
256
2048
Transfer Size (KB)
16384
• Lowest error rate is 2%
• Average error rate is 11.98%
Speaker: Bo Feng
19
Simulation Performance Study
Running time (sec)
70,000
400
60,000
350
50,000
300
250
40,000
200
30,000
150
20,000
100
10,000
50
0
0
2
4
8
16
32
64
• 32 physical nodes
Running time (sec)
Event rate (evt/sec)
Event rate (events/sec)
• 2048 clients
• 1024 servers
• # of processes from 2 to 256
128 256
# of processes
Speaker: Bo Feng
20
Case study:
Tiered-SSD Performance Tuning
300
275.67
240.96
Throughput (MB/sec)
250
200
150
100
99.92
86.053
50
• 16 clients
• 64K random requests
• 4 HDD-servers + 4 SSD-servers
• Performance boosts about 15%
for original setting
• Performance boosts about 140%
for tuned setting
0
4HDD
4SSD
Original
Tuned
Speaker: Bo Feng
21
Outline
• Introduction
• Related Work
• Design and Implementation
• Experiments
• Conclusions and Future Work
Speaker: Bo Feng
22
Conclusions and Future Work
• HPIS3 simulator: a hybrid parallel I/O and storage simulation system
•
•
•
•
•
•
•
Models of PVFS clients, servers, HDDs and SSDs
Validate against benchmarks
Minimum error rate is 2% and average is about 11.98% in IOR tests.
Scalable: # of processes from 2 to 256
Showcase of tiered-SSD settings under PVFS
Useful to find optimal settings
Useful to self-tuning at runtime
• Future work
• More evaluation for tiered-SSD vs. buffered-SSD
• Improve accuracy by detailed models
• Client-side settings and more
Speaker: Bo Feng
23
Thank you
Questions?
Bo Feng
bfeng5@hawk.iit.edu
Speaker: Bo Feng
24
Download