生命科学、气象行业 高性能计算解决方案及成功案例分享 凌巍才 高性能计算产品技术顾问 戴尔(中国)有限公司 1 Confidential Global Marketing 内容 • 生命科学高性能计算解决方案 – GPU加速解决方案 – 高性能存储解决方案 • WRF V3.3 ( 气象行业应用) 在 Dell R720 服务器 程序测试及优化 – gcc 编译器器 – Intel 编译器 • 2 Confidential 成功案例分享 Global Marketing 生命科学 HPC GPU 方案 Global Marketing 在生命科学领域中 很多用户采用GPU加速解决方案 4 Confidential Global Marketing CPU + GPU 计算 5 Confidential Global Marketing HPCC GPU 异构平台 6 Confidential Global Marketing 支持GPU的 Dell 服务器方案(2012年,12代服务器) Internal Solutions External Solutions (PowerEdge C) C6220 C6220 C6145 C6145 C410x C410x C410x C410x R720 T620 C6220 GPU:Socket Ratio Total System Boards Total HIC IB Capable Total GPU Per GPU B/W MSRP (M2075) Power Envelope (est) Theoretical GFLOPs Est. GFLOPs GFLOPS/Rack U $/GFLOPS Rack Size GPU/Rack U 7 Confidential 1:1 8 8 Yes 16 8 $117,000 5.525 kW TBD TBD TBD TBD 7 2.3 C6145 2:1 4 4 Yes 16 4 $86,900 4.118 kW TBD TBD TBD TBD 5 3.2 1:1 4 8 Yes 16 8 $114,000 5.030 kW 9,326 2,891 413 39 7 2.3 2:1 2 4 Yes 16 4 $85,250 3.802 kW 8,932 1,697 339 50 5 3.2 2:1 1 0 Yes* 4 4 $19,000 1:1 1 0 Yes 2 16 $13,000 2,431 TBD 486 8 5 0.8 1,401 TBD 701 9 2 1.0 Global Marketing GPU 扩展箱方案 (GPU外置方案) Dell PowerEdge C410x PCIe EXPANSION CHASSIS CONNECTING 1-8 HOSTS TO 1-16 PCIe Great for: HPC including universities, oil & gas, biomed research, design, simulation, mapping, visualization, rendering, and gaming • • • • • • 3U chassis, 19” wide, 143 pounds PCI express modules: 10 front, 6 rear PCI form factors: HH/HL and FH/HL Up to 225W per module PCIe inputs: 8PCIe x16 IPASS ports PCI fan out options: x16 to 1 slot, x16 to 2 slot, x16 to 3 slot, x16 to 4 slot • GPUs supported: NVIDIA M1060, M2050, M2070 (TBD) 8 Confidential • Thermals: high-efficiency 92mm fans; N + 1 fan redundancy • Management: On-board BMC; IPMI 2.0; dedicated management port • Power supplies: 4 x 1400W hot-plug, high efficiency PSUs; N+1 power redundancy • Services vary by region: IT Consulting, Server and Storage Deployment, Rack Integration (US only), Support Services Global Marketing PowerEdge C410x PCIe 模块 • Serviceable PCIe module (taco) capable of supporting any half-height, halflength (HH/HL) or full-height/half-length (FH/HL) cards • FH/FL cards supported with extended PCIe module • Future-proofing on next generations of NVIDIA and AMD ATI GPU cards Power connector for GPGPU card LED Board-to-board connector for X16 Gen PCIe signals and power GPU card 9 Confidential Global Marketing PowerEdge C410x Configurations • Enabling HPC applications to optimize cost / performance equation off single x16 1 GPU / x16 Host HIC x16 C6100 PCI Switch 8GPU/7U x16 C410x 7U = (1) C410x + (2) C6100 3 GPU / x16 x16 HIC Host PCI Switch 12GPU/5U x16 x16 C6100 x16 iPass cable C410x HIC x16 GPU GPU GPU 16GPU/7U x16 PCI Switch GPU C410x iPass cable 7U = (1) C410x + (2) C6100 4 GPU / x16 Host GPU x16 C6100 iPass cable Host GPU 2 GPU / x16 x16 HIC 16GPU/5U PCI Switch x16 x16 C6100 x16 iPass cable x16 5U = (1) C410x + (1) C6100 5U = (1) C410x + (1) C6100 C410x GPU GPU GPU GPU GPU/U ratios assume PowerEdge C6100 host with 4 servers per 2U chassis 10 Confidential Global Marketing Flexibility of the PowerEdge C410x • Increases to 8:1 possible with dual x16 x16 x16 x16 iPass cable x16 Host PCI Switch x16 iPass cable GPU Host PCI Switch x16 x16 iPass cable x16 GPU HIC HIC x16 C410x x16 x16 x16 PCI Switch x16 x16 iPass cable x16 x16 C410x 11 Confidential GPU GPU GPU HIC HIC GPU GPU PCI Switch GPU GPU GPU GPU GPU Global Marketing PowerEdge C6100 Configurations “2:1 Sandwich” C6100 C410x C6100 Summary C6100 “2:1 Sandwich” One Dell C410x (16 GPUs) Two C6100 (8 nodes) One x16 slot for each node to 2 GPUs 7U total 16 GPUs total 8 nodes total (2 GPUs per board) 12 Confidential Details • Two C6100 • 8 system boards • 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host • Single port x16 HIC (iPASS) • Single C410x • 16 GPUs (fully populated) • PCIe x8 per GPU • Total space = 7U Note: This configuration is equivalent to using the C6100 and the NVIDIA S2050 but this configuration is more dense Global Marketing PowerEdge C6100 Configurations “4:1 Sandwich” Details C410x C6100 Summary C6100 “4:1 Sandwich” One Dell C410x (16 GPUs) One C6100 (4 nodes) One x16 slot for each node to 4 GPUs 5U total 16 GPUs total 4 nodes total (4 GPUs per board) 13 Confidential • One C6100 • 4 system boards • 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host • Single port x16 HIC (iPASS) • Single C410x • 16 GPUs (fully populated) • PCIe x4 per GPU • Total space = 5U Global Marketing PowerEdge C6100 Configurations “8:1 Sandwich” (Possible Future Development) C410x C6100 C410x Summary C6100 “8:1 Sandwich” Two Dell C410x (32 GPUs) One C6100 (4 nodes) One x16 slot for each node to 8 GPUs 8U total 32 GPUs total 4 nodes total (8 GPUs per board) 14 Confidential Details • One C6100 • 4 system boards • 2S Westmere, 12 DIMM slots, QDR IB, up to 6 drives per host • Single port x16 HIC (iPASS) • Two C410x • 32 GPUs (fully populated) • PCIe x2 per GPU • Total space = 8U • See later table for metrics Global Marketing PowerEdge C6145 Configurations “8:1 Sandwich” 5U of Rack Space Details • One C6145 C6145 C410x Details C6145 “16:1 Sandwich” One Dell C410x (16 GPUs) One C6145 (2 nodes) Two-Four HIC slots for each node to 16 GPUs 5U total 16 GPUs total 2 nodes total (16 GPUs per board) Dell Confidential • 2 system boards • 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host • 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS) • One C410x • 16 GPUs (fully populated) • PCIe x4-x8 per GPU • Total space = 5U Global Marketing PowerEdge C6145 Configurations “16:1 Sandwich” 8U of Rack Space C410x C6145 C410x Details C6145 “16:1 Sandwich” Two Dell C410x (32 GPUs) One C6145 (2 nodes) Four HIC slots for each node to 16 GPUs 8U total 32 GPUs total 2 nodes total (16 GPUs per board) Dell Confidential Details • One C6145 • 2 system boards • 4S MagnyCours, 32 DIMM slots, QDR IB, up to 12 drives per host • 3 x Single port x16 HIC (iPASS) + 1 x Single port onboard x16 HIC (iPASS) • Two C410x • 32 GPUs (fully populated) • PCIe x4 per GPU • Total space = 8U Global Marketing PowerEdge C410x Block Diagram GPUs x 16 Switch Level 2x4 Switch Level 1x8 Host Connections x8 Global Marketing C410X BMC控制台配置界面 Global Marketing GPU 扩展箱支持服务器列表 HIC/C410x Support Matrix • Dell external GPU solution support – Hardware Interface Card (HIC) in PCIe slot connects to external GPU(s) in C410x – Dell ‘slot validates’ NVIDIA interface cards to verify power, thermals, etc. Server C6100 Planned Support C410x Support Date Yes Now C6105 RTS+ Now – BIOS 1.7.1 or later C6145 RTS Now C1100 Yes Now Precision R5500 Yes Now – Disable SSC in BIOS R710 Yes Now M610x Yes Now R410 Yes Now R720 RTS RTS R720xd RTS RTS R620 RTS RTS C6220 RTS RTS Global Marketing 生命科学应用测试: GPU-HMMER GPU-HMMER CPU vs. GPU 12000 Wall Clock (s) 10000 8000 1.8X 6000 2.7X CPU C410x / C6100 (1) 2.8X 4000 2.9X 2000 0 415 983 1419 Length of HMM 2293 Dell High Performance Computing 20 GPU:Host Scaling : GPU-HMMER GPU-HMMER: GPU Scaling 7000 Wall clock (s) 6000 5000 Speedup 4000 C410x / C6100 (1) C410x / C6100 (2) C410x / C6100 (4) Internal 2-x16 (2) 3000 2000 1.8X 3.6X 7.2X 3.6X 1000 0 415 983 1419 Length of HMM 2293 Dell High Performance Computing 21 GPU:Host Scaling: NAMD NAMD 1.52 1.6 Steps/Second 1.4 1.2 0.95 1 0.82 0.8 0.6 0.47 0.4 0.2 CPU C410x / C6100 (1) C410x / C6100 (2) C410x / C6100 (4) Internal 2-x16 (2) Speedup 4.7X 8.2X 15.2X 9.5X 0.10 0 STMV Dell High Performance Computing 22 GPU:Host Scaling : LAMMPS JL-Cut Wall clock (s) LAMMPS LJ GPU Scaling 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Speedup C410x / C6100 (1) C410x / C6100 (2) C410x / C6100 (4) Internal 2-x16 (2) 8.5X 13.5X 14.4X 14.0X 256000 500000 1000188 Number of Particles Dell High Performance Computing 23 生命科学 存储方案 Global Marketing 生命科学 计算、数据容量增长率 The Lustre Parallel File System • Key Lustre Components: 1.Clients (compute nodes) Clients • “Users” of the file system where applications run • The Dell HPC Cluster 2. Meta Data Server (MDS) • Holds meta-data information 3. Object Storage Server (OSS) • Provides back-end storage for the users’ files • Additional OSS units increase throughput linearly Meta Data Server (MDS) OSS OSS … OSS 2 7 Confidential InfiniBand (IPoIB) NFS Performance: Sequential Read NSS IPoIB Sequential Reads 1600000 1400000 Throughput KB/s 1200000 1000000 NSS Small 800000 NSS Medium 600000 NSS Large 400000 200000 0 1 • Peaks: 2 4 8 16 24 32 Threads (Nodes) – NSS Small: 1 node doing IO (fairly level until 4 nodes) – NSS Medium: 4 nodes doing IO (not much drop-off) – NSS Large: 8 nodes doing IO (good performance over range) Infiniband (IPoIB) NFS Performance: Sequential Write NSS IPoIB Sequential Writes 1600000 1400000 Throughput KB/s 1200000 1000000 NSS Small 800000 NSS Medium 600000 NSS Large 400000 200000 0 1 • Peaks: 2 4 8 16 24 32 Threads (Nodes) – NSS Small: 1 node doing IO (steady drop off to 16 nodes) – NSS Medium: 2 nodes doing IO (good performance for up to 8 nodes) – NSS Large: 4 nodes doing IO (good performance over range) 3 1 Confidential WRF V3.3 应用程序 测试调优 Global Marketing Dell 测试环境 • Dell R720 – cpu : 2x Intel Sandy Bridge E5- 2650, – Memory: 8x 8MB (64GB Memory) – Harddisk: 2x 300 GB 15Krpm (Raid 0) • BIOS Setting – disable HT – memory optimized – High Performance enable ( Power Max) • OS – Redhat Enterprise Linux 6.3 33 Confidential Gcc 测试 • gcc, gfortran, gc++ • Zlib 1.2.5 • HDF5 1.8.8 • Netcdf 4 • WRF V3.3 34 Confidential 测试结果 • 输出文件 wrf : 2011年11月30日 至 2011年12月5日 (13H9M53S) – wrf.exe starts at: Sun Apr 29 09:35:36 CST 2012 … – wrf: SUCCESS COMPLETE WRF – wrf.exe completed at: Sun Apr 29 22:45:29 CST 2012 36 Confidential 配置文件 •# Settings for x86_64 Linux, gfortran compiler with gcc (smpar) •DMPARALLEL = 1 •OMPCPP = -D_OPENMP •OMP = -fopenmp •OMPCC = -fopenmp •SFC = gfortran •SCC = gcc •CCOMP = gcc •DM_FC = mpif90 -f90=$(SFC) •DM_CC = mpicc -cc=$(SCC) •FC = $(SFC) •CC = $(SCC) -DFSEEKO64_OK •LD = $(FC) •RWORDSIZE = $(NATIVE_RWORDSIZE) •PROMOTION = # -fdefault-real-8 # uncomment manually •ARCH_LOCAL = -DNONSTANDARD_SYSTEM_SUBR •CFLAGS_LOCAL = -w -O3 -c -DLANDREAD_STUB •LDFLAGS_LOCAL = •CPLUSPLUSLIB = •ESMF_LDFLAG = $(CPLUSPLUSLIB) •FCOPTIM = -O3 -ftree-vectorize -ftree-loop-linear -funroll-loops •FCREDUCEDOPT = $(FCOPTIM) •FCNOOPT = -O0 •FCDEBUG = # -g $(FCNOOPT) •FORMAT_FIXED = -ffixed-form •FORMAT_FREE = -ffree-form -ffree-line-length-none •FCSUFFIX = •BYTESWAPIO = -fconvert=big-endian -frecord-marker=4 •FCBASEOPTS_NO_G = -w $(FORMAT_FREE) $(BYTESWAPIO) •FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG) •MODULE_SRCH_FLAG = •TRADFLAG = -traditional •CPP = /lib/cpp -C -P •AR = ar •ARFLAGS = ru •M4 = m4 -G •RANLIB = ranlib •CC_TOOLS = $(SCC) 37 Confidential Wrf.out …. WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 16 WRF TILE 1 IS 1 IE 250 JS 1 JE 10 WRF TILE 2 IS 1 IE 250 JS 11 JE 20 WRF TILE 3 IS 1 IE 250 JS 21 JE 30 WRF TILE 4 IS 1 IE 250 JS 31 JE 39 WRF TILE 5 IS 1 IE 250 JS 40 JE 48 WRF TILE 6 IS 1 IE 250 JS 49 JE 57 WRF TILE 7 IS 1 IE 250 JS 58 JE 66 WRF TILE 8 IS 1 IE 250 JS 67 JE 75 WRF TILE 9 IS 1 IE 250 JS 76 JE 84 WRF TILE 10 IS 1 IE 250 JS 85 JE 93 WRF TILE 11 IS 1 IE 250 JS 94 JE 102 WRF TILE 12 IS 1 IE 250 JS 103 JE 111 WRF TILE 13 IS 1 IE 250 JS 112 JE 120 WRF TILE 14 IS 1 IE 250 JS 121 JE 130 WRF TILE 15 IS 1 IE 250 JS 131 JE 140 WRF TILE 16 IS 1 IE 250 JS 141 JE 150 WRF NUMBER OF TILES = 16 ….. 38 Confidential 系统资源分析 CPU • CPU: (mpstat –P ALL) • • • • • • • • • • • • • • • • • • • • • 39 Linux 2.6.32-257.el6.x86_64 (r720) 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM 04:06:40 PM Confidential 04/29/2012 CPU %usr %nice %sys %iowait all 85.27 0.00 2.62 0.01 0.00 0 85.71 0.00 2.58 0.01 0.00 1 85.05 0.00 2.77 0.05 0.00 2 85.26 0.00 2.69 0.00 0.00 3 85.24 0.00 2.65 0.01 0.00 4 87.36 0.00 1.90 0.00 0.00 5 84.97 0.00 2.70 0.00 0.00 6 85.23 0.00 2.64 0.00 0.00 7 84.97 0.00 2.71 0.00 0.00 8 85.33 0.00 2.60 0.00 0.00 9 85.32 0.00 2.57 0.00 0.00 10 84.88 0.00 2.77 0.00 0.00 11 84.93 0.00 2.69 0.00 0.00 12 85.16 0.00 2.62 0.00 0.00 13 85.00 0.00 2.69 0.00 0.00 14 84.91 0.00 2.75 0.00 0.00 15 85.02 0.00 2.65 0.00 0.00 _x86_64_ (16 CPU) %irq %soft %steal %guest %idle 0.00 0.00 0.00 12.10 0.00 0.00 0.00 11.69 0.04 0.00 0.00 12.09 0.00 0.00 0.00 12.05 0.00 0.00 0.00 12.10 0.00 0.00 0.00 10.73 0.00 0.00 0.00 12.33 0.00 0.00 0.00 12.13 0.00 0.00 0.00 12.32 0.00 0.00 0.00 12.06 0.00 0.00 0.00 12.11 0.00 0.00 0.00 12.35 0.00 0.00 0.00 12.38 0.00 0.00 0.00 12.21 0.00 0.00 0.00 12.31 0.00 0.00 0.00 12.34 0.00 0.00 0.00 12.33 系统资源分析 (Memory) • Memory : (free) total Mem: used free shared buffers 65895488 32823072 33072416 -/+ buffers/cache: 5899828 59995660 Swap: 40 Confidential 66027512 0 66027512 cached 0 38220 26885024 系统资源分析 (IO, HDD) IO: (iostat) Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 9.01 125.71 2063.47 3096354 50823660 dm-0 0.64 12.63 1.99 311170 49016 dm-1 0.01 0.10 0.00 2576 0 dm-2 258.17 112.05 2061.48 2759698 50774616 HDD : (df) Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/vg_r720-lv_root 51606140 5002372 43982328 11% / tmpfs 32947744 88 32947656 1% /dev/shm /dev/sda1 495844 37433 432811 8% /boot /dev/mapper/vg_r720-lv_home 458559680 58258760 377007380 14% /home 41 Confidential Intel 测试 42 Confidential 4 3 Confidential Intel links • http://software.intel.com/en-us/articles/building-the-wrf-withintel-compilers-on-linux-and-improving-performance-on-intelarchitecture/ • http://software.intel.com/en-us/articles/wrf-and-wps-v311installation-bkm-with-inter-compilers-and-intelr-mpi/ • http://www.hpcadvisorycouncil.com/pdf/WRF_Best_Practices.p df 44 Confidential Intel Compilers Flags 45 Confidential Intel 调优 http://software.intel.com/en-us/articles/performance-hints-for-wrf-on-intel-architecture/ 1。Reducing MPI overhead: • -genv I_MPI_PIN_DOMAIN omp • -genv KMP_AFFINITY=compact • -perhost 2。 Improving cache and memory bandwidth utilization: • numtiles = X 3。Using Intel® Math Kernel Library (MKL) DFT for polar filters: • Depending on workload, Intel® MKL DFT may provide up to 3x speedup of simulation speed 4。Speeding up computations by reducing precision: • -fp-model fast=2 -no-prec-div -no-prec-sqrt 46 Confidential 案例分享 Global Marketing 华大基因研究院 清华大学生命科学院 Success References in Life Science • 国内 – Beijing Genome Institute (BGI) – Tsinghua University Life Institute – Beijing Normal University – Jiang Su Tai Cang Life Institute – The 4th Military Medical University – … • 国外 – David H. Murdock Research Institute – Virginia Bioinformatics Institute – University of Florida speeds up memory intensive gene – UCSF – National Center for Supercomputing Applications – … 50 Confidential 谢谢! 5 1 Confidential