Overview of Extreme-Scale Software Research in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University China-USA Computer Software Workshop Sep. 27, 2011 Outline Related R&D efforts in China Algorithms and Computational Methods HPC and e-Infrastructure Parallel programming frameworks Programming heterogeneous systems Advanced compiler technology Tools Domain specific programming support Related R&D efforts in China NSFC 863 program Basic algorithms and computable modeling for high performance scientific computing Network based research environment Many-core parallel programming High productivity computer and Grid service environment Multicore/many-core programming support HPC software for earth system modeling 973 program Parallel algorithms for large scale scientific computing Virtual computing environment Algorithms and Computational Methods NSFC’s Key Initiative on Algorithm and Modeling Basic algorithms and computable modeling for high performance scientific computing 8-year, launched in 2011 180 million Yuan funding Focused on Novel computational methods and basic parallel algorithms Computable modeling for selected domains Implementation and verification of parallel algorithms by simulation HPC & e-Infrastructure 863’s key projects on HPC and Grid “High productivity Computer and Grid Service Environment” Period: 2006-2010 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources Major R&D activities Developing PFlops computers Building up a grid service environment--CNGrid Developing Grid and HPC applications in selected areas CNGrid GOS Architecture Other Domain Specific Applications GSML Workshop. Cmd Line Tools IDE Debugger Compiler GSML Composer HPCG App & Mgmt Portal Gsh & cmd tools GSML Browser Tool/App VegaSSH System Mgmt Portal Core, System and App Level Services GOS Library (Batch, Message, File, etc) GOS System Call (Resource mgmt,Agora mgmt, User mgmt, Grip mgmt, etc) HPCG Backend Axis Handlers for Message Level Security CA Service metainfo mgmt File mgmt BatchJob mgmt Account mgmt MetaSchedule Message Service Dynamic DeployService Grip DataGrid GridWorkflow DB Service Work Flow Engine System Tomcat(5.0.28) + Axis(1.2 rc2) Agora Security Resource Space J2SE(1.4.2_07, 1.5.0_07) Res AC & Sharing Grip Instance Mgmt User Mgmt Agora Mgmt Core Res Mgmt OS (Linux/Unix/Windows) Naming Grip Runtime ServiceController Other RController Tomcat(Apache)+Axis, GT4, gLite, OMII Java J2SE Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps Other 3rd software & tools Hosting Environment PC Server (Grid Server) Abstractions Grid community: Agora persistent information storage and organization Grid process: Grip runtime control CNGrid GOS deployment CNGrid GOS deployed on 11 sites and some application Grids Support heterogeneous HPCs: Galaxy, Dawning, DeepComp Support multiple platforms Unix, Linux, Windows Using public network connection, enable only HTTP port Flexible client Web browser Special client GSML client CNIC: 150TFlops, 1.4PB storage,30 applications, 269 users all over the country, IPv4/v6 access Tsinghua University: 1.33TFlops, 158TB storage, 29 applications, 100+ users. IPV4/V6 access IAPCM: 1TFlops, 4.9TB storage, 10 applications, 138 users, IPv4/v6 access Shandong University 10TFlops, 18TB storage, 7 applications, 60+ users, IPv4/v6 access GSCC: 40TFlops, 40TB, 6 applications, 45 users , IPv4/v6 access SSC: 200TFlops, 600TB storage, 15 applications, 286 users, IPv4/v6 access XJTU: 4TFlops, 25TB storage, 14 applications, 120+ users, IPv4/v6 access HUST: 1.7TFlops, 15TB storage, IPv4/v6 access SIAT: 10TFlops, 17.6TB storage, IPv4v6 access USTC: 1TFlops, 15TB storage, 18 applications, 60+ users, IPv4/v6 access HKU: 20TFlops, 80+ users, IPv4/v6 access CNGrid: resources 11 sites >450TFlops 2900TB storage Three PF-scale sites will be integrated into CNGrid soon CNGrid:services and users 230 services >1400 users China commercial Aircraft Corp Bao Steel automobile institutes of CAS universities …… CNGrid:applications Supporting >700 projects 973, 863, NSFC, CAS Innovative, and Engineering projects Parallel programming frameworks Jasmin: A parallel programming Framework separate Library Models Special Applications Stencils Codes Algorithms Models Common Stencils Algorithms extract Data Dependency form Data Structures Parallel Computing Models Communications support Load Balancing Promote Computers Also supported by the 973 and 863 projects Basic ideas Hide the complexity of programming millons of cores Integrate the efficient implementations of parallel fast numerical algorithms Provide efficient data structures and solver libraries Support software engineering for code extensibility. Basic Ideas PetaFlops MPP Applications Codes TeraFlops Cluster Serial Programming Personal Computer JASMIN Structured Grid Inertial Confinement Fusion Global Climate Particle Modeling Simulation CFD Material Simulations J parallel Adaptive Structured Mesh INfrastructure JASMIN http:://www.iapcm.ac. cn/jasmin, 2010SR050446 2003-now …… Unstructured Grid JASMIN User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc. User Interfaces:Components based Parallel Programming models. ( C++ classes) JASMIN Numerical Algorithms:geometry, fast solvers, mature numerical methods, time integrators, etc. V. 2.0 HPC implementations( thousands of CPUs):data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc. Architecture:Multilayered, Modularized, Object-oriented; Codes: C++/C/F90/F77+MPI/OpenMP,500,000 lines; Installation: Personal computers, Cluster, MPP. Numerical simulations on TianHe-1A Codes # CPU cores Codes # CPU cores LARED-S 32,768 RH2D 1,024 LARED-P 72,000 HIME3D 3,600 LAP3D 16,384 PDD3D 4,096 MEPH3D 38,400 LARED-R 512 MD3D 80,000 LARED Integration 128 RT3D 1,000 Simulation duration : several hours to tens of hours. Programming heterogeneous systems GPU programming support Source to source translation Runtime optimization Mixed programming model for multi-GPU systems S2S translation for GPU A source-to-source translator, GPUS2S, for GPU programming Facilitate the development of parallel programs on GPU by combining automatic mapping and static compilation S2S translation for GPU (con’d) Insert directives into the source program Guide implicit call of CUDA runtime libraries Enable the user to control the mapping from the homogeneous CPU platform to GPU’s streaming platform Optimization based on runtime profiling Take full advantage of GPU according to the application characteristics by collecting runtime dynamic information. The GPU-S2S architecture PGAS programming model MPI message transfer model Layer of software productivity Pthread thread model GPU-S2S Profile information GPU supporting library Calling shared library User standard library Layer of Running-time performance collection performance discover Operating system GPU platform Program translation by GPU-S2S homogeneous Computing Templates library of platform code function called by Profile optimized computing homogeneos with libray intensive applications platform code directives User defined part Calling shared libary Source code before translation (homogeneous platform program framework) GPU-S2S Kernel program of GPU according templates General Templates library of purpose Profile optimized computing computing library intensive applications interface Calling shared libary User standard library Source code after translation (GPU streaming architecture platform program framework) Control program of CPU Runtime optimization based on profiling First level profiling (function level) GPU-S2S *.c、*.h homogeneous platform code C language compiler Pretreatment First level dynamic instrumentation Second level dynamic instrumentation Automatically inserting directives Second level profiling (memory access and kernel improvement ) Third level profiling (data partition) Generate CUDA code containing optimized kernel Compile and run Compile and run Extract profile information: computing kernel First Level Profile Extract profile information: Data block size, Share memory configuration parameters, Judge whether can use stream Second Level Profile Don’t need to optimize further Termination Need to optimize further Third level dynamic instrumentation in CUDA code Compile and run Extract profile information: Number of stream, Data size of every stream *.o Executable code on GPU CUDA Compiler tool Generate CUDA code using stream *.h、 *.cu、 *.c CUDA code Third Level Profile First level profiling Homogeneous platform code Allocate address space initialization function0 Source-tosource compiler instrumentation0 instrumentatio1 function1 ... Free address space Identify computing kernels instrumentation0 functionN instrumentation1 instrumentationN instrumentationN Instrument the scan source code, get the execution time of every function, and identify computing kernel Second level profiling Homogeneous platform code Computing kernel1 Computing kernel2 instrumentation instrumentation ... ... Computing kernel3 Source-to-source compiler Identify the memory access pattern and improve the kernels instrumentation Instrument the computing kernels extract and analyze the profile information, optimize according to the feature of application, and finally generate the CUDA code with optimized kernel Third level profiling CUDA control code Allocate address space initialization Allocate global address space function0--copyin function0--kernel Source-to-source compiler Optimization by improve data partition instrumentationi instrumentationi instrumentationk instrumentationk function0--copyout ... Free address space instrumentationo instrumentationo Get copy time and computing time by instrumentation Compute the number of streams and data size of each stream Generate the optimized CUDA code with stream 800 time(ms) 600 only using global memory 500 400 300 second level profile optimization 200 100 0 1024 2048 different size of input data third level profile optimization time(ms) memory access optimization 700 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 memory access optimization only using global memory second level profile optimization 4096 8192 different size of input data third level profile optimization Matrix multiplication Performance comparison before and after profile time(ms) The CUDA code with three 10000000 8000000 three level profile optimization CPU 6000000 4000000 2000000 0 level profiling optimization achieves 31% improvement over the CUDA code with only memory access optimization, 1024 2048 4096 8192 different size of input data Execution performance comparison on different platform and 91% improvement over the CUDA code using only global memory for computing . 1800 memory access optimization 1600 t i m e (m s ) 1400 1200 second level profile optimization third level profile optimization only using global memory 1000 800 600 400 200 0 15 30 45 The CUDA code after three level profile optimization achieves 60 number of Batch 38% improvement over FFT(1048576 points) Performance comparison before and after profile the CUDA code with memory access 50000 optimization, and 77% time(ms) 40000 three level profile optimization CPU 30000 20000 improvement over the CUDA code using only global memory for 10000 computing . 0 15 30 45 60 different size of input data FFT(1048576 points ) execution performance comparison on different platform Programming Multi-GPU systems The memory of the CPU+GPU system are both distributed and shared. So it is feasible to use MPI and PGAS programming model for this new kind of system. CPU MainMem CPU Private space Message data Main Mem Share space Share data Device Mem Device Mem Device Mem Device Mem GPU GPU GPU GPU MPI PGAS Using message passing or shared data for communication between parallel tasks or GPUs Mixed Programming Model NVIDIA GPU —— CUDA Traditional Programming model —— MPI/UPC MPI+CUDA/UPC+CUDA Program start Host CPU GPU Device choosing Program initial Main MM MPI/UPC runtime Device CPU Source data copy in CPU Main MM (communication interface of upper programing model) CPU CUDA runtime Device MM GPU Computing start call Communication between tasks Parallel Task Computing kernel GPU GPU computing CUDA program execution Result data copy out Device MM CPU CPU CPU end cudaMemCopy MPI+CUDA experiment Platform 2NF5588 server, equipped with 1 Xeon CPU (2.27GHz), 12GB MM 2 NVIDIA Tesla C1060 GPU(GT200 architecture, 4GB deviceMM) 1Gbt Ethernet RedHatLinux5.3 CUDA Toolkit 2.3 and CUDA SDK OpenMPI 1.3 BerkeleyUPC 2.1 MPI+CUDA experiment (con’d) Matrix Multiplication program Using block matrix multiply for UPC programming. Data spread on each UPC thread. The computing kernel carries out the multiplication of two blocks at one time, using CUDA to implement. The total time of execution: Tsum=Tcom+Tcuda=Tcom+Tcopy+Tkernel Tcom: UPC thread communication time Tcuda: CUDA program execution time Tcopy: Data transmission time between host and device Tkernel: GPU computing time MPI+CUDA experiment (con’d) 2server,8 MPI task most 1 server with 2 GPUs For 4094*4096,the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) is 184x of the case with 8 MPI task. For small scale data,such as 256,512 , the execution time of using 2 GPUs is even longer than using 1 GPUs the computing scale is too small , the communication between two tasks overwhelm the reduction of computing time. PKU Manycore Software Research Group Software tool development for GPU clusters Software porting service Unified multicore/manycore/clustering programming Resilience technology for very-large GPU clusters Joint project, <3k-line Code, supporting Tianhe Advanced training program PKU-Tianhe Turbulence Simulation PKUFFT(using GPUs) Reach a scale 43 times higher than that of the Earth Simulator did 7168 nodes / 14336 CPUs / 7168 GPUs FFT speed: 1.6X of Jaguar Proof of feasibility of GPU speed up for large scale systems MKL(not using GPUs) Jaguar Advanced Compiler Technology Advanced Compiler Technology (ACT) Group at the ICT, CAS ACT’s Current research Parallel programming languages and models Optimized compilers and tools for HPC (Dawning) and multicore processors (Loongson) Will lead the new multicore/many-core programming support project PTA: Process-based TAsk parallel programming model new process-based task construct With properties of isolation, atomicity and deterministic submission Annotate a loop into two parts, prologue and task segment #pragma pta parallel [clauses] #pragma pta task #pragma pta propagate (varlist) Suitable for expressing coarse-grained, irregular parallelism on loops Implementation and performance PTA compiler, runtime system and assistant tool (help writing correct programs) Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83 (average 6.72 on 8 cores) Code changes is within 10 lines, much smaller than OpenMP UPC-H : A Parallel Programming Model for Deep Parallel Hierarchies Hierarchical UPC Provide multi-level data distribution Implicit and explicit hierarchical loop parallelism Hybrid execution model: SPMD with fork-join Multi-dimensional data distribution and super-pipelining Implementations on CUDA clusters and Dawning 6000 cluster Based on Berkeley UPC Enhance optimizations as localization and communication optimization Support SIMD intrinsics CUDA cluster:72% of hand-tuned version’s performance, code reduction to 68% Multi-core cluster: better process mapping and cache reuse than UPC OpenMP and Runtime Support for Heterogeneous Platforms Heterogeneous platforms consisting of CPUs and GPUs OpenMP extension Specify partitioning ratio to optimize data transfer globally Specify heterogeneous blocking sizes to reduce false sharing among computing devices Runtime support Multiple GPUs, or CPU-GPU cooperation brings extra data transfer hurting the performance gain Programmers need unified data management system DSM system based on the blocking size specified Intelligent runtime prefetching with the help of compiler analysis Implementation and results On OpenUH compiler Gains 1.6X speedup through prefetching on NPB/SP (class C) Analyzers based on Compiling Techniques for MPI programs Communication slicing and process mapping tool Compiler part Optimized mapping tool Weighted graph, Hardware characteristic Graph partitioning and feedback-based evaluation Memory bandwidth measuring tool for MPI programs PDG Graph Building and slicing generation Iteration Set Transformation for approximation Detect the burst of bandwidth requirements Enhance the performance of MPI error checking Redundant error checking removal by dynamically turning on/off the global error checking With the help of compiler analysis on communicators Integrated with a model checking tool (ISP) and a runtime checking tool (MARMOT) LoongCC: An Optimizing Compiler for Loongson Multicore Processors Based on Open64-4.2 and supporting C/C++/Fortran Powerful optimizer and analyzer with better performances Open source at http://svn.open64.net/svnroot/open64/trunk/ SIMD intrinsic support Memory locality optimization Data layout optimization Data prefetching Load/store grouping for 128-bit memory access instructions Integrated with Aggressive Auto Parallelization Optimization (AAPO) module Dynamic privatization Parallel model with dynamic alias optimization Array reduction optimization Tools Testing and evaluation of HPC systems A center led by Tsinghua University (Prof. Wenguang Chen) Developing accurate and efficient testing and evaluation tools Developing benchmarks for HPC evaluation Provide services to HPC developers and users LSP3AS: large-scale parallel program performance analysis system Source Code Designed for performance tuning on peta-scale HPC systems Method: Source code is instrumented Instrumented code is executed, generating profiling&tracing data files The profiling&tracing data is analyzed and visualization report is generated Instrumentation: based on TAU from University of Dynamic Compensation TAU Instrumentation Measurement API RDMA Transmission and Buffer Management Instrumented Code Compiler/Linker External Libraries RDMA Library Executable Datafile Environment Clustering Analysis Based on Iteration Performance Datafile Clustering Visualization Based on hierarchy classify Profiling Tools Visualization and Analysis Traditional Process of performance analysis Tracing Tools Analysis based on hierarchical clustering Dependency of Each Step Innovations LSP3AS: large-scale parallel program performance analysis system Scalable performance data collection Distributed data collection and transmission: eliminate bottlenecks in network and data processing Dynamic Compensation: reduce the influence of performance data volume Efficient Data Transmission: use Remote Direct Memory Access (RDMA) to achieve high bandwidth and low latency Storage system FC FC RD M Compute node IO node IO node Lustre Client Or GFS Lustre Client Or GFS Thread Thread Receiver Receiver RD A …… M A RD M RD A …… M A Compute node Compute node Sender Sender Sender Sender Shared Memory Shared Memory Shared Memory Shared Memory User process User process User process User process User process User process Compute node User process User process LSP3AS: large-scale parallel program performance analysis system Analysis & Visualization Data Analysis: Iteration-based clustering are used Visualization: Clustering visualization Based on Hierarchy Classification SimHPC: Parallel Simulator Challenge for HPC Simulation: performance Target system: >1,000 nodes and processors Difficult for traditional architecture simulators e.g. Simics Our solution Parallel simulation Use same node in host system with the target Using cluster to simulate cluster Advantage: no need to model and simulate detailed components, such as pipeline in processors and cache Execution-driven, Full-system simulation, support execution of Linux and applications include benchmarks (e.g. Linpack) SimHPC: Parallel Simulator (con’d) Analysis Execution time of a process in target system is composed of: Tprocess Trun TIO Tready equal to host can be obtained in Linux kernel needed to be simulated unequal to host needed to be re-calculated − Trun: execution time of instruction sequences − TIO: I/O blocking time, such as r/w files, send/recv msgs − Tready: waiting time in ready-state So, Our simulator needs to: ①Capture system events • process scheduling • I/O operations: read/write files, MPI send()/recv() ②Simulate I/O and interconnection network subsystems ③Synchronize timing of each application process SimHPC: Parallel Simulator (con’d) System Architecture Application processes of multiple target nodes allocated to one host node number of host nodes << number of target nodes Events captured on host node while application is running Events sent to the central node for time analysis, synchronization, and simulation Host node Host node Event Capture Event Capture Host node …… Event Capture Parallel applications ... Target Event Collection Control Process ... Process Analysis & Time-axis Sychronize Simulation Results ... Target Target Process ... Process Process ... Process Target Process ... Process Simulator Simulator …… Interconnection Network Host Linux Disk I/O Host Hardware Platform Host Hardware Platform Architecture Simulation Host Host Host Linux SimHPC: Parallel Simulator (con’d) • Experiment Results – – – – Host: 5 IBM Blade HS21 (2-way Xeon) Target: 32 – 1024 nodes OS: Linux App: Linpack HPL Simulation Slowdown Simulation Error Test Linpack performance for Fat-tree and 2D-mesh Interconnection Communication time for Fat-tree and 2D-mesh Interconnection System-level Power Management Power-aware Job Scheduling algorithm Suspend a node if its idletime > threshold Wakeup nodes if there is no enough nodes to execute jobs, while Avoid node thrashing between busy and suspend state The algorithm is integrated into OpenPBS System-level Power Management (con’d) Power Management Tool Monitor the power-related status of the system Reduce runtime power consumption of the machine Multiple power management policies Manual-control On-demand control Suspend-enable … Power Management Policies Policy Level Power Management Software / Interfaces Management /Interface Level Power Management Agent in Node Node sleep/wakeup Node On/Off CPU Freq. control Power Fan speed control of I/O control equipments Layers of Power Management Node Level ... Control & Monitor System-level Power Management (con’d) Commands Status Power data Power • Power Management Test – On 5 IBM HS21 blades Task Load (tasks per hour) 20 Power Mesurement System Comparison Power Management Policy Task Exec. Time (s) Power Consumption (J) Performance slowdown Power Saving On-demand 3.55 1778077 5.15% -1.66% Suspend 3.60 1632521 9.76% -12.74% On-demand 3.55 1831432 4.62% -3.84% Suspend 3.65 1683161 10.61% -10.78% On-demand 3.55 2132947 3.55% -7.05% Suspend 3.66 2123577 11.25% -9.34% 200 800 Power management test for different Task Load (Compared to no power management) Domain specific programming support Parallel Computing Platform for Astrophysics Joint work Shanghai Astronomical Observatory, CAS (SHAO), Institute of Software, CAS (ISCAS) Shanghai Supercomputer Center (SSC) Build a high performance parallel computing software platform for astrophysics research, focusing on the planetary fluid dynamics and N-body problems New parallel computing models and parallel algorithms studied, validated and adopted to achieve high performance. Architecture Web Portal on CNGrid Software Platform for Astrophysics Data Processing Scientific Visualiztion Physical and Mathematical Model Numerical Methods Fluid Dynamics PETSc Aztec Improved Preconditioner MPI OpenMP N-body Problem FFTW SpMV Fortran GSL Improved Lib. for Collective Comunication C 100T Supercomputer Lustre Parallel Computing Model Software Development PETSc Optimized (Speedup 15-26) Method 1: Domain Decomposition Ordering Method for Field Coupling Method 2: Preconditioner for Domain Decomposition Method Method 3: PETSc Multi-physics Data Structure Left: mesh 128 x 128 x 96 Right: mesh 192 x 192 x 128 Computation Speedup: 15-26 Strong scalability: Original code normal, New code ideal Test environment: BlueGene/L at NCAR (HPCA2009) Strong Scalability on TianHe-1A 2016/6/30 CLeXML Math Library Task Parallel Self Adaptive Tunning Multi-core parallel Iterative Solver LAPACK BLAS Computationa l Model FFT CPU Self Adaptive Tunning, Instruction Reordering, Software Pipelining… BLAS2 Performance: MKL vs. CLeXML HPC Software support for Earth System Modeling Led by Tsinghua University Tsinghua Beihang University Jiangnan Computing Institute Peking University … Part of the national effort on climate change study 67 Development Wizard and Editor Source Code Compiler/ Debugger/ Optimizer Other Data Executable Algorithm (Parallel) Initial Field and Boundary Condition Running Environment Earth System Model Development Workflow Earth System Model Computation Output Result Evaluation Result Visualization Data Visualization and Analysis Tools Data Management Subsystem Standard Data Set 68 Major research activities Subprojet I Efficient integration and management of massive heterogeneous data Subproject II Fast visualization of massive data analysis and diagnosis of model data Subproject III MPMD program debugging,analysis and high-availability technologies Subproject IV Integrated development environment(IDE) and Demonstrative applications for earth system model Demonstrative Applications Expected Results research on global change model application systems development tools: data conversion diagnosis debugging performance analysis high availability integrated high performance computing environment for earth system model Existing tools: compiler system monitor version control editor software standards international resources template library module library high performance computers in China 71 Potential cooperation areas Software for exa-scale computer systems Power Performance Programmability resilience CPU/GPU hybrid programming Parallel algorithms and parallel program frameworks Large scale parallel applications support Applications requiring ExaFlops computers Thank you!