Efforts on Programming Environment and Tools in China's High-tech R&D Program Depei Qian Sino-German Joint Software Institute (JSI), Beihang University Email: depeiq@buaa.edu.cn August 1, 2011, CScADS tools workshop China’s High-tech Program The National High-tech R&D Program (863 Program) proposed by 4 senior Chinese Scientists and approved by former leader Mr. Deng Xiaoping in March 1986 One of the most important national science and technology R&D programs in China Now a regular national R&D program planed in 5-year terms, the one just finished is the 11th five-year plan 863 key projects on HPC and Grid “High performance computer and core software” 4-year project, May 2002 to Dec. 2005 100 million Yuan funding from the MOST More than 2Χ associated funding from local government, application organizations, and industry Outcomes: China National Grid (CNGrid) “High productivity Computer and Grid Service Environment” Period: 2006-2010 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources HPC development (2006-2010) First phase: developing two 100TFlops machines Dawning 5000A for SSC Lenovo DeepComp 7000 for SC of CAS Second phase: three 1000Tflops machines Tianhe IA: CPU+GPU, NUDT/Tianjin Supercomputing Center Dawning 6000: CPU+GPU, ICT/Dawning/South China Supercomputing Center (Shenzhen) Sunway: CPU-only, Jiangnan/Shandong Supercomputing Center CNGrid development 11 sites CNIC, CAS (Beijing, major site) Shanghai Supercomputer Center (Shanghai, major site ) Tsinghua University (Beijing) Institute of Applied Physics and Computational Mathematics (Beijing) University of Science and Technology of China (Hefei, Anhui) Xi’an Jiaotong University (Xi’an, Shaanxi) Shenzhen Institute of Advanced Technology (Shenzhen, Guangdong) Hong Kong University (Hong Kong) Shandong University (Jinan, Shandong) Huazhong University of Science and Technology (Wuhan, Hubei) Gansu Provincial Computing Center The CNGrid Operation Center (based on CNIC, CAS) CNGrid GOS Architecture Other Domain Specific Applications GSML Workshop. Cmd Line Tools IDE Debugger Compiler GSML Composer HPCG App & Mgmt Portal Gsh & cmd tools GSML Browser Tool/App VegaSSH System Mgmt Portal Core, System and App Level Services GOS Library (Batch, Message, File, etc) GOS System Call (Resource mgmt,Agora mgmt, User mgmt, Grip mgmt, etc) HPCG Backend Axis Handlers for Message Level Security CA Service metainfo mgmt File mgmt BatchJob mgmt Account mgmt MetaSchedule Message Service Dynamic DeployService Grip DataGrid GridWorkflow DB Service Work Flow Engine System Tomcat(5.0.28) + Axis(1.2 rc2) Agora Security Resource Space J2SE(1.4.2_07, 1.5.0_07) Res AC & Sharing Grip Instance Mgmt User Mgmt Agora Mgmt Core Res Mgmt OS (Linux/Unix/Windows) Naming Grip Runtime ServiceController Other RController Tomcat(Apache)+Axis, GT4, gLite, OMII Java J2SE Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps Other 3rd software & tools Hosting Environment PC Server (Grid Server) Jasmin: A parallel programming Framework Contact: Prof. Zeyao Mo, IAPCM, Beijing zeyao_mo@iapcm.ac.cn Basic ideas Library Models Special Applications Stencils Codes Algorithms separate Models Common Stencils Algorithms extract Data Dependency form Data Structures Promote Computers Parallel Computing Models Communications support Load Balancing Infrastructure parallel middlewares for scientific computing Basic ideas Hides parallel programming using millons of cores and the hierarchy of parallel computers; Integrates the efficient implementations of parallel fast numerical algorithms; Provides efficient data structures and solver libraries; Supports software engineering for code extensibility. Basic Ideas PetaFlops MPP Applications Codes TeraFlops Cluster Serial Programming Personal Computer JASMIN Structured Grid Inertial Confinement Fusion Global Climate Particle Modeling Simulation CFD Material Simulations J parallel Adaptive Structured Mesh INfrastructure JASMIN http:://www.iapcm.ac. cn/jasmin, 2010SR050446 2003-now …… Unstructured Grid JASMIN User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc. User Interfaces:Components based Parallel Programming models. ( C++ classes) JASMIN Numerical Algorithms:geometry, fast solvers, mature numerical methods, time integrators, etc. V. 2.0 HPC implementations( thousands of CPUs):data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc. Architecture:Multilayered, Modularized, Object-oriented; Codes: C++/C/F90/F77+MPI/OpenMP,500,000 lines; Installation: Personal computers, Cluster, MPP. JASMIN Mesh supported Inertial Confinement Fusion: 2004-now ICF Application Codes 13 codes, 46 researches, concurrently develop Different Combinations numerical methods Simulation Cycle Physical parameters Expert Experience Hides parallel computing and adaptive implementations using tens of thousands of CPU cores; Provides efficient data structures, algorithms and solvers; Support software engineering for code extensibility. Numerical simulations on TianHe-1A Codes # CPU cores Codes # CPU cores LARED-S 32,768 RH2D 1,024 LARED-P 72,000 HIME3D 3,600 LAP3D 16,384 PDD3D 4,096 MEPH3D 38,400 LARED-R 512 MD3D 80,000 LARED Integration 128 RT3D 1,000 Simulation duration : several hours to tens of hours. Codes Year 2004 Year 2010 LARED-H serial Parallel 2-D radiation hydrodynamics Lagrange code Single bolck Multiblock Without capsule NIF ignition target LARED-R Serial Parallel (2048 cores) 2-D radiation transport code Scale up MPI a factor of 1000 Parallel (32768 cores) Single level SAMR 3-D radiation hydrodynamics Euler code 2-D: single group Multi-group diffusion 3-D: no radiation 3-D: radiaiton multigroup diffusion LARED-P MPI Parallel (36000 cores) Terascale of particles LARED-S 3-D laser plasma interaction code GPU programming support and performance optimization Contact: Prof. Xiaoshe Dong, Xi’an Jiaotong University Email: xsdong@xjtu.edu.cn GPU program optimization Three approaches for GPU program optimization memory-access level kernel-speedup level data-partition level Approaches for GPU program optimization Memory-access Level Kernel-speedup Level Data-partition Level Source-to-source translation for GPU Developed a source-to-source translator, GPU-S2S, for GPU Facilitate the development of parallel programs on GPU by combining automatic mapping and static compilation Source-to-source translation for GPU Insert directives into the source program guide implicit calling of CUDA runtime libraries enable the user to control the mapping of the computeintensive applications from the homogeneous CPU platform to GPU’s streaming platform Optimization based on runtime profiling take full advantage of GPU according to the characteristics of applications by collecting runtime dynamic information. The GPU-S2S architecture PGAS programming model MPI message transfer model Layer of software productivity Pthread thread model GPU-S2S Profile information GPU supporting library Calling shared library User standard library Layer of Running-time performance collection performance discover Operating system GPU platform Program translation by GPU-S2S homogeneous Computing Templates library of platform code function called by Profile optimized computing homogeneos with libray intensive applications platform code directives User defined part Calling shared libary Source code before translation (homogeneous platform program framework) GPU-S2S Kernel program of GPU according templates General Templates library of purpose Profile optimized computing computing library intensive applications interface Calling shared libary User standard library Source code after translation (GPU streaming architecture platform program framework) Control program of CPU Runtime optimization based on profiling First level profiling (function level) GPU-S2S *.c、*.h homogeneous platform code C language compiler Pretreatment First level dynamic instrumentation Second level dynamic instrumentation Automatically inserting directives Second level profiling (memory access and kernel improvement ) Third level profiling (data partition) Generate CUDA code containing optimized kernel Compile and run Compile and run Extract profile information: computing kernel First Level Profile Extract profile information: Data block size, Share memory configuration parameters, Judge whether can use stream Second Level Profile Don’t need to optimize further Termination Need to optimize further Third level dynamic instrumentation in CUDA code Compile and run Extract profile information: Number of stream, Data size of every stream *.o Executable code on GPU CUDA Compiler tool Generate CUDA code using stream *.h、 *.cu、 *.c CUDA code Third Level Profile First level profiling Homogeneous platform code Allocate address space initialization function0 Source-tosource compiler instrumentation0 instrumentation0 instrumentatio1 function1 ... functionN Free address space instrumentation1 instrumentationN instrumentationN Scan source code before translation, find function and insert instrumentation before and after the function, compute execution time of every function, and find computing kernel finally. Second level profiling Homogeneous platform code Computing kernel1 Computing kernel2 instrumentation instrumentation ... ... Computing kernel3 Source-to-source compiler instrumentation GPU-S2S scans code, insert instrumentation in the corresponding place of computing kernel extract profile information, analyze the code, perform some optimization, according to the feature of application to expand the templates, finally generate the CUDA code with optimized kernel Using share memory is an general approach, containing 13 parameters, having different performance with different values. Third level profiling CUDA control code Allocate address space initialization Allocate global address space function0--copyin function0--kernel Source-to-source compiler instrumentationi instrumentationi instrumentationk instrumentationk function0--copyout ... Free address space instrumentationo instrumentationo GPU-S2S scans code, find computing kernel and its copy function, insert instrumentation into the corresponding place of code, get copy time and computing time. according to the time to compute the number of streams and data size of each stream. finally generate the optimized CUDA code with stream. Verification and experiment Experiment platform: server:4-core Xeon CPU with 12GB memory ,NVIDIA Tesla C1060 Redhat enterprise server version 5.3 operation system CUDA version 2.3 Test example: Matrix multiplication Fast Fourier transform (FFT) 800 time(ms) 600 only using global memory 500 400 300 second level profile optimization 200 100 0 1024 2048 different size of input data third level profile optimization time(ms) memory access optimization 700 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 memory access optimization only using global memory second level profile optimization 4096 8192 different size of input data third level profile optimization Matrix multiplication Performance comparison before and after profile time(ms) The CUDA code with three 10000000 8000000 three level profile optimization CPU 6000000 4000000 2000000 0 level profiling optimization achieves 31% improvement over the CUDA code with only memory access optimization, 1024 2048 4096 8192 different size of input data Execution performance comparison on different platform and 91% improvement over the CUDA code using only global memory for computing . 1800 memory access optimization 1600 t i m e (m s ) 1400 1200 second level profile optimization third level profile optimization only using global memory 1000 800 600 400 200 0 15 30 45 The CUDA code after three level profile optimization achieves 60 number of Batch 38% improvement over FFT(1048576 points) Performance comparison before and after profile the CUDA code with memory access 50000 optimization, and 77% time(ms) 40000 three level profile optimization CPU 30000 20000 improvement over the CUDA code using only global memory for 10000 computing . 0 15 30 45 60 different size of input data FFT(1048576 points ) execution performance comparison on different platform Programming Multi-GPU system The traditional programming models, MPI and PGAS, are not directly suitable for the new CPU+GPU platform. The legacy applications cannot exploit the power of GPUs. Programming model for CPU-GPU architecture Combining the traditional programming model and GPU-specific programming model, forming a mixed programming model. Better performance on the CPU-GPU architecture, making more efficient use of the computing power. CPU GPU … CPU …… GPU GPU … GPU Programming Multi-GPU system The memory of the CPU+GPU system are both distributed and shared. So it is feasible to use MPI and PGAS programming model for this new kind of system. CPU MainMem CPU Private space Message data Main Mem Share space Share data Device Mem Device Mem Device Mem Device Mem GPU GPU GPU GPU MPI PGAS Using message passing or shared data for communication between parallel tasks or GPUs Mixed Programming Model NVIDIA GPU —— CUDA Traditional Programming model —— MPI/UPC MPI+CUDA/UPC+CUDA Program start Host CPU GPU Device choosing Program initial Main MM MPI/UPC runtime Device CPU Source data copy in CPU Main MM (communication interface of upper programing model) CPU CUDA runtime Device MM GPU Computing start call Communication between tasks Parallel Task Computing kernel GPU GPU computing CUDA program execution Result data copy out Device MM CPU CPU CPU end cudaMemCopy Mixed Programming Model The primary control of an application is implemented by MPI or UPC programming model. The computing kernels of the application are implemented by CUDA, using GPU to accelerate computing. Optimizing the computing kernel , make better use of GPUs. Using GPU-S2S to generate the computing kernel program, hiding the CPU+GPU heterogeneity to use, improving the portability of application. Primary control program <include> Compiled with mpicc/upcc Declaration of computing kernel Link with Nvcc <include> Computing kernel program Compiled with nvcc Compiling process Run with mpirun/upcrun MPI+CUDA experiment Platform 2NF5588 server, equipped with 1 Xeon CPU (2.27GHz), 12GB MM 2 NVIDIA Tesla C1060 GPU(GT200 architecture,4GB deviceMM) 1Gbt Ethernet RedHatLinux5.3 CUDA Toolkit 2.3 and CUDA SDK OpenMPI 1.3 BerkeleyUPC 2.1 MPI+CUDA experiment (cont’) Matrix Multiplication program Using block matrix multiply for UPC programming. Data spread on each UPC thread. The computing kernel carries out the multiply of two blocks at one time, using CUDA to implement. The total time of execution: Tsum=Tcom+Tcuda=Tcom+Tcopy+Tkernel Tcom: UPC thread communication time Tcuda: CUDA program execution time Tcopy: Data transmission time between host and device Tkernel: GPU computing time MPI+CUDA experiment (cont’) 2server,8 MPI task most 1 server with 2 GPUs For 4094*4096,the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) is 184x of the case with 8 MPI task. For small scale data,such as 256,512 , the execution time of using 2 GPUs is even longer than using 1 GPUs the computing scale is too small , the communication between two tasks overwhelm the reduction of computing time. MPI+CUDA experiment (cont’) Matrix size:8192*8192 Tcuda reduced as the task number increase, but the Tsum of 4 tasks is larger than that of 2. Matrix size:16384*16384 Reason:the latency of Ethernet between 2 servers is much higher than the latency on the Bus inside one server 。 If the computing scale is larger or using faster network (e.g. Infiniband), Multinode with multi-GPUs will still improve the performance of application. Programming Support and Compilers Contact: Prof. Xiaobing Feng, ICT, CAS, Beijing fxb@ict.ac.cn Advanced Compiler Technology (ACT) Group at the ICT, CAS Institute of Computing Technology (ICT) is founded at 1956, the first and leading institute on computing technology in China ACT is founded in early 1960’s, and has over 40 years experiences on compilers Compilers for most of the mainframes developed in China Compiler and binary translation tools for Loogson proessors Parallel compilers and tools for the Dawning series (SMP/MPP/cluster) Advanced Compiler Technology (ACT) Group at the ICT, CAS ACT’s Current research Parallel programming languages and models Optimized compilers and tools for HPC (Dawning) and multi-core processors (Loongson) Advanced Compiler Technology (ACT) Group at the ICT, CAS • PTA model (Process-based TAsk parallel programming model ) – new process-based task construct • With properties of isolation, atomicity and deterministic submission – Annotate a loop into two parts, prologue and task segment #pragma pta parallel [clauses] #pragma pta task #pragma pta propagate (varlist) – Suitable for expressing coarse-grained, irregular parallelism on loops • Implementation and performance – PTA compiler, runtime system and assistant tool (help writing correct programs) – Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83 (average 6.72 on 8 cores) – Code changes is within 10 lines, much smaller than OpenMP UPC-H : A Parallel Programming Model for Deep Parallel Hierarchies Hierarchical UPC Provide multi-level data distribution Implicit and explicit hierarchical loop parallelism Hybrid execution model: SPMD with fork-join Multi-dimensional data distribution and super-pipelining Implementations on CUDA clusters and Dawning 6000 cluster Based on Berkeley UPC Enhance optimizations as localization and communication optimization Support SIMD intrinsics CUDA cluster:72% of hand-tuned version’s performance, code reduction to 68% Multi-core cluster: better process mapping and cache reuse than UPC OpenMP and Runtime Support for Heterogeneous Platforms Heterogeneous platforms consisting of CPUs and GPUs OpenMP extension Specify partitioning ratio to optimize data transfer globally Specify heterogeneous blocking sizes to reduce false sharing among computing devices Runtime support Multiple GPUs, or CPU-GPU cooperation brings extra data transfer hurting the performance gain Programmers need unified data management system DSM system based on the blocking size specified Intelligent runtime prefetching with the help of compiler analysis Implementation and results On OpenUH compiler Gains 1.6X speedup through prefetching on NPB/SP (class C) Analyzers based on Compiling Techniques for MPI programs Communication slicing and process mapping tool Compiler part Optimized mapping tool Weighted graph, Hardware characteristic Graph partitioning and feedback-based evaluation Memory bandwidth measuring tool for MPI programs PDG Graph Building and slicing generation Iteration Set Transformation for approximation Detect the burst of bandwidth requirements Enhance the performance of MPI error checking Redundant error checking removal by dynamically turning on/off the global error checking With the help of compiler analysis on communicators Integrated with a model checking tool (ISP) and a runtime checking tool (MARMOT) LoongCC: An Optimizing Compiler for Loongson Multicore Processors Based on Open64-4.2 and supporting C/C++/Fortran Powerful optimizer and analyzer with better performances Open source at http://svn.open64.net/svnroot/open64/trunk/ SIMD intrinsic support Memory locality optimization Data layout optimization Data prefetching Load/store grouping for 128-bit memory access instructions Integrated with Aggressive Auto Parallelization Optimization (AAPO) module Dynamic privatization Parallel model with dynamic alias optimization Array reduction optimization DigitalBridge: An Binary Translation System for Loongson Multicore Processors Fully utilizing hardware characters of Loongson CPUs Handle return instructions by shadow stack Handle Eflag operations by flag pattern Emulate X86 FPU by local FP registers Combination of static and dynamic translation Handle indirect-jumping table Handle misalignment data accesses by dynamic profile and exception handler Improve data locality by pool allocation Stack variables promotion Software Tools for High Performance Computing Contact: Prof. Yi Liu, JSI, Beihang University yi.liu@jsi.buaa.edu.cn LSP3AS: large-scale parallel program performance analysis system Source Code – Designed for performance tuning on peta-scale HPC systems – Method is common: • • • Source code is instrumented by inserting specified functioncalls Instrumented code is executed, while performance data are collected, generating profiling&tracing data files The profiling&tracing data is analyzed and visualization report is generated – Instrumentation: based on TAU from University of Oregon Dynamic Compensation TAU Instrumentation Measurement API RDMA Transmission and Buffer Management Instrumented Code Compiler/Linker External Libraries RDMA Library Executable Datafile Environment Clustering Analysis Based on Iteration Performance Datafile Clustering Visualization Based on hierarchy classify Profiling Tools Visualization and Analysis Traditional Process of performance analysis Tracing Tools Analysis based on hierarchical clustering Dependency of Each Step Innovations LSP3AS: large-scale parallel program performance analysis system ≈ 10 thousands of nodes in Petascale system, massive performance data will be generated, transmitted and stored Scalable structure for performance data collection Distributed data collection and transmission: eliminate bottlenecks in network and data processing Dynamic Compensation algorithm: reduce the influence of performance data volume Efficient Data Transmission: use Remote Direct Memory Access (RDMA) to achieve high bandwidth and low latency Storage system FC FC RD M Compute node IO node IO node Lustre Client Or GFS Lustre Client Or GFS Thread Thread Receiver Receiver RD A …… M A RD M RD A …… M A Compute node Compute node Sender Sender Sender Sender Shared Memory Shared Memory Shared Memory Shared Memory User process User process User process User process User process User process Compute node User process User process LSP3AS: large-scale parallel program performance analysis system • Analysis & Visualization – Two approaches to deal with huge amount of data • • Data Analysis: Iteration-based clustering approach from data mining technology are used Visualization: Clustering visualization Based on Hierarchy Classification SimHPC: Parallel Simulator Challenge for HPC Simulation: performance Target system: >1,000 nodes and processors Difficult for traditional architecture simulators e.g. Simics Our solution Parallel simulation Use same node in host system with the target Using cluster to simulate cluster Basis: HPC systems uses commercial processors, even blades also available for simulator Execution time of instruction sequences are the same in host & target Processes makes things a little complicated, we will discuss it later Advantage: no need to model and simulate detailed components, such as pipeline in processors and cache Execution-driven, Full-system simulation, support execution of Linux and applications include benchmarks (e.g. Linpack) SimHPC: Parallel Simulator (cont’) Analysis Execution time of a process in target system is composed of: Tprocess Trun TIO Tready equal to host can be obtained in Linux kernel needed to be simulated unequal to host needed to be re-calculated − Trun: execution time of instruction sequences − TIO: I/O blocking time, such as r/w files, send/recv msgs − Tready: waiting time in ready-state So, Our simulator needs to: ①Capture system events • process scheduling • I/O operations: read/write files, MPI send()/recv() ②Simulate I/O and interconnection network subsystems ③Synchronize timing of each application process SimHPC: Parallel Simulator (cont’) System Architecture Application processes of multiple target nodes are allocated to one host node number of host nodes << number of target nodes Events are captured on host node while application is running Events are sent to central node to analyze, synchronize time, and simulation Host node Host node Event Capture Event Capture Host node …… Event Capture Parallel applications ... Target Event Collection Control Process ... Process Analysis & Time-axis Sychronize Simulation Results ... Target Target Process ... Process Process ... Process Target Process ... Process Simulator Simulator …… Interconnection Network Host Linux Disk I/O Host Hardware Platform Host Hardware Platform Architecture Simulation Host Host Host Linux SimHPC: Parallel Simulator (cont’) • Experiment Results – – – – Host: 5 IBM Blade HS21 (2-way Xeon) Target: 32 – 1024 nodes OS: Linux App: Linpack HPL Simulation Slowdown Simulation Error Test Linpack performance for Fat-tree and 2D-mesh Interconnection Communication time for Fat-tree and 2D-mesh Interconnection System-level Power Management Power-aware Job Scheduling algorithm Idea: Suspend a node if its idle-time > threshold ②Wakeup nodes if there is no enough nodes to execute jobs, while ③Avoid node thrashing between busy and suspend state Since suspend & wakeup operation can consume power Do not wakeup a suspending node if it just goes to sleep ① The algorithm is integrated into OpenPBS System-level Power Management • Power Management Tool – Monitor the power-related status of the system – Reduce runtime power consumption of the machine – Multiple power management policies – – – – Manual-control On-demand control Suspend-enable … Power Management Policies Policy Level Power Management Software / Interfaces Management /Interface Level Power Management Agent in Node Node sleep/wakeup Node On/Off CPU Freq. control Power Fan speed control of I/O control equipments Layers of Power Management Node Level ... Control & Monitor System-level Power Management Commands Status Power data Power • Power Management Test – On 5 IBM HS21 blades Task Load (tasks per hour) 20 Power Mesurement System Comparison Power Management Policy Task Exec. Time (s) Power Consumption (J) Performance slowdown Power Saving On-demand 3.55 1778077 5.15% -1.66% Suspend 3.60 1632521 9.76% -12.74% On-demand 3.55 1831432 4.62% -3.84% Suspend 3.65 1683161 10.61% -10.78% On-demand 3.55 2132947 3.55% -7.05% Suspend 3.66 2123577 11.25% -9.34% 200 800 Power management test for different Task Load (Compared to no power management) Parallel Programming Platform for Astrophysics Contact: Yunquan Zhang, ISCAS, Beijing zyq@mail.rdcps.ac.cn Parallel Computing Software Platform for Astrophysics Joint work Shanghai Astronomical Observatory, CAS (SHAO), Institute of Software, CAS (ISCAS) Shanghai Supercomputer Center (SSC) Build a high performance parallel computing software platform for astrophysics research, focusing on the planetary fluid dynamics and N-body problems New parallel computing models and parallel algorithms studied, validated and adopted to achieve high performance. Software Architecture Web Portal on CNGrid Software Platform for Astrophysics Data Processing Scientific Visualiztion Physical and Mathematical Model Numerical Methods Fluid Dynamics PETSc Aztec Improved Preconditioner MPI OpenMP N-body Problem FFTW SpMV Fortran GSL Improved Lib. for Collective Comunication C 100T Supercomputer Lustre Parallel Computing Model Software Development PETSc Optimized Version 1 (Speedup 4-6) Mesh 80×80×50 (Dawning 5000A) 400 1600 1400 350 Aztec Petsc 300 250 200 Mesh 160×160×100 (Dawning 5000A) Aztec PETSc 1200 1000 Runtime (s) The PETSc optimized version1 for astrophysics numerical simulation has been finished. The early performace evaluation for Aztec code and PETSc code on Dawning 5000A is shown. For 80×80×50 mesh, the execution time of Aztec program is 4-7 times of the PETSc version, average 6 times; For 160×160×100 mesh, the execution time of Aztec program is 25 times of the PETSc version, average 4 times. Runtime (s) 800 150 600 100 400 50 200 0 16 32 64 128 256 512 1024 2048 Processor core 0 32 64 128 256 512 1024 Processor core 2048 PETSc Optimized Version 2 (Speedup 15-26) Method 1: Domain Decomposition Ordering Method for Field Coupling Method 2: Preconditioner for Domain Decomposition Method Method 3: PETSc Multi-physics Data Structure Left: mesh 128 x 128 x 96 Right: mesh 192 x 192 x 128 Computation Speedup: 15-26 Strong scalability: Original code normal, New code ideal Test environment: BlueGene/L at NCAR (HPCA2009) Strong Scalability on Dawning 5000A Dawning 5000A(160×160×100 mesh size) 1600 1400 Aztec Petsc Time(Seconds) 1200 1000 800 600 400 200 0 32 64 128 256 512 1024 2048 4096 8192 Processor Cores 2015/4/13 Strong Scalability ro tm p line a r: 192x192x128 1000 BG /L 曙光5000A 4.7 14.4 12.0 26.1 19.2 13.5 51.1 32.3 23.8 98.5 69.3 65.5 212.8 157.7 144.8 433.6 257.1 344.7 Tim e (S) 10 8.3 深腾7000 100 1 64 128 256 512 1024 2048 4096 8192 num b e r o f p ro c e sso r c o re 65 Strong Scalability on TianHe-1A 2015/4/13 CLeXML Math Library Task Parallel Self Adaptive Tunning Multi-core parallel Iterative Solver LAPACK BLAS Computationa l Model FFT CPU Self Adaptive Tunning, Instruction Reordering, Software Pipelining… BLAS2 Performance: MKL vs. CLeXML BLAS3 Performance: MKL vs. CLeXML FFT Performance: MKL vs. CLeXML FFT Performance: MKL vs. CLeXML HPC Software support for Earth System Modeling Contact: Prof. Guangwen Yang, Tsinghua University ygw@tsinghua.edu.cn 72 Development Wizard and Editor Source Code Compiler/ Debugger/ Optimizer Other Data Executable Algorithm (Parallel) Initial Field and Boundary Condition Running Environment Earth System Model Development Workflow Earth System Model Computation Output Result Evaluation Result Visualization Data Visualization and Analysis Tools Data Management Subsystem Standard Data Set 73 Demonstrative Applications Expected Results research on global change model application systems development tools: data conversion diagnosis debugging performance analysis high availability integrated high performance computing environment for earth system model Existing tools: compiler system monitor version control editor software standards international resources template library module library high performance computers in China 75 Integration and Management of Massive Heterogeneous Data web based data access portal provide simplified APIs for locating model data path provide reliable meta-data management and support user-defined meta-data support DAP data access protocol and provide model data queries data processing service based on ‘cloud’ methods provide SQL-Like query interface for climate model semantics support parallel data aggregation and extraction support online and offline conversion between different data formats support graphic workflow operations data storage service on parallel file system provide fast and reliable parallel I/O for climate modeling support compressed storage for earth scientific data Technical Route Presentation Layer Shell Command Line Eclipse Client Web Browser API(C & Fortran) Request Parsing Engine Support Layer transf er query publish share visualiz ation aggreg ation extrac tion data storage service conver sion read write archive Hadoop MPI OpenDAP GPU CUDA SDK Data Grid Middleware HDF5 pNetCDF PIO Key-Value Storage System Memory File System Compressed Archive File System Parallel File System PVFS2 toolset Storage Layer browse data processing service service data access service interface Web Service(Rest & SOAP) Fast Visualization and Diagnosis of Earth System Model Data Research topics design and implementation of parallel visualization algorithms design parallel volume rendering algorithms that can scale to hundreds of cores, and achieve efficient data sampling and composition design parallel contour surface algorithm to achieve quick extraction and composition of contour surface performance optimization for TB-scale data field visualization software acceleration for graphics and image hardware acceleration for graphics and image visualized representation methods for earth system models 78 computing node computing node computing node computing node network HPC high-resolution renderer & display wall high-speed internal bus OpenGL stream parallel visualization engine library pixel stream (giga-bps) graphical node PVE launcher meta-data manager preprocessed data graphical workstation DMX Chromium local user data processor netCDF, NC raw data (TB) web remote user MPMD Program Debugging, and Analysis MPMD parallel program debugging MPMD parallel program performance measurement and analysis support to efficient execution of MPMD parallel programs fault-tolerance technologies for MPMD parallel programs MPMD Program Debugging, Analysis Environment Runtime Support High Availability Debugging Performance Analysis Basic Hardware/Software Environment Technical Route Presentation Layer IDE Integration Framework Shell Command Line Eclipse client browser job management UI debug plug-in performance analysis plug-in Abstraction Layer abstraction service parallel debugging Service Layer query instru mentat ion group track job and resource management INT Fundamental Support perfor mance analys is data collec tion analys is management middleware Operation System reliability performance analysis resource management File System data repres entati on system monitor job control Language Environment Hardware(nodes and network) contro ller plug-in and command job scheduling Library Technical Route Debugging and Optimization IDE for Earth System Model Programs Debugging Performance Window Analysis Window system failure notification and fault-tolerant scheduling Earth System Model Abstraction Service Platform resource management and job scheduling performance optimization service debugging service scheduling performance sampling data debugging monitoring event collection debugging replay reliable monitoring hierarchical scheduling system failure notification program event collection earth system model MPMD program system execution environment Integrated Development Environment (IDE) a plug-in-based expandable development platform a template-based development supporting environment a tool library for earth system model development typical earth system model applications developed using the integrated development environment Plug-in integration method Another Tool Eclipse Platform Java Development Tools (JDT) Workbench JFace SWT Plug-in Development Environment (PDE) Help Team Workspace Debug Platform Runtime Eclipse Project Your Tool Their Tool Encapsulation of reusable modules Radiation Module Module encapsulation specification … … Module units: High performance reusable Time Integration Module Solver Module Boundary Layer Module Coupler Module Model module lib Thank You!