PARALLEL COMPUTING PERFORMANCE TESTING AND DEBUGGING AMY KNOWLES – MIDWESTERN STATE UNIVERSITY OUTLINE High Performance Architectures Parallel Programming Models Debugging Performance Conclusion HIGH PERFORMANCE ARCHITECTURES CPU(processor): smallest unit of hardware that executes operations How are they connected? Common HPC Architectures Shared Memory Distributed Memory CPUs can be on a single chip Distributed-Shared CPUs can be on distinct hardware units connected via a network GPU TRADITIONAL CPU: SHARED MEMORY Shared Memory: memory that may be simultaneously accessed by multiple processes with intent to provide communication among them Advantage: Changes in memory immediately available to ALL processes Disadvantages: Synchronicity issues Lack of Scalability TRADITIONAL CPU: DISTRIBUTED MEMORY Distributed Memory: a computer system where each processor has its own memory Advantages: Highly Scalable Rapid access to own memory Disadvantage: Requires LOTS of communication between processors TRADITIONAL CPU: DISTRIBUTED-SHARED Distributed-Shared Memory(Cluster): combination of shared and distributed memory architectures Nodes are connected through a high-speed distributed network Each node contains multiple processors organized as a shared memory system Most common architecture for high-performance computers (supercomputers) today! GPU GPU(graphical processing unit): consists of thousands of small, efficient cores designed to handle multiple tasks simultaneously Advantages: Extremely good at compute-intensive tasks GPU COMPUTING PARALLEL PROGRAM MODELS Multithreading Algorithm split up into parts that can be executed independently (task parallelization) POSIX threads (Pthreads) OpenMP Message Passing Program run as multiple processes with information passed between them as messages MPI Hybrid Combines multi-threading and message passing Most common: OpenMP + MPI MPI + CUDA DEBUGGING & PERFORMANCE Debugging Relative Debugging Collective Checking Memory Consistency Errors Performance MPI Optimization CPU-GPU Benchmark Suite RELATIVE DEBUGGING Relative debugging: allows a programmer to locate errors by viewing the differences in relevant data structures of two versions of the same program as they are executing. Initially proposed to aid in the parallelization of sequential code Why is it helpful? Allows programmers to focus on small changes within data structures without having to manage the complicated flow of data in a parallel environment When is it useful? Porting programs from a CPU cluster to a GPGPU hybrid COLLECTIVE CHECKING Collective communication: communication passed to the entire group simultaneously MPI standard imposes transitive correction properties on collective operations i.e. If a communication property holds for an operation between PX and PY, and the same communication property holds between PY and PZ, we may assume that property holds between PX and PZ without an additional check. Tree-Based Overlay Network(TBON): tree-based virtual network of nodes and logical links that is built on top of an existing network with the purpose to implement a service that is not available in the existing network In our case – Collective Error Checking Why is it useful in HPC? - Scalability COLLECTIVE CHECKING: TBON MEMORY CONSISTENCY ERRORS One-sided communication: non-blocking communication Allows overlap of communication and computation Allows potential for greater scalability Memory consistency errors in MPI Occur when two or more accesses of shared data conflict with each other Leads to an undefined or erroneous state MEMORY CONSISTENCY ERRORS MC Checker Detects memory consistency errors in one-sided MPI communication Advantages Incurs low runtime overhead Easy to use requiring no program modifications Although written for MPI programming model, can easily be extended to other one-sided programming models MPI OPTIMIZATION Methodology for optimizing MPI Applications MPI communication structure Can non-blocking communication be added? Load balancing Can the algorithm be re-worked to more evenly balance the workload Parameter optimization Can the library parameters be tuned? Results 30% performance gain if all three optimization methods applied Individual application showed 10%, 20%, 5% improvement, respectively CPU-GPU BENCHMARK SUITE MPI & CUDA MPI used to distribute work among processors CUDA used to offload heavy computation to GPUs Communication types for CPU-GPU Unpinned host memory Pinned host memory Use of CUDA-Aware MPI CPU-GPU BENCHMARK SUITE Application performance depends on: Application characteristics (communication vs. computation intensive) Compute capability of the nodes When dataset sizes are small, memory intensive applications benefit from use of pinned memory Allocating too much pinned memory for large dataset sizes can affect performance CONCLUSION HPC consists of three main architecture types – Shared, Distributed, and Distributed-Shared Numerous tools have been developed to debug parallel applications Memory consistency errors Collective communication checking Relative debugging Efficiency and optimization are key factors when developing parallel applications SOURCES DeRose, L., Gontarek, A., Moench, R., Vose, A., Abramson, D., Dinh, M.N., Jin, C., Relative Debugging for a Highly Parallel Hybrid Computer System, SC ’15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 15, 2015, Article No. 63. Hilbrich, T., de Supinski, B.R., Hansel, F., Muller, M.S., Schulz, M., Nagel, W.E., Runtime MPI Collective Checking with Tree-Based Overlay Networks, EuroMPI '13 Proceedings of the 20th European MPI Users' Group Meeting, September 15, 2013, pg. 129 – 134. Chen, Z., Dinan, J., Tang, Z., Balaji, P., Zhong, H., Wei, J., Huang, T., Qin, F., MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications, SC '14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 16, 2014, pg. 499 Agarwal, T., Becchi, M., Design of a Hybrid MPI-CUDA Benchmark Suite for CPU-GPU Clusters, PACT '14 Proceedings of the 23rd international conference on Parallel architectures and compilation, August 24, 2014, pg. 505 – 506. Pimenta, A., Cesar, E., Sikora, A., Methodology for MPI Applications Autotuning, EuroMPI '13 Proceedings of the 20th European MPI Users' Group Meeting, September 15, 2013, pg. 145 – 146. CPU architecture graphics taken from http://www.skirt.ugent.be/skirt/_parallel_computing.html GPU architecture graphics taken from http://www.nvidia.com/object/what-is-gpucomputing.html#sthash.UMRk458E.dpuf