PARALLEL COMPUTING PERFORMANCE TESTING AND DEBUGGING AMY KNOWLES – MIDWESTERN STATE UNIVERSITY

PARALLEL COMPUTING PERFORMANCE TESTING AND DEBUGGING AMY KNOWLES – MIDWESTERN STATE UNIVERSITY OUTLINE  High Performance Architectures  Parallel Programming Models  Debugging  Performance  Conclusion HIGH PERFORMANCE ARCHITECTURES  CPU(processor): smallest unit of hardware that executes operations  How are they connected?  Common HPC Architectures  Shared Memory  Distributed Memory  CPUs can be on a single chip  Distributed-Shared  CPUs can be on distinct hardware units connected via a network  GPU TRADITIONAL CPU: SHARED MEMORY  Shared Memory: memory that may be simultaneously accessed by multiple processes with intent to provide communication among them  Advantage:  Changes in memory immediately available to ALL processes  Disadvantages:  Synchronicity issues  Lack of Scalability TRADITIONAL CPU: DISTRIBUTED MEMORY  Distributed Memory: a computer system where each processor has its own memory  Advantages:  Highly Scalable  Rapid access to own memory  Disadvantage:  Requires LOTS of communication between processors TRADITIONAL CPU: DISTRIBUTED-SHARED  Distributed-Shared Memory(Cluster): combination of shared and distributed memory architectures  Nodes are connected through a high-speed distributed network  Each node contains multiple processors organized as a shared memory system  Most common architecture for high-performance computers (supercomputers) today! GPU  GPU(graphical processing unit): consists of thousands of small, efficient cores designed to handle multiple tasks simultaneously  Advantages:  Extremely good at compute-intensive tasks GPU COMPUTING PARALLEL PROGRAM MODELS  Multithreading  Algorithm split up into parts that can be executed independently (task parallelization)  POSIX threads (Pthreads)  OpenMP  Message Passing  Program run as multiple processes with information passed between them as messages  MPI  Hybrid  Combines multi-threading and message passing  Most common: OpenMP + MPI  MPI + CUDA DEBUGGING & PERFORMANCE  Debugging  Relative Debugging  Collective Checking  Memory Consistency Errors  Performance  MPI Optimization  CPU-GPU Benchmark Suite RELATIVE DEBUGGING  Relative debugging: allows a programmer to locate errors by viewing the differences in relevant data structures of two versions of the same program as they are executing.  Initially proposed to aid in the parallelization of sequential code  Why is it helpful?  Allows programmers to focus on small changes within data structures without having to manage the complicated flow of data in a parallel environment  When is it useful?  Porting programs from a CPU cluster to a GPGPU hybrid COLLECTIVE CHECKING  Collective communication: communication passed to the entire group simultaneously  MPI standard imposes transitive correction properties on collective operations  i.e. If a communication property holds for an operation between PX and PY, and the same communication property holds between PY and PZ, we may assume that property holds between PX and PZ without an additional check.  Tree-Based Overlay Network(TBON): tree-based virtual network of nodes and logical links that is built on top of an existing network with the purpose to implement a service that is not available in the existing network  In our case – Collective Error Checking  Why is it useful in HPC? - Scalability COLLECTIVE CHECKING: TBON MEMORY CONSISTENCY ERRORS  One-sided communication: non-blocking communication  Allows overlap of communication and computation  Allows potential for greater scalability  Memory consistency errors in MPI  Occur when two or more accesses of shared data conflict with each other  Leads to an undefined or erroneous state MEMORY CONSISTENCY ERRORS  MC Checker  Detects memory consistency errors in one-sided MPI communication  Advantages  Incurs low runtime overhead  Easy to use requiring no program modifications  Although written for MPI programming model, can easily be extended to other one-sided programming models MPI OPTIMIZATION  Methodology for optimizing MPI Applications  MPI communication structure  Can non-blocking communication be added?  Load balancing  Can the algorithm be re-worked to more evenly balance the workload  Parameter optimization  Can the library parameters be tuned?  Results  30% performance gain if all three optimization methods applied  Individual application showed 10%, 20%, 5% improvement, respectively CPU-GPU BENCHMARK SUITE  MPI & CUDA  MPI used to distribute work among processors  CUDA used to offload heavy computation to GPUs  Communication types for CPU-GPU  Unpinned host memory  Pinned host memory  Use of CUDA-Aware MPI CPU-GPU BENCHMARK SUITE  Application performance depends on:  Application characteristics (communication vs. computation intensive)  Compute capability of the nodes  When dataset sizes are small, memory intensive applications benefit from use of pinned memory  Allocating too much pinned memory for large dataset sizes can affect performance CONCLUSION  HPC consists of three main architecture types – Shared, Distributed, and Distributed-Shared  Numerous tools have been developed to debug parallel applications  Memory consistency errors  Collective communication checking  Relative debugging  Efficiency and optimization are key factors when developing parallel applications SOURCES  DeRose, L., Gontarek, A., Moench, R., Vose, A., Abramson, D., Dinh, M.N., Jin, C., Relative Debugging for a Highly Parallel Hybrid Computer System, SC ’15 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 15, 2015, Article No. 63.  Hilbrich, T., de Supinski, B.R., Hansel, F., Muller, M.S., Schulz, M., Nagel, W.E., Runtime MPI Collective Checking with Tree-Based Overlay Networks, EuroMPI '13 Proceedings of the 20th European MPI Users' Group Meeting, September 15, 2013, pg. 129 – 134.  Chen, Z., Dinan, J., Tang, Z., Balaji, P., Zhong, H., Wei, J., Huang, T., Qin, F., MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications, SC '14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 16, 2014, pg. 499  Agarwal, T., Becchi, M., Design of a Hybrid MPI-CUDA Benchmark Suite for CPU-GPU Clusters, PACT '14 Proceedings of the 23rd international conference on Parallel architectures and compilation, August 24, 2014, pg. 505 – 506.  Pimenta, A., Cesar, E., Sikora, A., Methodology for MPI Applications Autotuning, EuroMPI '13 Proceedings of the 20th European MPI Users' Group Meeting, September 15, 2013, pg. 145 – 146.  CPU architecture graphics taken from http://www.skirt.ugent.be/skirt/_parallel_computing.html  GPU architecture graphics taken from http://www.nvidia.com/object/what-is-gpucomputing.html#sthash.UMRk458E.dpuf

PARALLEL COMPUTING PERFORMANCE TESTING AND DEBUGGING AMY KNOWLES – MIDWESTERN STATE UNIVERSITY

Related documents

Products

Support

PARALLEL COMPUTING PERFORMANCE TESTING AND DEBUGGING AMY KNOWLES – MIDWESTERN STATE UNIVERSITY

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib