PARALLEL COMPUTING
PERFORMANCE TESTING AND DEBUGGING
AMY KNOWLES – MIDWESTERN STATE UNIVERSITY
OUTLINE
High Performance Architectures
Parallel Programming Models
Debugging
Performance
Conclusion
HIGH PERFORMANCE ARCHITECTURES
CPU(processor): smallest unit of
hardware that executes operations
How are they connected?
Common HPC Architectures
Shared Memory
Distributed Memory
CPUs can be on a single chip
Distributed-Shared
CPUs can be on distinct hardware
units connected via a network
GPU
TRADITIONAL CPU: SHARED MEMORY
Shared Memory: memory that may
be simultaneously accessed by
multiple processes with intent to
provide communication among them
Advantage:
Changes in memory immediately
available to ALL processes
Disadvantages:
Synchronicity issues
Lack of Scalability
TRADITIONAL CPU: DISTRIBUTED MEMORY
Distributed Memory: a computer
system where each processor has its
own memory
Advantages:
Highly Scalable
Rapid access to own memory
Disadvantage:
Requires LOTS of communication
between processors
TRADITIONAL CPU: DISTRIBUTED-SHARED
Distributed-Shared Memory(Cluster): combination of shared and distributed
memory architectures
Nodes are connected through a high-speed distributed network
Each node contains multiple processors organized as a shared memory system
Most common architecture for high-performance computers (supercomputers)
today!
GPU
GPU(graphical processing unit):
consists of thousands of small,
efficient cores designed to handle
multiple tasks simultaneously
Advantages:
Extremely good at compute-intensive
tasks
GPU COMPUTING
PARALLEL PROGRAM MODELS
Multithreading
Algorithm split up into parts that can be executed independently (task
parallelization)
POSIX threads (Pthreads)
OpenMP
Message Passing
Program run as multiple processes with information passed between them as
messages
MPI
Hybrid
Combines multi-threading and message passing
Most common: OpenMP + MPI
MPI + CUDA
DEBUGGING & PERFORMANCE
Debugging
Relative Debugging
Collective Checking
Memory Consistency Errors
Performance
MPI Optimization
CPU-GPU Benchmark Suite
RELATIVE DEBUGGING
Relative debugging: allows a programmer to locate errors by viewing the
differences in relevant data structures of two versions of the same program as
they are executing.
Initially proposed to aid in the parallelization of sequential code
Why is it helpful?
Allows programmers to focus on small changes within data structures without having
to manage the complicated flow of data in a parallel environment
When is it useful?
Porting programs from a CPU cluster to a GPGPU hybrid
COLLECTIVE CHECKING
Collective communication: communication passed to the entire group
simultaneously
MPI standard imposes transitive correction properties on collective operations
i.e. If a communication property holds for an operation between PX and PY, and the
same communication property holds between PY and PZ, we may assume that
property holds between PX and PZ without an additional check.
Tree-Based Overlay Network(TBON): tree-based virtual network of nodes and
logical links that is built on top of an existing network with the purpose to
implement a service that is not available in the existing network
In our case – Collective Error Checking
Why is it useful in HPC? - Scalability
COLLECTIVE CHECKING: TBON
MEMORY CONSISTENCY ERRORS
One-sided communication: non-blocking communication
Allows overlap of communication and computation
Allows potential for greater scalability
Memory consistency errors in MPI
Occur when two or more accesses of shared data conflict with each other
Leads to an undefined or erroneous state
MEMORY CONSISTENCY ERRORS
MC Checker
Detects memory consistency errors
in one-sided MPI communication
Advantages
Incurs low runtime overhead
Easy to use requiring no program
modifications
Although written for MPI
programming model, can easily be
extended to other one-sided
programming models
MPI OPTIMIZATION
Methodology for optimizing MPI Applications
MPI communication structure
Can non-blocking communication be added?
Load balancing
Can the algorithm be re-worked to more evenly balance the workload
Parameter optimization
Can the library parameters be tuned?
Results
30% performance gain if all three optimization methods applied
Individual application showed 10%, 20%, 5% improvement, respectively
CPU-GPU BENCHMARK SUITE
MPI & CUDA
MPI used to distribute work among processors
CUDA used to offload heavy computation to GPUs
Communication types for CPU-GPU
Unpinned host memory
Pinned host memory
Use of CUDA-Aware MPI
CPU-GPU BENCHMARK SUITE
Application performance depends
on:
Application characteristics
(communication vs. computation
intensive)
Compute capability of the nodes
When dataset sizes are small,
memory intensive applications
benefit from use of pinned memory
Allocating too much pinned memory
for large dataset sizes can affect
performance
CONCLUSION
HPC consists of three main architecture types – Shared, Distributed, and
Distributed-Shared
Numerous tools have been developed to debug parallel applications
Memory consistency errors
Collective communication checking
Relative debugging
Efficiency and optimization are key factors when developing parallel applications
SOURCES
DeRose, L., Gontarek, A., Moench, R., Vose, A., Abramson, D., Dinh, M.N., Jin, C., Relative Debugging for a Highly
Parallel Hybrid Computer System, SC ’15 Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis, November 15, 2015, Article No. 63.
Hilbrich, T., de Supinski, B.R., Hansel, F., Muller, M.S., Schulz, M., Nagel, W.E., Runtime MPI Collective Checking
with Tree-Based Overlay Networks, EuroMPI '13 Proceedings of the 20th European MPI Users' Group Meeting,
September 15, 2013, pg. 129 – 134.
Chen, Z., Dinan, J., Tang, Z., Balaji, P., Zhong, H., Wei, J., Huang, T., Qin, F., MC-Checker: Detecting Memory
Consistency Errors in MPI One-Sided Applications, SC '14 Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, November 16, 2014, pg. 499
Agarwal, T., Becchi, M., Design of a Hybrid MPI-CUDA Benchmark Suite for CPU-GPU Clusters, PACT '14
Proceedings of the 23rd international conference on Parallel architectures and compilation, August 24, 2014, pg. 505 –
506.
Pimenta, A., Cesar, E., Sikora, A., Methodology for MPI Applications Autotuning, EuroMPI '13 Proceedings of the
20th European MPI Users' Group Meeting, September 15, 2013, pg. 145 – 146.
CPU architecture graphics taken from http://www.skirt.ugent.be/skirt/_parallel_computing.html
GPU architecture graphics taken from http://www.nvidia.com/object/what-is-gpucomputing.html#sthash.UMRk458E.dpuf