PARALLEL COMPUTING PERFORMANCE TESTING AND DEBUGGING AMY KNOWLES – MIDWESTERN STATE UNIVERSITY

advertisement
PARALLEL COMPUTING
PERFORMANCE TESTING AND DEBUGGING
AMY KNOWLES – MIDWESTERN STATE UNIVERSITY
OUTLINE
 High Performance Architectures
 Parallel Programming Models
 Debugging
 Performance
 Conclusion
HIGH PERFORMANCE ARCHITECTURES
 CPU(processor): smallest unit of
hardware that executes operations
 How are they connected?
 Common HPC Architectures

Shared Memory

Distributed Memory

CPUs can be on a single chip

Distributed-Shared

CPUs can be on distinct hardware
units connected via a network

GPU
TRADITIONAL CPU: SHARED MEMORY
 Shared Memory: memory that may
be simultaneously accessed by
multiple processes with intent to
provide communication among them
 Advantage:

Changes in memory immediately
available to ALL processes
 Disadvantages:

Synchronicity issues

Lack of Scalability
TRADITIONAL CPU: DISTRIBUTED MEMORY
 Distributed Memory: a computer
system where each processor has its
own memory
 Advantages:

Highly Scalable

Rapid access to own memory
 Disadvantage:

Requires LOTS of communication
between processors
TRADITIONAL CPU: DISTRIBUTED-SHARED
 Distributed-Shared Memory(Cluster): combination of shared and distributed
memory architectures

Nodes are connected through a high-speed distributed network

Each node contains multiple processors organized as a shared memory system
 Most common architecture for high-performance computers (supercomputers)
today!
GPU
 GPU(graphical processing unit):
consists of thousands of small,
efficient cores designed to handle
multiple tasks simultaneously
 Advantages:

Extremely good at compute-intensive
tasks
GPU COMPUTING
PARALLEL PROGRAM MODELS

Multithreading
 Algorithm split up into parts that can be executed independently (task
parallelization)
 POSIX threads (Pthreads)
 OpenMP

Message Passing
 Program run as multiple processes with information passed between them as
messages
 MPI

Hybrid
 Combines multi-threading and message passing
 Most common: OpenMP + MPI
 MPI + CUDA
DEBUGGING & PERFORMANCE
 Debugging

Relative Debugging

Collective Checking

Memory Consistency Errors
 Performance

MPI Optimization

CPU-GPU Benchmark Suite
RELATIVE DEBUGGING
 Relative debugging: allows a programmer to locate errors by viewing the
differences in relevant data structures of two versions of the same program as
they are executing.
 Initially proposed to aid in the parallelization of sequential code
 Why is it helpful?

Allows programmers to focus on small changes within data structures without having
to manage the complicated flow of data in a parallel environment
 When is it useful?

Porting programs from a CPU cluster to a GPGPU hybrid
COLLECTIVE CHECKING
 Collective communication: communication passed to the entire group
simultaneously
 MPI standard imposes transitive correction properties on collective operations

i.e. If a communication property holds for an operation between PX and PY, and the
same communication property holds between PY and PZ, we may assume that
property holds between PX and PZ without an additional check.
 Tree-Based Overlay Network(TBON): tree-based virtual network of nodes and
logical links that is built on top of an existing network with the purpose to
implement a service that is not available in the existing network

In our case – Collective Error Checking

Why is it useful in HPC? - Scalability
COLLECTIVE CHECKING: TBON
MEMORY CONSISTENCY ERRORS
 One-sided communication: non-blocking communication

Allows overlap of communication and computation

Allows potential for greater scalability
 Memory consistency errors in MPI

Occur when two or more accesses of shared data conflict with each other

Leads to an undefined or erroneous state
MEMORY CONSISTENCY ERRORS
 MC Checker

Detects memory consistency errors
in one-sided MPI communication
 Advantages

Incurs low runtime overhead

Easy to use requiring no program
modifications

Although written for MPI
programming model, can easily be
extended to other one-sided
programming models
MPI OPTIMIZATION
 Methodology for optimizing MPI Applications

MPI communication structure
 Can non-blocking communication be added?

Load balancing
 Can the algorithm be re-worked to more evenly balance the workload

Parameter optimization
 Can the library parameters be tuned?
 Results

30% performance gain if all three optimization methods applied

Individual application showed 10%, 20%, 5% improvement, respectively
CPU-GPU BENCHMARK SUITE
 MPI & CUDA

MPI used to distribute work among processors

CUDA used to offload heavy computation to GPUs
 Communication types for CPU-GPU

Unpinned host memory

Pinned host memory

Use of CUDA-Aware MPI
CPU-GPU BENCHMARK SUITE
 Application performance depends
on:

Application characteristics
(communication vs. computation
intensive)

Compute capability of the nodes
 When dataset sizes are small,
memory intensive applications
benefit from use of pinned memory
 Allocating too much pinned memory
for large dataset sizes can affect
performance
CONCLUSION
 HPC consists of three main architecture types – Shared, Distributed, and
Distributed-Shared
 Numerous tools have been developed to debug parallel applications

Memory consistency errors

Collective communication checking

Relative debugging
 Efficiency and optimization are key factors when developing parallel applications
SOURCES

DeRose, L., Gontarek, A., Moench, R., Vose, A., Abramson, D., Dinh, M.N., Jin, C., Relative Debugging for a Highly
Parallel Hybrid Computer System, SC ’15 Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis, November 15, 2015, Article No. 63.

Hilbrich, T., de Supinski, B.R., Hansel, F., Muller, M.S., Schulz, M., Nagel, W.E., Runtime MPI Collective Checking
with Tree-Based Overlay Networks, EuroMPI '13 Proceedings of the 20th European MPI Users' Group Meeting,
September 15, 2013, pg. 129 – 134.

Chen, Z., Dinan, J., Tang, Z., Balaji, P., Zhong, H., Wei, J., Huang, T., Qin, F., MC-Checker: Detecting Memory
Consistency Errors in MPI One-Sided Applications, SC '14 Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, November 16, 2014, pg. 499

Agarwal, T., Becchi, M., Design of a Hybrid MPI-CUDA Benchmark Suite for CPU-GPU Clusters, PACT '14
Proceedings of the 23rd international conference on Parallel architectures and compilation, August 24, 2014, pg. 505 –
506.

Pimenta, A., Cesar, E., Sikora, A., Methodology for MPI Applications Autotuning, EuroMPI '13 Proceedings of the
20th European MPI Users' Group Meeting, September 15, 2013, pg. 145 – 146.

CPU architecture graphics taken from http://www.skirt.ugent.be/skirt/_parallel_computing.html

GPU architecture graphics taken from http://www.nvidia.com/object/what-is-gpucomputing.html#sthash.UMRk458E.dpuf
Download