Parallel Programming
Lecture 1:
Introduction
Mohamed Mead
Course Contents
❑Introduction to Parallel and distributed computing
❑Parallel algorithm design
❑Shared Memory Programming and OpenMp
❑HPC and MPI
❑Complexity analysis of distributed and parallel algorithms.
❑parallel searching Algorithms
❑parallel sorting Algorithms
❑parallel graph Algorithms
What is parallel computing?
❑Serial or sequential computing: doing a task in sequence on a single
processor
❑Parallel computing: breaking up a task into sub-tasks and doing them in
parallel on a set of processors (often connected by a network)
➢using multiple, simultaneous computations in order to speed up solving
the overall problem
➢requires multiple processing elements that can be used simultaneously
(“parallel hardware”)
The need for parallel computing or HPC
❑Make some science simulations feasible in the lifetime of humans
❑Provide answers in real time or near real time
The need for parallel computing or HPC
Parallelism in Hardware
Flynn’s taxonomy
❑ Instruction stream: a sequence instructions executed by a
processor.
❑ Data stream: a sequence of data required by an instruction
stream.
Flynn’s taxonomy
❑Single-instruction, single-data (SISD) systems –
➢ An SISD computing system is a uniprocessor machine which is capable of executing a
single instruction, operating on a single data stream. In SISD, machine instructions
are processed in a sequential manner and computers adopting this model are
popularly called sequential computers. Most conventional computers have SISD
architecture. All the instructions and data to be processed have to be stored in
primary memory.
Flynn’s taxonomy
❑SISD (Single Instruction Single Data)
➢ Sequential execution of instructions on a single data stream
➢ One control unit and one processing unit
➢ Simple to program and understand
➢ Limited by the speed of a single processor
➢ Examples include early desktop computers and simple embedded systems
Flynn’s taxonomy
❑Single-instruction, multiple-data (SIMD) systems –
➢ An SIMD system is a multiprocessor machine capable of executing the same
instruction on all the CPUs but operating on different data streams. Machines based
on an SIMD model are well suited to scientific computing since they involve lots of
vector and matrix operations. So that the information can be passed to all the
processing elements (PEs) organized data elements of vectors can be divided into
multiple sets(N-sets for N PE systems) and each PE can process one data set.
Flynn’s taxonomy
❑SIMD (Single Instruction Multiple Data)
➢ One instruction applied to multiple data elements simultaneously
➢ Multiple processing elements controlled by a single control unit
➢ Exploits data-level parallelism
➢ Efficient for tasks with regular data structures (matrices, vectors)
➢ Examples include vector processors and GPUs
Applications:
• Image processing
• Matrix manipulations
Flynn’s taxonomy
❑Multiple-instruction, single-data (MISD) systems –
➢ An MISD computing system is a multiprocessor machine capable of executing
different instructions on different PEs but all of them operating on the same
dataset .
➢ Example Z = sin(x)+cos(x)+tan(x)
➢ The system performs different operations on the same data set. Machines built
using the MISD model are not useful in most of the application, a few machines are
built, but none of them are available commercially.
❑ Example: Multiple cryptography algorithms attempting to crack a single coded message.
Flynn’s taxonomy
❑Multiple-instruction, multiple-data (MIMD) systems –
➢ An MIMD system is a multiprocessor machine which is capable of executing multiple
instructions on multiple data sets. Each PE in the MIMD model has separate instruction and
data streams; therefore machines built using this model are capable to any kind of
application.
➢ MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory
MIMD based on the way PEs are coupled to the main memory.
Flynn’s taxonomy
❑MIMD (Multiple Instruction Multiple Data)
➢ Multiple processors execute different instructions on different data independently
➢ Most flexible and general-purpose parallel architecture
➢ Supports both task-level and data-level parallelism
➢ Includes shared memory and distributed memory variants
➢ Examples include multi-core processors and computer clusters
Real-world Examples of Parallel Systems
❑SISD and SIMD Systems
➢ SISD examples
➢ Traditional single-core processors (early Intel x86 CPUs)
➢ Simple microcontrollers in embedded systems (Arduino Uno)
➢ Basic calculators and early personal computers
➢ SIMD examples
➢ Vector processors (Cray-1 supercomputer)
➢ Graphics Processing Units (NVIDIA GeForce, AMD Radeon)
➢ Digital Signal Processors (DSPs) in audio equipment
➢ NEON SIMD architecture in ARM processors
➢ Intel's SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions)
Real-world Examples of Parallel Systems
❑MISD and MIMD Systems
➢ MISD examples (rare in practice)
➢ Systolic arrays for matrix multiplication
➢ Some pipelined architectures in signal processing
➢ Theoretical fault-tolerant systems with redundant processing
➢ MIMD examples
➢ Multi-core processors (Intel Core i7, AMD Ryzen)
➢ Symmetric Multiprocessing (SMP) systems
➢ Distributed computing systems (Hadoop clusters)
➢ Grid computing networks (SETI@home project)
➢ Cloud computing infrastructures (Amazon EC2, Google Cloud)
Thinking, Strategy, Constraints
❑ Strategy: Partitioning!
❑ Two ways of thinking:
❑ Task-parallelism
❑ Data-parallelism
❑ Some constraints:
❑ Communication
❑ Load balancing
❑ Synchronization
Partitioning
❑ One of the first steps in designing a parallel program is to break the problem
into discrete "chunks" of work that can be distributed to multiple tasks. This
is known as decomposition or partitioning.
❑ There are two basic ways to partition computational work among parallel
tasks:
➢ Domain decomposition
➢ functional decomposition
Partitioning
❑ Domain Decomposition: In this type of partitioning, the data
associated with a problem is decomposed. Each parallel task
then works on a portion of the data.
Partitioning
❑ Functional Decomposition: In this approach, the focus is on the
computation that is to be performed rather than on the data
manipulated by the computation. The problem is decomposed
according to the work that must be done. Each task then performs a
portion of the overall work
Communications
❑ The need for communications between tasks depends upon your problem
❑ You DO need communications
➢ Most parallel applications are not quite so simple, and tasks require to
share data with each other.
❑ You DON'T need communications
➢ Some types of problems can be decomposed and executed in parallel
with virtually no need for tasks to share data. For example, imagine an
image processing operation where every pixel in a black and white image
needs to have its color reversed. The image data can easily be
distributed to multiple tasks that then act independently of each other
to do their portion of the work.
➢ These types of problems require a little inter-task communication.
Communications
❑ There are a number of important factors to consider when
designing your program's inter-task communications
1- Cost of communications
➢ Inter-task communication virtually always implies overhead.
➢ Machine cycles and resources that could be used for computation are
instead used to package and transmit data.
➢ Communications frequently require some type of synchronization
between tasks, which can result in tasks spending time "waiting" instead
of doing work.
➢ Competing communication traffic can saturate the available network
bandwidth.
Communications
2- Latency vs. Bandwidth
➢ Bandwidth is the amount of data you can send and receive in one
second.
➢ Latency is the amount of time used by data to reach its
destination and come back.
➢ Sending many small messages can cause latency to dominate
communication overheads. Often it is more efficient to package
small messages into a larger message, thus increasing the
effective communications bandwidth.
Communications
3- Synchronous vs. Asynchronous Communications
➢ Synchronous communications require some type of "handshaking"
between tasks that are sharing data. This can be explicitly
structured in code by the programmer, or it may happen at a
lower level unknown to the programmer.
➢ Synchronous communications are often referred to as blocking
communications
since
other
communications have completed.
work
must
wait
until
the
Communications
➢ Asynchronous communications allow tasks to transfer data
independently from one another. For example, task 1 can prepare
and send a message to task 2, and then immediately begin doing
other work. When task 2 actually receives the data doesn't
matter.
➢ Asynchronous communications are often referred to as nonblocking communications since other work can be done while the
communications are taking place.
➢ Interleaving computation with communication is the single
greatest benefit for using asynchronous communications.
Communications
4- Scope of Communications
➢ Knowing which tasks must communicate with each other is
critical during the design stage of a parallel code. Both of the
two scopings described below can be implemented synchronously
or asynchronously.
➢ Point-to-point - involves two tasks with one task acting as the
sender/producer
of
data, and the other
acting
as the
receiver/consumer.
➢ Collective - involves data sharing between more than two tasks,
which are often specified as being members in a common group,
or collective.
Communications
Examples: Collective Communications
Communications
5- Efficiency of Communications
➢ Very often, the programmer will have a choice with regard to factors
that can affect communications performance. Only a few are
mentioned here.
➢ Which implementation for a given model should be used? Using the
Message Passing Model as an example, one MPI implementation may be
faster on a given hardware platform than another.
➢ What type of communication operations should be used? As mentioned
previously, asynchronous communication operations can improve overall
program performance.
➢ Network media - some platforms may offer more than one network for
communications. Which one is best?
Load balancing
❑ Load balancing refers to the practice of distributing work among
tasks so that all tasks are kept busy all of the time. It can be
considered a minimization of task idle time.
❑ Load balancing is important to parallel programs for performance
reasons. For example, if all tasks are subject to a barrier
synchronization point, the slowest task will determine the overall
performance.
How to Achieve Load Balance
❑ Equally partition the work each task receives
➢ For array/matrix operations where each task performs similar
work, evenly distribute the data set among the tasks.
➢ For loop iterations where the work done in each iteration is
similar, evenly distribute the iterations across the tasks.
➢ If a heterogeneous mix of machines with varying performance
characteristics are being used, be sure to use some type of
performance analysis tool to detect any load imbalances. Adjust
work accordingly.
How to Achieve Load Balance
❑ Use dynamic work assignment
➢ When the amount of work each task will perform is intentionally
variable, or is unable to be predicted, it may be helpful to use a
scheduler - task pool approach. As each task finishes its work, it
queues to get a new piece of work.
➢ It may become necessary to design an algorithm which detects
and handles load imbalances as they occur dynamically within the
code.