Uploaded by Flaront Anzi

PARALLEL PROGRAMMING

advertisement
ALL LECTURE SLIDES
Lecture 1
















Why parallel? 1
Why we need ever-increasing performance? 1
Why we’re building ever-increasing performance? 1
The solution of parallel 1
Why we need to write parallel programs 1
Approaches to the serial problem 1
More problems + Example 1
Analysis of example #1 and #2 2
Better parallel algorithm 2
Multiple cores forming a global sum 2
How do we write parallel programs? Task Parallelism and Data Parallelism 2
Division of work: Task Parallelism and Data Parallelism 3
Coordination 3
Types of parallel systems: Shared-memory and Distrubuted Memory 3
Terminology of Concurrent, Parallel and Distributed computing 2
Concluding remarks of lecture 1 3
Lecture 2 & 3









































The von Neumann Architecture 4
Main Memory 4
CPU(Central Processing Unit) and its parts 4
Register 4
Program Counter 4
Bus 4
Process and its components 4
Multitasking 4
Threading 4
A process and two threads figure 5
Basics of caching 5
Principle of locality: Spatial locality and Temporal locality 5
Levels of Cache. Cache hit. Cache miss 5
Issues with cache(Write-through and Write-back) 5
Cache mappings (Full associative, Directed mapped, n-way set associative) 5
A table of 16- line main memory to a 4-line cache 5
Caches and programs code and table 6
Virtual memory 6
Virtual page numbers 6
Page table with Virtual Address divided into Virtual Page Number and Byte Offset
TLB (Translation-lookaside buffer) 6
ILP (Instruction Level Parallelism) 6
Pipeling Example #1 6
Pipeling Example #2 7
Pipeling in general + table 7
Multiple Issues (Static multiple issue, Dynamic multiple issue) 7
Speculation 7
Hardware multithreading (SMT, Fine-grained and Coarse-grained) 7
Flynn’s Taxonomy (MISD, MIMD, SISD, SIMD) 8
SIMD and its drawbacks 8
What if we don’t have as many ALUs as data items? 8
Vector Processors in general (the pros, the cons) 8
GPUs (Graphics Processing Units) 9
MIMD 9
Shared Memory System 9
UMA multicore System 9
NUMA multicore System 9
Distributed Memory System 9
Interconnection network 9
Shared Memory Interconnects (Bus and Switch interconnects, Crossbar) 10
Distributed Memory Interconnects (Direct and Indirect interconnections) 10
6









































Bisection width 10
Figures about “Two bisections of a ring” and “A bisection of toroidal mesh” 10
Bandwidth 11
Bisection bandwidth 11
Hypercube 11
Indirect interconnects 11
Figures about “Crossbar interconnect for distributed memory” and “Omega network” 11
A switch in an omega network 11
More definition (Latency, Bandwidth) 12
Cache coherence 12
A shared memory system with two cores and two caches 12
Snooping Cache Coherence 12
Directory Based Cache Coherence 12
The burden is on software 12
SPMD 12
Writing Parallel Programs 13
Shared Memory (Dynamic and Static Threads) 13
Nondeterminism 13
Codes about “Busy-waiting”, “message-passing”, “Partitioned Global Address Space Languages” 13
Input and Output 14
Sources of Overhead in Parallel Programs (Interprocess interaction, Idling, Excess Computation) 14
Perfomance Metrics for Parallel Systems: Execution Time 14
Speedup (Speedup of a parallel program, speedup bounds) 14
Superlinear Speedups 15
Performance Metrics: Efficiency 15
Efficiency of a parallel program 15
Speedups and efficiencies of a parallel program or on different problem sizes 15
Effect of overhead 16
Amdahl’s Law 16
Proof for Traditional Problems 16
Proof of Amdahl’s Law and Examples 16
Limitations of Amdahl’s Law 16
Scalability 16
Taking Timing codes 17
Foster’s methodology(Partitioning, Communication, Agglomeration, Mapping) 17
Histogram Example 17
Serial program input and output 17
First two stages of Foster’s Methodology 17
Alternative definition of tasks and communication 18
Adding the local arrays 18
Concluding Remarks 18
Lecture 4

















Corresponding MPI functions 18
Topologies 18
Linear model of communication overhead 18
Contention 18
One-to-all broadcast (Input, Output, Ring, Mesh, Hypercube, Algorithm) 19
All-to-one reduction (Input, Output, Algorithm) 20
All-to-all broadcast (Input, Output, Ring, Mesh, Hypercube, summary) 20-21
All-to-all reduction (Input, Output, Algorithm) 21
All-reduce (Input, Output, Algorithm) 22
Prefix Sum (Input, Output, Algorithm) 22
Scatter (Input, Output, Algorithm) 22
Gather (Input, Output, Algorithm) 23
All-to-all personalized (Input, Output, Summary, Hypercube algorithm, E-cube routing) 23-24
Hypercube Time for all topologies 24
Improved one-to-all broadcast (Scatter, All-to-all broadcast, Time analysis) 24-25
Improved all-to-one reduction (All-to-all reduction, Gather, Time Analysis) 25
Improved all-reduce (All-to-all reduction, Gather, Scatter, All-to-all broadcast, Time analysis) 25
Lecture 5 & 6




















































POSIX®Threads 26
Caveat 26
Hello World 1, 2 and 3 26
Compiling a Pthread program 26
Running a Pthread program 26
Global variables 26
Starting the Threads 26
pthread_t objects 26
A closer look 1 and 2 27
Function started by pthread_create 27
Running the Threads 27
Stopping the Threads 27
Serial pseudo-code 27
Using 3 Pthreads 27
Pthreads matrix-vector multiplication 27
Estimating π and a thread function for computing π 28
Busy-Waiting 28
Possible race condition 28
Pthreads global sum with busy-waiting 28
Global sum function with critical section after loop 28
Mutexes 29
Global sum function that uses a mutex 29
Issues 29
Problems with a mutex solution 30
A first attempt at sending messages using pthreads 30
Syntax of the various Semaphore functions 30
Barriers + (Using barriers to time the slowest threads and for debugging)
Busy-Waiting and a Mutex 30
Implementing a barrier with Semaphores 30
Condition Variables 31
Implementing a barrier with conditional variables 31
Linked Lists (+Linked Lists Membership) 31
Inserting a new node into a list 31
Deleting a node from a linked list 31
A Multi-Threaded Linked List 32
Simultaneous access by two threads 32
Solution #1 and #2 and their issues 32
Implementation of Member with one mutex per list node 32
Pthreads Read-Write Locks 32
Protecting our Linked List functions 33
Linked List Perfomance 33
Caches, Cache-Coherence, and False Sharing 33
Pthreads matrix-vector multiplication 33
Thread-Safety (+Example) 33
Simple Approach 34
The strtok function 34
Multi-threaded tokenizer 34
Running with one thread 34
Running with two threads 34
Other unsafe C library functions 34
“re-entrant” (thread safe) functions 35
Concluding remarks about lecture 5&6 35
Lecture 7 & 8






OpenMP 35
Pragmas 35
OpenMP pragmas 36
A process forking and joining two threads
clause 36
Of note… 36
36
30




































Some terminology 36
In case the compiler doesn’t support OpenMD 36
The Trapezoid Rule in general 36-38
Scope and Scope in OpenMD 37
The Reduction Clause 37
Reduction Operators 38
Parallel for 38
Legal forms for parallelizable for statements 38
Caveats 38
Data dependencies 39
Estimating π with OpenMD solution 39
The default clause 39
Bubble Sort 39
Serial Odd-Even Transposition Sort 39
First OpenMD Odd-Even Sort 39
Second OpenMD Odd-Even Sort 40
Scheduling loops and their Results 40
The Schedule Clause (Default schedule, Cyclic schedule) 40
Schedule(type, chunksize, static, dynamic or guided, auto, runtime)
The Static Schedule Type 41
The Dynamic Schedule Type 41
The Guided Schedule Type 41
The Runtime Schedule Type 41
Queues 41
Message-Passing 41
Sending Messages 41
Receiving Messages 41
Termination Detection 42
Startup 42
The Atomic Directive 42
Critical Sections 42
Locks (+Using Locks in the Message-Passing Programs) 42
Some Caveats 43
Matrix-vector multiplication 43
Thread-Safety 43
Concluding Remarks about Lecture 7 & 8 43
Lecture 9 & 10




















Identifying MPI processes 43
Our first MPI program 43
MPI Compilation 44
MPI Execution 44
MPI Programs 44
MPI Components 44
Basic Outline 44
Communicators 44
SPMD 44
Communication 45
Data Types 45
Message Matching 45
Receiving messages 45
status_p argument 45
How much data am I receiving? 45
The Trapezoidal Rule in MPI 45
Parallelizing the Trapezoidal Rule 46
Parallel pseudo-code 46
One trapezoid 46
Pseudo-code for a serial program 46
40

























































Tasks and communications for Trapezoidal Rule 46
First Version (1&2&3) for Trapezoidal 46
Dealing with I/O 46
Input 47
Running with 6 processes 47
Function for reading user input 47
Tree-structured communication 47
A tree-structured global sum (+an alternate one) 47
MPI_Reduce 48
Predefined reduction operators in MPI 48
Collective vs Point-to-Point Communications 48
MPI_Reduce examples 48
MPI_Allreduce 48
A global sum followed by distribution of the result 49
A butterfly structured global sum 49
Broadcast (+ A tree structured broadcast) 49
A version of Get_input that uses MPI_Bcast 49
Data distributions 49
Serial implementation of vector addition 49
Different partitions of a 12-component vector among 3 processes 49
Partitioning options (Block, Cyclic, Block-cyclic partitioning ) 50
MPI_Scatter 50
Parallel implementation of vector addition 50
MPI Gather 50
Reading and distributing a vector 50
Print a distributed vector 50
Allgather 51
Matrix-vector multiplication 51
Multiply a matrix by a vector 51
C style arrays 51
Serial Matrix-vector multiplication 51
An MPI matrix-vector multiplication function 51
Derived datatypes 51
MPI_Type create_struct 52
MPI_Get_address 52
MPI_Type_commit 52
MPI_Type_free 52
Get input function with a derived datatype 52
Elapsed parallel time 52
Elapsed serial time 52
MPI_Barrier 53
Scalability
53
Speedup formula 53
Efficiency formula 53
A parallel sorting algorithm sort 53
Odd-even transposition sort (+Example) 53
Serial odd-even transposition sort 54
Communications among tasks in odd-even sort 54
Parallel odd-even transposition sort (+Pseudocode) 54
Compute_partner 54
Safety in MPI programs 54
MPI_Ssend 55
Restructuring communication 55
MPI_Sendrecv 55
Safe communication with five processes 55
Parallel odd-even transposition sort with code 55
Concluding Remarks about Lecture 9&10 56
1
Why Parallel?
Instead of designing and building faster microprocessors, we put multiple processors on a single integrated circuit.
Why we need ever-increasing performance?



Computational power is increasing, but so are our computation problems and needs.
Problems we never dreamed of have been solved because of past increases, such as decoding the human
genome.
More complex problems are still waiting to be solved.
Why we’re building parallel systems?




Smaller transistors = faster processors.
Faster processors = increased power consumption.
Increased power consumption = increased heat.
Increased heat = unreliable processors.
The solution of parallel


Move away from single-core systems to multicore processors.
“core” = central processing unit (CPU)
Why we need to write parallel programs?


Running multiple instances of a serial program often isn’t very useful.
What you really want is for it to run faster.
Approaches to the serial problem


Rewrite serial programs so that they’re parallel.
Write translation programs that automatically convert serial programs into parallel programs.
o This is very difficult to do.
o Success has been limited.
More problems



Some coding constructs can be recognized by an automatic program generator, and converted to a parallel
construct.
However, it’s likely that the result will be a very inefficient program.
Sometimes the best parallel solution is tostep back and devise an entirely new algorithm.
Example


After each core completes execution of the code, is a private variable my_sum contains the sum of the values
computed by its calls to Compute_next_value.
Once all the cores are done computing their private my_sum, they form a global sum by sending results to a
designated “master” core which adds the final result.
2
Example #2
Analysis






In the first example, the master core performs 7 receives and 7 additions.
In the second example, the master core performs 3 receives and 3 additions.
The improvement is more than a factor of 2!
The difference is more dramatic with a larger number of cores.
If we have 1000 cores:
o The first example would require the master to perform 999 receives and 999 additions.
o The second example would only require 10 receives and 10 additions.
That’s an improvement of almost a factor of 100!
Better parallel algorithm









Don’t make the master core do all the work.
Share it among the other cores.
Pair the cores so that core 0 adds its result with core 1’s result.
Core 2 adds its result with core 3’s result, etc.
Work with odd and even numbered pairs of cores.
Repeat the process now with only the evenly ranked cores.
Core 0 adds result from core 2.
Core 4 adds the result from core 6, etc.
Now cores divisible by 4 repeat the process, and so forth, until core 0 has the final result
Multiple cores forming a global sum
How do we write parallel programs?


Task parallelism
o Partition various tasks carried out solving the problem among the cores.
Data parallelism
o Partition the data used in solving the problem among the cores.
o Each core carries out similar operations on it’s part of the data.
3
Division of work – data parallelism
Division of work –task parallelism
Coordination




Cores usually need to coordinate their work.
Communication – one or more cores send their current partial sums to another core.
Load balancing – share the work evenly among the cores so that one is not heavily loaded.
Synchronization – because each core works at its own pace, make sure cores do not get too far ahead of the
rest
Types of parallel systems


Shared-memory
o The cores can share access to the computer’s
memory.
o Coordinate the cores by having them examine and
update shared memory locations.
Distributed-memory
o Each core has its own, private memory.
o The cores must communicate explicitly by sending
messages across a network
Terminology



Concurrent computing – a program is one in which multiple tasks can be in progress at any instant.
Parallel computing – a program is one in which multiple tasks cooperate closely to solve a problem
Distributed computing – a program may need to cooperate with other programs to solve a problem
Concluding Remarks





The laws of physics have brought us to the doorstep of multicore technology.
Serial programs typically don’t benefit from multiple cores.
Automatic parallel program generation from serial program code isn’t the most efficient approach to get high
performance from multicore computers.
Learning to write parallel programs involves learning how to coordinate the cores.
Parallel programs are usually very complex and therefore, require sound program techniques and
development.
4
The von Neumann Architecture
Main Memory:


This is a collection of locations, each of which is capable of storing both instructions and data.
Every location consists of an address, which is used to access the location, and the contents of the location.
Central processing unit (CPU)


Divided into two parts.
Control unit - responsible for deciding which instruction in a program should be executed. (the boss)
Arithmetic and logic unit (ALU) -responsible for executing the actual instructions. (the worker)
Register – very fast storage, part of the CPU.
Program counter – stores address of the next instruction to be executed.
Bus – wires and hardware that connects the CPU and memory
An operating system “process”


An instance of a computer program that is being executed.
Components of a process:
o The executable machine language program.
o A block of memory.
o Descriptors of resources the OS has allocated to the process.
o Security information.
o Information about the state of the process
Multitasking



Gives the illusion that a single processor system is running multiple programs simultaneously.
Each process takes turns running. (time slice)
After its time is up, it waits until it has a turn again. (blocks)
Threading



Threads are contained within processes.
They allow programmers to divide their programs into (more or less) independent tasks.
The hope is that when one thread blocks because it is waiting on a resource, another will have work to do and
can run.
5
A process and two threads
Basics of caching


A collection of memory locations that can be accessed in less time than some other memory locations.
A CPU cache is typically located on the same chip, or one that can be accessed much faster than ordinary
memory
Principle of locality



Accessing one location is followed by an access of a nearby location.
Spatial locality – accessing a nearby location.
Temporal locality – accessing in the near future.
Levels of Cache
Cache hit
Cache miss
Issues with cache



When a CPU writes data to cache, the value in cache may be inconsistent with the value in main memory.
Write-through caches handle this by updating the data in main memory at the time it is written to cache.
Write-back caches mark data in the cache as dirty. When the cache line is replaced by a new cache line from
memory, the dirty line is written to memory.
Cache mappings



Full associative – a new line can be placed at any location
in the cache.
Direct mapped – each cache line has a unique location in
the cache to which it will be assigned.
n-way set associative – each cache line can be place in one
of n different locations in the cache. When more than one line
in memory can be mapped to several different locations in
cache we also need to be able to decide which line should be
replaced or evicted.
Example Table: Assignments of a 16-line
main memory to a 4-line cache
6
Caches and programs
Virtual memory






If we run a very large program or a program that accesses very large data sets, all of the instructions and data
may not fit into main memory.
Virtual memory functions as a cache for secondary storage.
It exploits the principle of spatial and temporal locality.
It only keeps the active parts of running programs in main memory.
Swap space - those parts that are idle are kept in a block of secondary storage.
Pages – blocks of data and instructions.
o Usually these are relatively large.
o Most systems have a fixed page size that currently ranges from 4 to 16 kilobytes.
Virtual page numbers



When a program is compiled its pages are assigned virtual page numbers.
When the program is run, a table is created that maps the virtual page numbers to physical addresses.
A page table is used to translate the virtual address into a physical address.
Page Table
Virtual Address divided into Virtual Page Number and Byte Offset
Translation-lookaside buffer (TLB)




Using a page table has the potential to significantly increase each program’s overall run-time.
A special address translation cache in the processor.
It caches a small number of entries (typically 16–512) from the page table in very fast memory.
Page fault – attempting to access a valid physical address for a page in the page table but the page is only
stored on disk.
Instruction Level Parallelism (ILP)



Attempts to improve processor performance by having multiple processor components or functional units
simultaneously executing instructions.
Pipelining - functional units are arranged in stages.
Multiple issue - multiple instructions can be simultaneously initiated.
Pipelining Example #1: Add the floating
point numbers 9.87×104 and 6.54×103
7
Pipeling Example #2
Assume each operation takes one nanosecond (10-9 seconds).
This for loop takes about 7000 nanoseconds.
Pipelining





Divide the floating point adder into 7 separate pieces of
hardware or functional units.
First unit fetches two operands, second unit compares
exponents, etc.
Output of one functional unit is input to the next.
One floating point addition still takes 7 nanoseconds.
But 1000 floating point additions now takes 1006
nanoseconds!
Pipelined Addition Table: Numbers in the table
are subscripts of operands/results.
Multiple Issue



Multiple issue processors replicate functional units and try to
simultaneously execute different instructions in a program.
static multiple issue - functional units are scheduled at compile
time.
dynamic multiple issue – functional units are scheduled at run-time
(superscalar)
Speculation


In order to make use of multiple issue, the system must find
instructions that can be executed simultaneously.
In speculation, the compiler or the processor makes a guess about an
instruction, and then executes the instruction on the basis of the
guess.
Hardware multithreading






There aren’t always good opportunities for simultaneous execution of different threads.
Hardware multithreading provides a means for systems to continue doing useful work when the task being
currently executed has stalled. The current task has to wait for data to be loaded from memory.
o Ex., the current task has to wait for data to be loaded from memory.
Simultaneous multithreading (SMT) - a variation on fine-grained multithreading.
Allows multiple threads to make use of the multiple functional units
Fine-grained - the processor switches between threads after each instruction, skipping threads that are
stalled.
o Pros: potential to avoid wasted machine time due to stalls.
o Cons: a thread that’s ready to execute a long sequence of instructions may have to wait to execute
every instruction.
Coarse-grained - only switches threads that are stalled waiting for a time consuming operation to complete.
o Pros: switching threads doesn’t need to be nearly instantaneous.
o Cons: the processor can be idled on shorter stalls, and thread switching will also cause delays.
8
Flynn’s
Taxonomy
SIMD: Single instruction stream Multiple data stream



Parallelism achieved by dividing data among the processors.
Applies the same instruction to multiple data items.
Called data parallelism.
Example
What if we don’t have as many ALUs as data items?


Divide the work and process iteratively.
Ex. m = 4 ALUs and n = 15 data items.
SIMD drawbacks




All ALUs are required to execute the same instruction, or remain idle.
In classic design, they must also operate synchronously.
The ALUs have no instruction storage.
Efficient for large data parallel problems, but not other types of more complex parallel problems.
Vector processors






Operate on arrays or vectors of data while conventional CPU’s operate on individual data elements or scalars.
Vector registers: Capable of storing a vector of operands and operating simultaneously on their contents.
Vectorized and pipelined functional units: The same operation is applied to each element in the vector
(or pairs of elements).
Vector instructions: Operate on vectors rather than scalars.
Interleaved memory.
o Multiple “banks” of memory, which can be accessed more or less independently.
o Distribute elements of a vector across multiple banks, so reduce or eliminate delay in loading/storing
successive elements.
Strided memory access and hardware scatter/gather.
o The program accesses elements of a vector located at fixed intervals.
The Pros of Vector Processors






Fast.
Easy to use.
Vectorizing compilers are good at identifying code to exploit.
Compilers also can provide information about code that cannot be vectorized.
o Helps the programmer re-evaluate code.
High memory bandwidth.
Uses every item in a cache line.
The Cons of Vector Processors


They don’t handle irregular data structures as well as other parallel architectures.
A very finite limit to their ability to handle ever larger problems. (scalability)
9
Graphics Processing Units (GPUs)






Real time graphics application programming interfaces or API’s use points, lines, and triangles to internally
represent the surface of an object.
A graphics processing pipeline converts the internal representation into an array of pixels that can be sent to a
computer screen.
Several stages of this pipeline (called shader functions) are programmable. Typically just a few lines of C code.
Shader functions are also implicitly parallel, since they can be applied to multiple elements in the graphics
stream.
GPU’s can often optimize performance by using SIMD parallelism.
The current generation of GPU’s use SIMD parallelism. Although they are not pure SIMD systems.
MIMD: Multiple instruction stream Multiple data stream


Supports multiple simultaneous instruction streams operating on multiple data streams.
Typically consist of a collection of fully independent processing units or cores, each of which has its own
control unit and its own ALU.
Shared Memory System




A collection of autonomous processors is connected to
a memory system via an interconnection network.
Each processor can access each memory location.
The processors usually communicate implicitly by
accessing shared data structures.
Most widely available shared memory systems use one
or more multicore processors. (multiple CPU’s or cores
on a single chip)
UMA multicore system


NUMA multicore system
UMA: Time to access all the memory locations will be the same for all the cores.
NUMA: A memory location a core is directly connected to can be accessed faster than a memory location that
must be accessed through another chip.
Distributed Memory System


Clusters (most popular): A collection of commodity
systems. Connected by a commodity interconnection
network.
Nodes of a cluster are individual computations units joined
by a communication network.
Interconnection network


Affects performance of both distributed and shared memory systems.
Two categories:
 Shared memory interconnects
 Distributed memory interconnect
10
Shared memory interconnects



Bus interconnect: A collection of parallel communication wires together with some hardware that controls
access to the bus. Communication wires are shared by the devices that are connected to it. As the number of
devices connected to the bus increases, contention for use of the bus increases, and performance decreases.
Switched interconnect: Uses switches to control the routing of data among the connected devices.
Crossbar: Allows simultaneous communication among different devices. Faster than buses. But the cost of
the switches and links is relatively high. BELOW WE HAVE CROSSBARS
A crossbar switch connecting 4
processors (Pi) and 4 memory
modules (Mj)
Configuration of internal switches
in a crossbar
Simultaneous memory accesses by the processors
Distributed memory interconnects
Two groups


Direct interconnect: Each switch is directly connected to a
processor memory pair, and the switches are connected to each
other.
Indirect interconnect: Switches may not be directly connected
to a processor.
DIRECT INTERCONNECT ->
Bisection width: A measure of “number of simultaneous communications” or “connectivity”.
Two bisections of a ring
A bisection of a toroidal mesh
11
Bandwidth: The rate at which a link can transmit data. Usually given in megabits or megabytes per second.
Bisection bandwidth: A measure of network quality. Instead of counting the number of links joining the halves, it
sums the bandwidth of the links.
Fully connected network: Each switch is directly connected to
every other switch.
Hypercube


Highly connected direct interconnect.
Built inductively:
o A one-dimensional hypercube is a fully-connected
system with two processors.
o A two-dimensional hypercube is built from two onedimensional hypercubes by joining “corresponding”
switches.
o Similarly a three-dimensional hypercube is built from
two two-dimensional hypercubes.
Hypercubes
Indirect interconnects


Simple examples of indirect networks:
o Crossbar
o Omega network
Often shown with unidirectional links and a collection of
processors, each of which has an outgoing and an incoming link,
and a switching network.
A generic indirect network
Crossbar interconnect for distributed memory
A switch in an omega network
Omega network
12
More definitions: Any time data is transmitted, we’re interested in how long
it will take for the data to reach its destination.


Latency: The time that elapses between the source’s beginning to
transmit the data and the destination’s starting to receive the first byte.
Bandwidth: The rate at which the destination receives data after it
has started to receive the first byte.
Cache coherence: Programmers have no control over caches and when they get updated.
A shared memory system with two cores and two caches:
Snooping Cache Coherence




The cores share a bus .
Any signal transmitted on the bus can be “seen” by all cores connected to the bus.
When core 0 updates the copy of x stored in its cache it also broadcasts this information across the bus.
If core 1 is “snooping” the bus, it will see that x has been updated and it can mark its copy of x as invalid.
Directory Based Cache Coherence


Uses a data structure called a directory that stores the status of each cache line.
When a variable is updated, the directory is consulted, and the cache controllers of the cores that have that
variable’s cache line in their caches are invalidated.
The burden is on software




Hardware and compilers can keep up the pace needed.
From now on…
In shared memory programs:
o Start a single process and fork threads.
o Threads carry out tasks.
In distributed memory programs:
o Start multiple processes.
o Processes carry out tasks.
SPMD – single program multiple data

A SPMD programs consists of a single executable that can behave as if it
were multiple different programs through the use of conditional branches.
13
Writing Parallel Programs
1.
Divide the work among theprocesses/threads
(a) so each process/threadgets roughly the same amount of work
(b) and communication isminimized.
2. Arrange for the processes/threads to synchronize.
3. Arrange for communication among processes/threads.
Shared Memory


Dynamic threads
o Master thread waits for work, forks new threads, and when threads are done, they terminate
o Efficient use of resources, but thread creation and termination is time consuming.
Static threads
o Pool of threads created and are allocated work, but do not terminate until cleanup.
o Better performance, but potential waste of system resources.
Nondeterminism
my_val = Compute_val ( my_rank ) ;
x += my_val ;
Nondeterminism




Race condition
Critical section
Mutually exclusive
Mutual exclusion lock (mutex, or simply lock)
busy-waiting
Partitioned Global Address Space Languages
message-passing
14
Input and Output





In distributed memory programs, only process 0 will access stdin. In shared memory programs, only the
master thread or thread 0 will access stdin.
In both distributed memory and shared memory programs all the processes/threads can access stdout and
stderr.
However, because of the indeterminacy of the order of output to stdout, in most cases only a single
process/thread will be used for all output to stdout other than debugging output.
Debug output should always include the rank or id of the process/thread that’s generating the output.
Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr. So, for
example, each process/thread can open its own, private file for reading or writing, but no two
processes/threads will open the same file.
Sources of Overhead in Parallel Programs:






If I use two processors, shouldn't my program run
twice as fast?
No - a number of overheads cause degradation in
performance, like: excess computation,
communication, idling, contention.
The execution profile of a hypothetical parallel
program executing on eight processing elements.
Profile indicates times spent performing
computation (both essential and excess),
communication, and idling.
Interprocess interactions: Processors working
on any non-trivial parallel problem will need to talk to each other.
Idling: Processes may idle because of load imbalance, synchronization, or serial components.
Excess Computation: This is computation not performed by the serial version. This might be because the
serial algorithm is difficult to parallelize, or that some computations are repeated across processors to
minimize communication.
Performance Metrics for Parallel Systems: Execution Time



Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a
sequential computer.
The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last
processor finishes execution.
We denote the serial runtime by TS or TSerial and the parallel runtime by TP or TParallel
Speedup




Number of cores = p
Serial run-time = Tserial
Parallel run-time = Tparallel
Tparallel = Tserial / p (linear speedup)
Speedup of a parallel program
Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the
𝑇
same problem on a parallel computer with p identical processing elements. 𝑆 = 𝑠𝑒𝑟𝑖𝑎𝑙
𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
Speedup Bounds




Speedup can be as low as 0 (the parallel program never terminates).
Speedup, in theory, should be upper bounded by p -after all, we can only expect a p-fold speedup if we use times as
many resources.
A speedup greater than p is possible only if the processing elements spend in average less than T S / p time solving
the problem.
In this case, a single processor could be timeslided to achieve a faster serial program, which contradicts our
assumption of fastest serial program as basis for speedup.
15
Superlinear Speedups
One reason for superlinearity is that the parallel version does less work than
corresponding serial algorithm.
Searching an unstructured tree for a node with a given label,`S', on two
processing elements using depth-first traversal. The two-processor version
with processor 0 searching the left subtree andprocessor 1 searching the
right subtree expands only the shadednodes before the solution is found.
The corresponding serialformulation expands the entire tree. It is clear that the serial algorithm does more work than
the parallel algorithm.
Superlinear Speedups



Resource-based superlinearity: The higher aggregate cache/memory bandwidth can result in better
cache-hit ratios, and therefore superlinearity.
Example: A processor with 64KB of cache yields an 80% hit ratio. If two processors are used, since the
problem size/processor is smaller, the hit ratio goes up to 90%. Of the remaining 10% access, 8% come from
local memory and 2% from remote memory.
If DRAM access time is 100 ns, cache access time is 2 ns, and remote memory access time is 400ns, this
corresponds to a speedup of 2.43!
Performance Metrics: Efficiency


Efficiency is a measure of the fraction of time for which a processing element is usefully employed.
𝑆
Mathematically, it is given by: 𝐸 =

Following the bounds on speedup, efficiency can be as low as 0 and as high as 1.
𝑃
Efficiency of a parallel program
Speedups and efficiencies of a parallel program
Speedups and efficiencies of parallel program
on different problem sizes
Speedup
Efficiency
16
Effect of overhead: Tparallel = Tserial / p + Toverhead
Amdahl’s Law: Unless virtually all of a serial program is parallelized, the possible speedup is going to be very
limited, regardless of the number of cores available. Thus Amdahl’s law provides an upper bound on the speedup that
can be obtained by a parallel program.
Let f be the fraction of operations in a computation that must be performed sequentially, where 0 ≤ f ≤ 1.
The maximum speedup achievable by a parallel computer with n processors is
Proof for Traditional Problems
If the fraction of the computation that cannot be divided
into concurrent tasks is f, and no overhead incurs when the
computation is divided into concurrent parts, the time to
perform the computation with n processors is given by:
Tp ≥ fTs + [(1 - f ) Ts] / n
Proof of Amdahl’s Law
Using the preceding expression for Tp
The last expression is obtained by dividing both numerator and denominator by Ts.
Multiplying numerator & denominator by n produces the following alternate version of this formula:
Example 1
95% of a program’s execution time occurs inside a loop that can be executed in
parallel(while the other 5% is serial). What is the maximum speedup we should
expect from a parallel version of the program executing on 8 CPUs?
Example 2
What is the maximum speedup achievable by the program of the previous
example (increasing the number of processors as much as we may prefer) ?
Limitation of Amdahl’s Law



Ignores communication cost
On communications-intensive applications, does not capture the additional communication slowdown due to
network congestion.
As a result, Amdahl’s law usually overestimates speedup achievable
Scalability



In general, a problem is scalable if it can handle ever increasing problem sizes.
If we increase the number of processes/threads and keep the efficiency fixed without increasing problem size,
the problem is strongly scalable.
If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of
processes/threads, the problem is weakly scalable.
17
Taking Timings
Foster’s methodology
1.
Partitioning: divide the computation to be performed and the data operated on by the computation into
small tasks. The focus here should be on identifying tasks that can be executed in parallel.
2. Communication: determine what communication needs to be carried out among the tasks identified in
the previous step.
3. Agglomeration or aggregation: combine tasks and communications identified in the first step into
larger tasks. For example, if task A must be executed before task B can be executed, it may make sense to
aggregate them into a single composite task.
4. Mapping: assign the composite tasks identified in the previous step to processes/threads.This should be
done so that communication is minimized, and each process/thread gets roughly the same amount of
work.
Histogram example
Serial program - input
1.
2.
3.
4.
5.
The number of measurements: data_count
An array of data_count floats: data
The minimum value for the bin containing the smallest values: min_meas
The maximum value for the bin containing the largest values: max_meas
The number of bins: bin_count
Serial program - output
1. bin_maxes : an array of bin_count floats
2. bin_counts : an array of bin_count ints
First two stages of Foster’s Methodology
18
Alternative definition of tasks and communication
Adding the local arrays
Concluding Remarks:






Serial systems: The standard model of computer hardware has been the von Neumann architecture.
Parallel hardware: Flynn’s taxonomy.
Parallel software:
o We focus on software for homogeneous MIMD systems, consisting of a single program that obtains
parallelism by branching.
o SPMD programs.
Input and Output
o We’ll write programs in which one process or thread can access stdin, and all processes can access
stdout and stderr.
o However, because of nondeterminism, except for debug output we’ll usually have a single process or
thread accessing stdout
Performance
o Speedup
o Efficiency
o Amdahl’s law
o Scalability
Parallel Program Design: Foster’s methodologyss or thread accessing stdout.
Corresponding MPI functions
Linear model of communication overhead





Point-to-point message takes time ts + twm
ts is the latency
tw is the per-word transfer time (inverse bandwidth)
m is the message size in # words
(Must use compatible units for m and tw )
Contention




Assuming bi-directional links
Each node can send and receive simultaneously
Contention if link is used by more than one message
k-way contention means tw → tw /k
Topologies
19
One-to-all broadcast
Input:

The message M is stored locally on the root
Output:

The message M is stored locally on all processes
Ring
Mesh
Ring


Recursive doubling
I Double the number of active processes in each step
Mesh


Use ring algorithm on the root’s mesh row
Use ring algorithm on all mesh columns in parallel
Hypercube
Generalize mesh algorithm to d dimensions
Algorithm
The algorithms described above are identical on all three topologies
The given algorithm is not general.


Number of steps: d = log2 p

Time per step: ts + twm

Total time: (ts + twm)log2 p
In particular, note that broadcasting to p2 processes is onlytwice as expensive as broadcasting to p processes
(log2 p2 = 2 log2 p)
20
All-to-one reduction
Algorithm





Analogous to all-to-one broadcast algorithm
Analogous time (plus the time to compute a ⊕ b)
Reverse order of communications
Reverse direction of communications
Combine incoming message with local message using ⊕
All-to-all broadcast
Ring
Ring algorithm
1.
2.
3.
4.
5.
6.
7.
8.
9.


left ← (me − 1) mod p
right ← (me + 1) mod p
result ← M
M ← result
for k = 1, 2, . . . , p − 1 do
Send M to right
Receive M from left
result ← result ∪ M
end for
The “send” is assumed to be non-blocking
Lines 6–7 can be implemented via MPI_Sendrecv
Time of ring algorithm



Number of steps: p − 1
Time per step: ts + twm
Total time: (p − 1)(ts + twm)
Mesh algorithm
The mesh algorithm is based on the ring algorithm:


Apply the ring algorithm to all mesh rows in parallel
Apply the ring algorithm to all mesh columns in parallel
21
Time of mesh algorithm
Hypercube algorithm
The hypercube algorithm is also
based on the ring algorithm:


For each dimension d of
the hypercube in sequence:
Apply the ring algorithm to
the 2d−1 links in the current
dimension in parallel.
Time of hypercube algorithm
Summary

All-to-all reduction
Input:



The p2 messages Mr,k for r, k = 0, 1, . . . , p − 1
The message Mr,k is stored locally on process r
An associative reduction operator ⊕
Output:

The “sum” Mr:= M0,r ⊕ M1,r ⊕ · · · ⊕ Mp-1,r stored locally on each process r
Algorithm





Analogous to all-to-all broadcast algorithm
Analogous time (plus the time for computing a ⊕ b)
Reverse order of communications
Reverse direction of communications
Combine incoming message with part of local message using ⊕
 Same transfer time (tw term)
I But the number of messages differ
22
All-reduce
Input:



The p messages Mk for k = 0, 1, . . . , p − 1
The message Mk is stored locally on process k
An associative reduction operator ⊕
Output:

The “sum” M := M0 ⊕ M1 ⊕ · · · ⊕ Mp-1 stored locally on all processes
Algorithm




Analogous to all-to-all broadcast algorithm
Combine incoming message with local message using ⊕
Cheaper since the message size does not grow
Total time: (ts + twm)log2 p
Prefix sum
Input:



The p messages Mk for k = 0, 1, . . . , p − 1
The message Mk is stored locally on process k
An associative reduction operator ⊕
Output:

The “sum” M(k) := M0 ⊕ M1 ⊕ · · · ⊕ Mk stored locally onprocess k for all k
Algorithm



Analogous to all-reduce algorithm
Analogous time
Locally store only the corresponding partial sum
Scatter
Input:

The p messages Mk for k = 0, 1, . . . , p − 1 stored
locally on the root
Output:

The message Mk stored locally on process k for all k
Algorithm




Analogous to one-to-all broadcast algorithm
Send half of the messages in the first step, send one quarter inthe second step, and so on
More expensive since several messages are sent in each step
Total time: tslog2 p + tw (p − 1)m
23
Gather
Input:


The p messages Mk for k = 0, 1, . . . , p − 1
The message Mk is stored locally on process k
Output:

The p messages Mk stored locally on the root
Algorithm




Analogous to scatter algorithm
Analogous time
Reverse the order of communications
Reverse the direction of communications
All-to-all personalized
Input:
The p2 messages Mr,k for r, k = 0, 1, . . . , p − 1
The message Mr,k is stored locally on process r
Output:
The p messages Mr,k stored locally on process k for all k
Summary
The hypercube algorithm is not optimal with respect
tocommunication volume (the lower bound is t wm(p − 1))
An optimal (w.r.t. volume) hypercube algorithm




Idea: Let each pair of processes exchange messages directly
Time: I (p − 1)(ts + twm)
Q: In which order do we pair the processes?
A:In step k, let me exchange messages with me XOR k. This can be done without contention!
An optimal hypercube algorithm
24
An optimal hypercube algorithm based on E-cube routing
E-cube routing

Routing from s to t := s XOR k in step k

The difference between s and t is s XOR t = s XOR(s
XOR k) = k

The number of links to traverse equals the number of
1’s in thebinary representation of k (the so-called
Hamming distance)

E-cube routing: route through the links according to
somefixed (arbitrary) ordering imposed on the dimensions
Why does E-cube work?
E-cube routing example
Hypercube for all topologies
Improved one-to-all broadcast
25
Time analysis
Improved all-to-one reduction
Time analysis
Improved all-reduce
All-reduce = One-to-all reduction + All-to-one broadcast
...but gather followed by scatter cancel out!
Time analysis
Processes and Threads



A process is an instance of a running (or suspended) program.
Threads are analogous to a “light-weight”process.
In a shared memory program a single process may have multiple threads of control.
26
POSIX®Threads




Also known as Pthreads.
A standard for Unix-like operating systems.
A library that can be linked with C programs.
Specifies an application programming interface (API) for multi-threaded programming.
Caveat: The Pthreads API is only available onPOSIXR systems — Linux, MacOS X,Solaris, HPUX, …
Hello World(1)
Hello World(2)
Hello World(3)
Compiling a Pthread program
gcc −g −Wall −o pth_hello pth_hello . c –lpthread
Running a Pthreads program
Global variables


Can introduce subtle and confusing bugs!
Limit use of global variables to situations in which they’re really needed.
o (Shared variables)
Starting the Threads


Processes in MPI are usually started by a script.
In Pthreads the threads are started by the program executable.
pthread_t objects




Opaque
The actual data that they store is system-specific.
Their data members aren’t directly accessible to user code.
However, the Pthreads standard guarantees that a pthread_t object does store enough information to uniquely
identify the thread with which it’s associated.
27
A closer look(1)
A closer look(2)
Function started by pthread_create




Prototype: void* thread_function ( void* args_p ) ;
Void* can be cast to any pointer type in C.
So args_p can point to a list containing one or more values needed by thread_function.
Similarly, the return value of thread_function can point to a list of one or more values.
Running the Threads
Main thread forks and joins two threads.
Stopping the Threads


We call the function pthread_join once foreach thread.
A single call to pthread_join will wait for thethread associated with the pthread_t object to complete.
Serial pseudo-code
Using 3 Pthreads
Pthreads matrix-vector multiplication
28
Estimating π
A thread function for computing π
Note that as we increase n, the estimate
with one thread gets better and better.
Busy-Waiting


Possible race condition
A thread repeatedly tests a condition, but, effectively,
does no useful work until the condition has the
appropriate value.
Beware of optimizing compilers, though!
Pthreads global sum with busy-waiting
Global sum function with critical section after loop (1)
Global sum function with critical section after loop (2)
29
Mutexes




A thread that is busy-waiting may continually use the CPU accomplishing nothing.
Mutex (mutual exclusion) is a special type of variable that can be used to restrict access to a critical
section to a single thread at a time.
Used to guarantee that one thread“excludes” all other threads while itexecutes the critical section.
The Pthreads standard includes a special type for mutexes: pthread_mutex_t.

When a Pthreads program finishes using a mutex, it should call

In order to gain access to a critical section a thread calls

When a thread is finished executing the code in a critical section, it should call
Global sum function that uses a mutex(1)
Global sum function that uses a mutex(2)
Run-times (in seconds) of π programs using n = 108 terms
on a system with two four-core processors.
Possible sequence of events with busy-waiting and more
threads than cores.
Issues



Busy-waiting enforces the order threads access a critical section.
Using mutexes, the order is left to chance and the system.
There are applications where we need to control the order threads access the critical section.
30
Problems with a mutex solution
A first attempt at sending messages using pthreads
Syntax of the various semaphore functions
Barriers

Synchronizing the threads to make sure that they all are
at the same point in a program is called a barrier.
 No thread can cross the barrier until
all the threads have reached it.
Using barriers to time the slowest thread
Using barriers for debugging
Busy-waiting and a Mutex


Implementing a barrier using busy-waitingand
a mutex is straightforward.
 We use a shared counter protected by themutex.
When the counter indicates that everythread has entered
the critical section,threads can leave the critical section.
Implementing a barrier with semaphores
31
Condition Variables



A condition variable is a data object that allows a thread to
suspend execution until a certain event or condition occurs.
When the event or condition occurs another thread can
signal the thread to “wake up.”
A condition variable is always associated with a mutex.
Implementing a barrier with
condition variables
Linked Lists
Linked List Membership
Deleting a node from a linked list
Inserting a new node into a list
32
A Multi-Threaded Linked List


In order to share access to the list, we can define head_p to be a global variable.
This will simplify the function headers for Member, Insert, and Delete, since we won’t need to pass in either
head_p or a pointer to head_p: we’ll only need to pass in the value of interest.
Simultaneous access
by two threads
Solution #1


An obvious solution is to simply lock the list any time
that a thread attempts to access it.
A call to each of the three functions can be protected by a mutex.
Issues of Sol #1



We’re serializing access to the list.
If the vast majority of our operations are calls to Member, we’ll fail to exploit this opportunity for parallelism.
On the other hand, if most of our operations are calls to Insert and Delete, then this may be the best solution
since we’ll need to serialize access to the list for most of the operations, and this solution will certainly be easy
to implement.
Solution #2


Instead of locking the entire list, we could try to lock individual nodes.
A “finer-grained” approach.
Issues of Sol #2



This is much more complex than the original Member function.
It is also much slower, since, in general, each time a node is accessed, a mutex must be locked and unlocked.
The addition of a mutex field to each node will substantially increase the amount of storage needed for the list.
Implementation of Member with
one mutex per list node (1)
Implementation of Member with
one mutex per list node(2)
33
Pthreads Read-Write Locks








Neither of our multi-threaded linked lists exploits the potential for simultaneous access to any node by threads
that are executing Member.
The first solution only allows one thread to access the entire list at any instant.
The second only allows one thread to access any given node at any instant.
A read-write lock is somewhat like a mutex except that it provides two lock functions.
The first lock function locks the read-writelock for reading, while the second locks itfor writing.
So multiple threads can simultaneously obtain the lock by calling the read-lock function, while only one thread
can obtain the lock by calling the write-lock function.
Thus, if any threads own the lock for reading, any threads that want to obtain the lock for writing will block in
the call to the write-lock function.
If any thread owns the lock for writing, any threads that want to obtain the lock for reading or writing will
block in their respective locking functions.
Protecting our linked list functions
Linked List Performance

100,000 ops/thread
 99.9% Member
 0.05% Insert
 0.05% Delete

100,000 ops/thread
 80% Member
 10% Insert
 10% Delete
Caches, Cache-Coherence, and False Sharing



Recall that chip designers have added blocks of relatively fast memory to processors called cache memory.
The use of cache memory can have a huge impact on shared-memory.
A write-miss occurs when a core tries to update a variable that’s not in cache, and it has to access main
memory.
Pthreads matrix-vector multiplication
Thread-Safety

A block of code is thread-safe if it can be simultaneously
executed by multiple threads without causing problems.
Example


Suppose we want to use multiple threads to “tokenize” a file that consists of ordinary English text.
The tokens are just contiguous sequences of characters separated from the rest of the text by white-space — a
space, a tab, or a newline.
34
Simple approach




Divide the input file into lines of text and assign the lines to the threads in a roundrobin fashion.
The first line goes to thread 0, the second goes to thread 1, . . . , the tth goes to thread t, the t +1st goes to
thread 0, etc.
We can serialize access to the lines ofinput using semaphores.
After a thread has read a single line ofinput, it can tokenize the line using the strtok function.
The strtok function


The first time it’s called the string argument should be the text to be tokenized. Our line of input.
For subsequent calls, the first argument should be NULL.

The idea is that in the first call, strtok caches a pointer to string, and for subsequent calls it returns successive
tokens taken from the cached copy.
Multi-threaded tokenizer (1)
Multi-threaded tokenizer (2)
Running with one thread

It correctly tokenizes the input stream.
o Pease porridge hot.
o Pease porridge cold.
o Pease porridge in the pot
o Nine days old.
Running with two threads
What happened?

strtok caches the input line by declaring a variable to have
static storage class.

This causes the value stored in this variable to persist from
one call to the next.

Unfortunately for us, this cached string is shared, not private.

Thus, thread 0’s call to strtok with the third line of the input
has apparently overwritten the contents of thread 1’s call with the
second line.

So the strtok function is not thread-safe.If multiple threads
call it simultaneously, the output may not be correct.
Other unsafe C library functions



Regrettably, it’s not uncommon for C library functions to fail to be thread-safe.
The random number generator random in stdlib.h.
The time conversion function localtime in time.h.
35
“re-entrant” (thread safe) functions
In some cases, the C standard specifies an alternate, thread-safe,
version of a function.
Concluding remarks







A thread in shared-memory programming is analogous to a process in distributed memory programming.
o However, a thread is often lighter-weight than a full-fledged process.
o In Pthreads programs, all the threads have access to global variables, while local
o variables usually are private to the thread running the function.
o When indeterminacy results from multiple threads attempting to access a shared resource such as a
shared variable or a shared file, at least one of the accesses is an update, and the accesses can result in
an error, we have a race condition.
A critical section is a block of code that updates a shared resource that can only be updated by one thread at
a time.
o So the execution of code in a critical section should, effectively, be executed as serial code.
A mutex can be used to avoid conflicting access to critical sections as well.
o Think of it as a lock on a critical section, since mutexes arrange for mutually exclusive access to a
critical section.
A semaphore is the third way to avoid conflicting access to critical sections.
o It is an unsigned int together with two operations: sem_wait and sem_post.
o Semaphores are more powerful than mutexes since they can be initialized to any nonnegative value.
A barrier is a point in a program at which the threads block until all of the threads have reached it.
A read-write lock is used when it’s safe for multiple threads to simultaneously read a data structure, but if a
thread needs to modify or write to the data structure, then only that thread can access the data structure
during the modification.
Some C functions cache data between calls by declaring variables to be static, causing errors when multiple
threads call the function.
o This type of function is not thread-safe
OpenMP




An API for shared-memory parallel programming.
MP = multiprocessing
Designed for systems in which each thread or process can potentially have access to all available memory.
System is viewed as a collection of cores or CPU’s, all of which have access to main memory.
Pragmas (#pragma)



Special preprocessor instructions.
Typically added to a system to allow behaviors that
aren’t part of the basic C specification.
Compilers that don’t support the pragmas ignore
them.
36
OpenMP pragmas

# pragma omp parallel
o Most basic parallel directive.
o The number of threads that run the following structured block of code is determined by the run-time
system.
A process forking and
joining two threads
clause




Text that modifies a directive.
The num_threads clause can be added to a parallel directive.
It allows the programmer to specify the number of threads that should execute the following block.
# pragma omp parallel num_threads ( thread_count )
Of note…




There may be system-defined limitations on the number of threads that a program can start.
The OpenMP standard doesn’t guarantee that this will actually start thread_count threads.
Most current systems can start hundreds or even thousands of threads.
Unless we’re trying to start a lot of threads, we will almost always get the desired number of threads.
Some terminology: In OpenMP parlance the collection of threads executing the parallel block — the original thread
and the new threads — is called a team, the original thread is called the master, and the additional threads are called
slaves.
In case the compiler doesn’t support OpenMP
The trapezoid rule
Serial algorithm
A First OpenMP Version
1) We identified two types of tasks:
a) computation of the areas of individual trapezoids, and
b) adding the areas of trapezoids.
2) There is no communication among the tasks in the first collection, but each task in the first collection
communicates with task 1b.
3) We assumed that there would be many more trapezoids than cores.
1. So we aggregated tasks by assigning a contiguous block of trapezoids to each thread (and a single
thread to each core)
37
Assignment of trapezoids to threads
Unpredictable results when two (or more) threads attempts
to simultaneously executete: global_result += my_result
Mutual exclusion
# pragma omp critical
<- only one thread can execute the following structured block at a time
global_result += my_result ;
Scope
2. In serial programming, the scope of a variable consists of those parts of a program in which the variable can be
used.
3. In OpenMP, the scope of a variable refers to the set of threads that can access the variable in a parallel block.
Scope in OpenMP



A variable that can be accessed by all the threads in the team has shared scope.
A variable that can only be accessed by a single thread has private scope.
The default scope for variables declared before a parallel block is shared.
The Reduction Clause
38
Reduction operators



A reduction operator is a binary operation (such as
addition or multiplication).
A reduction is a computation that repeatedly applies
the same reduction operator to a sequence of operands
in order to get a single result.
All of the intermediate results of the operation should
be stored in the same variable: the reduction variable.
Parallel for



Forks a team of threads to execute the following
structured block.
However, the structured block following the parallel for
directive must be a for loop.
Furthermore, with the parallel for directive the system
parallelizes the for loop by dividing the iterations of the
loop among the threads.
Legal forms for parallelizable for statements
Caveats




The variable index must have integer or pointer type (e.g., it can’t be a float).
The expressions start, end, and incr must have a compatible type. For example, if index is a pointer, then incr
must have integer type.
The expressions start, end, and incr must not change during execution of the loop.
During execution of the loop, the variable index can only be modified by the “increment expression” in the for
statement.
39
Data dependencies
What happened?
1. OpenMP compilers don’t check for dependences among iterations
in a loop that’s being parallelized with a parallel for directive.
2. A loop in which the results of one or more iterations depend on
other iterations cannot, in general, be correctly parallelized by
OpenMP.
Estimating π
OpenMP solution #1
OpenMP solution #2
The default clause


Lets the programmer specify the scope of
each variable in a block. default(none)
With this clause the compiler will require
that we specify the scope of each variable
we use in the block and that has been
declared outside the block.
Bubble Sort
Serial Odd-Even Transposition Sort
First OpenMP Odd-Even Sort
40
Second OpenMP Odd-Even Sort
Odd-even sort with two parallel for directives and two for
directives. (Times are in seconds.)
Scheduling loops
Results
 f(i) calls the sin function i times.
 Assume the time to execute f(2i) requires approximately twice as
much time as the time to execute f(i).
 n = 10,000
1. one thread
2. run-time = 3.67 seconds.
 n = 10,000
1. two threads
2. default assignment
3. run-time = 2.76 seconds
4. speedup = 1.33
 n = 10,000
1. two threads
2. cyclic assignment
3. run-time = 1.84 seconds
4. speedup = 1.99
The Schedule Clause
Default schedule:
Cyclic schedule:
schedule ( type , chunksize )


Type can be:
o static: the iterations can be assigned to the threads before the loop is executed.
o dynamic or guided: the iterations are assigned to the threads while the loop is executing.
o auto: the compiler and/or the run-time system determine the schedule.
o runtime: the schedule is determined at run-time.
The chunksize is a positive integer.
41
The static Schedule Type
twelve iterations, 0, 1, . . . , 11, and three threads
The Dynamic Schedule Type




The iterations are also broken up into chunksof chunksize consecutive iterations.
Each thread executes a chunk, and when a thread finishes a chunk, it requests another one from the run-time
system.
This continues until all the iterations are completed.
The chunksize can be omitted. When it is omitted, a chunksize of 1 is used.
The Guided Schedule Type




Each thread also executes a chunk, and when a thread finishes a chunk, it requests another one.
However, in a guided schedule, as chunks are completed the size of the new chunks decreases.
If no chunksize is specified, the size of the chunks decreases down to 1.
If chunksize is specified, it decreases down to chunksize, with the exception that the very last chunk can be
smaller than chunksize.
The Runtime Schedule Type


The system uses the environment variable OMP_SCHEDULE to determine at run time how to schedule the
loop.
The OMP_SCHEDULE environment variable can take on any of the values that can be used for a static,
dynamic, or guided schedule.
Queues



Can be viewed as an abstraction of a line of customers waiting to pay for their groceries in a supermarket.
A natural data structure to use in many multithreaded applications.
For example, suppose we have several “producer” threads and several “consumer” threads.
o Producer threads might “produce” requests for data.
o Consumer threads might “consume” the request by finding or generating the requested data.
Message-Passing


Each thread could have a shared message queue, and when
one thread wants to “send a message” to another thread, it
could enqueue the message in the destination thread’s
queue.
A thread could receive a message by dequeuing the message
at the head of its message queue.
Sending Messages
Receiving Messages
42
Termination Detection
Startup





When the program begins execution, a single thread,
the master thread, will get command line arguments and
allocate an array of message queues: one for each thread.
This array needs to be shared among the threads, since
any thread can send to any other thread, and hence any thread
can enqueue a message in any of the queues.
One or more threads may finish allocating their queues before some other threads.
We need an explicit barrier so that when a thread encounters the barrier, it blocks until all the threads in the
team have reached the barrier.
After all the threads have reached the barrier all the threads in the team can proceed. # pragma omp barrier
The Atomic Directive





Unlike the critical directive, it can only protect critical sections that consist of a single C assignment statement.
o # pragma omp atomic
Further, the statement must have one of the following forms:
o <op>=<expression>;
o x++;
o ++x;
o x--;
o --x;
Here <op> can be one of the binary operators
o +, *, -, /, &, ^, |, <<, or >>
Many processors provide a special load-modify-store instruction.
A critical section that only does a load-modify-store can be protected much more efficiently by using this
special instruction rather than the constructs that are used to protect more general critical sections.
Critical Sections



OpenMP provides the option of adding a name to a critical directive:
o # pragma omp critical(name)
When we do this, two blocks protected with critical directives with different names can be executed
simultaneously.
However, the names are set during compilation, and we want a different critical section for each thread’s
queue.
Locks: A lock consists of a data structure and functions that allow the
programmer to explicitly enforce mutual exclusion in a critical section.
Using Locks in the Message-Passing Program
43
Some Caveats
1. You shouldn’t mix the different types of mutual exclusion for a single critical section.
2. There is no guarantee of fairness in mutual exclusion constructs.
3. It can be dangerous to “nest” mutual exclusion constructs
Matrix-vector multiplication
Thread-Safety
Concluding Remarks










OpenMP is a standard for programming shared-memory systems.
OpenMP uses both special functions and preprocessor directives called pragmas.
OpenMP programs start multiple threads rather than multiple processes.
Many OpenMP directives can be modified by clauses.
A major problem in the development of shared memory programs is the possibility of race conditions.
OpenMP provides several mechanisms for insuring mutual exclusion in critical sections.
o Critical directives
o Named critical directives
o Atomic directives
o Simple locks
By default most systems use a block partitioning of the iterations in a parallelized for loop.
OpenMP offers a variety of scheduling options.
In OpenMP the scope of a variable is the collection of threads to which the variable is accessible.
A reduction is a computation that repeatedly applies the same reduction operator to a sequence of operands in
order to get a single result
Identifying MPI processes


Common practice to identify processes by nonnegative
integer ranks.
p processes are numbered 0, 1, 2, ... p-1
Our first MPI program
44
Compilation
Execution
MPI Programs




Written in C.
o Has main.
o Uses stdio.h, string.h, etc.
Need to add mpi.h header file.
Identifiers defined by MPI start with “MPI_”.
First letter following underscore is uppercase.
o For function names and MPI-defined types.
o Helps to avoid confusion.
MPI Components

MPI_Init
o Tells MPI to do all the necessary setup.

MPI_Finalize
o Tells MPI we’re done, so clean up anything allocated for this program.
Basic Outline
Communicators
Communicators
 A collection of processes that can
send messages to each other.
 MPI_Init defines a communicator
that consists of all the processes
created when the program is started.
 Called MPI_COMM_WORLD.
SPMD




Single-Program Multiple-Data
We compile one program.
Process 0 does something different.
o Receives messages and prints them while the other processes do the work.
The if-else construct makes our program SPMD.
45
Communication
Data types
Message matching
Communication
Receiving messages

A receiver can get a message without knowing:
o the amount of data in the message,
o the sender of the message,
o or the tag of the message.
status_p argument
How much data am I receiving?
Issues with send and receive




Exact behavior is determined by the MPI implementation.
MPI_Send may behave differently with regard to buffer size, cutoffs and blocking.
MPI_Recv always blocks until a matching message is received.
Know your implementation; don’t make assumptions!
The Trapezoidal Rule in MPI
46
One trapezoid
Pseudo-code for a serial program
Parallelizing the Trapezoidal Rule
1. Partition problem solution into tasks.
2. Identify communication channels
between tasks.
3. Aggregate tasks into composite tasks.
4. Map composite tasks to cores.
Parallel pseudo-code
Tasks and communications for Trapezoidal Rule
First version (1&2)
First version (3)
Dealing with I/O
47
Input


Running with 6 processes (unpredictable output)
Most MPI implementations only allow process 0
in MPI_COMM_WORLD access to stdin.
Process 0 must read the data (scanf) and send
to the other processes.
Function for reading user input
Tree-structured communication
1. In the first phase:
(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and 7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their new values.
2. (a) Process 4 sends its newest value to process 0.
(b) Process 0 adds the received value to its newest value
A tree-structured global sum
An alternative tree-structured global sum
48
MPI_Reduce
Predefined reduction operators in MPI
Collective vs. Point-to-Point Communications









All the processes in the communicator must call the same collective function.
For example, a program that attempts to match a call to MPI_Reduce on one process with a call to MPI_Recv
on another process is erroneous, and, in all likelihood, the program will hang or crash.
The arguments passed by each process to an MPI collective communication must be “compatible.”
For example, if one process passes in 0 as the dest_process and another passes in 1, then the outcome of a call
to MPI_Reduce is erroneous, and, once again, the program is likely to hang or crash.
The output_data_p argument is only used on dest_process.
However, all of the processes still need to pass in an actual argument corresponding to output_data_p, even if
it’s just NULL.
Point-to-point communications are matched on the basis of tags and communicators.
Collective communications don’t use tags.
They’re matched solely on the basis of the communicator and the order in which they’re called.
Example 1
Multiple calls to
MPI_Reduce
Example 2




Suppose that each process calls MPI_Reduce with operator MPI_SUM, and destination process 0.
At first glance, it might seem that after the two calls to MPI_Reduce, the value of b will be 3, and the value of d
will be 6.
However, the names of the memory locations are irrelevant to the matching of the calls to MPI_Reduce.
The order of the calls will determine the matching so the value stored in b will be 1+2+1 = 4, and the value
stored in d will be 2+1+2 = 5.
MPI_Allreduce
Useful in a situation in which all of the processes need the result of a global sum in order to complete some larger
computation.
49
Broadcast: Data belonging to a single process is sent to all of the processes in the communicator.
A version of Get_input that uses MPI_Bcast
Data distributions(computing a vector sum)
Serial implementation of vector addition
Different partitions of a 12-component
vector among 3 processes
50
Partitioning options



Block partitioning
o Assign blocks of consecutive components to each process.
Cyclic partitioning
o Assign components in a round robin fashion.
Block-cyclic partitioning
o Use a cyclic distribution of blocks of components.
Scatter

Parallel implementation of vector addition
MPI_Scatter can be used in a function
that reads in an entire vector
on process 0 but only sends the needed
components to each of the other processes.
Reading and distributing a vector
Gather

Collect all of the components
of the vector onto process 0,
and then process 0 can process
all of the components.
Print a distributed vector (1)
Print a distributed vector (2)
51
Allgather


Concatenates the contents of each process’ send_buf_p and
stores this in each process’ recv_buf_p.
As usual, recv_count is the amount of data being received from
each process.
Matrix-vector multiplication
Multiply a matrix by a vector
C style arrays
Serial matrix-vector multiplication
An MPI matrix-vector multiplication function (1)
An MPI matrix-vector multiplication function (2)
Derived datatypes





Used to represent any collection of data items in memory by storing both the types of the items and their
relative locations in memory.
The idea is that if a function that sends data knows this information about a collection of data items, it can
collect the items from memory before they are sent.
Similarly, a function that receives data can distribute the items into their correct destinations in memory when
they’re received.
Formally, consists of a sequence of basic MPI data types
together with a displacement for each of the data types.
Trapezoidal Rule example:
52
MPI_Type create_struct

Builds a derived datatype that consists of
individual elements that have different basic
types.
MPI_Get_address

Returns the address of the memory location referenced by location_p.
The special type MPI_Aint is an integer type that is big
enough to store an address on the system.
MPI_Type_commit

Allows the MPI implementation to optimize its internal representation of the datatype for use in
communication functions.
MPI_Type_free

When we’re finished with our new type,
this frees any additional storage used.
Get input function with a derived datatype (1)
Get input function with a derived datatype (2)
Get input function with a derived datatype(3)
Elapsed parallel time
Returns the number of seconds
that have elapsed since
some time in the past.
Elapsed serial time


In this case, you don’t need to link
in the MPI libraries.
Returns time in microseconds elapsed
from some point in the past.
53
MPI_Barrier
Ensures that no process will return from calling it until every process in the communicator has started calling it.
Speedup
Efficiency
Scalability



A program is scalable if the problem size can be increased at a rate so that the efficiency doesn’t decrease as
the number of processes increase.
Programs that can maintain a constant efficiency without increasing the problem size are sometimes said to be
strongly scalable.
Programs that can maintain a constant efficiency if the problem size increases at the same rate as the number
of processes are sometimes said to be weakly scalable.
A parallel sorting algorithm
Sorting




n keys and p = comm sz processes.
n/p keys assigned to each process.
No restrictions on which keys are assigned to which processes.
When the algorithm terminates:
o The keys assigned to each process should be sorted in (say) increasing order.
o If 0 ≤ q < r < p, then each key assigned to process q should be less than or equal to every
assigned to process r.
Odd-even transposition sort


A sequence of phases.
Even phases, compare swaps:

Odd phases, compare swaps:
Example





Start: 5, 9, 4, 3
Even phase: compare-swap (5,9) and (4,3) getting the list 5, 9, 3, 4
Odd phase: compare-swap (9,3) getting the list 5, 3, 9, 4
Even phase: compare-swap (5,3) and (9,4) getting the list 3, 5, 4, 9
Odd phase: compare-swap (5,4) getting the list 3, 4, 5, 9
key
54
Serial odd-even transposition sort
Parallel odd-even transposition sort
Communications among tasks in odd-even sort
Pseudo-code
Compute_partner
Safety in MPI programs









The MPI standard allows MPI_Send to behave in two different ways:
o it can simply copy the message into an MPI managed buffer and return,
o or it can block until the matching call to MPI_Recv starts.
Many implementations of MPI set a threshold at which the system switches from buffering to blocking.
Relatively small messages will be buffered by MPI_Send.
Larger messages, will cause it to block.
If the MPI_Send executed by each process blocks, no process will be able to start executing a call to
MPI_Recv, and the program will hang or deadlock.
Each process is blocked waiting for an
event that will never happen.
A program that relies on MPI provided buffering is said to be unsafe.
Such a program may run without problems for various sets of input, but it may hang or crash with other sets.
55
MPI_Ssend


An alternative to MPI_Send defined by the MPI standard.
The extra “s” stands for synchronous and MPI_Ssend is
guaranteed to block until the matching receive starts.
Restructuring communication
MPI_Sendrecv




An alternative to scheduling the communications
ourselves.
Carries out a blocking send and a receive in a single call.
The dest and the source can be the same or different.
Especially useful because MPI schedules the
communications so that the program won’t hang or crash.
Safe communication with five processes
Parallel odd-even transposition sort
56
Concluding Remarks











MPI or the Message-Passing Interface is a library of functions that can be called from C, C++, or Fortran
programs.
A communicator is a collection of processes that can send messages to each other.
Many parallel programs use the single-program multiple data or SPMD approach.
Most serial programs are deterministic: if we run the same program with the same input we’ll get the same
output.
Parallel programs often don’t possess this property.
Collective communications involve all the processes in a communicator.
When we time parallel programs, we’re usually interested in elapsed time or “wall clock time”.
Speedup is the ratio of the serial run-time to the parallel run-time.
Efficiency is the speedup divided by the number of parallel processes.
If it’s possible to increase the problem size (n) so that the efficiency doesn’t decrease as p is increased, a
parallel program is said to be scalable.
An MPI program is unsafe if its correct behavior depends on the fact that MPI_Send is buffering its input.
Download