Basic Concepts in Parallelization

advertisement
1
Basic Concepts in Parallelization
Ruud van der Pas
Senior Staff Engineer
Oracle Solaris Studio
Oracle
Menlo Park, CA, USA
IWOMP 2010
CCS, University of Tsukuba
Tsukuba, Japan
June 14-16, 2010
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
2
RvdP/V1
Outline
❑
Introduction
❑
Parallel Architectures
❑
Parallel Programming Models
❑
Data Races
❑
Summary
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
3
Introduction
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
4
Why Parallelization ?
Parallelization is another optimization technique
The goal is to reduce the execution time
To this end, multiple processors, or cores, are used
Time
1 core
parallelization
4 cores
Using 4 cores, the execution
time is 1/4 of the single core
time
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
5
What Is Parallelization ?
"Something" is parallel if there is a certain level
of independence in the order of operations
In other words, it doesn't matter in what order
those operations are performed
sequence of machine instructions
A
collection of program statements
 An
algorithm
 The
RvdP/V1
problem you're trying to solve
Basic Concepts in Parallelization
granularity
A
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
6
What is a Thread ?
 Loosely said, a thread consists of a series of instructions
with it's own program counter (“PC”) and state
 A parallel program executes threads in parallel
 These threads are then scheduled onto processors
PC
Thread 0
Thread 1
.....
.....
.....
.....
.....
.....
.....
.....
PC
P
RvdP/V1
Thread 2
PC
.....
.....
.....
..... PC
Thread 3
.....
.....
.....
.....
P
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
7
Parallel overhead
❑
The total CPU time often exceeds the serial CPU time:
●
The newly introduced parallel portions in your
program need to be executed
● Threads need time sending data to each other and
synchronizing (“communication”)
✔ Often the key contributor, spoiling all the fun
❑ Typically, things also get worse when increasing the
number of threads
❑
RvdP/V1
Effi cient parallelization is about minimizing the
communication overhead
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
8
Communication
Wallclock time
Serial
Execution
Parallel - Without
communication


Embarrassingly
parallel: 4x faster
Wallclock time is ¼ of
serial wallclock time
Parallel - With
communication

Additional communication

Less than 4x faster

Consumes additional resources

Wallclock time is more than ¼
of serial wallclock time

Total CPU time increases
Communication
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
9
Load balancing
Perfect Load Balancing
Time
Load Imbalance
1 idle
2 idle
3 idle

All threads fi nish in the
same amount of time

No threads is idle

Different threads need a different
amount of time to fi nish their
task

Total wall clock time increases

Program does not scale well
Thread is idle
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About scalability
Ideal
Speed-up S(P)
10
Popt
P

Defi ne the speed-up S(P) as
S(P) := T(1)/T(P)

The effi ciency E(P) is defi ned as
E(P) := S(P)/P

In the ideal case, S(P)=P and
E(P)=P/P=1=100%

Unless the application is
embarrassingly parallel, S(P)
eventually starts to deviate from
the ideal curve

Past this point Popt, the application
sees less and less benefi t from
adding processors

Note that both metrics give no
information on the actual run-time

As such, they can be dangerous to
use
In some cases, S(P) exceeds P
This is called "superlinear" behaviour
Don't count on this to happen though
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
11
Amdahl's Law
Assume our program has a parallel fraction “f”
This implies the execution time T(1) := f*T(1) + (1-f)*T(1)
On P processors: T(P) = (f/P)*T(1) + (1-f)*T(1)
Amdahl's law:
S(P) = T(1)/T(P) = 1 / (f/P + 1-f)
Comments:
▶
▶
RvdP/V1
This "law' describes the effect the non-parallelizable part of a
program has on scalability
Note that the additional overhead caused by parallelization and
speed-up because of cache effects are not taken into account
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
12
Amdahl's Law
16
Speed-up
14

It is easy to scale on a
small number of
processors

Scalable performance
however requires a
high degree of
parallelization i.e. f is
very close to 1

This implies that you
need to parallelize that
part of the code where
the majority of the time
is spent
12
Value for f:
10
1.00
0.99
0.75
0.50
0.25
0.00
8
6
4
2
0
RvdP/V1
1
2
3 4
5
6
7
8
9 10 11 12 13 14 15 16
Processors
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
13
Amdahl's Law in practice
We can estimate the parallel fraction “f”
Recall: T(P) = (f/P)*T(1) + (1-f)*T(1)
It is trivial to solve this equation for “f”:
f = (1 - T(P)/T(1))/(1 - (1/P))
Example:
T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70
f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84%
Estimated performance on 8 processors is then:
T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5
S(8) = T/T(8) = 3.78
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
14
Threads Are Getting Cheap .....
Elapsed time and speed up for f = 84%
100
Using 8 cores:
100/26.5 = 3.8x faster
90
80
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number of cores used
= Elapsed time
= Speed up
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
15
Numerical results
Consider:
A=B+C+D+E
Serial Processing
Parallel Processing
Thread 0
A = B + C
T1 = B + C
A = A + D
T1 = T1 + T2
Thread 1
T2 = D + E
A = A + E
☞ The roundoff behaviour is different and so the numerical
results may be different too
☞ This is natural for parallel programs, but it may be hard to
differentiate it from an ordinary bug ....
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
16
Cache Coherence
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
17
Typical cache based system
?
CPU
RvdP/V1
L1
cache
L2
cache
Basic Concepts in Parallelization
Memory
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
18
Cache line modifi cations
page
cache line
caches
cache line
inconsistency !
Main Memory
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
19
Caches in a parallel computer
❑
A cache line starts in
memory
❑
Over time multiple copies
of this line may exist
Processors
Caches
Memory
invalidated
CPU 0
Cache Coherence ("cc"):
✔
Tracks changes in copies
✔
Makes sure correct cache
line is used
✔
Different implementations
possible
✔
CPU 1
invalidated
CPU 2
Need hardware support to
make it effi cient
invalidated
CPU 3
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
20
Parallel Architectures
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
21
Uniform Memory Access (UMA)
cache
CPU
Memory
I/O
Interconnect
I/O
cache
CPU
cache
I/O
CPU
❑
Also called "SMP"
(Symmetric Multi
Processor)
❑
Memory Access time is
Uniform for all CPUs
❑
CPU can be multicore
❑
Interconnect is "cc":
●
Pro
✔
Easy to use and to administer
✔
Effi cient use of resources
Con
RvdP/V1
✔
Said to be expensive
✔
Said to be non-scalable
Bus
● Crossbar
❑ No fragmentation Memory and I/O are
shared resources
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
22
NUMA
Interconnect
M
M
I/O
Also called "Distributed
Memory" or NORMA (No
Remote Memory Access)
❑
Memory Access time is
Non-Uniform
❑
Hence the name "NUMA"
❑
Interconnect is not "cc":
M
I/O
I/O
cache
cache
cache
CPU
CPU
CPU
●
Ethernet, Infi niband,
etc, ......
❑ Runs 'N' copies of the OS
Pro
✔
Said to be cheap
✔
Said to be scalable
❑
Con
RvdP/V1
❑
✔
Diffi cult to use and administer
✔
In-effi cient use of resources
Memory and I/O are
distributed resources
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
23
The Hybrid Architecture
Second-level Interconnect
SMP/MC node
❑
SMP/MC node
Second-level interconnect is not cache coherent
●
Ethernet, Infi niband, etc, ....
❑ Hybrid Architecture with all Pros and Cons:
●
●
RvdP/V1
UMA within one SMP/Multicore node
NUMA across nodes
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
24
cc-NUMA
2-nd level
Interconnect
❑
Two-level interconnect:
●
UMA/SMP within one system
● NUMA between the systems
❑ Both interconnects support cache coherence i.e. the
system is fully cache coherent
RvdP/V1
❑
Has all the advantages ('look and feel') of an SMP
❑
Downside is the Non-Uniform Memory Access time
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
25
Parallel Programming Models
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
26
How To Program A Parallel Computer?
❑
There are numerous parallel programming models
❑
The ones most well-known are:
Cl An
● Distributed Memory
si an us y
ng d/ te
✔ Sockets (standardized, low level)
le or r
SM
✔ PVM - Parallel Virtual Machine (obsolete)
P
✔ MPI
●
- Message Passing Interface (de-facto std)
Shared Memory
✔ Posix
Threads (standardized, low level)
✔ OpenMP
(de-facto standard)
✔ Automatic
RvdP/V1
SM
on P
ly
Parallelization (compiler does it for you)
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
27
Parallel Programming Models
Distributed Memory - MPI
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
28
What is MPI?
❑
MPI stands for the “Message Passing Interface”
❑
MPI is a very extensive de-facto parallel programming
API for distributed memory systems (i.e. a cluster)
●
An MPI program can however also be executed on a
shared memory system
❑ First specifi cation: 1994
❑
The current MPI-2 specifi cation was released in 1997
●
RvdP/V1
Major enhancements
✔ Remote memory management
✔ Parallel I/O
✔ Dynamic process management
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
29
More about MPI
❑
MPI has its own data types (e.g. MPI_INT)
●
User defi ned data types are supported as well
❑ MPI supports C, C++ and Fortran
●
Include fi le <mpi.h> in C/C++ and “mpif.h” in Fortran
❑ An MPI environment typically consists of at least:
●
A library implementing the API
● A compiler and linker that support the library
● A run time environment to launch an MPI program
❑ Various implementations available
●
RvdP/V1
HPC Clustertools, MPICH, MVAPICH, LAM, Voltaire
MPI, Scali MPI, HP-MPI, .....
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
30
The MPI Programming Model
M
0
1
M
M
2
3
M
M
4
5
M
A Cluster Of Systems
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
31
The MPI Execution Model
All processes
start
MPI_Init
MPI_Finalize
All processes
fi nish
= communication
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
32
The MPI Memory Model
P
P
✔
All threads/processes have
access to their own, private,
memory only
✔
Data transfer and most
synchronization has to be
programmed explicitly
✔
All data is private
✔
Data is shared explicitly by
exchanging buffers
private
private
Transfer
Mechanism
P
private
private
P
P
private
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
33
Example - Send “N” Integers
#include <mpi.h>
include fi le
you = 0; him = 1;
MPI_Init(&argc, &argv);
initialize MPI environment
MPI_Comm_rank(MPI_COMM_WORLD, &me);
if ( me == 0 ) {
get process ID
process 0 sends
error_code = MPI_Send(&data_buffer, N, MPI_INT,
him, 1957, MPI_COMM_WORLD);
} else if ( me == 1 ) {
process 1 receives
error_code = MPI_Recv(&data_buffer, N, MPI_INT,
you, 1957, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
MPI_Finalize(); leave the MPI environment
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
34
Run time Behavior
Process 0
Process 1
you = 1
him = 0
you = 1
him = 0
me = 0
MPI_Send
N integers
destination = you = 1
label = 1957
me = 1
MPI_Recv
N integers
sender = him =0
label = 1957
Yes ! Connection established
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
35
The Pros and Cons of MPI
❑
Advantages of MPI:
●
Flexibility - Can use any cluster of any size
● Straightforward - Just plug in the MPI calls
● Widely available - Several implementations out there
● Widely used - Very popular programming model
❑ Disadvantages of MPI:
●
●
●
●
●
RvdP/V1
Redesign of application - Could be a lot of work
Easy to make mistakes - Many details to handle
Hard to debug - Need to dig into underlying system
More resources - Typically, more memory is needed
Special care - Input/Output
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
36
Parallel Programming Models
Shared Memory - Automatic
Parallelization
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
37
Automatic Parallelization (-xautopar)
❑
Compiler performs the parallelization (loop based)
❑
Different iterations of the loop executed in parallel
❑
Same binary used for any number of threads
for (i=0; i<1000; i++)
a[i] = b[i] + c[i];
OMP_NUM_THREADS=4
RvdP/V1
Thread 0
Thread 1
Thread 2
Thread 3
0-249
250-499
500-749
750-999
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
38
Parallel Programming Models
Shared Memory - OpenMP
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
39
Shared Memory
0
1
P
M
M
M
http://www.openmp.org
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
40
Shared Memory Model
Programming Model
T
T
private
✔ All threads have access to the
same, globally shared, memory
private
✔ Data can be shared or private
Shared
Memory
✔ Shared data is accessible by all
threads
T
private
private
T
T
private
RvdP/V1
✔ Private data can only be
accessed by the thread that
owns it
✔ Data transfer is transparent to
the programmer
✔ Synchronization takes place,
but it is mostly implicit
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
41
About data
 In
a shared memory parallel program variables have a
"label" attached to them:
☞ Labelled
"Private" ⇨ Visible to one thread only
✔ Change
made in local data, is not seen by others
✔ Example - Local variables in a function that is
executed in parallel
☞ Labelled "Shared" ⇨ Visible to all threads
✔ Change
made in global data, is seen by all others
✔ Example - Global data
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
42
Example - Matrix times vector
#pragma omp parallel for default(none) \
private(i,j,sum) shared(m,n,a,b,c)
for (i=0; i<m; i++)
j
{
sum = 0.0;
for (j=0; j<n; j++)
=
*
sum += b[i][j]*c[j];
a[i] = sum;
i
}
TID = 0
TID = 1
for (i=0,1,2,3,4)
i = 0
sum =
b[i=0][j]*c[j]
a[0] = sum
i = 1
sum =
for (i=5,6,7,8,9)
i = 5
sum =
b[i=5][j]*c[j]
a[5] = sum
i = 6
sum =
b[i=1][j]*c[j]
a[1] = sum
b[i=6][j]*c[j]
a[6] = sum
... etc ...
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
43
RvdP/V1
A Black and White comparison
MPI
OpenMP
De-facto standard
Endorsed by all key players
Runs on any number of (cheap) systems
“Grid Ready”
High and steep learning curve
You're on your own
All or nothing model
No data scoping (shared, private, ..)
More widely used (but ....)
Sequential version is not preserved
Requires a library only
Requires a run-time environment
Easier to understand performance
De-facto standard
Endorsed by all key players
Limited to one (SMP) system
Not (yet?) “Grid Ready”
Easier to get started (but, ...)
Assistance from compiler
Mix and match model
Requires data scoping
Increasingly popular (CMT !)
Preserves sequential code
Need a compiler
No special environment
Performance issues implicit
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
44
The Hybrid Parallel
Programming Model
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
45
The Hybrid Programming Model
Distributed Memory
Shared Memory
RvdP/V1
Shared Memory
0
1
P
0
1
P
M
M
M
M
M
M
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
46
Data Races
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
47
About Parallelism
Parallelism
Independence
"Something" that does not
obey this rule, is not
parallel (at that level ...)
No Fixed Ordering
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
48
Shared Memory Programming
T
T
X
private
X
private
Y
Shared
Memory
T
X
private
private
T
T
private
Threads communicate via shared memory
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
49
What is a Data Race?
❑
Two different threads in a multi-threaded shared
memory program
❑
Access the same (=shared) memory location
●
●
●
RvdP/V1
Asynchronously
Without holding any common exclusive locks
At least one of the accesses is a write/store
Basic Concepts in Parallelization
and
and
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
50
Example of a data race
#pragma omp parallel shared(n)
{n = omp_get_thread_num();}
T
private
W
T
n
W
private
Shared
Memory
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
51
Another example
#pragma omp parallel shared(x)
{x = x + 1;}
T
private
R/W
x
T
R/W
private
Shared
Memory
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
52
About Data Races
❑
Loosely described, a data race means that the update
of a shared variable is not well protected
❑
A data race tends to show up in a nasty way:
●
●
●
●
RvdP/V1
Numerical results are (somewhat) different from run
to run
Especially with Floating-Point data diffi cult to
distinguish from a numerical side-effect
Changing the number of threads can cause the
problem to seemingly (dis)appear
✔ May also depend on the load on the system
May only show up using many threads
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
53
A parallel loop
for (i=0; i<8; i++)
a[i] = a[i] + b[i];
RvdP/V1
Every iteration in this
loop is independent of
the other iterations
Thread 1
Thread 2
a[0]=a[0]+b[0]
a[4]=a[4]+b[4]
a[1]=a[1]+b[1]
a[5]=a[5]+b[5]
a[2]=a[2]+b[2]
a[6]=a[6]+b[6]
a[3]=a[3]+b[3]
a[7]=a[7]+b[7]
Basic Concepts in Parallelization
Time
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
54
Not a parallel loop
for (i=0; i<8; i++)
a[i] = a[i+1] + b[i];
RvdP/V1
The result is not
deterministic when
run in parallel !
Thread 1
Thread 2
a[0]=a[1]+b[0]
a[4]=a[5]+b[4]
a[1]=a[2]+b[1]
a[5]=a[6]+b[5]
a[2]=a[3]+b[2]
a[6]=a[7]+b[6]
a[3]=a[4]+b[3]
a[7]=a[8]+b[7]
Basic Concepts in Parallelization
Time
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
55
About the experiment
❑
We manually parallelized the previous loop
●
The compiler detects the data dependence and does
not parallelize the loop
❑ Vectors a and b are of type integer
❑
We use the checksum of a as a measure for
correctness:
●
checksum += a[i] for i = 0, 1, 2, ...., n-2
❑ The correct, sequential, checksum result is computed
as a reference
❑
We ran the program using 1, 2, 4, 32 and 48 threads
●
RvdP/V1
Each of these experiments was repeated 4 times
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
56
RvdP/V1
Numerical results
threads:
threads:
threads:
threads:
1
1
1
1
checksum
checksum
checksum
checksum
1953
1953
1953
1953
correct
correct
correct
correct
value
value
value
value
1953
1953
1953
1953
threads:
threads:
threads:
threads:
2
2
2
2
checksum
checksum
checksum
checksum
1953
1953
1953
1953
correct
correct
correct
correct
value
value
value
value
1953
1953
1953
1953
threads:
threads:
threads:
threads:
4
4
4
4
checksum
checksum
checksum
checksum
1905
1905
1953
1937
correct
correct
correct
correct
value
value
value
value
1953
1953
1953
1953
threads:
threads:
threads:
threads:
32
32
32
32
checksum
checksum
checksum
checksum
1525
1473
1489
1513
correct
correct
correct
correct
value
value
value
value
1953
1953
1953
1953
threads:
threads:
threads:
threads:
48
48
48
48
checksum
checksum
checksum
checksum
936
1007
887
822
correct
correct
correct
correct
value
value
value
value
1953
1953
1953
1953
Basic Concepts in Parallelization
Data Race
In Action !
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
57
Summary
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
58
Parallelism Is Everywhere
Multiple levels of parallelism:
Granularity
Technology
Programming Model
Instruction Level
Chip Level
System Level
Grid Level
Superscalar
Multicore
SMP/cc-NUMA
Cluster
Compiler
Compiler, OpenMP, MPI
Compiler, OpenMP, MPI
MPI
Threads Are Getting Cheap
RvdP/V1
Basic Concepts in Parallelization
Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Download