Uploaded by Martin Ivazov

1.1 - Въведение в паралелното програмиране

advertisement
Parallelism Non-negotiable
Martin Krastev, ChaosGroup
Concurrency & Parallelism
While we will mostly refer to parallelism throughout this talk, concurrency is the
underlying concept that enables parallelism; there are different schools of thought
on the subject of concurrency and parallelism, but here we adopt the notion that:
●
●
Concurrency is the mere absence of sequentiality; we have concurrency
when some of the steps of an algorithm can happen out of order without
changing the final outcome. Example:
A=B*C+D*F
Parallelism is the simultaneous occurrence of such steps. Example:
A = sum( B * C
time
D*F)
So, about parallel computing
●
The advent of multiprogramming AKA multi-tasking
○
●
Para-virtualized, time-shared multi-tasking -- emerging in 1960s,
prominent across mainframes by 1970s (IBM System/360)
Single-task parallelism
○
Started with Ridiculously Computationally-Heavy™ tasks which had no
alternative but to get well understood and parallelised somehow; often
those turned out to be Embarrassingly Parallel™. Such super-tasks soon
became delegated to super-computers -- purpose-built parallel machines
appearing in the early 1970’s (https://en.wikipedia.org/wiki/Vector_processor)
..Well
Tasks we throw at computers are getting ever heavier, but
sequential computing is not getting much faster anymore.
Yes, really
No, seriously
Help us, parallelism
You’re our only hope
First thing first -- taxonomy
Established types of parallelism:
●
●
●
Instruction-level parallelism (pipelining; super-scalarity, VLIW)
Data-level prallalism (SIMD/SIMT/SPMD)
Task-level parallelism (multi-threading, multi-tasking, distributed comp.)
Covert types:
●
Memory-level parallelism
Parallel Peformance Models
Merely knowing that better parallelization would increase performance is not
sufficient; we want prediction models that can set our expectations and, thus, drive
our effort.
●
●
●
Amdahl’s Law (https://en.wikipedia.org/wiki/Amdahl%27s_law)
Gustafson-Barsis’s Law (https://en.wikipedia.org/wiki/Gustafson%27s_law)
Roofline model (https://en.wikipedia.org/wiki/Roofline_model)
Amdahl’s Law
By Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551
Gustafson-Barsis’s Law
Let’s fix the task time and star throwing
more and more work and processors at
the parallel portion of our task:
productivity = (1 - x) + processors * x,
x : parallel portion, plot axis in %
y : processors, from 1 to ∞
The larger the parallel portion of our taks,
the sooner we get unbound linear scaling of
productivity!
The moral of the story? (for the gamers among us)
For a given framerate, we can increase the resolution to
infinity!
-- Gustafson-Barsis
YET
For a given resolution, we cannot increase the framerate to
infinity.
-- Gene Amdahl
(sorry, Doom players)
Roofline Model
By Giu.natale - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49641351
Instruction-level parallelism
Lowest hanging fruit -- we feed it sequentialisms, it produces parallelism for us!
So what’s wrong with that? - We already ate it.
By early 2000’s pipeline and super-scalarity efficiency gains plateaued. Any
advances in that direction since then have been minor.
We’ve hit a hardware wall there.
Instruction-level parallelism
Measured in IPC (Instructions Per Clock; or its reciprocal CPI), but..
●
●
●
Dominated by Data Locality (duh!)
Even with good Data Locality, compiler (writers) can fail at reading the fine
print of CPU μarch, and kill IPC when we least expect it
IPC is not always the decisive factor of single-core performance:
execution_time = executed_instuctions / (IPC * clock_rate)
VLIW -- a cheaper approach to super-scalarity. Perhaps too cheap?
Instruction-level parallelism
Simultaneous Multi-Threading (SMT), AKA Hyperthreading
●
●
●
●
●
●
Another cheap way to improve ILP in super-scalar cores.
Multiple contexts; instructions are fetched from multiple (2, 4, 8) streams.
Improves opportunities for instruction scheduling.
Requires more than one thread to function (duh!)
Can give throughput improvements from 10’s of % to 0%.
Can be detrimental to latency-critical single-tread tasks! : (
Data-level parallelism
Arrays AKA Where there’s one, there are many.
“The three four-letter horsemen acronyms of DLP”
●
●
●
SIMD (Single-instruction, Multiple-data)
SIMT (Single-instruction, Multiple-threads)
SPMD (Single-program, Multiple-data)
Data-level parallelism: SIMD
S = MUL(B, C)
B0
T = MUL(E, F)
B1
C0
B2
C1
B3
C2
E0
C3
E1
F0
E2
F1
E3
F2
U = ADD(S, T)
...
S0
S1
S2
S3
U0
T0
U1
U2
U3
T1
T2
T3
F3
Data-level parallelism: SIMD
●
●
●
●
Same work, fewer instructions (remember execution_time from ILP?)
Found in virtually all modern CPU ISAs (but not standardized)
Packed memory accesses: AoS -> SoA -> AoSoA; gather-scatter
Compilers can extract DLP from loops, and generate SIMD for us; your
mileage will vary.
Data-level parallelism: SIMT
S = MUL(B, C)
B7
T = MUL(E, F)
B1
C7
B0
C1
B3
C0
C3
U = ADD(S, T)
...
S7
U7
S1
S0
S3
...
...
...
...
T7
T1
T0
T3
U1
U0
U3
Data-level parallelism: SIMT
●
●
The GPU evolution of SIMD -- same kernel executed in lock-step within SIMD
blocks (warps), multiplied across many warps, keeping in-flight thousands of
threads; warp treads issue independent memory addresses (but strongly
prefer coalescing)
Many threads generating many memory accesses enables Memory-level
parallelism (more on that in a sec)
Data-level parallelism: SPMD
S = MUL(B, C)
B1
B7
T = MUL(E, F)
B3
B0
C1
C7
C3
C0
U = ADD(S, T)
...
S1
S7
S3
S0
...
...
...
T1
T7
...
T3
T0
U7
U1
U3
U0
Data-level parallelism: SPMD
●
●
The CPU “answer” to SIMT -- same kernel executed in parallel (not in
lock-step) on tens of cores, and (perhaps) in lock-step on the individual lanes
of (ever-wider) SIMD units
Same kernel executed on GPU and CPU does SIMT on GPUs and SPMD on
CPUs.
Memory-level parallelism
Postulate: in a memory-latency-bound scenario, the more memory requests we
can generate ahead of time, the higher our efficiency is.
Reality check: we’re standing before a Memory Wall (that Data Locality..)
●
●
●
MLP originally devised as a clever way for ILP to by-pass the Memory Wall;
but a real breakthrough in MLP requires orders-of-magnitude higher
speculative-execution rates than current OOO designs : /
Hey, SIMT provides orders-of-magnitude more memory request w/o
speculation! There the sky bandwidth is the limit!
Ergo SPMD cannot exploit MLP to the extent SIMT can
Thread/Task-level parallelism
Simplistic view: business as usual with some synchronisation, right? Right..
Memory Consistency, Sequential Consistency and Cache Coherence, or
How to Avoid Creating Diverging Views of the Same Universe™
●
●
●
The (invisible) foundation: cache-coherence hardware protocols
Atomics & fences
Transactional memory
..And then consider UMA vs NUMA vs distributed computing.
Cache-coherence hw protocols
State-of-data state machines:
●
●
●
MSI (Modified/Shared/Invalid)
MESI (Modified/Exclusive/Shared/Invalid)
MOESI (Modified/Owned/Exclusive/Invalid)
..To name a few.
Concurrency control models
●
Pessimistic: locks -> dealocks : (
Demand the world acknowledges our state before we move on.
●
Optimistic: lockless -> livelocks : /
Optimists often get more work done than pessimists.
‘Залудо работѝ, залудо не стой!’
Atomics
When single-threaded, our code is sequentially-consistent both in value
computations and in side effects.
This is not the case when the effects of our code are observed from other threads.
Ergo, this may not be the case even when the results of our code are observed
from its own thread when running multi-threaded!
The reason for that is that shared resources (among threads) are not treated the
same way by the underlying coherence protocols, as they are by C-like
programming languages under normal conditions. We need special semantics for
shared resources. Enter atomics.
Atomics
Atomic types provide atomic read-modify-writes over fundamental data types, e.g.
arithmetics and compare-set/compare-exchange over integrals an pointers. They
also provide guarantees for the visibility of the effect of the op to other threads.
According to the C++ 2011 memory-order model, atomics order can be:
●
●
●
●
●
●
Relaxed
Consume
Acquire
Release
Acquire-release
Sequentially-consistent
Fences, AKA Barriers
On an implementation level, atomics often rely on fences. But memory fences can
be used on their own as synchronisation primitives. Fences guarantee sequential
consistency for the mem accesses, subject to the fence, in accordance with those
being ‘before’ or ‘after’ the fence point; the guarantee holds for the issuing thread,
but more importantly, for any external observers (e.g. other threads). Fences can
be:
●
●
●
Load/Read -- affects the loads from memory
Store/Write -- affects the stores to memory
Combined -- affects all memory accesses, load and stores alike
Transactional memory
A typically optimistic model -- thread requests an atomic memory transaction
(marks begin/end), which may or may not succeed.
●
●
First hw implementations in mainframe/server chips, but tech gradually makes
its way to the desktop.
Intel first introduced it in Haswell (TSX extension) -- eventually discovered a
bug in their implementation and disabled the feature. Re-introduced in later
Broadwell models; originally-affected models never recovered.
Let’s wrap it up for today (take-aways)
●
●
●
●
●
If your task is suitable for SIMT, by God and country, write it for SIMT!
If not, then still given DLP, SPMD might be your best bet; make use of SIMD!
When sharing resources, optimistic concurrency control may be generally
harder to implement than pessimistic, but is often more optimal.
Don’t leave ILP optimisations for last -- ILP issues could be the tip of data
locality problems -- use profilers thoughout the dev cycle of your project.
Don’t blindly trust the compiler to provide the best ILP for you -- get familiar
with the architecture/μarch, so you can get the most out of the compiler.
References
[0] Karl Rupp, 40 Years of Microprocessor Trend Data https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/
[1] William Dally, Efficiency and Parallelism: The Challenges of Future Computing https://www.youtube.com/watch?v=l1ImS3gbg08
[2] Andrew Glew, MLP yes! ILP no! https://people.eecs.berkeley.edu/~kubitron/asplos98/abstracts/andrew_glew.pdf
[3] L. Ceze, J. Tuck and J. Torrellas Are We Ready for High Memory-Level Parallelism? http://iacoma.cs.uiuc.edu/iacoma-papers/wmpi06.pdf
[4] Herb Sutter, Strong and Weak Hardware Memory Models https://herbsutter.com/2012/08/02/strong-and-weak-hardware-memory-models/
[5] Peter Sewell, C/C++11 mappings to processors https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
[6] Jeff Preshing, Acquire and Release Semantics http://preshing.com/20120913/acquire-and-release-semantics/
[7] Jeff Preshing, The Purpose of memory_order_consume in C++11 http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/
Download