Beyond Threads: Scalable, Composable, Parallelism with Intel® Cilk™ Plus and TBB

advertisement
Beyond Threads:
Scalable, Composable, Parallelism
with Intel® Cilk™ Plus and TBB
Jim Cownie <james.h.cownie@intel.com>
Intel
SSG/DPD/TCAR
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
1
Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
2
Performance Trends
After ~2004 only the
number of transistors
continues to increase
We have hit limits in
• Power
• Instruction level
parallelism
• Clock speed
Single core scalar
performance is now
only growing slowly
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
3
But… Moore’s law is alive and well
90nm
2003
65nm
2005
45nm
2007
New Intel technology generation
every 2 years
Intel R&D technologies drive this
pace well into the decade
32nm
2009
22nm
2011
25 nm
14nm
2013
15nm
10nm
2015
Hi-K metal-gate
3-D Trigate
Shrink
We will have lots of transistors!
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
4
How do we use all the transistors?
Eat other system
components
• Graphics
• Memory i/f
• PCI i/f
Add cache
Replicate cores
This is a desktop
part, but it has
four cores each
with two HW threads and 256 bit (8 single or 4 double) SIMD
FP units
Number of cores will continue to increase in the future
Data and thread parallelism are mandatory to achieve highest
performance
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
5
How do we use all the transistors for HPC?
• Many Integrated Core
(“MIC”, aka “Knights …”)
• >50 cache coherent
cores, 4 HW threads/core
• 512 bit vector FPU/core
• 22nm process
• Extended x86 ISA
• Linux kernel
• Fortran, C, C++, Cilk,
OpenMP, MPI, …
Data and thread
• Demonstrated >1 TFlop
parallelism are even
more important here!
sustained on DGEMM
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
6
Exascale trends
• US government wants 1 ExaFlop in 20MW in 2018
• Critical issues
– Power (requires 300x improvement in energy efficiency!)
– Reliability
– Programmability (MPI + what?)
– Did I mention Power?
• Architecture:
– Cluster of SMP nodes
– Each node will have lots (100s..1000s?) of cores
– Each core will have wide vector units
Data and thread parallelism become more important
Homogeneous MPI parallelism won’t cut it
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
7
But, didn’t we solve threading in the 1990s?
• Pthreads standard:
IEEE 1003.1c-1995
• OpenMP standard:
1997
Yes, but…
• How do I choose how many threads to use?
• How do I split up my work, should I have a
function/thread?
• How do I debug with non-determinism?
• How do I balance load between threads?
• What happens if I call a library that also wants to
use threads?
• What happens on a new machine with more cores?
Programming with threads is HARD
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
8
The answer was in Seattle…
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
9
Scalable, Composable Parallelism
• Scalable: a single binary can exploit all the cores in
the HW it happens to be running on
– Efficiently
– Without requiring user control
Scalable software benefits from future HW
• Composable: Parallelism can be used at all levels of
SW stack (user code, library, nested library,…)
– Without over-subscription
– With parallelism exploited at each level
Composable software allows use of parallel libraries
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
10
What’s wrong with OpenMP?
• Parallelism is compulsory
• You know which thread you are: omp_get_thread_num()
• You know how many threads exist:
omp_get_num_threads()
• You control how work is assigned to threads:
schedule(…)
• OpenMP gives you lots of control but you end up
tuning for the current machine
OpenMP gives you too many knobs to play with!
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
11
What’s wrong with OpenMP?
• Static scheduling can’t handle jitter
– If one thread runs slowly (OS interrupt, more cache/TLB
misses) all threads have to wait
– With more cores jitter is more likely
• Nested parallelism is dangerous
– If OMP_NESTED=false, inner parallelism is not
exploited
– If OMP_NESTED=true, it’s easy to get exponential
over-subscription
OpenMP is not composable
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
12
OK, but how can I have parallelism without
threads?
Think about the parallelism in your problem
• Describe the way your problem can be broken
down into independent computations (tasks)
• Let the runtime do the hard work
– handle allocation of tasks to threads to ensure efficient
execution
– choose the number of threads to use depending on
available hardware
You don’t normally worry about register allocation,
similarly you shouldn’t worry about threads
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
13
Key Features of Cilk™ Plus
• Small extensions to C and C++
• Express the independent tasks in your code
• Express the vector operations in your code
• Results are deterministic
– There is a “serial elision” of the parallel code
• Formal properties
– Guaranteed memory limits: executing on n-threads uses <=
n times memory of serial code
– Provably efficient work-stealing scheduler
• Tools support: Cilk screen, Cilk view
• Public specification with an open-source
implementation in a GCC branch
Cilk lets programmers think about their problem,
not the runtime implementation
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
14
Example: Fibonacci Numbers
The Fibonacci numbers are the sequence 0, 1, 1,
2, 3, 5, 8, 13, 21, 34, …, where each number is
the sum of the previous two.
Recurrence:
F0 = 0,
F1 = 1,
Fn = Fn–1 + Fn–2 for n > 1
It is named after Leonardo di Pisa (1170–1250 CE),
known as Fibonacci. Fibonacci’s 1202 book Liber
Abaci introduced the sequence to Western
mathematics, though it had previously been
discovered in India.
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
15
Fibonacci Execution
int fib(int n)
{
if (n < 2) return n;
int x = fib(n-1);
int y = fib(n-2);
return x + y;
}
Key idea for parallelization:
fib(n-1) and fib(n-2) can be
calculated simultaneously
fib(4)
fib(3)
fib(2)
fib(1)
fib(1)
fib(2)
fib(1)
fib(0)
fib(0)
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
16
Nested Parallelism in Cilk™ Plus
The named child function
int fib(int n)
may execute in parallel
{
with the caller
if (n < 2) return n;
int x = cilk_spawn fib(n-1);
int y = fib(n-2);
cilk_sync;
Control cannot pass here
return x+y;
}
until all spawned children
have returned
Cilk keywords grant permission for parallel
execution. They do not force it.
Code with the Cilk keywords macro-ed out is a
correct serial version.
Parallelism is introduced recursively
so composability happens trivially.
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
17
So, you can handle functional code,
but what about real code?
• Recursion is hard, we’re not all Lisp programmers!
– cilk_for compiles a loop into a recursive parallel task
decomposition of the iteration space
• Real code has global variables whose update from
parallel tasks would be racy
• Races are
– Hard to detect (non-deterministic values)
– Hard to fix (need to modify every access and add locking)
Solution
• Cilk screen for detecting problems
• Reducers for removing them
Cilk™ Plus is more than just the language extensions
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
18
Cilk screen
• Cilk screen runs on the executable image using metadata
embedded by the compiler
– No need for a special build
• For a given input, and lock-free code, Cilk screen guarantees to
localize a race if there exists a parallel execution that could
produce results different from the serial execution
• It runs about 20 times slower than real-time
Address of
data
Location of
1st access
Location of
2nd access
Backtrace
2nd access
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
19
Reducer Hyperobjects
• A variable can be declared as a reducer over an
associative operation, e.g. multiplication, logical
AND, list concatenation, …
• Strands can update the variable as if it were an
ordinary variable, but it is maintained as a
collection of different views
• The runtime system coordinates the views and
combines them when appropriate
• When only one view remains, the underlying value
is stable and can be extracted
Example:
summing
reducer
Software & Services Group
Developer Products Division
x: 42
x: 14
89
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
x: 33
Warwick HPC
17 Feb 2012
20
Reducers in Cilk™ Plus
• You can write your own reducers with any
reduction operation
• Reducers can be used (though less elegantly) in C
Not lexically bound
to a particular loop
Updates local
view of sum
cilk::reducer_opadd<float> sum = 0;
...
cilk_for( size_t i=1; i<n; ++i )
sum += f(i);
Read final value of sum
... = sum.get_value();
Reducers simplify race removal
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
21
Vector language features, the “Plus”
• Similar to Fortran 90 vector language but
– in C/C++
– no vector temporaries introduced by the compiler
• Explicit vector expressions
x[:] = a*x[:] + y[:]; // Known lengths
x[0:count] = a*x[0:count] + y[0:count];
x[0:n:2] = a*x[0:n:2] + y[0:n:2]; // Strided
x[i1[:]] = y[i2[:]]
// scatter, gather
• Elemental functions
• #pragma simd to force vectorization
• Generated code is comparable with hand-coded
“intrinsics”
Explicit vector language makes it easier to exploit
SIMD instructions efficiently
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
22
Performance Tuning: Cilk view output for
Cache-Oblivious Stencil
A cache-oblivious stencil
algorithm gives good
parallelism and minimizes
cache misses by using
divide-and-conquer in all
dimensions including time.
Linear
Speedup
Measured
Speedup
Available Parallelism
Algorithm designed by
Frigo and Strumpen in
"Cache-oblivious stencil
computations" (ICS '05)
Software & Services Group
Developer Products Division
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Burdened
Speedup
47.93
What if I don’t want a new language?
Use Threading Building Blocks (TBB)
• Open Source (GPL) C++ template library
– Ported to many machines and OSes (not just Intel
architectures)
• Cilk like concepts of tasks, a work-stealing runtime
• Scalable, composable (and composes with Cilk)
• Additional features beyond Cilk’s recursive parallelism
– Pipelines of tasks (TBB::pipeline)
– General dependency DAGs of tasks (TBB::flow::graph)
– Task priorities
– Parallel containers
– Memory allocator optimized for parallel use
TBB provides portable task-based parallelism in C++
without requiring new language features.
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
24
Summary
On modern and foreseeable processors
• Data parallelism (vectorization) matters
• Task parallelism matters
• Dealing with threads directly is too hard to be a
feasible solution
– Restricts parallelism to one level of the software stack
– Hard to scale forwards as new hardware appears
• OpenMP has trouble scaling and composing
• Tasking systems like Cilk™ Plus and TBB are
available now and are better alternatives
Check out www.cilk.com,
www.threadingbuildingblocks.org
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
25
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
26
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components
and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance.
Buyers should consult other sources of information to evaluate the performance of systems or
components they are considering purchasing. For more information on performance tests and on
the performance of Intel products, reference www.intel.com/software/products.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino
Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386,
Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside,
Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel
NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel
XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium
Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon
Inside are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2011. Intel Corporation.
http://intel.com/software/products
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
27
Notes and References
Slide 3: Graph by Herb Sutter from http://www.gotw.ca/publications/concurrency-ddj.htm
Slide 5: More Ivybridge info is available from
http://www.intel.com/idf/library/pdf/sf_2011/SF11_SPCS005_101F.pdf
Slide 9: Photos taken by Jim Cownie, used with permission 
Slide 14: Information about Intel® Cilk™ Plus can be found at http://www.cilkplus.org,
along with pointers to the open source implementation.
“The implementation of the Cilk-5 multithreaded language”
(http://dl.acm.org/citation.cfm?doid=277652.277725 )
Slide 23: Cilk view is described in “The Cilkview scalability analyzer”
(http://dl.acm.org/citation.cfm?id=1810479.1810509)
“Cache oblivious stencil computations” (http://dl.acm.org/citation.cfm?id=1088197)
Slide 24: TBB is described at www.threadingbuildingblocks.org
Software & Services Group
Developer Products Division
Warwick HPC
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
17 Feb 2012
28
Download