Beyond Threads: Scalable, Composable, Parallelism with Intel® Cilk™ Plus and TBB Jim Cownie <james.h.cownie@intel.com> Intel SSG/DPD/TCAR Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 1 Optimization Notice Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 2 Performance Trends After ~2004 only the number of transistors continues to increase We have hit limits in • Power • Instruction level parallelism • Clock speed Single core scalar performance is now only growing slowly Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 3 But… Moore’s law is alive and well 90nm 2003 65nm 2005 45nm 2007 New Intel technology generation every 2 years Intel R&D technologies drive this pace well into the decade 32nm 2009 22nm 2011 25 nm 14nm 2013 15nm 10nm 2015 Hi-K metal-gate 3-D Trigate Shrink We will have lots of transistors! Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 4 How do we use all the transistors? Eat other system components • Graphics • Memory i/f • PCI i/f Add cache Replicate cores This is a desktop part, but it has four cores each with two HW threads and 256 bit (8 single or 4 double) SIMD FP units Number of cores will continue to increase in the future Data and thread parallelism are mandatory to achieve highest performance Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 5 How do we use all the transistors for HPC? • Many Integrated Core (“MIC”, aka “Knights …”) • >50 cache coherent cores, 4 HW threads/core • 512 bit vector FPU/core • 22nm process • Extended x86 ISA • Linux kernel • Fortran, C, C++, Cilk, OpenMP, MPI, … Data and thread • Demonstrated >1 TFlop parallelism are even more important here! sustained on DGEMM Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 6 Exascale trends • US government wants 1 ExaFlop in 20MW in 2018 • Critical issues – Power (requires 300x improvement in energy efficiency!) – Reliability – Programmability (MPI + what?) – Did I mention Power? • Architecture: – Cluster of SMP nodes – Each node will have lots (100s..1000s?) of cores – Each core will have wide vector units Data and thread parallelism become more important Homogeneous MPI parallelism won’t cut it Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 7 But, didn’t we solve threading in the 1990s? • Pthreads standard: IEEE 1003.1c-1995 • OpenMP standard: 1997 Yes, but… • How do I choose how many threads to use? • How do I split up my work, should I have a function/thread? • How do I debug with non-determinism? • How do I balance load between threads? • What happens if I call a library that also wants to use threads? • What happens on a new machine with more cores? Programming with threads is HARD Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 8 The answer was in Seattle… Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 9 Scalable, Composable Parallelism • Scalable: a single binary can exploit all the cores in the HW it happens to be running on – Efficiently – Without requiring user control Scalable software benefits from future HW • Composable: Parallelism can be used at all levels of SW stack (user code, library, nested library,…) – Without over-subscription – With parallelism exploited at each level Composable software allows use of parallel libraries Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 10 What’s wrong with OpenMP? • Parallelism is compulsory • You know which thread you are: omp_get_thread_num() • You know how many threads exist: omp_get_num_threads() • You control how work is assigned to threads: schedule(…) • OpenMP gives you lots of control but you end up tuning for the current machine OpenMP gives you too many knobs to play with! Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 11 What’s wrong with OpenMP? • Static scheduling can’t handle jitter – If one thread runs slowly (OS interrupt, more cache/TLB misses) all threads have to wait – With more cores jitter is more likely • Nested parallelism is dangerous – If OMP_NESTED=false, inner parallelism is not exploited – If OMP_NESTED=true, it’s easy to get exponential over-subscription OpenMP is not composable Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 12 OK, but how can I have parallelism without threads? Think about the parallelism in your problem • Describe the way your problem can be broken down into independent computations (tasks) • Let the runtime do the hard work – handle allocation of tasks to threads to ensure efficient execution – choose the number of threads to use depending on available hardware You don’t normally worry about register allocation, similarly you shouldn’t worry about threads Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 13 Key Features of Cilk™ Plus • Small extensions to C and C++ • Express the independent tasks in your code • Express the vector operations in your code • Results are deterministic – There is a “serial elision” of the parallel code • Formal properties – Guaranteed memory limits: executing on n-threads uses <= n times memory of serial code – Provably efficient work-stealing scheduler • Tools support: Cilk screen, Cilk view • Public specification with an open-source implementation in a GCC branch Cilk lets programmers think about their problem, not the runtime implementation Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 14 Example: Fibonacci Numbers The Fibonacci numbers are the sequence 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, …, where each number is the sum of the previous two. Recurrence: F0 = 0, F1 = 1, Fn = Fn–1 + Fn–2 for n > 1 It is named after Leonardo di Pisa (1170–1250 CE), known as Fibonacci. Fibonacci’s 1202 book Liber Abaci introduced the sequence to Western mathematics, though it had previously been discovered in India. Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 15 Fibonacci Execution int fib(int n) { if (n < 2) return n; int x = fib(n-1); int y = fib(n-2); return x + y; } Key idea for parallelization: fib(n-1) and fib(n-2) can be calculated simultaneously fib(4) fib(3) fib(2) fib(1) fib(1) fib(2) fib(1) fib(0) fib(0) Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 16 Nested Parallelism in Cilk™ Plus The named child function int fib(int n) may execute in parallel { with the caller if (n < 2) return n; int x = cilk_spawn fib(n-1); int y = fib(n-2); cilk_sync; Control cannot pass here return x+y; } until all spawned children have returned Cilk keywords grant permission for parallel execution. They do not force it. Code with the Cilk keywords macro-ed out is a correct serial version. Parallelism is introduced recursively so composability happens trivially. Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 17 So, you can handle functional code, but what about real code? • Recursion is hard, we’re not all Lisp programmers! – cilk_for compiles a loop into a recursive parallel task decomposition of the iteration space • Real code has global variables whose update from parallel tasks would be racy • Races are – Hard to detect (non-deterministic values) – Hard to fix (need to modify every access and add locking) Solution • Cilk screen for detecting problems • Reducers for removing them Cilk™ Plus is more than just the language extensions Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 18 Cilk screen • Cilk screen runs on the executable image using metadata embedded by the compiler – No need for a special build • For a given input, and lock-free code, Cilk screen guarantees to localize a race if there exists a parallel execution that could produce results different from the serial execution • It runs about 20 times slower than real-time Address of data Location of 1st access Location of 2nd access Backtrace 2nd access Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 19 Reducer Hyperobjects • A variable can be declared as a reducer over an associative operation, e.g. multiplication, logical AND, list concatenation, … • Strands can update the variable as if it were an ordinary variable, but it is maintained as a collection of different views • The runtime system coordinates the views and combines them when appropriate • When only one view remains, the underlying value is stable and can be extracted Example: summing reducer Software & Services Group Developer Products Division x: 42 x: 14 89 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. x: 33 Warwick HPC 17 Feb 2012 20 Reducers in Cilk™ Plus • You can write your own reducers with any reduction operation • Reducers can be used (though less elegantly) in C Not lexically bound to a particular loop Updates local view of sum cilk::reducer_opadd<float> sum = 0; ... cilk_for( size_t i=1; i<n; ++i ) sum += f(i); Read final value of sum ... = sum.get_value(); Reducers simplify race removal Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 21 Vector language features, the “Plus” • Similar to Fortran 90 vector language but – in C/C++ – no vector temporaries introduced by the compiler • Explicit vector expressions x[:] = a*x[:] + y[:]; // Known lengths x[0:count] = a*x[0:count] + y[0:count]; x[0:n:2] = a*x[0:n:2] + y[0:n:2]; // Strided x[i1[:]] = y[i2[:]] // scatter, gather • Elemental functions • #pragma simd to force vectorization • Generated code is comparable with hand-coded “intrinsics” Explicit vector language makes it easier to exploit SIMD instructions efficiently Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 22 Performance Tuning: Cilk view output for Cache-Oblivious Stencil A cache-oblivious stencil algorithm gives good parallelism and minimizes cache misses by using divide-and-conquer in all dimensions including time. Linear Speedup Measured Speedup Available Parallelism Algorithm designed by Frigo and Strumpen in "Cache-oblivious stencil computations" (ICS '05) Software & Services Group Developer Products Division Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Burdened Speedup 47.93 What if I don’t want a new language? Use Threading Building Blocks (TBB) • Open Source (GPL) C++ template library – Ported to many machines and OSes (not just Intel architectures) • Cilk like concepts of tasks, a work-stealing runtime • Scalable, composable (and composes with Cilk) • Additional features beyond Cilk’s recursive parallelism – Pipelines of tasks (TBB::pipeline) – General dependency DAGs of tasks (TBB::flow::graph) – Task priorities – Parallel containers – Memory allocator optimized for parallel use TBB provides portable task-based parallelism in C++ without requiring new language features. Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 24 Summary On modern and foreseeable processors • Data parallelism (vectorization) matters • Task parallelism matters • Dealing with threads directly is too hard to be a feasible solution – Restricts parallelism to one level of the software stack – Hard to scale forwards as new hardware appears • OpenMP has trouble scaling and composing • Tasking systems like Cilk™ Plus and TBB are available now and are better alternatives Check out www.cilk.com, www.threadingbuildingblocks.org Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 25 Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 26 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2011. Intel Corporation. http://intel.com/software/products Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 27 Notes and References Slide 3: Graph by Herb Sutter from http://www.gotw.ca/publications/concurrency-ddj.htm Slide 5: More Ivybridge info is available from http://www.intel.com/idf/library/pdf/sf_2011/SF11_SPCS005_101F.pdf Slide 9: Photos taken by Jim Cownie, used with permission Slide 14: Information about Intel® Cilk™ Plus can be found at http://www.cilkplus.org, along with pointers to the open source implementation. “The implementation of the Cilk-5 multithreaded language” (http://dl.acm.org/citation.cfm?doid=277652.277725 ) Slide 23: Cilk view is described in “The Cilkview scalability analyzer” (http://dl.acm.org/citation.cfm?id=1810479.1810509) “Cache oblivious stencil computations” (http://dl.acm.org/citation.cfm?id=1088197) Slide 24: TBB is described at www.threadingbuildingblocks.org Software & Services Group Developer Products Division Warwick HPC Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 17 Feb 2012 28