Intel® Threading Building Blocks Agenda •Overview •Intel® Threading Building Blocks − Parallel Algorithms − Task Scheduler − Concurrent Containers − Sync Primitives − Memory Allocator •Summary Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries Software and Services Group ‹#› Multi-Core is Mainstream •Gaining performance from multi-core requires parallel programming •Multi-threading is used to: • Reduce or hide latency • Increase throughput Software and Services Group ‹#› Going Parallel Typical Serial C++ Program Ideal Parallel C++ Program Issues Algorithms Parallel Algorithms Require many code changes when developed from scratch: often it takes a threading expert to get it right Data Structures Thread-safe and scalable Data Structures Serial data structures usually require global locks to make operations thread-safe Dependencies - Minimum of dependencies Too many dependencies expensive synchronization poor parallel performance - Efficient use of synchronization primitives Memory Management Scalable Memory Manager Standard memory allocator is often inefficient in multi-threaded app Software and Services Group ‹#› Intel® Threading Building Blocks Concurrent Containers Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch - Common idioms for concurrent access a scalable alternative to a serial container with a lock around it TBB Flow Graph – New! Thread Local Storage Task scheduler The engine that empowers parallel algorithms that employs task-stealing to maximize concurrency Miscellaneous Threads Thread-safe timers OS API wrappers Scalable implementation of thread-local data that supports infinite number of TLS Synchronization Primitives User-level and OS wrappers for mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition variables Memory Allocation Per-thread scalable memory manager and false-sharing free allocators Software and Services Group ‹#› Intel® Threading Building Blocks Extend C++ for parallelism • Portable C++ runtime library that does thread management, letting developers focus on proven parallel patterns • Scalable • Composable • Flexible • Portable Both GPL and commercial licenses are available. http://threadingbuildingblocks.org *Other names and brands may be claimed as the property of others Software and Services Group ‹#› Intel® Threading Building Blocks Parallel Algorithms Software and Services Group ‹#› Generic Parallel Algorithms • Loop parallelization parallel_for, parallel_reduce, parallel_scan >Load balanced parallel execution of fixed number of independent loop iterations • Parallel Algorithms for Streams parallel_do, parallel_for_each, pipeline / parallel_pipeline >Use for unstructured stream or pile of work • Parallel function invocation parallel_invoke >Parallel execution of a number of user-specified functions • Parallel Sort parallel_sort >Comparison sort with an average time complexity O(N Log(N)) Software and Services Group ‹#› Parallel Algorithm Usage Example #include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; ChangeArray class defines a for-loop body for parallel_for class ChangeArray{ blocked_range – TBB template representing 1D iteration space int* array; public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ As usual with C++ function for (int i=r.begin(); i!=r.end(); i++ ){ objects the main work Foo (array[i]); is done inside operator() } } }; void ChangeArrayParallel (int* a, int n ) { parallel_for (blocked_range<int>(0, n), ChangeArray(a)); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0; } A call to a template function parallel_for<Range, Body>: with arguments Range blocked_range Body ChangeArray Software and Services Group ‹#› parallel_for(Range(Data), Body(), Partitioner()); [Data, Data+N) [Data, Data+N/2) [Data+N/2, Data+N) [Data, Data+N/k) [Data, Data+GrainSize) tasks available to thieves Software and Services Group ‹#› Two Execution Orders Depth First Breadth First (stack) (queue) Small space Large space Excellent cache locality Poor cache locality No parallelism Maximum parallelism Software and Services Group ‹#› Work Depth First; Steal Breadth First Best choice for theft! •big piece of work •data far from victim’s hot data. Second best choice. L2 L1 victim thread Software and Services Group ‹#› C++0x Lambda Expression Support parallel_for example will transform into: #include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; void ChangeArrayParallel (int* a, int n ) { parallel_for (0, n, 1, [=](int i) { Foo (a[i]); }); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0; } parallel_for has an overload that takes start, stop and step argument and constructs blocked_range internally Capture variables by value from surrounding scope to completely mimic the non-lambda implementation. Note that [&] could be used to capture variables by reference . Using lambda expressions implement MyBody::operator() right inside the call to parallel_for(). Software and Services Group ‹#› Functional parallelism has never been easier int main(int argc, char* argv[]) { spin_mutex m; int a = 1, b = 2; void function_handle(void) calling parallel_invoke( void bar(int, int, mutex)implemented foo, [a, b, &m](){ using a lambda expression bar(a, b, m); }, Serial thread-safe job, [&m](){ wrapped in a lambda expression for(int i = 0; i < K; ++i) { that is being executed in parallel spin_mutex::scoped_lock l(m); with three other functions cout << i << endl; } }, [&m](){ parallel_for( 0, N, 1, [&m](int i) { Parallel job, which is also executed spin_mutex::scoped_lock l(m); in parallel with other functions. cout << i << " "; }); }); } return 0; Services threads Group Now imagine writing all this code with Software just and plain ‹#› Strongly-typed parallel_pipeline float RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_tokens=*/16, Call function tbb::parallel_pipeline make_filter<void,float*>( to run pipeline stages (filters) filter::serial, [&](flow_control& fc)-> float*{ Create pipeline stage object if( first<last ) { tbb::make_filter<InputDataType, return first++; OutputDataType>(mode, body) } else { fc.stop(); // stop processing Pipeline stage mode can be serial, return NULL; parallel, serial_in_order, or } serial_out_of_order } )& input: void make_filter<float*,float>( filter::parallel, output: float* get new float [](float* p){return (*p)*(*p);} )& input: float* make_filter<float,void>( output: float float*float filter::serial, [&sum](float x) {sum+=x;} input: float ) sum+=float2 ); /* sum=first2+(first+1)2 + … +(last-1)2 output: void computed in parallel */ return sqrt(sum); } Software and Services Group ‹#› Intel® Threading Building Blocks Task Scheduler Software and Services Group ‹#› Task Scheduler • Task scheduler is the engine driving Intel® Threading Building Blocks • Manages thread pool, hiding complexity of native thread management • Maps logical tasks to threads • Parallel algorithms are based on task scheduler interface • Task scheduler is designed to address common performance issues of parallel programming with native threads Problem Intel® TBB Approach Oversubscription One scheduler thread per hardware thread Fair scheduling Non-preemptive unfair scheduling High overhead Programmer specifies tasks, not threads. Load imbalance Work-stealing balances load Software and Services Group ‹#› Logical task – it is just a C++ class #include “tbb/task_scheduler_init.h” #include “tbb/task.h” using namespace tbb; class ThisIsATask: public task { public: task* execute () { WORK (); return NULL; } }; • Derive from tbb::task class • Implement execute() member function • Create and spawn root task and your tasks • Wait for tasks to finish Software and Services Group ‹#› Task Tree Example Time Depth Level Thread 1 Thread 2 root task wait for all() child1 child2 Intel® TBB wait calls don’t block calling thread! It blocks the task however. Intel TBB worker thread keeps stealing tasks while waiting Yellow arrows– Creation sequence Black arrows – Task dependency Software and Services Group ‹#› Intel® Threading Building Blocks Concurrent Containers Software and Services Group ‹#› Concurrent Containers • Intel® TBB provides highly concurrent containers − STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container − Wrapping a lock around an STL container turns it into a serial bottleneck and still does not always guarantee thread safety > STL containers are inherently not thread-safe • Intel TBB provides fine-grained locking or lockless implementations − Worse single-thread performance, but better scalability. − Can be used with the library, OpenMP*, or native threads. *Other names and brands may be claimed as the property of others Software and Services Group ‹#› Concurrent Containers Key Features − concurrent_hash_map <Key,T,Hasher,Allocator> >Models hash table of std::pair <const Key, T> elements − concurrent_unordered_map<Key,T,Hasher,Equality,Allocator> >Permits concurrent traversal and insertion (no concurrent erasure) >Requires no visible locking, looks similar to STL interfaces − concurrent_vector <T, Allocator> >Dynamically growable array of T: grow_by and grow_to_atleast − concurrent_queue <T, Allocator> >For single threaded run concurrent_queue supports regular “first-in-first-out” ordering >If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed − concurrent_bounded_queue <T, Allocator> >Similar to concurrent_queue with a difference that it allows specifying capacity. Once the capacity is reached ‘push’ will wait until other elements will be popped before it can continue. − concurrent_priority_queue <T, Compare, Allocator> >Similar to std::priority_queue with scalable pop and push oprations Software and Services Group ‹#› Hash-map Examples #include <map> typedef std::map<std::string, int> StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { tbb::spin_mutex::scoped_lock lock( global_lock ); table[*p] += 1; } Concurrent Ops TBB cumap TBB chmap STL map Traversal Yes No No Insertion Yes Yes No Erasure No Yes No Search Yes Yes No #include "tbb/concurrent_hash_map.h" typedef concurrent_hash_map<std::string,int> StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { StringTable::accessor a; // local lock table.insert( a, *p ); a->second += 1;} } #include "tbb/concurrent_unordered_map.h“ typedef concurrent_unordered_map<std::string,atomic<int>> StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { table[*p] += 1; // similar to STL but value is tbb::atomic<int> } Software and Services Group ‹#› Intel® Threading Building Blocks Sync Primitives Software and Services Group ‹#› Synchronization Primitives Features •Atomic Operations. −High-level abstractions •Exception-safe Locks −spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few instructions −Use queuing_rw_mutex when scalability and fairness are important −Use recursive_mutex when your threading model requires that one thread can re-acquire a lock. All locks should be released by one thread for another one to get a lock. −Use reader-writer mutex to allow non-blocking read for multiple threads •Portable condition variables Software and Services Group ‹#› Example: spin_rw_mutex #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutex MyMutex; int foo (){ // Construction of ‘lock’ acquires ‘MyMutex’ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; // Destructor of ‘lock’ releases ‘MyMutex’ } •If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lock •Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it was upgraded Software and Services Group ‹#› Intel® Threading Building Blocks Scalable Memory Allocator Software and Services Group ‹#› Scalable Memory Allocation • Problem − Memory allocation is a bottle-neck in concurrent environment Threads acquire a global lock to allocate/deallocate memory from the global heap • Solution − Intel® Threading Building Blocks provides tested, tuned, and scalable memory allocator optimized for all object sizes: Manual and automatic replacement of memory management calls C++ interface to use it with C++ objects as an underlying allocator (e.g. STL containers) Scalable memory pools Software and Services Group ‹#› Memory API Calls Replacement •Manual −Change your code to call Intel® TBB scable_malloc/scalable_free instead of malloc and free −Use scalable_* API to implement operators new and delete −Use tbb::scalable_allocator<T> as an underlying allocator for C++ objects (e.g. STL containers) •Automatic (Windows* and Linux*) −Requires no code changes just re-link your binaries using proxy libraries Linux*: libtbbmalloc_proxy.so.2 or libtbbmalloc_proxy_debug.so.2 Windows*: tbbmalloc_proxy.dll or tbbmalloc_debug_proxy.dll Software and Services Group ‹#› C++ Allocator Template •Use tbb::scalable_allocator<T> as an underlying allocator for C++ objects •Example: // STL container used with Intel® TBB scalable allocator std::vector<int, tbb::scalable_allocator<int> >; Software and Services Group ‹#› Scalable Memory Pools #include "tbb/memory_pool.h" ... tbb::memory_pool<std::allocator<char> > my_pool(); void* my_ptr = my_pool.malloc(10); void* my_ptr_2 = my_pool.malloc(20); … my_pool.recycle(); // destructor also frees everything #include "tbb/memory_pool.h" ... char buf[1024*1024]; tbb::fixed_pool my_pool(buf, 1024*1024); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr);} Allocate memory from the pool Allocate and free from a fixed size buffer Software and Services Group ‹#› Scalable Memory Allocator Structure scalable_malloc interface pool_malloc layer small object support, incl.percore caches large object support, incl. cache backend free space acquisition pool callbacks system malloc mmap/VirtalAlloc Software and Services Group ‹#› Intel® TBB Memory Allocator Internals •Small blocks −Per-thread memory pools •Large blocks −Treat memory as “objects” of fixed size, not as ranges of address space. Typically several dozen (or less) object sizes are in active use −Keep released memory objects in a pool and reuse when object of such size is requested −Pooled objects “age” over time Cleanup threshold varies for different object sizes −Low fragmentation is achieved using segregated free lists Intel TBB scalable memory allocator is designed for multi-threaded apps and optimized for multi-core Software and Services Group ‹#› Intel® Threading Building Blocks Concurrent Containers Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch - Common idioms for concurrent access a scalable alternative to a serial container with a lock around it TBB Graph Thread Local Storage Task scheduler The engine that empowers parallel algorithms that employs task-stealing to maximize concurrency Miscellaneous Threads Thread-safe timers OS API wrappers Scalable implementation of thread-local data that supports infinite number of TLS Synchronization Primitives User-level and OS wrappers for mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition variables Memory Allocation Per-thread scalable memory manager and false-sharing free allocators Software and Services Group ‹#› Supplementary Links • • • • Commercial Product Web Page www.intel.com/software/products/tbb Open Source Web Portal www.threadingbuildingblocks.org Knowledge Base, Blogs and User Forums http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1 http://software.intel.com/en-us/blogs/category/osstbb/ http://software.intel.com/en-us/forums/intel-threading-building-blocks Technical Articles: − “Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms” http://www.devx.com/cplus/Article/32935 − “Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers” http://www.devx.com/cplus/Article/33334 • Industry Articles: − Product Review: Intel Threading Building Blocks http://www.devx.com/go-parallel/Article/33270 − “The Concurrency Revolution”, Herb Sutter, Dr. Dobb’s 1/19/2005 http://www.ddj.com/dept/cpp/184401916 Software and Services Group ‹#› Software and Services Group ‹#›