TBB 2.1 external presentation

advertisement
Intel® Threading Building
Blocks
Agenda
•Overview
•Intel® Threading Building Blocks
− Parallel Algorithms
− Task Scheduler
− Concurrent Containers
− Sync Primitives
− Memory Allocator
•Summary
Intel and the Intel logo are trademarks of Intel
Corporation in the United States and other countries
Software and Services Group
‹#›
Multi-Core is Mainstream
•Gaining performance from multi-core requires parallel
programming
•Multi-threading is used to:
• Reduce or hide latency
• Increase throughput
Software and Services Group
‹#›
Going Parallel
Typical Serial C++
Program
Ideal Parallel C++ Program
Issues
Algorithms
Parallel Algorithms
Require many code changes when
developed from scratch: often it
takes a threading expert to get it
right
Data Structures
Thread-safe and
scalable Data Structures
Serial data structures usually require
global locks to make operations
thread-safe
Dependencies
- Minimum of
dependencies
Too many dependencies  expensive
synchronization  poor parallel
performance
- Efficient use of
synchronization
primitives
Memory Management
Scalable Memory
Manager
Standard memory allocator is often
inefficient in multi-threaded app
Software and Services Group
‹#›
Intel® Threading Building Blocks
Concurrent Containers
Generic Parallel Algorithms
Efficient scalable way to exploit the power
of multi-core without having to start
from scratch
-
Common idioms for concurrent access
a scalable alternative to a serial container
with a lock around it
TBB Flow Graph – New!
Thread Local Storage
Task scheduler
The engine that empowers parallel
algorithms that employs task-stealing
to maximize concurrency
Miscellaneous
Threads
Thread-safe timers
OS API wrappers
Scalable implementation of thread-local
data that supports infinite number of TLS
Synchronization Primitives
User-level and OS wrappers for
mutual exclusion, ranging from atomic
operations to several flavors of
mutexes and condition variables
Memory Allocation
Per-thread scalable memory manager and false-sharing free allocators
Software and Services Group
‹#›
Intel® Threading Building Blocks
Extend C++ for parallelism
• Portable C++ runtime library that does thread
management, letting developers focus on proven
parallel patterns
• Scalable
• Composable
• Flexible
• Portable
Both GPL and commercial licenses are available.
http://threadingbuildingblocks.org
*Other names and brands may be claimed as the property of others
Software and Services Group
‹#›
Intel® Threading Building Blocks
Parallel Algorithms
Software and Services Group
‹#›
Generic Parallel Algorithms
• Loop parallelization
parallel_for, parallel_reduce, parallel_scan
>Load balanced parallel execution of fixed number of independent loop
iterations
• Parallel Algorithms for Streams
parallel_do, parallel_for_each, pipeline / parallel_pipeline
>Use for unstructured stream or pile of work
• Parallel function invocation
parallel_invoke
>Parallel execution of a number of user-specified functions
• Parallel Sort
parallel_sort
>Comparison sort with an average time complexity O(N Log(N))
Software and Services Group
‹#›
Parallel Algorithm Usage Example
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h“
using namespace tbb;
ChangeArray class defines
a for-loop body for parallel_for
class ChangeArray{
blocked_range – TBB template
representing 1D iteration space
int* array;
public:
ChangeArray (int* a): array(a) {}
void operator()( const blocked_range<int>& r ) const{
As usual with C++ function
for (int i=r.begin(); i!=r.end(); i++ ){
objects the main work
Foo (array[i]);
is done inside operator()
}
}
};
void ChangeArrayParallel (int* a, int n )
{
parallel_for (blocked_range<int>(0, n), ChangeArray(a));
}
int main (){
int A[N];
// initialize array here…
ChangeArrayParallel (A, N);
return 0;
}
A call to a template function
parallel_for<Range, Body>:
with arguments
Range  blocked_range
Body  ChangeArray
Software and Services Group
‹#›
parallel_for(Range(Data), Body(), Partitioner());
[Data, Data+N)
[Data, Data+N/2)
[Data+N/2, Data+N)
[Data, Data+N/k)
[Data, Data+GrainSize)
tasks available to
thieves
Software and Services Group
‹#›
Two Execution Orders
Depth First
Breadth First
(stack)
(queue)
Small space
Large space
Excellent cache locality
Poor cache locality
No parallelism
Maximum parallelism
Software and Services Group
‹#›
Work Depth First; Steal Breadth First
Best choice for theft!
•big piece of work
•data far from victim’s hot data.
Second best choice.
L2
L1
victim thread
Software and Services Group
‹#›
C++0x Lambda Expression Support
parallel_for example will transform into:
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h“
using namespace tbb;
void ChangeArrayParallel (int* a, int n )
{
parallel_for (0, n, 1,
[=](int i) {
Foo (a[i]);
});
}
int main (){
int A[N];
// initialize array here…
ChangeArrayParallel (A, N);
return 0;
}
parallel_for has an overload that takes
start, stop and step argument and
constructs blocked_range internally
Capture variables by value
from surrounding scope to
completely mimic the non-lambda
implementation. Note that [&]
could be used to capture
variables by reference .
Using lambda expressions implement
MyBody::operator() right inside
the call to parallel_for().
Software and Services Group
‹#›
Functional parallelism has never been easier
int main(int argc, char* argv[]) {
spin_mutex m;
int a = 1, b = 2;
void function_handle(void) calling
parallel_invoke(
void bar(int, int, mutex)implemented
foo,
[a, b, &m](){
using a lambda expression
bar(a, b, m);
},
Serial thread-safe job,
[&m](){
wrapped in a lambda expression
for(int i = 0; i < K; ++i) {
that is being executed in parallel
spin_mutex::scoped_lock l(m);
with three other functions
cout << i << endl;
}
},
[&m](){
parallel_for( 0, N, 1,
[&m](int i) {
Parallel job, which is also executed
spin_mutex::scoped_lock l(m);
in parallel with other functions.
cout << i << " ";
});
});
}
return 0;
Services threads
Group
Now imagine writing all this code with Software
just and
plain
‹#›
Strongly-typed parallel_pipeline
float RootMeanSquare( float* first, float* last ) {
float sum=0;
parallel_pipeline( /*max_number_of_tokens=*/16, Call function tbb::parallel_pipeline
make_filter<void,float*>(
to run pipeline stages (filters)
filter::serial,
[&](flow_control& fc)-> float*{
Create pipeline stage object
if( first<last ) {
tbb::make_filter<InputDataType,
return first++;
OutputDataType>(mode, body)
} else {
fc.stop(); // stop processing
Pipeline stage mode can be serial,
return NULL;
parallel, serial_in_order, or
}
serial_out_of_order
}
)&
input: void
make_filter<float*,float>(
filter::parallel,
output: float*
get new float
[](float* p){return (*p)*(*p);}
)&
input: float*
make_filter<float,void>(
output: float
float*float
filter::serial,
[&sum](float x) {sum+=x;}
input: float
)
sum+=float2
);
/* sum=first2+(first+1)2 + … +(last-1)2
output: void
computed in parallel */
return sqrt(sum);
}
Software and Services Group
‹#›
Intel® Threading Building Blocks
Task Scheduler
Software and Services Group
‹#›
Task Scheduler
• Task scheduler is the engine driving Intel® Threading Building Blocks
• Manages thread pool, hiding complexity of native thread management
• Maps logical tasks to threads
• Parallel algorithms are based on task scheduler interface
• Task scheduler is designed to address common performance issues of
parallel programming with native threads
Problem
Intel® TBB Approach
Oversubscription
One scheduler thread per hardware thread
Fair scheduling
Non-preemptive unfair scheduling
High overhead
Programmer specifies tasks, not threads.
Load imbalance
Work-stealing balances load
Software and Services Group
‹#›
Logical task – it is just a C++ class
#include “tbb/task_scheduler_init.h”
#include “tbb/task.h”
using namespace tbb;
class ThisIsATask: public task {
public:
task* execute () {
WORK ();
return NULL;
}
};
• Derive from tbb::task class
• Implement execute()
member function
• Create and spawn root task
and your tasks
•
Wait for tasks to finish
Software and Services Group
‹#›
Task Tree Example
Time
Depth Level
Thread 1 Thread 2
root
task
wait
for
all()
child1
child2
Intel® TBB wait
calls don’t block
calling thread! It
blocks the task
however. Intel
TBB worker thread
keeps stealing
tasks while waiting
Yellow arrows– Creation sequence
Black arrows – Task dependency
Software and Services Group
‹#›
Intel® Threading Building Blocks
Concurrent Containers
Software and Services Group
‹#›
Concurrent Containers
• Intel® TBB provides highly concurrent containers
− STL containers are not concurrency-friendly: attempt to modify
them concurrently can corrupt container
− Wrapping a lock around an STL container turns it into a serial
bottleneck and still does not always guarantee thread safety
> STL containers are inherently not thread-safe
• Intel TBB provides fine-grained locking or lockless
implementations
− Worse single-thread performance, but better scalability.
− Can be used with the library, OpenMP*, or native threads.
*Other names and brands may be claimed as the property of others
Software and Services Group
‹#›
Concurrent Containers Key Features
− concurrent_hash_map <Key,T,Hasher,Allocator>
>Models hash table of std::pair <const Key, T> elements
− concurrent_unordered_map<Key,T,Hasher,Equality,Allocator>
>Permits concurrent traversal and insertion (no concurrent erasure)
>Requires no visible locking, looks similar to STL interfaces
− concurrent_vector <T, Allocator>
>Dynamically growable array of T: grow_by and grow_to_atleast
− concurrent_queue <T, Allocator>
>For single threaded run concurrent_queue supports regular “first-in-first-out”
ordering
>If one thread pushes two values and the other thread pops those two values
they will come out in the order as they were pushed
− concurrent_bounded_queue <T, Allocator>
>Similar to concurrent_queue with a difference that it allows specifying capacity.
Once the capacity is reached ‘push’ will wait until other elements will be
popped before it can continue.
− concurrent_priority_queue <T, Compare, Allocator>
>Similar to std::priority_queue with scalable pop and push oprations
Software and Services Group
‹#›
Hash-map Examples
#include <map>
typedef std::map<std::string, int> StringTable;
for (std::string* p=range.begin(); p!=range.end(); ++p)
{
tbb::spin_mutex::scoped_lock lock( global_lock );
table[*p] += 1;
}
Concurrent
Ops
TBB
cumap
TBB
chmap
STL
map
Traversal
Yes
No
No
Insertion
Yes
Yes
No
Erasure
No
Yes
No
Search
Yes
Yes
No
#include "tbb/concurrent_hash_map.h"
typedef concurrent_hash_map<std::string,int> StringTable;
for (std::string* p=range.begin(); p!=range.end(); ++p) {
StringTable::accessor a; // local lock
table.insert( a, *p );
a->second += 1;}
}
#include "tbb/concurrent_unordered_map.h“
typedef concurrent_unordered_map<std::string,atomic<int>> StringTable;
for (std::string* p=range.begin(); p!=range.end(); ++p) {
table[*p] += 1; // similar to STL but value is tbb::atomic<int>
}
Software and Services Group
‹#›
Intel® Threading Building Blocks
Sync Primitives
Software and Services Group
‹#›
Synchronization Primitives Features
•Atomic Operations.
−High-level abstractions
•Exception-safe Locks
−spin_mutex is VERY FAST in lightly contended situations; use it
if you need to protect very few instructions
−Use queuing_rw_mutex when scalability and fairness are
important
−Use recursive_mutex when your threading model requires that
one thread can re-acquire a lock. All locks should be released by
one thread for another one to get a lock.
−Use reader-writer mutex to allow non-blocking read for multiple
threads
•Portable condition variables
Software and Services Group
‹#›
Example: spin_rw_mutex
#include “tbb/spin_rw_mutex.h”
using namespace tbb;
spin_rw_mutex MyMutex;
int foo (){
// Construction of ‘lock’ acquires ‘MyMutex’
spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false);
…
if (!lock.upgrade_to_writer ()) { … }
else { … }
return 0;
// Destructor of ‘lock’ releases ‘MyMutex’
}
•If exception occurs within the protected code block destructor will
automatically release the lock if it’s acquired avoiding a dead-lock
•Any reader lock may be upgraded to writer lock; upgrade_to_writer
indicates whether the lock had to be released before it was upgraded
Software and Services Group
‹#›
Intel® Threading Building Blocks
Scalable Memory Allocator
Software and Services Group
‹#›
Scalable Memory Allocation
•
Problem
− Memory allocation is a bottle-neck in concurrent environment
 Threads acquire a global lock to allocate/deallocate memory
from the global heap
•
Solution
− Intel® Threading Building Blocks provides tested, tuned, and
scalable memory allocator optimized for all object sizes:
 Manual and automatic replacement of memory management
calls
 C++ interface to use it with C++ objects as an underlying
allocator (e.g. STL containers)
 Scalable memory pools
Software and Services Group
‹#›
Memory API Calls Replacement
•Manual
−Change your code to call Intel® TBB scable_malloc/scalable_free
instead of malloc and free
−Use scalable_* API to implement operators new and delete
−Use tbb::scalable_allocator<T> as an underlying allocator for C++
objects (e.g. STL containers)
•Automatic (Windows* and Linux*)
−Requires no code changes just re-link your binaries using proxy
libraries
Linux*: libtbbmalloc_proxy.so.2 or libtbbmalloc_proxy_debug.so.2
Windows*: tbbmalloc_proxy.dll or tbbmalloc_debug_proxy.dll
Software and Services Group
‹#›
C++ Allocator Template
•Use tbb::scalable_allocator<T> as an underlying allocator for
C++ objects
•Example:
// STL container used with Intel® TBB scalable allocator
std::vector<int, tbb::scalable_allocator<int> >;
Software and Services Group
‹#›
Scalable Memory Pools
#include "tbb/memory_pool.h"
...
tbb::memory_pool<std::allocator<char> >
my_pool();
void* my_ptr = my_pool.malloc(10);
void* my_ptr_2 = my_pool.malloc(20);
…
my_pool.recycle();
// destructor also frees everything
#include "tbb/memory_pool.h"
...
char buf[1024*1024];
tbb::fixed_pool my_pool(buf, 1024*1024);
void* my_ptr = my_pool.malloc(10);
my_pool.free(my_ptr);}
Allocate memory
from the pool
Allocate and free from a
fixed size buffer
Software and Services Group
‹#›
Scalable Memory Allocator Structure
scalable_malloc
interface
pool_malloc layer
small object
support, incl.percore caches
large object
support, incl. cache
backend
free space acquisition
pool callbacks
system malloc
mmap/VirtalAlloc
Software and Services Group
‹#›
Intel® TBB Memory Allocator Internals
•Small blocks
−Per-thread memory pools
•Large blocks
−Treat memory as “objects” of fixed size, not as ranges of address
space.
Typically several dozen (or less) object sizes are in active use
−Keep released memory objects in a pool and reuse when object of
such size is requested
−Pooled objects “age” over time
Cleanup threshold varies for different object sizes
−Low fragmentation is achieved using segregated free lists
Intel TBB scalable memory allocator is designed for
multi-threaded apps and optimized for multi-core
Software and Services Group
‹#›
Intel® Threading Building Blocks
Concurrent Containers
Generic Parallel Algorithms
Efficient scalable way to exploit the power
of multi-core without having to start
from scratch
-
Common idioms for concurrent access
a scalable alternative to a serial container
with a lock around it
TBB Graph
Thread Local Storage
Task scheduler
The engine that empowers parallel
algorithms that employs task-stealing
to maximize concurrency
Miscellaneous
Threads
Thread-safe timers
OS API wrappers
Scalable implementation of thread-local
data that supports infinite number of TLS
Synchronization Primitives
User-level and OS wrappers for
mutual exclusion, ranging from atomic
operations to several flavors of
mutexes and condition variables
Memory Allocation
Per-thread scalable memory manager and false-sharing free allocators
Software and Services Group
‹#›
Supplementary Links
•
•
•
•
Commercial Product Web Page
www.intel.com/software/products/tbb
Open Source Web Portal
www.threadingbuildingblocks.org
Knowledge Base, Blogs and User Forums
http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1
http://software.intel.com/en-us/blogs/category/osstbb/
http://software.intel.com/en-us/forums/intel-threading-building-blocks
Technical Articles:
− “Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms”
http://www.devx.com/cplus/Article/32935
− “Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers”
http://www.devx.com/cplus/Article/33334
•
Industry Articles:
− Product Review: Intel Threading Building Blocks
http://www.devx.com/go-parallel/Article/33270
− “The Concurrency Revolution”, Herb Sutter, Dr. Dobb’s 1/19/2005
http://www.ddj.com/dept/cpp/184401916
Software and Services Group
‹#›
Software and Services Group
‹#›
Download