MKL, Performance Primitives, Thread Building Blocks

Performance Libraries
Martyn Corden
Developer Products Division
Software & Services Group
Intel Corporation
June 2010
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
1
Agenda
• Intel® Math Kernel Library (MKL)
• Intel® Performance Primitives
• Intel® Threading Building Blocks
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Math Kernel Library (MKL)
contents:
• BLAS
(vector & matrix computation routines.)
– BLAS for sparse vectors/matrices
• LAPACK (Linear algebra)
– Solvers and eigensolvers. Many hundreds of routines total!
– Cluster implementation (SCALAPACK)
• DFTs (General FFTs)
– Mixed radix, multi-dimensional transforms
– Cluster implementation
• Sparse Solvers (PARDISO, DSS and and ISS)
– OOC version for huge problem sizes
•
•
•
•
Vector Math Library
(vectorized transcendental functions)
Performance Libraries:
Intel® Math
Vector Statistical Library
(random
number generators)
Kernel Library (MKL)
Optimization Solvers
(non-linear least squares, …)
PDE solvers
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Math Kernel Library:
a simple way to thread your application
• Many components of MKL have threaded versions
– Based on the compiler’s OpenMP runtime library
• Link threaded or non-threaded interface
– libmkl_intel_thread.a or libmkl_sequential.a
– Use the link line advisor at
http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/
– Or use -mkl with the Intel compiler
• Set the number of threads
– export MKL_NUM_THREADS or OMP_NUM_THREADS
– Call mkl_set_num_threads or omp_set_num_threads
• Optimized for different processor families
– Loads appropriate version at runtime
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® MKL Domains and Parallelism
Where’s the Parallelism?
Domain
SIMD
Open MP
BLAS 1, 2, 3
X
X
FFTs
X
X
LAPACK
X
X
(dense LA solvers)
PARDISO
MPI
(relies on BLAS 3)
X
(sparse solver)
VML/VSL
X
X
ScaLAPACK
(cluster dense LA solvers)
X
X
(hybrid)
Cluster FFT
X
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
55
Intel® Integrated Performance Primitives
(Intel® IPP)
A collection of highly optimized functions for
Multimedia, Data Processing, Communications and
Embedded Applications
– “Signal Processing” in its broadest sense
Mainly for C and C++ programmers, but API
available for Fortran now too
Optimized for the latest Intel multi-core processors.
•Video coding
•Audio coding
•Speech coding
•Speech recognition
•Data compression
•Cryptography
•Matrix maths
•Signal processing
•Image processing
•JPEG and JPEG2000
•Computer vision
•Image color conversion
•String processing
•Vector maths
•Realistic Rendering
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
* Other names and brands may be claimed as the property of others.
Intel® Threading Building Blocks
Extend C++ for parallelism
Highlights
•
A C++ runtime library that does thread management, letting
developers focus on proven parallel patterns
• Appropriately scales to the number of HW threads available
• Supports nested parallelism
• The thread library API is portable across Linux, Windows,
and Mac OS* platforms. Open Source community extended
support to FreeBSD*, IA Solaris* and XBox* 360
• Run-time library provides optimal size thread pool, task
granularity and performance oriented scheduling
• Automatic load balancing through task stealing
• Cache efficiency and memory reuse
• Committed to:
• compiler independence
• processor independence
• OS independence
Both GPL and commercial licenses are available.
http://threadingbuildingblocks.org
*Other names and brands may be claimed as the property of others
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® Threading Building Blocks 3.0
Generic Parallel Algorithms
Concurrent Containers
parallel_for(range)
parallel_reduce
parallel_for_each(begin, end)
parallel_do
parallel_invoke
pipeline, parallel_pipeline
parallel_sort
parallel_scan
concurrent_hash_map
concurrent_queue
concurrent_bounded_queue
concurrent_vector
concurrent_unordered_map
Task scheduler
task_group
task_structured_group
task_scheduler_init
task_scheduler_observer
Miscellaneous
Threads
tick_count
tbb_thread, thread
Thread Local Storage
enumerable_thread_specific
combinable
Synchronization Primitives
atomic; mutex; recursive_mutex;
spin_mutex; spin_rw_mutex;
queuing_mutex; queuing_rw_mutex;
reader_writer_lock; critical_section;
condition_variable;
lock_guard; unique_lock;
null_mutex; null_rw_mutex;
Memory Allocation
tbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Questions?
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Further Information
http://software.intel.com/en-us/articles/consistency-of-floating-point-resultsusing-the-intel-compiler/
http://software.intel.com/en-us/articles/tips-for-debugging-run-time-failuresin-intel-fortran-applications/
Intel® Debugger for Linux* (IDB)
http://software.intel.com/en-us/articles/idb-linux/
• http://software.intel.com/en-us/intel-hpc-home
• http://software.intel.com/en-us/articles/intel-compiler-professional-editionswhite-papers/
• The Intel® C++ and Fortran Compiler User and Reference Guides,
http://software.intel.com/sites/products/documentation/hpc/compilerpro/enus/cpp/lin/compiler_c/index.htm or
http://software.intel.com/sites/products/documentation/hpc/compilerpro/enus/fortran/lin/compiler_f/index.htm
• And the User Forums and Knowledge Base,
http://software.intel.com/en-us/forums
http://software.intel.com/en-us/articles/tools
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Summary
• Comprehensive set of tools for multi-core and
cluster parallelism from Intel for x86 architecture
– Best performance on Intel architecture, and competitive
performance on AMD systems
– Intel tools can be used to standardize x86 development
C++/Fortran development
• Our focus is on
– Best Performance
– Comprehensive coverage of parallelism
– Ease of use
– Compatibility and software investment protection
Visit http://intel.com/software/products
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
7/2/2010
11
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED,
BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR
WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR
INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components
and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance.
Buyers should consult other sources of information to evaluate the performance of systems or
components they are considering purchasing. For more information on performance tests and on
the performance of Intel products, reference www.intel.com/software/products.
Intel, the Intel logo, Itanium, Pentium, Intel Xeon, Intel Core, Intel Centrino and VTune are
trademarks or registered trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2010. Intel Corporation.
http://intel.com/software/products
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
12
Linking with Intel® MKL contd..
Layered model approach for better control
– Interface Layer
 Compiler: Intel / GNU
 LP64 / ILP64
Interfaces
– Threading Layer
 Threaded / alternate OpenMP
 Sequential
Threading
Computation
Run-time
– Computational Layer
– Run-time Layer
Choose the libs from each layer for linking.
Ex 1: Static linking using Intel® Fortran Compiler, BLAS, Intel® 64 processor on Linux
$ifort myprog.f libmkl_intel_lp64.a
libmkl_intel_thread.a
Performance
Libraries: Intel® Mathlibmkl_core.a
libiomp5.so
Kernel Library (MKL)
Ex 2: Dynamic linking with Intel® C++ compiler on Windows
c:\>icl mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.dll
Note: Strongly recommended to link Run-time layer library dynamically
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel® MKL Threading
• There are numerous
opportunities for threading:
–
–
–
–
Level 3 BLAS ( O(n3) )
LAPACK* ( O(n3) )
FFTs ( O(n log(n) )
VML, VSL ? depends on
processor and function
Not threaded for some routines due to:
–
–
Limited resource is memory bandwidth.
Threading level 1 and level 2 BLAS are
mostly ineffective ( O(n) )
– Threaded using OpenMP*
– With support for GCC* and Microsoft*
OpenMP*
– ScaLAPACK and Cluster FFT are SMP
Parallel
– All Intel® MKL is thread-safe
Performance Libraries: Intel® Math
Kernel Library (MKL)
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Threading Control in Intel® MKL
• Set OpenMP or Intel MKL environment variable:
OMP_NUM_THREADS
MKL_NUM_THREADS
MKL_DOMAIN_NUM_THREADS
• Call OpenMP or Intel MKL using
omp_set_num_threads()
mkl_set_num_threads()
mkl_domain_set_num_threads()
MKL_DYNAMIC/mkl_set_dynamic(): Intel® MKL decides the number
of threads.
• Example: You could configure Intel MKL to run 4 threads for BLAS, but
sequentially in all other parts of the library
– Environment variable
Performance Libraries: Intel®
Math
set MKL_DOMAIN_NUM_THREADS=“MKL_ALL=1,
MKL_BLAS=4”
– Function calls
Kernel Library (MKL)
mkl_domain_set_num_threads( 1, MKL_ALL);
mkl_domain_set_num_threads( 4, MKL_BLAS);
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Check Intel® TBB online
www.threadingbuildingblocks.org
Open Source
License information
Downloads, active users
forum, developers’ blogs,
documentation
News and
announcements
*Other names and brands may be claimed as the property of others
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Code samples, FAQ
What’s New in TBB 3.0
• Extended Compatibility
–
–
–
–
Added support for Microsoft* Visual Studio* 2010
Extended C++0x features support
Added Microsoft* Parallel Patterns Library*-compatible classes
Added support for Apple* Snow Leopard*
• Improved Composability and Enhanced Task Scheduler
Features
– Fire-and-forget tasks for queue-like work
– Independent task scheduling for foreign threads for improved responsiveness
– Simplified management of task_group_context: it can now be created and
destroyed by different threads
• New Parallel Pipeline
– Elegant new parallel_pipeline function provides a strongly-typed lambda-friendly
pipeline interface
• New Concurrent Container
– New concurrent_unordered_map, an associative container that permits concurrent
insertion and traversal with no visible locking (similar to C++0x
std::unordered_map)
• New Synchronization Primitives
– C++0x-based std::lock_guard, std::unique_lock, and most of
std::condition_variable
– Microsoft* Parallel Patterns Library*-compatible critical_section and
reader_writer_lock
• Improved
Performance
*Other names and brands may be claimed as the property of others
– Faster&thread
local
storage
(enumerable_thread_specific
and combinable)
Software
Services
Group,
Developer
Products Division
– Scalable memory allocator is optimized for large block allocations
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Parallel Algorithm Usage Example
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h“
using namespace tbb;
ChangeArray class defines
a for-loop body for parallel_for
class ChangeArray{
int* array;
public:
ChangeArray (int* a): array(a) {}
void operator()( const blocked_range<int>& r ) const{
for (int i=r.begin(); i!=r.end(); i++ ){
Foo (array[i]);
}
}
};
void ChangeArrayParallel (int* a, int n )
{
parallel_for (blocked_range<int>(0, n), ChangeArray(a));
}
int main (){
int A[N];
// initialize array here…
ChangeArrayParallel (A, N);
return 0;
}
A call to a template function
parallel_for<Range, Body>:
with arguments
Range  blocked_range
Body  ChangeArray
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
blocked_range – TBB template
representing 1D iteration space
As usual with C++ function
objects the main work
is done inside operator()
C++0x Lambda Expression Support
parallel_for example will transform into:
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h“
using namespace tbb;
void ChangeArrayParallel (int* a, int n )
{
parallel_for (0, n, 1,
[=](int i) {
Foo (a[i]);
});
}
int main (){
int A[N];
// initialize array here…
ChangeArrayParallel (A, N);
return 0;
}
parallel_for has an overload that takes
start, stop and step argument and
constructs blocked_range internally
Capture variables by value
from surrounding scope to
completely mimic the non-lambda
implementation. Note that [&]
could be used to capture
variables by reference .
Using lambda expressions implement
MyBody::operator() right inside
the call to parallel_for().
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Functional parallelism has never been easier
void foo() {
}
void bar(int a, int b, spin_mutex& m) {
int c = a + b;
spin_mutex::scoped_lock l(m);
cout << c << endl;
}
int main(int argc, char* argv[]) {
spin_mutex m;
int a = 1, b = 2;
parallel_invoke(
foo,
[a, b, &m](){
bar(a, b, m);
},
[&m](){
for(int i = 0; i < K; ++i) {
spin_mutex::scoped_lock l(m);
cout << i << endl;
}
},
[&m](){
parallel_for( 0, N, 1,
[&m](int i) {
spin_mutex::scoped_lock l(m);
cout << i << " ";
});
});
}
already existing thread-safe
functions a user would like
to be executed in parallel
void function_handle(void) calling
void bar(int, int, mutex)implemented
using a lambda expression
Serial thread-safe job,
wrapped in a lambda expression
that is being executed in parallel
with three other functions
Parallel job, which is also executed
in parallel with other functions.
return 0;
Now imagine writing all this code with just plain threads
Software & Services Group, Developer Products Division
Copyright © 2010, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.