Performance Libraries Martyn Corden Developer Products Division Software & Services Group Intel Corporation June 2010 Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 1 Agenda • Intel® Math Kernel Library (MKL) • Intel® Performance Primitives • Intel® Threading Building Blocks Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Math Kernel Library (MKL) contents: • BLAS (vector & matrix computation routines.) – BLAS for sparse vectors/matrices • LAPACK (Linear algebra) – Solvers and eigensolvers. Many hundreds of routines total! – Cluster implementation (SCALAPACK) • DFTs (General FFTs) – Mixed radix, multi-dimensional transforms – Cluster implementation • Sparse Solvers (PARDISO, DSS and and ISS) – OOC version for huge problem sizes • • • • Vector Math Library (vectorized transcendental functions) Performance Libraries: Intel® Math Vector Statistical Library (random number generators) Kernel Library (MKL) Optimization Solvers (non-linear least squares, …) PDE solvers Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Math Kernel Library: a simple way to thread your application • Many components of MKL have threaded versions – Based on the compiler’s OpenMP runtime library • Link threaded or non-threaded interface – libmkl_intel_thread.a or libmkl_sequential.a – Use the link line advisor at http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ – Or use -mkl with the Intel compiler • Set the number of threads – export MKL_NUM_THREADS or OMP_NUM_THREADS – Call mkl_set_num_threads or omp_set_num_threads • Optimized for different processor families – Loads appropriate version at runtime Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® MKL Domains and Parallelism Where’s the Parallelism? Domain SIMD Open MP BLAS 1, 2, 3 X X FFTs X X LAPACK X X (dense LA solvers) PARDISO MPI (relies on BLAS 3) X (sparse solver) VML/VSL X X ScaLAPACK (cluster dense LA solvers) X X (hybrid) Cluster FFT X Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 55 Intel® Integrated Performance Primitives (Intel® IPP) A collection of highly optimized functions for Multimedia, Data Processing, Communications and Embedded Applications – “Signal Processing” in its broadest sense Mainly for C and C++ programmers, but API available for Fortran now too Optimized for the latest Intel multi-core processors. •Video coding •Audio coding •Speech coding •Speech recognition •Data compression •Cryptography •Matrix maths •Signal processing •Image processing •JPEG and JPEG2000 •Computer vision •Image color conversion •String processing •Vector maths •Realistic Rendering Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. * Other names and brands may be claimed as the property of others. Intel® Threading Building Blocks Extend C++ for parallelism Highlights • A C++ runtime library that does thread management, letting developers focus on proven parallel patterns • Appropriately scales to the number of HW threads available • Supports nested parallelism • The thread library API is portable across Linux, Windows, and Mac OS* platforms. Open Source community extended support to FreeBSD*, IA Solaris* and XBox* 360 • Run-time library provides optimal size thread pool, task granularity and performance oriented scheduling • Automatic load balancing through task stealing • Cache efficiency and memory reuse • Committed to: • compiler independence • processor independence • OS independence Both GPL and commercial licenses are available. http://threadingbuildingblocks.org *Other names and brands may be claimed as the property of others Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Threading Building Blocks 3.0 Generic Parallel Algorithms Concurrent Containers parallel_for(range) parallel_reduce parallel_for_each(begin, end) parallel_do parallel_invoke pipeline, parallel_pipeline parallel_sort parallel_scan concurrent_hash_map concurrent_queue concurrent_bounded_queue concurrent_vector concurrent_unordered_map Task scheduler task_group task_structured_group task_scheduler_init task_scheduler_observer Miscellaneous Threads tick_count tbb_thread, thread Thread Local Storage enumerable_thread_specific combinable Synchronization Primitives atomic; mutex; recursive_mutex; spin_mutex; spin_rw_mutex; queuing_mutex; queuing_rw_mutex; reader_writer_lock; critical_section; condition_variable; lock_guard; unique_lock; null_mutex; null_rw_mutex; Memory Allocation tbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Questions? Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Further Information http://software.intel.com/en-us/articles/consistency-of-floating-point-resultsusing-the-intel-compiler/ http://software.intel.com/en-us/articles/tips-for-debugging-run-time-failuresin-intel-fortran-applications/ Intel® Debugger for Linux* (IDB) http://software.intel.com/en-us/articles/idb-linux/ • http://software.intel.com/en-us/intel-hpc-home • http://software.intel.com/en-us/articles/intel-compiler-professional-editionswhite-papers/ • The Intel® C++ and Fortran Compiler User and Reference Guides, http://software.intel.com/sites/products/documentation/hpc/compilerpro/enus/cpp/lin/compiler_c/index.htm or http://software.intel.com/sites/products/documentation/hpc/compilerpro/enus/fortran/lin/compiler_f/index.htm • And the User Forums and Knowledge Base, http://software.intel.com/en-us/forums http://software.intel.com/en-us/articles/tools Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Summary • Comprehensive set of tools for multi-core and cluster parallelism from Intel for x86 architecture – Best performance on Intel architecture, and competitive performance on AMD systems – Intel tools can be used to standardize x86 development C++/Fortran development • Our focus is on – Best Performance – Comprehensive coverage of parallelism – Ease of use – Compatibility and software investment protection Visit http://intel.com/software/products Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 7/2/2010 11 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. Intel, the Intel logo, Itanium, Pentium, Intel Xeon, Intel Core, Intel Centrino and VTune are trademarks or registered trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2010. Intel Corporation. http://intel.com/software/products Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 12 Linking with Intel® MKL contd.. Layered model approach for better control – Interface Layer Compiler: Intel / GNU LP64 / ILP64 Interfaces – Threading Layer Threaded / alternate OpenMP Sequential Threading Computation Run-time – Computational Layer – Run-time Layer Choose the libs from each layer for linking. Ex 1: Static linking using Intel® Fortran Compiler, BLAS, Intel® 64 processor on Linux $ifort myprog.f libmkl_intel_lp64.a libmkl_intel_thread.a Performance Libraries: Intel® Mathlibmkl_core.a libiomp5.so Kernel Library (MKL) Ex 2: Dynamic linking with Intel® C++ compiler on Windows c:\>icl mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.dll Note: Strongly recommended to link Run-time layer library dynamically Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® MKL Threading • There are numerous opportunities for threading: – – – – Level 3 BLAS ( O(n3) ) LAPACK* ( O(n3) ) FFTs ( O(n log(n) ) VML, VSL ? depends on processor and function Not threaded for some routines due to: – – Limited resource is memory bandwidth. Threading level 1 and level 2 BLAS are mostly ineffective ( O(n) ) – Threaded using OpenMP* – With support for GCC* and Microsoft* OpenMP* – ScaLAPACK and Cluster FFT are SMP Parallel – All Intel® MKL is thread-safe Performance Libraries: Intel® Math Kernel Library (MKL) Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Threading Control in Intel® MKL • Set OpenMP or Intel MKL environment variable: OMP_NUM_THREADS MKL_NUM_THREADS MKL_DOMAIN_NUM_THREADS • Call OpenMP or Intel MKL using omp_set_num_threads() mkl_set_num_threads() mkl_domain_set_num_threads() MKL_DYNAMIC/mkl_set_dynamic(): Intel® MKL decides the number of threads. • Example: You could configure Intel MKL to run 4 threads for BLAS, but sequentially in all other parts of the library – Environment variable Performance Libraries: Intel® Math set MKL_DOMAIN_NUM_THREADS=“MKL_ALL=1, MKL_BLAS=4” – Function calls Kernel Library (MKL) mkl_domain_set_num_threads( 1, MKL_ALL); mkl_domain_set_num_threads( 4, MKL_BLAS); Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Check Intel® TBB online www.threadingbuildingblocks.org Open Source License information Downloads, active users forum, developers’ blogs, documentation News and announcements *Other names and brands may be claimed as the property of others Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Code samples, FAQ What’s New in TBB 3.0 • Extended Compatibility – – – – Added support for Microsoft* Visual Studio* 2010 Extended C++0x features support Added Microsoft* Parallel Patterns Library*-compatible classes Added support for Apple* Snow Leopard* • Improved Composability and Enhanced Task Scheduler Features – Fire-and-forget tasks for queue-like work – Independent task scheduling for foreign threads for improved responsiveness – Simplified management of task_group_context: it can now be created and destroyed by different threads • New Parallel Pipeline – Elegant new parallel_pipeline function provides a strongly-typed lambda-friendly pipeline interface • New Concurrent Container – New concurrent_unordered_map, an associative container that permits concurrent insertion and traversal with no visible locking (similar to C++0x std::unordered_map) • New Synchronization Primitives – C++0x-based std::lock_guard, std::unique_lock, and most of std::condition_variable – Microsoft* Parallel Patterns Library*-compatible critical_section and reader_writer_lock • Improved Performance *Other names and brands may be claimed as the property of others – Faster&thread local storage (enumerable_thread_specific and combinable) Software Services Group, Developer Products Division – Scalable memory allocator is optimized for large block allocations Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Parallel Algorithm Usage Example #include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; ChangeArray class defines a for-loop body for parallel_for class ChangeArray{ int* array; public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range<int>& r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ Foo (array[i]); } } }; void ChangeArrayParallel (int* a, int n ) { parallel_for (blocked_range<int>(0, n), ChangeArray(a)); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0; } A call to a template function parallel_for<Range, Body>: with arguments Range blocked_range Body ChangeArray Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. blocked_range – TBB template representing 1D iteration space As usual with C++ function objects the main work is done inside operator() C++0x Lambda Expression Support parallel_for example will transform into: #include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; void ChangeArrayParallel (int* a, int n ) { parallel_for (0, n, 1, [=](int i) { Foo (a[i]); }); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0; } parallel_for has an overload that takes start, stop and step argument and constructs blocked_range internally Capture variables by value from surrounding scope to completely mimic the non-lambda implementation. Note that [&] could be used to capture variables by reference . Using lambda expressions implement MyBody::operator() right inside the call to parallel_for(). Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Functional parallelism has never been easier void foo() { } void bar(int a, int b, spin_mutex& m) { int c = a + b; spin_mutex::scoped_lock l(m); cout << c << endl; } int main(int argc, char* argv[]) { spin_mutex m; int a = 1, b = 2; parallel_invoke( foo, [a, b, &m](){ bar(a, b, m); }, [&m](){ for(int i = 0; i < K; ++i) { spin_mutex::scoped_lock l(m); cout << i << endl; } }, [&m](){ parallel_for( 0, N, 1, [&m](int i) { spin_mutex::scoped_lock l(m); cout << i << " "; }); }); } already existing thread-safe functions a user would like to be executed in parallel void function_handle(void) calling void bar(int, int, mutex)implemented using a lambda expression Serial thread-safe job, wrapped in a lambda expression that is being executed in parallel with three other functions Parallel job, which is also executed in parallel with other functions. return 0; Now imagine writing all this code with just plain threads Software & Services Group, Developer Products Division Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.