Leveraging Optimized Tools and Libraries Shuo Li Financial Services Engineering Software and Services Group Intel Corporation Agenda • Lab Step 1 Baseline • Intel® Parallel Studio XE 2013 • Lab Step 1 Using Intel Compiler • Intel® MKL • Lab Step 1 Using Intel Compiler and MKL • Summary 2 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Lab Step 1 Baseline Monte Carlo European Option Pricing Monte Carlo Method? Statistical Computing method pioneered by Nicholas Metropolis Monte Carlo in Finance Phelim Boyle introduced Monte Carlo method to Quantitative Finance 4 • Simple and Repetitive algorithms • Central Limit Theorem 1. Sample a random path for S in a risk neutral world 2. Calculate the payoff from the derivative 3. Repeat steps 1 and 2 to get many sample values of the payoff from the derivative in a risk-neutral world. 4. Calculate the mean of the sample payoff 5. Discount expected payoff at risk-free rate to get an estimate of the value of the option iXPTC 2013 Intel® Xeon Phi™ Coprocessor Initial Implementation with GCC • Use GCC 4.4.6 typedef std::tr1::mt19937 ENG; // Mersenne Twister typedef std::tr1::normal_distribution<float> DIST; typedef std::tr1::variate_generator<ENG,DIST> GEN; • C/C++ TR1 Random number generator ENG • Program Files Driver.cpp Main program file MonteCarlo.h Parameter Definitions MonteCarloStepn.cpp Monte Carlo Calculations Makefile Build file 5 eng; DIST dist(0,1); GEN gen(eng,dist); for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VOLATILITY * sqrt(T[opt]); float MuByT = (RISKFREE - 0.5 * VOLATILITY * VOLATILITY) * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0, val2 = 0.0; for(int pos = 0; pos < RAND_N; pos++) { float callValue = max(0.0, Sval *exp(MuByT + VBySqrtT * gen()) - Xval); val += callValue; val2 += callValue * callValue; } float exprt = exp(-RISKFREE *T[opt]); CallResult[opt] = exprt * val / (float)RAND_N; float stdDev = sqrt(((float)RAND_N * val2 - val * val)/ ((float)RAND_N * (float)(RAND_N - 1))); CallConfidence[opt] = (float)(exprt * 1.96 * stdDev / sqrtf((float)RAND_N)); } iXPTC 2013 Intel® Xeon Phi™ Coprocessor Your Mission: Make it Faster and Better • Make it fast on Intel® Xeon® Processor and Even faster on Intel® Xeon Phi™ Coprocessor • Take the full advantage the hardware resource • Tools: Intel Parallel Studio XE 2013 SP1 – Intel® C/C++ Compiler – Intel® MKL • Methodology: Stepwise Optimization Framework Let’s Get Started by typing “make” 6 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Intel® Parallel Studio XE 2013 • Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications More Cores. Wider Vectors. Performance Delivered. Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 More Cores Scaling Performance Efficiently Multicore Many-core 50+ cores Wider Vectors 128 Bits Serial Performance Task & Data Parallel Performance • Industry-leading performance from advanced compilers • Comprehensive libraries 256 Bits 512 Bits 8 Distributed Performance • Parallel programming models • Insightful analysis tools iXPTC 2013 Intel® Xeon Phi™ Coprocessor What’s New? Intel® Parallel Studio XE Intel® Compiler Cluster s& Studio Libraries XE Intel® Parallel Studio XE 2013/ Intel® Cluster Studio XE 2013 • Performance Leadership: – 3rd Generation Intel® Core™ Processors (code name “Ivy Bridge”) and future Intel® processors (code name “Haswell”) – Intel® Xeon Phi™ coprocessors – Improved C++ and Fortran performance New Product Capabilities – Latest OS: Windows* 8 Desktop, Linux* – IDE: Visual Studio 2008, 2010, 2012 and gnu tool chain – Standards: C99, selected C++11 features, almost complete Fortran 2003 support and selected features from Fortran 2008, Fortran 2008, MPI 2.2 9 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Support for Latest Intel Processors and Coprocessors Intel® Ivy Bridge microarchitecture Intel® Haswell microarchitecture Intel® Xeon Phi™ coprocessor ✔ AVX ✔ AVX2, FMA3 ✔ IMCI Intel® TBB library ✔ ✔ ✔ Intel® MKL library ✔ AVX ✔ AVX2, FMA3 ✔ Intel® MPI library ✔ ✔ ✔ Intel® VTune™ Amplifier XE† ✔ Hardware Events ✔ Hardware Events ✔ Hardware Events Intel® ✔ Memory & Thread Checks ✔ Memory & Thread ✔ Memory & Thread†† Intel® C++ and Fortran Compiler 10 Inspector XE iXPTC 2013 Intel® Xeon Phi™ Coprocessor Performance-Oriented Compiler Suites Intel® Compilers, Performance Libraries, Debugging Tools On Windows, Linux and Mac OS X Intel® C++ Composer XE 2013 • Intel® C++ Compiler XE 13.0 with Intel® Cilk™ Plus • Intel® TBB • Intel® MKL • Intel® IPP • Intel® Xeon Phi™ product family support, Linux Intel Composer XE 2013 Intel® Fortran Composer XE 2013 • Intel® Fortran Compiler XE 13.0 • Intel® MKL • Compatibility with Compaq Visual Fortran* • Fortran 2003, 2008 support • Intel® Xeon Phi™ product family support, Linux • Combines Intel C++ Composer XE and Intel® Fortran Composer XE • For Fortran developers who also want Intel C++ • Windows (requires Visual Studio) and Linux only Windows: Intel C++/Visual* C++ compatibility & integration into Microsoft* Visual Studio* Linux: Intel C++/gcc* compatibility & integration into Eclipse* CDT Mac OS X: Intel C++/gcc compatibility & integration into XCode* Environment All: Intel Fortran performance leadership, compatible with Compaq* Visual* Fortran All: Leadership performance on Intel and compatible architectures All: One Year Intel® Premier Support. Renewable Annually. Performance, Compatibility, Support 11 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Superior C++ Compiler Performance More Performance • • • • 12 Just recompile Uses Intel® AVX and Intel® AVX2 instructions Intel® Xeon Phi™ product family support, Linux: Compiler, debugger (Linux) Intel® Cilk™ Plus: Tasking and vectorization iXPTC 2013 Intel® Xeon Phi™ Coprocessor Lab Step 1: Using Intel Compiler Build Monte Carlo European Options using Intel C/C++ Compiler • Intel Compiler is fully compatible with • Intel Parallel Studio XE 2013 installed on your notebook • Just type icpc –V for test • Source environmental variables at . /opt/intel/composerxe/pkg_bin/compilervars.sh intel64 • Reissue the make command with CXX=icpc make CXX=icpc • Rerun MonteCarlo built by Intel® C/C++ Composer XE ./MonteCarlo 14 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Intel® MKL 16 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Intel® MKL Supports Intel® Xeon Phi™ Coprocessors • Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessors. • Heterogeneous computing • Takes advantage of both multicore host and many-core coprocessors. • Optimized for wider (512-bit) SIMD instructions and threaded for many cores. • All Intel MKL functions are supported: • But optimized at different levels. Pairing highly parallel software with highly parallel hardware. 17 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Highly Optimized Functions • As of MKL 11.0 Update 2 (the latest): – BLAS Level 3, and much of Level 1 & 2 – Sparse BLAS: ?CSRMV, ?CSRMM – Some important LAPACK routines (LU, QR, Cholesky) – Fast Fourier transforms – Vector Math Library – Random number generators in the Vector Statistical Library • Broader functionality to be optimized in future update releases. 18 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Usage Models on Intel® Xeon Phi™ coprocessors • Automatic Offload • No code changes required • Automatically uses both host and target • Transparent data transfer and execution management • Compiler Assisted Offload • Explicit controls of data transfer and remote execution using compiler offload pragmas/directives • Can be used together with Automatic Offload • Native Execution • Uses the coprocessors as independent nodes • Input data and binaries are copied to targets in advance 19 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Lab 1 Using Intel Compile and MKL Using Intel® MKL Random Number Generation • Include MKL header file #include <mkl_vsl.h> • Declare a buffer to receive random numbers float random[RANd_N]; • Define a random stream descriptor data structure VSLSTREAMSTATEPTR Randomstream • Create and initialize the random streams vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEE) • Receive the Random Number in the buffer vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, RAND_N, random, 0.0, 1.0); • Add –mkl in your linker options link and remake the rerun MonteCarlo • Record the performance number in the Excel Worksheet 21 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Summary Summary • Using Intel® Compiler for high performance • Use Intel® MKL to accelerate Monte Carlo. 23 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Backup 25 26 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Intel® MKL Supports for Intel® Xeon Phi™ Coprocessors • Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessors. • Heterogeneous computing • Takes advantage of both multicore host and many-core coprocessors. • Optimized for wider (512-bit) SIMD instructions and threaded for many cores. • All Intel MKL functions are supported: • But optimized at different levels. Pairing highly parallel software with highly parallel hardware. 27 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Highly Optimized Functions – As of MKL 11.0 Update 2 (the latest): • BLAS Level 3, and much of Level 1 & 2 • Sparse BLAS: ?CSRMV, ?CSRMM • Some important LAPACK routines (LU, QR, Cholesky) • Fast Fourier transforms • Vector Math Library • Random number generators in the Vector Statistical Library – Broader functionality to be optimized in future update releases. 28 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Usage Models on Intel® Xeon Phi™ Coprocessors • Automatic Offload • No code changes required • Automatically uses both host and target • Transparent data transfer and execution management • Compiler Assisted Offload • Explicit controls of data transfer and remote execution using compiler offload pragmas/directives • Can be used together with Automatic Offload • Native Execution • Uses the coprocessors as independent nodes • Input data and binaries are copied to targets in advance 29 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Automatic Offload (AO) • Offloading is automatic and transparent. • Can take advantage of multiple coprocessors. • By default, Intel MKL decides: • When to offload • Work division between host and targets • Users enjoy host and target parallelism automatically. • Users can still specify work division between host and target. (for BLAS only) 30 iXPTC 2013 Intel® Xeon Phi™ Coprocessor How to Use Automatic Offload • Using Automatic Offload is easy Set an env variable: Call a function: or mkl_mic_enable() MKL_MIC_ENABLE=1 • What if there doesn’t exist a coprocessor in the system? • Runs on the host as usual without penalty! 31 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Automatic Offload Enabled Functions • A selected set of MKL functions are AO enabled. • Only functions with sufficient computation to offset data transfer overhead are subject to AO • In 11.0.2, AO enabled functions include: • Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM, ?SYMM • LAPACK 3 amigos: LU, QR, Cholesky • Offloading happens only when matrix sizes are right • ?GEMM: M, N > 2048, K > 256 • ?SYMM: M, N > 2048 • ?TRSM/?TRMM: M, N > 3072 32 • LU: M, N > 8192 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Work Division Control in Automatic Offload Examples Notes mkl_mic_set_Workdivision( MKL_TARGET_MIC, 0, 0.5) Offload 50% of computation only to the 1st card. Examples Notes MKL_MIC_0_WORKDIVISION=0.5 Offload 50% of computation only to the 1st card. Work division settings have no effects for LAPACK functions. 33 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Compiler Assisted Offload (CAO) • Offloading is explicitly controlled by compiler pragmas or directives. • All MKL functions can be offloaded in CAO. • In comparison, only a subset of MKL is subject to AO. • Can leverage the full potential of compiler’s offloading facility. • More flexibility in data transfer and remote execution management. • A big advantage is data persistence: Reusing transferred data for multiple operations. 34 iXPTC 2013 Intel® Xeon Phi™ Coprocessor How to Use Compiler Assisted Offload • The same way you would offload any function call to the coprocessor. • An example in C: #pragma offload target(mic) \ in(transa, transb, N, alpha, beta) \ in(A:length(matrix_elements)) \ in(B:length(matrix_elements)) \ in(C:length(matrix_elements)) \ out(C:length(matrix_elements) alloc_if(0)) { sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); } 35 iXPTC 2013 Intel® Xeon Phi™ Coprocessor How to Use Compiler Assisted Offload • An example in Fortran: !DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM !DEC$ OMP OFFLOAD TARGET( MIC ) & !DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), & !DEC$ IN( A: LENGTH( NCOLA * LDA )), & !DEC$ IN( B: LENGTH( NCOLB * LDB )), & !DEC$ INOUT( C: LENGTH( N * LDC )) !$OMP PARALLEL SECTIONS !$OMP SECTION CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, & A, LDA, B, LDB BETA, C, LDC ) !$OMP END PARALLEL SECTIONS 36 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Using AO and CAO in the Same Program • Users can use AO for some MKL calls and use CAO for others in the same program • Only supported by Intel compilers. • Work division must be set explicitly for AO. • Otherwise, all MKL AO calls are executed on the host. 37 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Native Execution • Use the coprocessor as an independent compute node. • Programs can be built to run only on the coprocessor by using the –mmic build option. – MKL function calls inside an offloaded code region executes natively. – Better performance if input data is already available on the coprocessor, and output is not immediately needed on the host side. 38 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Considerations of Using Intel® MKL on Intel® Xeon Phi™ Coprocessors High level parallelism is critical in maximizing performance. • BLAS (Level 3) and LAPACK with large problem size get the most benefit. • Scaling beyond 100’s threads, vectorized, good data locality Minimize data transfer overhead when offload. • Offset data transfer overhead with enough computation. • Exploit data persistence: CAO to help! You can always run on the host if offloading does not offer better performance. 39 iXPTC 2013 Intel® Xeon Phi™ Coprocessor iXPTC 2013 Intel® Xeon Phi™ Coprocessor Value of Suites Suite Only Features • Advisor XE Parallelism Advice • C++ Performance Guide Performance Wizard • Pointer Checker Reduces memory corruption • Code Complexity Analysis Find code likely to be less reliable • Static Analysis Improved! Find Errors and Harden your Security Optimization Notice 41 intel.com/software/products Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Compiler s& Libraries What’s New in Libraries? Intel® MKL • Digital random number generator (DRNG) for improved vector statistics calculations • Automatically utilize Intel® Xeon Phi™ Coprocessors and balance compute loads between CPUs and coprocessors Intel® IPP • Enhanced image resize performance primitives • Improved IPP footprint size Intel® TBB "Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table—crowd simulation software.” • Improved usability and reliability of the Flow Graph feature • Additional C++11 Support Michaël Rouillé, CTO, Gol Ready to Use Libraries to Increase Performance Optimization Notice 42 intel.com/software/products Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.