Intel Parallel Studio XE Software Programming Tool Suite 30-3-30

Leveraging Optimized Tools and
Libraries
Shuo Li
Financial Services Engineering
Software and Services Group
Intel Corporation
Agenda
• Lab Step 1 Baseline
• Intel® Parallel Studio XE 2013
• Lab Step 1 Using Intel Compiler
• Intel® MKL
• Lab Step 1 Using Intel Compiler and MKL
• Summary
2
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Lab Step 1 Baseline
Monte Carlo European Option Pricing
Monte Carlo
Method?
Statistical Computing
method pioneered by
Nicholas Metropolis
Monte Carlo in
Finance
Phelim Boyle introduced
Monte Carlo method to
Quantitative Finance
4
• Simple and Repetitive
algorithms
• Central Limit Theorem
1. Sample a random path for S in a risk
neutral world
2. Calculate the payoff from the derivative
3. Repeat steps 1 and 2 to get many sample
values of the payoff from the derivative in
a risk-neutral world.
4. Calculate the mean of the sample payoff
5. Discount expected payoff at risk-free rate
to get an estimate of the value of the
option
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Initial Implementation with GCC
• Use GCC 4.4.6
typedef std::tr1::mt19937
ENG; // Mersenne Twister
typedef std::tr1::normal_distribution<float> DIST;
typedef std::tr1::variate_generator<ENG,DIST> GEN;
• C/C++ TR1 Random number
generator
ENG
• Program Files
Driver.cpp
Main program file
MonteCarlo.h
Parameter Definitions
MonteCarloStepn.cpp
Monte Carlo Calculations
Makefile
Build file
5
eng;
DIST dist(0,1);
GEN gen(eng,dist);
for(int opt = 0; opt < OPT_N; opt++)
{
float VBySqrtT = VOLATILITY * sqrt(T[opt]);
float MuByT = (RISKFREE - 0.5 * VOLATILITY * VOLATILITY) *
T[opt];
float Sval = S[opt];
float Xval = X[opt];
float val = 0.0, val2 = 0.0;
for(int pos = 0; pos < RAND_N; pos++)
{
float callValue = max(0.0, Sval *exp(MuByT + VBySqrtT *
gen()) - Xval);
val += callValue;
val2 += callValue * callValue;
}
float exprt = exp(-RISKFREE *T[opt]);
CallResult[opt] = exprt * val / (float)RAND_N;
float stdDev = sqrt(((float)RAND_N * val2 - val * val)/
((float)RAND_N * (float)(RAND_N - 1)));
CallConfidence[opt] = (float)(exprt * 1.96 * stdDev /
sqrtf((float)RAND_N));
}
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Your Mission: Make it Faster and Better
• Make it fast on Intel® Xeon® Processor and
Even faster on Intel® Xeon Phi™ Coprocessor
• Take the full advantage the hardware resource
• Tools: Intel Parallel Studio XE 2013 SP1
– Intel® C/C++ Compiler
– Intel® MKL
• Methodology: Stepwise Optimization Framework
Let’s Get Started by typing “make”
6
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® Parallel Studio XE 2013
• Helping Developers Efficiently Produce Fast,
Scalable and Reliable Applications
More Cores. Wider Vectors. Performance Delivered.
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
More Cores
Scaling
Performance
Efficiently
Multicore Many-core
50+ cores
Wider Vectors
128 Bits
Serial
Performance
Task & Data
Parallel
Performance
• Industry-leading
performance from advanced
compilers
• Comprehensive libraries
256 Bits
512 Bits
8
Distributed
Performance
• Parallel programming models
• Insightful analysis tools
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
What’s New?
Intel®
Parallel
Studio
XE
Intel®
Compiler
Cluster
s&
Studio
Libraries
XE
Intel® Parallel Studio XE 2013/ Intel® Cluster Studio XE 2013
• Performance Leadership:
– 3rd Generation Intel® Core™ Processors (code name
“Ivy Bridge”) and future Intel® processors
(code name “Haswell”)
– Intel® Xeon Phi™ coprocessors
– Improved C++ and Fortran performance
New Product Capabilities
– Latest OS: Windows* 8 Desktop, Linux*
– IDE: Visual Studio 2008, 2010, 2012 and gnu tool chain
– Standards: C99, selected C++11 features, almost
complete Fortran 2003 support and selected features
from Fortran 2008, Fortran 2008, MPI 2.2
9
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Support for Latest Intel
Processors and Coprocessors
Intel® Ivy Bridge
microarchitecture
Intel® Haswell
microarchitecture
Intel® Xeon Phi™
coprocessor
✔
AVX
✔
AVX2, FMA3
✔
IMCI
Intel® TBB library
✔
✔
✔
Intel® MKL library
✔
AVX
✔
AVX2, FMA3
✔
Intel® MPI library
✔
✔
✔
Intel® VTune™ Amplifier XE†
✔
Hardware Events
✔
Hardware Events
✔
Hardware Events
Intel®
✔
Memory & Thread
Checks
✔
Memory & Thread
✔
Memory & Thread††
Intel® C++ and Fortran Compiler
10
Inspector XE
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Performance-Oriented Compiler Suites
Intel® Compilers, Performance Libraries, Debugging Tools
On Windows, Linux and Mac OS X
Intel® C++
Composer XE 2013
• Intel® C++ Compiler XE 13.0
with Intel® Cilk™ Plus
• Intel® TBB
• Intel® MKL
• Intel® IPP
• Intel® Xeon Phi™ product
family support, Linux
Intel Composer
XE 2013
Intel® Fortran
Composer XE 2013
• Intel® Fortran Compiler XE 13.0
• Intel® MKL
• Compatibility with Compaq
Visual Fortran*
• Fortran 2003, 2008 support
• Intel® Xeon Phi™ product
family support, Linux
• Combines Intel C++
Composer XE and Intel®
Fortran Composer XE
• For Fortran developers who
also want Intel C++
• Windows (requires Visual
Studio) and Linux only
Windows: Intel C++/Visual* C++ compatibility & integration into Microsoft* Visual Studio*
Linux: Intel C++/gcc* compatibility & integration into Eclipse* CDT
Mac OS X: Intel C++/gcc compatibility & integration into XCode* Environment
All: Intel Fortran performance leadership, compatible with Compaq* Visual* Fortran
All: Leadership performance on Intel and compatible architectures
All: One Year Intel® Premier Support. Renewable Annually.
Performance, Compatibility, Support
11
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Superior C++ Compiler Performance
More Performance
•
•
•
•
12
Just recompile
Uses Intel® AVX and Intel® AVX2 instructions
Intel® Xeon Phi™ product family support, Linux: Compiler, debugger (Linux)
Intel® Cilk™ Plus: Tasking and vectorization
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Lab Step 1: Using Intel Compiler
Build Monte Carlo European Options
using Intel C/C++ Compiler
• Intel Compiler is fully compatible with
• Intel Parallel Studio XE 2013 installed on your notebook
• Just type icpc –V for test
• Source environmental variables at
. /opt/intel/composerxe/pkg_bin/compilervars.sh intel64
• Reissue the make command with CXX=icpc
make CXX=icpc
• Rerun MonteCarlo built by Intel® C/C++ Composer XE
./MonteCarlo
14
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® MKL
16
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® MKL Supports Intel® Xeon Phi™ Coprocessors
• Intel® MKL 11.0 supports the Intel® Xeon Phi™ coprocessors.
• Heterogeneous computing
• Takes advantage of both multicore host and many-core
coprocessors.
• Optimized for wider (512-bit) SIMD instructions and threaded
for many cores.
• All Intel MKL functions are supported:
• But optimized at different levels.
Pairing highly parallel software
with highly parallel hardware.
17
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Highly Optimized Functions
• As of MKL 11.0 Update 2 (the latest):
– BLAS Level 3, and much of Level 1 & 2
– Sparse BLAS: ?CSRMV, ?CSRMM
– Some important LAPACK routines (LU, QR, Cholesky)
– Fast Fourier transforms
– Vector Math Library
– Random number generators in the Vector Statistical
Library
• Broader functionality to be optimized in future
update releases.
18
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Usage Models on Intel® Xeon Phi™ coprocessors
• Automatic Offload
• No code changes required
• Automatically uses both host and target
• Transparent data transfer and execution management
• Compiler Assisted Offload
• Explicit controls of data transfer and remote execution
using compiler offload pragmas/directives
• Can be used together with Automatic Offload
• Native Execution
• Uses the coprocessors as independent nodes
• Input data and binaries are copied to targets in advance
19
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Lab 1 Using Intel Compile and MKL
Using Intel® MKL Random Number Generation
• Include MKL header file #include <mkl_vsl.h>
• Declare a buffer to receive random numbers
float random[RANd_N];
• Define a random stream descriptor data structure
VSLSTREAMSTATEPTR Randomstream
• Create and initialize the random streams
vslNewStream(&Randomstream, VSL_BRNG_MT19937, RANDSEE)
• Receive the Random Number in the buffer
vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF,
Randomstream, RAND_N, random, 0.0, 1.0);
• Add –mkl in your linker options link and remake the rerun
MonteCarlo
• Record the performance number in the Excel Worksheet
21
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Summary
Summary
• Using Intel® Compiler for high performance
• Use Intel® MKL to accelerate Monte Carlo.
23
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Backup
25
26
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Intel® MKL Supports for Intel® Xeon Phi™
Coprocessors
• Intel® MKL 11.0 supports the Intel® Xeon Phi™
coprocessors.
• Heterogeneous computing
• Takes advantage of both multicore host and many-core
coprocessors.
• Optimized for wider (512-bit) SIMD instructions and
threaded for many cores.
• All Intel MKL functions are supported:
• But optimized at different levels.
Pairing highly parallel software
with highly parallel hardware.
27
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Highly Optimized Functions
– As of MKL 11.0 Update 2 (the latest):
• BLAS Level 3, and much of Level 1 & 2
• Sparse BLAS: ?CSRMV, ?CSRMM
• Some important LAPACK routines (LU, QR, Cholesky)
• Fast Fourier transforms
• Vector Math Library
• Random number generators in the Vector Statistical Library
– Broader functionality to be optimized in future update
releases.
28
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Usage Models on Intel® Xeon Phi™
Coprocessors
• Automatic Offload
• No code changes required
• Automatically uses both host and target
• Transparent data transfer and execution management
• Compiler Assisted Offload
• Explicit controls of data transfer and remote execution
using compiler offload pragmas/directives
• Can be used together with Automatic Offload
• Native Execution
• Uses the coprocessors as independent nodes
• Input data and binaries are copied to targets in advance
29
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Automatic Offload (AO)
• Offloading is automatic and transparent.
• Can take advantage of multiple coprocessors.
• By default, Intel MKL decides:
• When to offload
• Work division between host and targets
• Users enjoy host and target parallelism
automatically.
• Users can still specify work division between host
and target. (for BLAS only)
30
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
How to Use Automatic Offload
• Using Automatic Offload is easy
Set an env variable:
Call a function:
or
mkl_mic_enable()
MKL_MIC_ENABLE=1
• What if there doesn’t exist a coprocessor in the system?
• Runs on the host as usual without penalty!
31
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Automatic Offload Enabled Functions
• A selected set of MKL functions are AO enabled.
• Only functions with sufficient computation to offset data
transfer overhead are subject to AO
• In 11.0.2, AO enabled functions include:
• Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM, ?SYMM
• LAPACK 3 amigos: LU, QR, Cholesky
• Offloading happens only when matrix sizes are right
• ?GEMM: M, N > 2048, K > 256
• ?SYMM: M, N > 2048
• ?TRSM/?TRMM: M, N > 3072
32
• LU: M, N > 8192
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Work Division Control in Automatic Offload
Examples
Notes
mkl_mic_set_Workdivision(
MKL_TARGET_MIC, 0, 0.5)
Offload 50% of computation only to the 1st
card.
Examples
Notes
MKL_MIC_0_WORKDIVISION=0.5
Offload 50% of computation only to the 1st
card.
Work division settings have no
effects for LAPACK functions.
33
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Compiler Assisted Offload (CAO)
• Offloading is explicitly controlled by compiler
pragmas or directives.
• All MKL functions can be offloaded in CAO.
• In comparison, only a subset of MKL is subject to AO.
• Can leverage the full potential of compiler’s
offloading facility.
• More flexibility in data transfer and remote
execution management.
• A big advantage is data persistence: Reusing transferred
data for multiple operations.
34
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
How to Use Compiler Assisted Offload
• The same way you would offload any function
call to the coprocessor.
• An example in C:
#pragma offload target(mic) \
in(transa, transb, N, alpha, beta) \
in(A:length(matrix_elements)) \
in(B:length(matrix_elements)) \
in(C:length(matrix_elements)) \
out(C:length(matrix_elements) alloc_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
35
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
How to Use Compiler Assisted Offload
• An example in Fortran:
!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM
!DEC$ OMP OFFLOAD TARGET( MIC ) &
!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &
!DEC$ IN( A: LENGTH( NCOLA * LDA )), &
!DEC$ IN( B: LENGTH( NCOLB * LDB )), &
!DEC$ INOUT( C: LENGTH( N * LDC ))
!$OMP PARALLEL SECTIONS
!$OMP SECTION
CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &
A, LDA, B, LDB BETA, C, LDC )
!$OMP END PARALLEL SECTIONS
36
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Using AO and CAO in the Same Program
• Users can use AO for some MKL calls and use
CAO for others in the same program
• Only supported by Intel compilers.
• Work division must be set explicitly for AO.
• Otherwise, all MKL AO calls are executed on the host.
37
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Native Execution
• Use the coprocessor as an independent compute
node.
• Programs can be built to run only on the coprocessor by using
the –mmic build option.
– MKL function calls inside an offloaded code
region executes natively.
– Better performance if input data is already available on the
coprocessor, and output is not immediately needed on the host side.
38
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Considerations of Using Intel® MKL on
Intel® Xeon Phi™ Coprocessors
High level parallelism is critical in maximizing
performance.
• BLAS (Level 3) and LAPACK with large problem size get the
most benefit.
• Scaling beyond 100’s threads, vectorized, good data locality
Minimize data transfer overhead when offload.
• Offset data transfer overhead with enough computation.
• Exploit data persistence: CAO to help!
You can always run on the host if offloading does
not offer better performance.
39
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
iXPTC 2013
Intel® Xeon Phi™ Coprocessor
Value of Suites
Suite Only Features
• Advisor XE
Parallelism Advice
• C++ Performance Guide
Performance Wizard
• Pointer Checker
Reduces memory corruption
• Code Complexity Analysis
Find code likely to be less
reliable
• Static Analysis Improved!
Find Errors and Harden your
Security
Optimization Notice
41
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Compiler
s&
Libraries
What’s New in Libraries?
Intel® MKL
• Digital random number generator (DRNG) for improved vector statistics
calculations
• Automatically utilize Intel® Xeon Phi™ Coprocessors and balance
compute loads between CPUs and coprocessors
Intel® IPP
• Enhanced image resize performance primitives
• Improved IPP footprint size
Intel® TBB
"Intel® TBB provided us with
optimized code that we did not have
to develop or maintain for critical
system services. I could assign my
developers to code what we bring to
the software table—crowd simulation
software.”
•
Improved usability and reliability of the Flow Graph feature
•
Additional C++11 Support
Michaël Rouillé, CTO, Gol
Ready to Use Libraries to Increase Performance
Optimization Notice
42
intel.com/software/products
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.