iXPTC 2013

advertisement
Scale from Intel® Xeon® Processor to
Intel® Xeon Phi™ Coprocessors
Shuo Li
Financial Services Engineering
Software and Services Group
Intel Corporation
Agenda
• A Tale of Two Architectures
• Transcendental Functions - Step 5 Lab 1/3
• Thread Affinity - Step 5 Lab part 2/3
• Prefetch and Streaming Store
• Data Blocking – Step 5 Lab part 3/3
• Summary
2
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
A Tale of Two Architectures
A Tale of Two Architectures
Sockets
Clock Speed
Execution Style
Intel® Xeon® processor
Intel® Xeon Phi™ Coprocessor
2
1
2.6 GHz
1.1 GHz
Multicore
Out-of-order
In-order
Cores/socket
8
Up to 61
HW Threads/Core
2
4
Thread switching
HyperThreading
Round Robin
SIMD widths
8SP, 4DP
16SP, 8DP
Peak Gflops
692SP, 346DP
2020SP, 1010DP
102GB/s
320GB/s
32kB
32kB
L2 Cache/Core
256kB
512kB
L3 Cache/Socket
30MB
none
Memory Bandwidth
L1 DCache/Core
iXPTC 2013
4
Intel® Xeon Phi ™Coprocessor
Transcendental Functions
Extended Math Unit
• Fast approximations of transcendental functions
using hardware lookup table in single precision
• Minimax quadratic polynomial approximation
• Full effective 23-bit Mantissa bits
• 1-2 cycles of throughput in 4 elementary functions
• Benefit other functions that directly use them
6
Elementary functions
Name
Cycle
Reciprocal
rcp()
1
Reciprocal square root
rsqrt2()
1
Exponential base 2
exp2()
2
Logarithmic base 2
log2()
1
Derived functions
Name
Cycle
divide
rcp(), X
2
Square root
rsqrt(), X
2
power
log2(),X,exp2()
4
exponent base 10
vexp223ps, X
3
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Use EMU Functions in Finance Algorithms
• Challenges in using exp2() and log2()
– exp() and log() are widely use in Finance algorithm
– the base of which are e (=2.718…) not 2
• The Change of Base Formula
– M_LOG2E, M_LN2 are defined in exp(x) = exp2(x*M_LOG2E)
log(x) = log2(x)*M_LN2
math.h
– Cost: 1 multiplication, and its effect on result accuracy
• In General, it works for any base b
– expb(x) = exp2(x*log2b) logb(x) = log2(x) * logb2
• Manage the cost of conversion
– Absorb multiply to other constant calculations outside loop
– Always convert from other bases to base 2
7
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Black-Scholes can use EMU Functions
const float
R = 0.02f;
const float
LN2_V =
const float
V = 0.30f;
const float
RLOG2E =-R*M_LOG2E;
const float
RVV =
M_LN2 * (1/V);
(R + 0.5f * V * V)/V;
for(int opt = 0; opt < OPT_N; opt++)
for(int opt = 0; opt < OPT_N; opt++)
{
float T = OptionYears[opt];
{
float T = OptionYears[opt];
float X = OptionStrike[opt];
float X = OptionStrike[opt];
float S = StockPrice[opt];
float S = StockPrice[opt];
float rsqrtT = 1/sqrtf(T);
float sqrtT = sqrtf(T);
float sqrtT = 1/rsqrtT;
float d1 = logf(S/X)/(V*sqrtT) + RVV*sqrtT;
float d1 = log2f(S/X)*LN2_V*rsqrtT + RVV*sqrtT;
float d2 = d1 - V * sqrtT;
float d2= d1 - V * sqrtT;
CNDD1 = CNDF(d1);
CNDD1 = CNDF(d1);
CNDD2 = CNDF(d2);
CNDD2 = CNDF(d2);
float expRT = X * expf(-R * T);
float expRT = X * exp2f(RLOG2E * T);
float CallVal = S * CNDD1 - expRT * CNDD2;
float CallVal = S * CNDD1 - expRT * CNDD2;
CallResult[opt] = CallVal;
CallResult[opt] = CallVal;
PutResult[opt] = CallVal
+
PutResult[opt] = CallVal
expRT - S;
+
expRT - S;
}
}
8
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Transcendental Functions
Step 5 Lab 1/3
Using Intel® Xeon Phi™ Coprocessors
• Use Intel® Xeon® E5 2670 platform with Intel® Xeon Phi™ Coprocessor
• Make sure you can invoke Intel C/C++ Compiler
– Try icpc –V
to prinout the banner for Intel® C/C++ compiler
• Build the native Intel® Xeon Phi™ Coprocessor application
– Change the Makefile and use –mmic in lieu of –xAVX
source /opt/intel/Compiler/2013.2.146/composerxe/pkg_bin/compilervars.sh intel64
• Copy program from Host to Coprocessor
– Find out the host name using hostname (suppose it returns esg014)
– Copy the executives using - scp ./MonteCarlo esg014-mic0:
– Establish a execution environment - ssh esg014-mic0
– Set env. Variable %export LD_LIBRARY_PATH=.
– Optionally export KMP_AFFINITY=“compact,granularity=fine”
– Invoke the program ./MonteCarlo
10
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Step 5 Transcendental Functions
• Use transcendental functions in EMU
• Inner loop calls expf(MuByT + VBySqrtT * random[pos]);
• Call exp2f and adjust the parameter by a factor of M_LOG2E
• Combine the multiplication with MuByT and VBySqrtT
float VBySqrtT = VLOG2E * sqrtf(T[opt]);
float MuByT = RVVLOG2E * T[opt];
11
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Thread Affinity
More on Thread Affinity
• Bind the worker threads to certain processor core/threads
• Minimizes the thread migration and context switch
• Improves data locality; reduce coherency traffic
• Two components to the problem:
– How many worker threads to create?
– How to bind worker threads to core/threads?
• Two ways to specify thread affinity
– Environment variables OMP_NUM_THREADS, KMP_AFFINITY
– C/C++ API: kmp_set_defaults("KMP_AFFINITY=compact")
omp_set_num_threads(244);
• Add to your source file#include <omp.h>
• Compiler with –openmp
• Use libiomp5.so
13
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
The Optimal Thread Number
• Intel MIC maintains 4 hardware contexts per core
– Round-robin execution policy,
– Require 2 threads for decent performance
– Financial algorithms takes all 4 threads to peak
•
Intel Xeon processor optionally use HyperThreading
– Execute-until-stall execution policy
– Truly compute intensive ones peak with 1 thread per core
– Finance algorithms likes HyperThreading, 2 threads per core
• Use OpenMP application with NCORE number of cores
– Host only: 2 x ncore (or 1x if HyperThreading disabled)
– MIC Native:4 x ncore
– Offload:
14
4 x (ncore-1) OpenMP runtime avoids the core OS runs
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Thread Affinity Choices
• Intel® OpenMP* Supports the following Affinity Type:
– Compact assign threads to consecutive h/w contexts on same physical core
to achieve the benefit from shared cache.
– Scatter assign consecutive threads to different physical cores maximize
access to memory.
– Balanced blend of compact & scatter (currently only available for Intel®
MIC Architecture)
• You can also specify affinity modifier
Modifier
Specifier
Description
Granularity
=core
Broadest granularity level supported. Allows all the OpenMP threads
bound to a core to float between the different thread contexts.
=fine
=thread
The finest granularity level. Causes each OpenMP thread to be
bound to a single thread context
Describes the lowest levels that
OpenMP threads are allowed to
float within a topology map
– Explicit setting
KMP_AFFINITY set to granularity=fine, proclist=“1-240”,explicit
• Affinity is particularly important if not all available threads are used
• Affinity Type is less important in full thread subscription
15
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Thread Affinity in Monte Carlo
• Monte Carlo can take all threads available to a core
– Enable HyperThreading for Intel® Xeon® processor
– Set –opt-threads-per-core=4 for the Coprocessor code
• Affinity type is less important when all cores are used
– Argument for compact: maximized the share random number
effect
– Argument for scatter: maximize the bandwidth to memory
• However you have to set thread affinity type
– API Calls:
– Env. variables:
#ifdef _OPENMP
kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine");
#endif
~ $ export LD_LIBRARY_PATH=.
~ $ export KMP_AFFINITY="scatter,granularity=fine"
~ $ ./MonteCarlo
16
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Thread Affinity
Step 5 Lab 2/3
Step 5 Thread Affinity
• Add #include <omp.h>
• Add the following line before the very first #pragam omp
#ifdef _OPENMP
kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine
");
#endif
• Add –opt-threads-per-core =4 to the Makefile
18
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Prefetch and Streaming Stores
Prefetch on Intel Multicore and Manycore platforms
• Objective: Move data from memory to L1 or L2 Cache in
anticipation of CPU Load/Store
• More import on in-order Intel Xeon Phi Coprocessor
• Less important on out of order Intel Xeon Processor
• Compiler prefetching is on by default for Intel® Xeon Phi™
coprocessors at –O2 and above
• Compiler prefetch is not enabled by default on Intel® Xeon®
Processors
– Use external options –opt-prefetch[=n] n = 1.. 4
• Use the compiler reporting options to see detailed diagnostics
of prefetching per loop
– Use -opt-report-phase hlo –opt-report 3
20
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Automatic Prefetches
Loop Prefetch
• Compiler generated prefetches target memory
access in a future iteration of the loop
• Target regular, predictable array and pointer access
Interactions with Hardware prefetcher
• Intel® Xeon Phi™ Comprocessor has a hardware L2
prefetcher
• If Software prefetches are doing a good job,
Hardware prefetching does not kick in
• References not prefetched by compiler may get
prefetched by hardware prefetcher
21
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Explicit Prefetch
• Use Intrinsics
–
_mm_prefetch((char *) &a[i], hint);
See xmmintrin.h for possible hints (for L1, L2, non-temporal, …)
– But you have to specify the prefetch distance
– Also gather/scatter prefetch intrinsics, see zmmintrin.h and compiler
user guide, e.g. _mm512_prefetch_i32gather_ps
• Use a pragma / directive (easier):
–
#pragma prefetch a
[:hint[:distance]]
– You specify what to prefetch, but can choose to let compiler figure out
how far ahead to do it.
• Use Compiler switches:
– -opt-prefetch-distance=n1[,n2]
– specify the prefetch distance (how many iterations ahead, use n1 and
prefetches inside loops. n1 indicates distance from memory to L2.
– BlackScholes uses -opt-prefetch-distance=8,2
22
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Streaming Store
• Avoid read for ownership for certain memory write operation
• Bypass prefetch related to the memory read
• Use #pragma vector nontemporal (v1, v2, …) to drop a
hint to compiler
• Without Streaming Stores 448 Bytes read/write per iteration
• With Streaming Stores,
320 Bytes read/write
per iteration
• Relief Bandwidth
pressure; improve
cache utilization
• –vec-report6 displays
the compiler action
for (int chunkBase = 0; chunkBase < OptPerThread; chunkBase +=
CHUNKSIZE)
{
#pragma simd vectorlength(CHUNKSIZE)
#pragma simd
#pragma vector aligned
#pragma vector nontemporal (CallResult, PutResult)
for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++)
{
float CNDD1;
float CNDD2;
float CallVal =0.0f, PutVal = 0.0f;
float T = OptionYears[opt];
float X = OptionStrike[opt];
float S = StockPrice[opt];
……
CallVal = S * CNDD1 - XexpRT * CNDD2;
PutVal = CallVal + XexpRT - S;
CallResult[opt] = CallVal ;
bs_test_sp.c(215): (col. 4) remark: vectorization support: streaming store was
generated for =CallResult.
PutResult[opt]
PutVal ;
bs_test_sp.c(216): (col. 4) remark: vectorization support: streaming store
was
generated for PutResult.
}
23
}
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Data Blocking
Data Blocking
•
Partition data to small blocks that fits in L2 Cache
– Exploit data reuse in the application.
– Ensure the data remains in the cache across multiple uses
– Using the data in cache remove the need to go to memory
– Bandwidth limited program may execute at FLOPS limit
•
Simple case of 1D
– Data size DATA_N is used WORK_N times from 100s of threads
– Each handles a piece of work and have to traverse all data
•
Without Blocking
•
With Blocking
– 100s of thread pound on different
area of DATA_N
– Cacheable BSIZE of data is processed by all
100s threads a time
– Memory interconnet limit the
performance
– Each data is read once kept reusing until all
threads are done with it
#pragma omp parallel for
for(int wrk = 0; wrk < WORK_N; wrk++)
{
initialize_the_work(wrk);
for(int ind = 0; ind < DATA_N; ind++)
{
dataptr datavalue = read_data(dataind);
result = compute(datavalue);
aggregate = combine(aggregate, result);
}
postprocess_work(aggregate);
}
for(int BBase = 0; BBase < DATA_N; BBase += BSIZE)
{
#pragma omp parallel for
for(int wrk = 0; wrk < WORK_N; wrk++)
{
initialize_the_work(wrk);
for(int ind = BBase; ind < BBase+BSIZE; ind++)
{
dataptr datavalue = read_data(ind);
result = compute(datavalue);
aggregate[wrk] = combine(aggregate[wrk], result);
}
postprocess_work(aggregate[wrk]);
}
Intel® Xeon Phi ™Coprocessor
}
iXPTC 2013
Blocking in Monte Carlo European Options
•
Each thread runs Monte Carlo using all random num.
•
Random num. are too big to fit in each thread’s L2
•
•
–
Random num size is RAND_N * sizeof(float) = 1 MB, at
256K floats
–
Each thread’s L2 is 512KB/4 = 128KB effective: 64KB or
16K floats
Without Blocking:
–
Each thread make independent pass of RAND_N data for
all options it runs
–
Interconnects is busy satisfying the read req. from
different threads at different points
–
Also prefetched data from different points saturate the
memory bandwidth
const int nblocks = RAND_N/BLOCKSIZE;
for(int block = 0; block < nblocks; ++block)
{
vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF,
Randomstream, BLOCKSIZE, random, 0.0f, 1.0f);
#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
{
float VBySqrtT = VLOG2E * sqrtf(T[opt]);
float MuByT = RVVLOG2E * T[opt];
float Sval = S[opt];
float Xval = X[opt];
float val = 0.0, val2 = 0.0;
#pragma vector aligned
#pragma simd reduction(+:val) reduction(+:val2)
#pragma unroll(4)
for(int pos = 0; pos < BLOCKSIZE; pos++)
{
… … …
val += callValue;
val2 += callValue * callValue;
With Blocking
–
Random number is partitioned into cacheable blocks
–
A block is brought to cache when its previous block is
done processing by all threads.
–
It remains in the caches of all threads until all thread
have finish their passes
–
Each threads repeatedly reuse the data in cache for all
options it need to runs.
}
h_CallResult[opt] += val;
h_CallConfidence[opt] += val2;
}
}
#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
{
const float val
= h_CallResult[opt];
const float val2
= h_CallConfidence[opt];
const float exprt
= exp2f(-RLOG2E*T[opt]);
h_CallResult[opt]
= exprt * val * INV_RAND_N;
… … …
h_CallConfidence[opt] = (float)(exprt * stdDev *
CONFIDENCE_DENOM);
}
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Data Blocking
Step 5 Lab 3/3
Step 5 Data Blocking
•
1D Data blocking to Random nums.
•
Move the loop to calculate per option data from
middle loop to outside the loop
#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
–
L2: 521K per core, 128K per threads
–
Your application can use 64KB
const float val=h_CallResult[opt];
–
BLOCKSIZE = 16*1024
const float val2=h_CallConfidence[opt];
{
const int nblocks = RAND_N/BLOCKSIZE;
const float exprt=exp2f(-RLOG2E*T[opt]);
for(block=0; block<nblocks; ++block){
vsRngGaussian(VSL_METHOD_SGAUSSIAN_ICDF,
Randomstream, BLOCKSIZE, random, 0.0f,
h_CallResult[opt]= exprt*val*INV_RAND_N;
const float stdDev= sqrtf((F_RAND_N * val2 val * val) * STDDEV_DENOM);
h_CallConfidence[opt]= (float)(exprt * stdDev *
1.0f);
CONFIDENCE_DENOM);
•
<<<Existing Code>>
// Save intermediate result here
h_CallResult[opt] +=
#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
val;
h_CallConfidence[opt] +=
val2;
{
h_CallResult[opt]
}
•
Add initialization loop before the triple nested
loops
Change Inner loop from RAND_N to
BLOCKSIZE
28
= 0.0f;
h_CallConfidence[opt] = 0.0f;
}
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Run your program on Intel® Xeon Phi™ Coprocessor
• Build the native Intel® Xeon Phi™ Coprocessor
application
– Change the Makefile and use –mmic in lieu of –xAVX
• Copy program from Host to Coprocessor
–
–
–
–
–
29
Find out the host name using hostname (if it retuns esg014)
Copy the executives using scp MonteCarlo esg014-mic0:
Establish a execution environment ssh esg014-mic0
Set env. Variable %export LD_LIBRARY_PATH=.
Invoke the program ./MonteCarlo
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Summary
Summary
• Base 2 exponent and logarithmic functions are always faster
than other bases on Intel multicore and manycore platforms
• Set thread affinity when you use OpenMP*
• Allow hardware prefetcher to work for you. Fine tuning your
loop with prefetcher pragma/directives
• Optimize your data access and rearrange your computation
based on data in cache
31
iXPTC 2013
Intel® Xeon Phi ™Coprocessor
Download