Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors Shuo Li Financial Services Engineering Software and Services Group Intel Corporation Agenda • A Tale of Two Architectures • Transcendental Functions - Step 5 Lab 1/3 • Thread Affinity - Step 5 Lab part 2/3 • Prefetch and Streaming Store • Data Blocking – Step 5 Lab part 3/3 • Summary 2 iXPTC 2013 Intel® Xeon Phi ™Coprocessor A Tale of Two Architectures A Tale of Two Architectures Sockets Clock Speed Execution Style Intel® Xeon® processor Intel® Xeon Phi™ Coprocessor 2 1 2.6 GHz 1.1 GHz Multicore Out-of-order In-order Cores/socket 8 Up to 61 HW Threads/Core 2 4 Thread switching HyperThreading Round Robin SIMD widths 8SP, 4DP 16SP, 8DP Peak Gflops 692SP, 346DP 2020SP, 1010DP 102GB/s 320GB/s 32kB 32kB L2 Cache/Core 256kB 512kB L3 Cache/Socket 30MB none Memory Bandwidth L1 DCache/Core iXPTC 2013 4 Intel® Xeon Phi ™Coprocessor Transcendental Functions Extended Math Unit • Fast approximations of transcendental functions using hardware lookup table in single precision • Minimax quadratic polynomial approximation • Full effective 23-bit Mantissa bits • 1-2 cycles of throughput in 4 elementary functions • Benefit other functions that directly use them 6 Elementary functions Name Cycle Reciprocal rcp() 1 Reciprocal square root rsqrt2() 1 Exponential base 2 exp2() 2 Logarithmic base 2 log2() 1 Derived functions Name Cycle divide rcp(), X 2 Square root rsqrt(), X 2 power log2(),X,exp2() 4 exponent base 10 vexp223ps, X 3 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Use EMU Functions in Finance Algorithms • Challenges in using exp2() and log2() – exp() and log() are widely use in Finance algorithm – the base of which are e (=2.718…) not 2 • The Change of Base Formula – M_LOG2E, M_LN2 are defined in exp(x) = exp2(x*M_LOG2E) log(x) = log2(x)*M_LN2 math.h – Cost: 1 multiplication, and its effect on result accuracy • In General, it works for any base b – expb(x) = exp2(x*log2b) logb(x) = log2(x) * logb2 • Manage the cost of conversion – Absorb multiply to other constant calculations outside loop – Always convert from other bases to base 2 7 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Black-Scholes can use EMU Functions const float R = 0.02f; const float LN2_V = const float V = 0.30f; const float RLOG2E =-R*M_LOG2E; const float RVV = M_LN2 * (1/V); (R + 0.5f * V * V)/V; for(int opt = 0; opt < OPT_N; opt++) for(int opt = 0; opt < OPT_N; opt++) { float T = OptionYears[opt]; { float T = OptionYears[opt]; float X = OptionStrike[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; float S = StockPrice[opt]; float rsqrtT = 1/sqrtf(T); float sqrtT = sqrtf(T); float sqrtT = 1/rsqrtT; float d1 = logf(S/X)/(V*sqrtT) + RVV*sqrtT; float d1 = log2f(S/X)*LN2_V*rsqrtT + RVV*sqrtT; float d2 = d1 - V * sqrtT; float d2= d1 - V * sqrtT; CNDD1 = CNDF(d1); CNDD1 = CNDF(d1); CNDD2 = CNDF(d2); CNDD2 = CNDF(d2); float expRT = X * expf(-R * T); float expRT = X * exp2f(RLOG2E * T); float CallVal = S * CNDD1 - expRT * CNDD2; float CallVal = S * CNDD1 - expRT * CNDD2; CallResult[opt] = CallVal; CallResult[opt] = CallVal; PutResult[opt] = CallVal + PutResult[opt] = CallVal expRT - S; + expRT - S; } } 8 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Transcendental Functions Step 5 Lab 1/3 Using Intel® Xeon Phi™ Coprocessors • Use Intel® Xeon® E5 2670 platform with Intel® Xeon Phi™ Coprocessor • Make sure you can invoke Intel C/C++ Compiler – Try icpc –V to prinout the banner for Intel® C/C++ compiler • Build the native Intel® Xeon Phi™ Coprocessor application – Change the Makefile and use –mmic in lieu of –xAVX source /opt/intel/Compiler/2013.2.146/composerxe/pkg_bin/compilervars.sh intel64 • Copy program from Host to Coprocessor – Find out the host name using hostname (suppose it returns esg014) – Copy the executives using - scp ./MonteCarlo esg014-mic0: – Establish a execution environment - ssh esg014-mic0 – Set env. Variable %export LD_LIBRARY_PATH=. – Optionally export KMP_AFFINITY=“compact,granularity=fine” – Invoke the program ./MonteCarlo 10 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Step 5 Transcendental Functions • Use transcendental functions in EMU • Inner loop calls expf(MuByT + VBySqrtT * random[pos]); • Call exp2f and adjust the parameter by a factor of M_LOG2E • Combine the multiplication with MuByT and VBySqrtT float VBySqrtT = VLOG2E * sqrtf(T[opt]); float MuByT = RVVLOG2E * T[opt]; 11 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Thread Affinity More on Thread Affinity • Bind the worker threads to certain processor core/threads • Minimizes the thread migration and context switch • Improves data locality; reduce coherency traffic • Two components to the problem: – How many worker threads to create? – How to bind worker threads to core/threads? • Two ways to specify thread affinity – Environment variables OMP_NUM_THREADS, KMP_AFFINITY – C/C++ API: kmp_set_defaults("KMP_AFFINITY=compact") omp_set_num_threads(244); • Add to your source file#include <omp.h> • Compiler with –openmp • Use libiomp5.so 13 iXPTC 2013 Intel® Xeon Phi ™Coprocessor The Optimal Thread Number • Intel MIC maintains 4 hardware contexts per core – Round-robin execution policy, – Require 2 threads for decent performance – Financial algorithms takes all 4 threads to peak • Intel Xeon processor optionally use HyperThreading – Execute-until-stall execution policy – Truly compute intensive ones peak with 1 thread per core – Finance algorithms likes HyperThreading, 2 threads per core • Use OpenMP application with NCORE number of cores – Host only: 2 x ncore (or 1x if HyperThreading disabled) – MIC Native:4 x ncore – Offload: 14 4 x (ncore-1) OpenMP runtime avoids the core OS runs iXPTC 2013 Intel® Xeon Phi ™Coprocessor Thread Affinity Choices • Intel® OpenMP* Supports the following Affinity Type: – Compact assign threads to consecutive h/w contexts on same physical core to achieve the benefit from shared cache. – Scatter assign consecutive threads to different physical cores maximize access to memory. – Balanced blend of compact & scatter (currently only available for Intel® MIC Architecture) • You can also specify affinity modifier Modifier Specifier Description Granularity =core Broadest granularity level supported. Allows all the OpenMP threads bound to a core to float between the different thread contexts. =fine =thread The finest granularity level. Causes each OpenMP thread to be bound to a single thread context Describes the lowest levels that OpenMP threads are allowed to float within a topology map – Explicit setting KMP_AFFINITY set to granularity=fine, proclist=“1-240”,explicit • Affinity is particularly important if not all available threads are used • Affinity Type is less important in full thread subscription 15 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Thread Affinity in Monte Carlo • Monte Carlo can take all threads available to a core – Enable HyperThreading for Intel® Xeon® processor – Set –opt-threads-per-core=4 for the Coprocessor code • Affinity type is less important when all cores are used – Argument for compact: maximized the share random number effect – Argument for scatter: maximize the bandwidth to memory • However you have to set thread affinity type – API Calls: – Env. variables: #ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine"); #endif ~ $ export LD_LIBRARY_PATH=. ~ $ export KMP_AFFINITY="scatter,granularity=fine" ~ $ ./MonteCarlo 16 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Thread Affinity Step 5 Lab 2/3 Step 5 Thread Affinity • Add #include <omp.h> • Add the following line before the very first #pragam omp #ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine "); #endif • Add –opt-threads-per-core =4 to the Makefile 18 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Prefetch and Streaming Stores Prefetch on Intel Multicore and Manycore platforms • Objective: Move data from memory to L1 or L2 Cache in anticipation of CPU Load/Store • More import on in-order Intel Xeon Phi Coprocessor • Less important on out of order Intel Xeon Processor • Compiler prefetching is on by default for Intel® Xeon Phi™ coprocessors at –O2 and above • Compiler prefetch is not enabled by default on Intel® Xeon® Processors – Use external options –opt-prefetch[=n] n = 1.. 4 • Use the compiler reporting options to see detailed diagnostics of prefetching per loop – Use -opt-report-phase hlo –opt-report 3 20 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Automatic Prefetches Loop Prefetch • Compiler generated prefetches target memory access in a future iteration of the loop • Target regular, predictable array and pointer access Interactions with Hardware prefetcher • Intel® Xeon Phi™ Comprocessor has a hardware L2 prefetcher • If Software prefetches are doing a good job, Hardware prefetching does not kick in • References not prefetched by compiler may get prefetched by hardware prefetcher 21 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Explicit Prefetch • Use Intrinsics – _mm_prefetch((char *) &a[i], hint); See xmmintrin.h for possible hints (for L1, L2, non-temporal, …) – But you have to specify the prefetch distance – Also gather/scatter prefetch intrinsics, see zmmintrin.h and compiler user guide, e.g. _mm512_prefetch_i32gather_ps • Use a pragma / directive (easier): – #pragma prefetch a [:hint[:distance]] – You specify what to prefetch, but can choose to let compiler figure out how far ahead to do it. • Use Compiler switches: – -opt-prefetch-distance=n1[,n2] – specify the prefetch distance (how many iterations ahead, use n1 and prefetches inside loops. n1 indicates distance from memory to L2. – BlackScholes uses -opt-prefetch-distance=8,2 22 iXPTC 2013 Intel® Xeon Phi ™Coprocessor Streaming Store • Avoid read for ownership for certain memory write operation • Bypass prefetch related to the memory read • Use #pragma vector nontemporal (v1, v2, …) to drop a hint to compiler • Without Streaming Stores 448 Bytes read/write per iteration • With Streaming Stores, 320 Bytes read/write per iteration • Relief Bandwidth pressure; improve cache utilization • –vec-report6 displays the compiler action for (int chunkBase = 0; chunkBase < OptPerThread; chunkBase += CHUNKSIZE) { #pragma simd vectorlength(CHUNKSIZE) #pragma simd #pragma vector aligned #pragma vector nontemporal (CallResult, PutResult) for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++) { float CNDD1; float CNDD2; float CallVal =0.0f, PutVal = 0.0f; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; …… CallVal = S * CNDD1 - XexpRT * CNDD2; PutVal = CallVal + XexpRT - S; CallResult[opt] = CallVal ; bs_test_sp.c(215): (col. 4) remark: vectorization support: streaming store was generated for =CallResult. PutResult[opt] PutVal ; bs_test_sp.c(216): (col. 4) remark: vectorization support: streaming store was generated for PutResult. } 23 } iXPTC 2013 Intel® Xeon Phi ™Coprocessor Data Blocking Data Blocking • Partition data to small blocks that fits in L2 Cache – Exploit data reuse in the application. – Ensure the data remains in the cache across multiple uses – Using the data in cache remove the need to go to memory – Bandwidth limited program may execute at FLOPS limit • Simple case of 1D – Data size DATA_N is used WORK_N times from 100s of threads – Each handles a piece of work and have to traverse all data • Without Blocking • With Blocking – 100s of thread pound on different area of DATA_N – Cacheable BSIZE of data is processed by all 100s threads a time – Memory interconnet limit the performance – Each data is read once kept reusing until all threads are done with it #pragma omp parallel for for(int wrk = 0; wrk < WORK_N; wrk++) { initialize_the_work(wrk); for(int ind = 0; ind < DATA_N; ind++) { dataptr datavalue = read_data(dataind); result = compute(datavalue); aggregate = combine(aggregate, result); } postprocess_work(aggregate); } for(int BBase = 0; BBase < DATA_N; BBase += BSIZE) { #pragma omp parallel for for(int wrk = 0; wrk < WORK_N; wrk++) { initialize_the_work(wrk); for(int ind = BBase; ind < BBase+BSIZE; ind++) { dataptr datavalue = read_data(ind); result = compute(datavalue); aggregate[wrk] = combine(aggregate[wrk], result); } postprocess_work(aggregate[wrk]); } Intel® Xeon Phi ™Coprocessor } iXPTC 2013 Blocking in Monte Carlo European Options • Each thread runs Monte Carlo using all random num. • Random num. are too big to fit in each thread’s L2 • • – Random num size is RAND_N * sizeof(float) = 1 MB, at 256K floats – Each thread’s L2 is 512KB/4 = 128KB effective: 64KB or 16K floats Without Blocking: – Each thread make independent pass of RAND_N data for all options it runs – Interconnects is busy satisfying the read req. from different threads at different points – Also prefetched data from different points saturate the memory bandwidth const int nblocks = RAND_N/BLOCKSIZE; for(int block = 0; block < nblocks; ++block) { vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, BLOCKSIZE, random, 0.0f, 1.0f); #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VLOG2E * sqrtf(T[opt]); float MuByT = RVVLOG2E * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0, val2 = 0.0; #pragma vector aligned #pragma simd reduction(+:val) reduction(+:val2) #pragma unroll(4) for(int pos = 0; pos < BLOCKSIZE; pos++) { … … … val += callValue; val2 += callValue * callValue; With Blocking – Random number is partitioned into cacheable blocks – A block is brought to cache when its previous block is done processing by all threads. – It remains in the caches of all threads until all thread have finish their passes – Each threads repeatedly reuse the data in cache for all options it need to runs. } h_CallResult[opt] += val; h_CallConfidence[opt] += val2; } } #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { const float val = h_CallResult[opt]; const float val2 = h_CallConfidence[opt]; const float exprt = exp2f(-RLOG2E*T[opt]); h_CallResult[opt] = exprt * val * INV_RAND_N; … … … h_CallConfidence[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM); } iXPTC 2013 Intel® Xeon Phi ™Coprocessor Data Blocking Step 5 Lab 3/3 Step 5 Data Blocking • 1D Data blocking to Random nums. • Move the loop to calculate per option data from middle loop to outside the loop #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) – L2: 521K per core, 128K per threads – Your application can use 64KB const float val=h_CallResult[opt]; – BLOCKSIZE = 16*1024 const float val2=h_CallConfidence[opt]; { const int nblocks = RAND_N/BLOCKSIZE; const float exprt=exp2f(-RLOG2E*T[opt]); for(block=0; block<nblocks; ++block){ vsRngGaussian(VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, BLOCKSIZE, random, 0.0f, h_CallResult[opt]= exprt*val*INV_RAND_N; const float stdDev= sqrtf((F_RAND_N * val2 val * val) * STDDEV_DENOM); h_CallConfidence[opt]= (float)(exprt * stdDev * 1.0f); CONFIDENCE_DENOM); • <<<Existing Code>> // Save intermediate result here h_CallResult[opt] += #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) val; h_CallConfidence[opt] += val2; { h_CallResult[opt] } • Add initialization loop before the triple nested loops Change Inner loop from RAND_N to BLOCKSIZE 28 = 0.0f; h_CallConfidence[opt] = 0.0f; } iXPTC 2013 Intel® Xeon Phi ™Coprocessor Run your program on Intel® Xeon Phi™ Coprocessor • Build the native Intel® Xeon Phi™ Coprocessor application – Change the Makefile and use –mmic in lieu of –xAVX • Copy program from Host to Coprocessor – – – – – 29 Find out the host name using hostname (if it retuns esg014) Copy the executives using scp MonteCarlo esg014-mic0: Establish a execution environment ssh esg014-mic0 Set env. Variable %export LD_LIBRARY_PATH=. Invoke the program ./MonteCarlo iXPTC 2013 Intel® Xeon Phi ™Coprocessor Summary Summary • Base 2 exponent and logarithmic functions are always faster than other bases on Intel multicore and manycore platforms • Set thread affinity when you use OpenMP* • Allow hardware prefetcher to work for you. Fine tuning your loop with prefetcher pragma/directives • Optimize your data access and rearrange your computation based on data in cache 31 iXPTC 2013 Intel® Xeon Phi ™Coprocessor