Experiences with Xeon Phi coprocessors Abel compute cluster at UiO 2014-2015 Ole W. Saastad, PhD UiO/USIT/UAV/ITF/FI Feb. 2015 Experiences with Xeon Phi (aka MIC) at the Abel compute cluster at University of Oslo 2014-2015. Preface The introduction of many integrated cores processors from Intel has made massive parallel computation imperative. The motivation have been the very high compute capacity, a large number of relatively low performing cores yield a high combined performance. The theoretical performance for a single processor, using 60 cores, is over 1 Tflops/s in double precision. A typical node with Sandy Bridge processor in the Abel cluster have a theoretical performance of 166 Gflops/s per processor and provide per processor about 160 Gflops/s of performance when running the top500 test. Comparing this to a matrix multiplication test on Xeon Phi that clocks in at slightly less than 500 Gflops/s per processor. This is about 3 times the original Sandy Bridge node performance. All this using the same source code, just a recompile. The fraction of theoretical performance seen in real application and real benchmarks are far less than that with Sandy Bridge. Even so the Xeon Phi is formidable processor. The Xeon Phi is a x84-64 processor with the same Intel compilers used for all other Intel processors. Not directly binary compatible, many legacy less fortunate constructs have been phased out. Performance reporting is done in two different ways, in many cases wall clock time which is total time from launch to completion or in some cases performance measurement of the kind like jobs per unit of time. The first is of the category lower is better while the latter is higher is better. Please examine the graphs and notice which measurement type is used. Experiences with Xeon Phi Introduction The Intel Many Integrated Cores (MIC) architecture is developed with the concept that higher performance can be attained using more transistors for computation. Most of the transistors in a modern big core processor like Sandy Bridge does not do calculation. The idea behind the MIC architecture was to employ a rather large number of simpler cores on the chip. Combined they would yield a higher performance than a smaller number of big cores. Big cores contain elaborate hardware prefetch, out of order execution, branch prediction, nested level of cache. A large fraction of transistors do not really take part in calculation, e.g. not vector units or ALUs. On the other hand if a far higher fraction of transistors could be exploited to do calculation better utilization and performance could be obtained. This is what's done in MIC and GPUs, for the GPUs an even higher number of very small cores are doing calculations. Like the GPUs Xeon Phi (the trade name) also comes on a extension card, figure 1. Later versions might be offered as a complete motherboard. No definite date is given. Figure 1: Overview of coprocessor and host system. Presently the Xeon Phi is offered as a PCIe add on card that are hosted by an Intel Sandy Bridge based host. The limitation of PCIe bus bandwidth can represent a bottleneck. The processor is made up of a rather large number of simple x86-64 cores. The cores of Intel MIC are based on a modified version of P54C design, used in the original Pentium. The basis of the Intel MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture that can utilize existing parallelisation software tools. As with the Itanium architecture more is left to the software, compilers and libraries. Intel provide all the compiler, math-, thread- and MPIlibraries for the MIC architecture. 5 of 51 Experiences with Xeon Phi The Intel Xeon Phi coprocessor supports the same floating-point data types as the Intel Xeon processor. Single (32-bit) and double (64-bit) precision are supported in hardware; quadruple (128bit) precision is supported through software. Extended (80-bit) precision is supported through the x87 instruction set. The same set of rounding modes is supported as for Intel Xeon processors. Figure 2 show the layout of the processor, all the cores, memory and the PCIe interfaces located along a bidirectional ring bus. Each core has both level one and level two cache. The system is fully cache coherent (cc), the boxes marked TD are “tag memory” for the cache lines cache coherency directory. The L2 cache organization per core is inclusive of the L1 data and instruction caches while the L2 organization comprises 64 bytes per way with 8-way associativity, 1024 sets, 2 banks, 32GB (35 bits) of cacheable address range and a raw latency of 11 clocks. The L2 cache is part of the Core-Ring Interface block. This block also houses the tag directory (TD). Both the L1 and L2 caches use the standard MESI protocol for maintaining the shared state among cores. See table 1 for information about the L1 and L2 caches. Figure 2: Intel Xeon Phi chip architecture layout. Notice the ring bus as interconnect.. 6 of 51 Experiences with Xeon Phi Parameter L1 L2 Coherence MESI MESI 32 KB + 32 KB 512 KB 8-way 8-way 64 bytes 64 bytes 8 8 1 cycle 11 cycles pseudo LRU pseudo LRU 1 per clock 1 per clock Read or Write Read or Write Size Associativity Line Banks Access Time Policy Duty Cycle Ports Table 1: Characteristics of the cache hierarchy og the MIC architecture. The MIC architecture supports 40-bit physical address in 64-bit mode and 4-KB and 2-MB page sizes. On a TLB miss, a four-level page table walk is performed as usual, and the INVLPG instruction works as expected (changes to the page table sizes). The coprocessor core implements two types of memory: uncacheable (UC) and write-back (WB). No other memory type is legal or supported. Figure 3: Multithreading Architectural Support in the Intel.® Xeon Phi™ Coprocessor Multithreading is a key element to hide latency and used extensively in the MIC architecture. There are also additional reasons for using several threads in flight. The vector unit can only issue a vector instruction from one stream every second clock cycle. This constraint require that at least two threads per core are scheduled at the same time to fill the vector pipeline. The vector unit can issue execute one instruction per clock cycle if fed from two different threads. Issuing more threads helps 7 of 51 Experiences with Xeon Phi both to hide latency and to fill the vector pipeline. Four hardware threads available help to accommodate this. Figure 3 show an overview of the hardware threads in the early stage of execution. Figure 4: Intel Xeon Phi Knights corner core. Note that only pipe 0 can issue instructions to the vector unit. Figure 4 and 5 shows detailed and simplified layout the compute core. There are two pipes that feed instructions into one of the scalar unit, the x87 unit or the 512 bits wide vector unit. For HPC it's the vector unit that attract special attention. Figure 5: Simplified schematic of the MIC core. 8 of 51 Experiences with Xeon Phi The vector unit can hold and operate simultaneously on 8 double precision or 16 single precision floating point numbers and provide one result per clock cycle. Giving a clock sycle of about 1 GHz the maximum theoretical performance of the vector unit is 8 Gflops/s. With 60 such cores it yield a aggregated combined floating point performance of 480 Gflops/s. Using Fused Multiply Add (FMA) twice this performance is theoretically possible. Figure 6: Vector units on Intel processors, SSE (Pentium III), AVX (Sandy Bridge) and MIC-512 (MIC architecture). Figure 6 show evolution of the Intel vector units since the introduction of the 128 bits SSE (successor to the MMX instructions first introduced in 1997) found in Pentium III in 1999, through AVX introduced with Sandy Bridge in 2011 and finally the 512 bits wide vector unit found the Knights cores cores introduced in 2013. Wider vector instructions is highly beneficial for so called vector operations frequently found in scientific applications. Vector operations with stride one maps very well onto this kind of vector units. As an example the practical performance measured for matrix matrix multiplication using the math kernel library (MKL) is 459 Gflops/s using double precision. How much of this that can be harvested in user applications is the task of the compiler and ultimately the programmer. All the cores, caches, interfaces and the memory channels are connected to the interconnect bus. This is a bidirectional ring that provides an efficient transport between all the elements within the chip. Intel might change this to mesh or something else in the future. In addition Intel introduced the well known Fused Multiply Add instruction with the MIC architecture. This instruction is well known in all supercomputer architectures. It was the important instruction that increased the Cray I in 1976 performance from 80 Mflops/s without to 140 Mflops/s using this instruction (interesting to note that Cray did not claim 2x performance gain as Intel does today). The FMA instruction is very well suited for vector and matrix operations. The prime example is matrix matrix multiplication. Fused multiply add instructions come in two kinds, one with three arguments FMA3 and one with four arguments FMA4. GPU cards have supported this instruction since about 2009. 9 of 51 Experiences with Xeon Phi Figure 7: Fused multiply add (FMA) instruction schematic. For FMA3 the result must be one of a,b or c. Most common is accumulation of type a=a+b*c. The fused multiply add instruction is not one of the most used floating point operations and is not always easy to utilize. When theoretical performance numbers are posted it is always using this instruction together with the fully loaded vector. The memory subsystem is based on GDDR5 memory. Now GDDR5 SDRAM is high-bandwidth memory generally found on graphics cards, not computing engines. GDDR5 memory supports very high data rates in the tens of Gbits/sec using multi-GHz transfer clocks. These SDRAMs also cost more per Gbit than bulk SDRAM, but you’re paying for performance. Figure 8: Layout the coprocessor board chip and memory. There are foure memory controllers. Since all memory seems to be equal it looks like there is an interleaved access to the memory banks. Simple streams memory benchmark test show that over 150 GiB/s is easy to obtain, twice that of a Sandy Bridge based node. This demonstrates the very high bandwidth of the GDDR5 memory. Demonstrations with the usage of streaming stores from Intel emphasizes the problems with saturation of the interconnect ring. Again it turns out that software and programming skills are needed to exploit this new architecture fully. 10 of 51 Experiences with Xeon Phi Programming experiences The common Intel programming tools all support the MIC architecture. The compilation and build process are something known a cross compiling. This can have consequences when build script try to verify that compilers work by testing if executables can be built and run. The executables built for the MIC architecture cannot be run on Sandy Bridge. Special attention is needed when dealing with such builds. For all other compilations the only thing needed is to inform the compiler that you want to compile for MIC. This is easily done with a flag called “-mmic”. Programming for the so called native model Native model is the programming model when the Xeon Phi coprocessor is used a a stand alone linux system. You log in to the Busy box Linux and run programs just as you would on a normal compute node. Typically there is a directory NFS mounted that share files with the host system. As an example is compilation of the simple stream memory bandwidth benchmark. The compilation on the host can look like this: icc -mmic -O3 -openmp -o stream.x stream.c The excutable can be copied over top the target system and run. Any tests for its execution on the host system fails even the library check tool ldd. It will just rebort “not a dynamic executable”, while on the target system it displays the dynamic libraries as normal (using ldd). Even if the compilers are identical the compiler flags are different from Sandy Bridge. The most noticeable is the flag that trigger generation of the novel instruction for fused multiply add (FMA), where a=b*c + d (FMA4) and a=b*c+a (FMA3). The flag to invoke this is “-fma”. Intel is using this instruction to calculate theoretical performance. With 8 double precision (DP) numbers in the vector unit and one result per clock cycle we arrive at about 16 Gflops/s, one multiply and one add, two instructions per clock tick. This number is somewhat optimistic. Is it possible to write programs that schedule 120 threads or more with long vector tasks that are capable of filling all vector pipes is an open question. Not only do the vector need to be filled, the fused multiply add also need to make a a rather large fraction of the code. Maybe this is only possible for the top 500 test ? Assuming no FMA instructions we arrive at 8 Gflops/s per core which is 480 Gflops/s using 60 cores. Still a formidable performance. Intel MKL is also ported and supported on the Intel MIC architecture. This makes it easy to port applications that rely on the functions within that library. Intel MPI is also ported to MIC and runs without any extra installations. Just copy the bin and lib directories to the MIC set the relevant paths and things runs and resolve with minimal problems. To compile MPI programs using Intel MPI are just as easy as for none MPI programs, all include files etc are available for MIC. In short the programming and porting of applications to run natively on the MIC architecture is very easy. Tuning for application performance is another matter, but this is true for all types of accelerators. 11 of 51 Experiences with Xeon Phi Programming for the offload model The offload programming model treats the Xeon Phi just like a co-processor. Part of the executable code is executed on the mic co-processor. The program is run on the host processor and is compiled for Sandy-Bridge architecture. The only difference is that some part of the executable is execute on the co-processor. The part of the code run on the co-processor can be a library function like MKL, a user written function or routine or a region of the program. The cases involving user written code the compiler must be instructed to generate code to be cross compiled for the mic architecture. This is done with compiler directives much like OpenMP directives. In the case of MKL offload functions very little extra is needed by the programmer. Just setting the right environment variables. If the co-processor is present the MKL will automatically execute the MKL routines on the co-processor. However, only a very limited set of MKL functions are ported at the time of writing. However, MKL automatic offload do load balancing between the host processors and the co-processor. To offload user written code or functions there is bit bit more setup to be done. A number of directives and data movement must be taken care of. All of this is done with compiler directives. Load balancing must be done explicitly by the user. The partitioning of workload between host processors and the co-processor must be done manually. However functions can be run concurrently do that the host processor can work on one functions while the co-processor can work on another as long as there is no shared data as this would require synchronisation. This is possible to achieve, but the complexity can be quite high. The host memory and device memory are are not shared. Data must explicitly be copied between the two. Compiling is relatively easy. Only a few extra flags are needed. Code to be executed on the mic is automatically generated by the compiler. ifort -offload-attribute-target=mic -openmp -O3 -xhost 12 of 51 -o mxm.x mxm.F90 -mkl Experiences with Xeon Phi Performance evaluation Native execution Stream memory bandwidth benchmark Stream is a well known benchmark for measuring memory bandwidth written by John D. Calpin of TACC. TACC also happen to host the large supercomputer system called “Stampede” which is an accelerated system using a large array of Intel Xeon Phis. Stream can be built in several ways, it turned out that static allocation of the three vectors of which to operate on provided the best results. The source code illustrate how the data is allocated: # ifndef USE_MALLOC static double a[N+OFFSET], b[N+OFFSET], c[N+OFFSET]; # else static volatile double *a, *b, *c; # endif #ifdef USE_MALLOC a = malloc(sizeof(double)*(N+OFFSET)); b = malloc(sizeof(double)*(N+OFFSET)); c = malloc(sizeof(double)*(N+OFFSET)); #endif The figure below show the difference between using malloc and static allocation. For this reason all subsequent runs using stream very done using static allocation. 13 of 51 Experiences with Xeon Phi Stream benchmark Dynamic (malloc) vs. Static allocation 160 140 Bandwidth [GiB/s] 120 Copy: Scale: Add: Triad: 100 80 60 40 20 0 Malloc Static Data allocation type Figure 9: Effect of data allocation scheme, C malloc versus static allocation. Both runs with compact placements. To achieve optimum performance with this special architecture which lacks out of order execution, fancy prefetch machinery, use a high core count and a large number of threads one need to take extra care when compiling programs. There are a large range fo compiler switches that can help with this. However, the switches may have different effect on the MIC architecture than with Sandy Bridge. One such optimalisation is the prefetching. The MIC relies much more on software prefetching as these cores lack the same efficient hardware prefetcher. More is left to the programmer and the compiler. 14 of 51 Experiences with Xeon Phi Stream benchmark Effect of prefetch compiler options 155 150 Copy: Scale: Bandwidth [GiB/s] Add: 145 Triad: 140 135 130 125 120 No prefetch opts Prefetch opts Compiler flags used Figure 10: Effect of compiler prefetch options (compiler options used: -opt-prefetch=4 -opt-prefetch-distance=64,32). The figure above show the beneficial effect of providing prefetch compiler options to the Ccompiler enabling it to issue prefetch instructions in loops and other places where memory latencies impacts performance. The distance numbers of iterations given to the prefetcher options are found by trial and error. There are in most cases not a magic number that will be optimal in all cases. Additionally one might use streaming store instructions to prevent store instructions to write to cache. There is no reason to pollute the cache with data if the data is not reused. For the tests done here it had a small effect. Bandwidth for Triadd went up from 134.9 to 135.2 GiB/s, hardly significant. 15 of 51 Experiences with Xeon Phi Placement of threads per core is also a major performance issue. Hardware threads are grouped together onto cores, within which they share resources. This sharing of resources can benefit or harm particular algorithms, depending on their behaviors. Understanding the behavior of your algorithms will guide you in selecting the optimal affinity. Affinity can be specified in the environment (KMP_AFFINITY) or via a function call (kmp_set_affinity). There are basically three models, compact, scatter and balanced. In addition there is granularity. Granularity is set to fine so each OpenMP thread is constrained to a single HW thread. Alternatively, setting core granularity groups the OpenMP threads assigned to a core into little quartets with free reign over their respective cores. The affinity types COMPACT and SCATTER either clump OpenMP threads together on as few cores as possible or spread them apart so that adjacent thread numbers are on different cores. Sometimes though, it is advantageous to enable a limited number of threads distributed across the cores in a manner that leaves adjacent thread numbers on the same cores. For this, there is a new affinity type available, BALANCED, which does exactly that. Using the verbose setting shown above, you can determine how the OpenMP thread numbers are dispersed. Figure 11: Placements of threads onto cores using balanced affinity settings. Figure 12: Placements of threads onto cores using scatter affinity settings. 16 of 51 Experiences with Xeon Phi Figure 13: Illustration of the compact placement model. All 60 threads are scheduled to be as close together as possible. Four threads are sharing a single core even if there are idle cores. Beneficial when the threads can share the L2 cache. Figure 14: Illustration of scatter placement model. All 60 threads are scheduled as far apart as possible. In this case one thread per core. Beneficial when memory bandwidth is required. In addition the number of threads employed in the test is interesting. Will one thread per core saturate the interconnect torus, memory controllers or memory? 17 of 51 Experiences with Xeon Phi Stream benchmark Effect on core affinity settings 160 140 bandwidth [GiB/s] 120 100 Copy: Scale: Add: Triad: 80 60 40 20 0 Compact Scatter Balanced Affinity setting Figure 15: Effect on core affinity setting. 60 threads are used in this test. For 240 threads the effect is small. Figure 16 show the effect on performance when using one, two, three or four threads per core. When effectively run using all options the 60 cores are capable of saturating the memory bandwidth. The stream benchmark is just copying data and do not perform any significant calculation so effect of more cores do not show up when all the memory bandwidth is already utilized. Stream benchmark Size 4.5 GiB, Affinity=scatter 155,0 Copy: Scale: Add: Triad: Bandwidth [GiB/s] 150,0 145,0 140,0 135,0 130,0 125,0 60 120 180 240 #threads Figure 16: Effect of number of threads scheduled on stream performance. 18 of 51 Experiences with Xeon Phi FFT – MKL / FFTW interface FFT is used in a large number of applications and need no more introduction. One of the most common implementations is the FFTW. While the FFTW implementation can be successfully built on SB it is not so easy to cross build for MIC and hence the MKL with its associated FFTW interface is used to assess the scaling and performance. As a reference for performance and scaling some runs using SB and Haswell are included. FFT 2-d FFTW/MKL Sandy Bridge vs. Haswell 14 Wall time [sec] 12 SB 10 HW 8 6 4 2 0 1 2 4 8 # threads 16 20 32 40 Figure 17: 2d-FFT performance using Sandy Bridge and Haswell processors. SB outperform HW in this benchmark due to larger L2 cache (20MB vs 12MB), an example where cache size matter. Size of 2d NxN array, N=20000. 2d-FFT benchmark Wall time [sec] MKL using FFTW interface 160 140 120 100 80 60 40 20 0 1 2 4 8 16 32 48 60 80 100 120 180 240 #cores Figure 18: 2d-FFT performance using XeonPhi and MKL with FFTW interface. Size of problem (NxN) is N=18000. Scaling is good downto about 16 to 32 cores. After this scaling is poor, this limits the XeonPhi performance as this architecture rely on strong scaling. 19 of 51 Experiences with Xeon Phi 2d-FFT scaling- Sandy Bridge vs Xeon Phi MKL using FFTW interface 25 SandyBridge Xeon Phi Speedup [times] 20 15 10 5 0 11 22 44 88 16 16 32 32 48 60 60 80 80 100 100 120 120 180 180 240 240 #cores Figure 19: 2d-FFT scaling comparing Sandy Bridge vs. Xeon Phi. The stronger scaling experienced with Xeon Phi is evident. Size: N=18000. 2d-FFT performance Sandy Bridge vs Xeon Phi. MKL using FFTW interface 8 Wall time [seconds] 7 6 5 4 3 2 1 0 Sandy Bridge Xeon Phi Processor Figure 20: Performance of Sandy Bridge comparedto Xeon Phi. Sandy Bridge clearly outperform the Xeon Phi. The good scaling show by Xeon Phi in the above figure is not enough as this problem does not scale perfect. Xeon Phi would have beaten SB if scaled to all 240 threads with a calculated run time at 1.67 seconds. Size as above. 20 of 51 Experiences with Xeon Phi NAS Kernels benchmark The NAS benchmarks are well known. They are mostly known as MPI benchmarks, but have been rewritten to OpenMP version and other parallel implementations. As the Xeon Phi is a general processor it can run both OpenMP threaded shared memory code or distributed memory code like MPI. Hence both implementations have been tested. Most attention to the threaded version as this is more interesting with a cache coherent shared memory system. For this kind of benchmarks based on real application all optimalisation comes into play, prefetch, placement and affinity, threads per core etc. The effect of different placements is shown in figure 21 where the three models compact, scatter and balanced are shown. The best result for each test is compared, the actual number of cores might change as behavior changes with most parameters. The performance difference effect of placement is significant and care must always be taken to select optimal affinity. Which placement model yield best performance is not obvious. For small selected problems where all data for two or more threads can be kept in the L2 cache a compact model might be the best option. However, if those threads are competing for the execution units the core might be starved for execution units, but bear in mind that the vector unit only can schedule one instruction from each thread every other cycle. At least two threads are needed to fill the vector pipeline. In addition memory bandwidth are often a limiting factor. One core has a certain bandwidth and by spreading the threads onto many cores the total aggregated bandwidth is far larger than from a smaller set of cores. Performance relative to compact placement NPB benchmark Effect of placement / affinity 130% 125% Compact 120% Scatter 115% Balanced 110% 105% 100% 95% 90% 85% 80% BT.C CG.C EP.C FT.B IS.C LU.C MG.B SP.C NPB benchmark Figure 21: Effect on processor placement using the three affinity models compact, scatter and balanced. 21 of 51 Experiences with Xeon Phi Since the simplified cores on the MIC architecture possess only a modest hardware prefetch machinery a bigger burden is placed on both the programmer and ultimately the compiler. Efficient usage of prefetch on a cache based system is need to hide the very long memory latencies. Memory latency are often in excess of 100 ns (over 100 clock cycles) while the cache is one tenth of this . To get data into the cache before it is needed is quite important. As there is no crystal ball or psychic inside so one must rely on clever guesswork, or as in my case trial and error. Setting the prefetch to fetch data too far into the future is counterproductive. Either by exhausting the TLB cache or polluting the L2 cache. Performancerel to compiler defaults 120% NPB Benchmark Sw off Effect of compiler prefetch settings Def ault Dist=64,32 Dist=4,2 110% Dist=4 100% 90% 80% 70% 60% 50% BT.C CG.C EP.C FT.B IS.C LU.C MG.B SP.C NPB benchmark Figure 22: Effect on various prefetch settings. Selecting off leave it all the the rather simple hardware prefetcher. All this performance optimization is needed to harvest the power of the 60 cores with 240 hardware threads. The ultimate litmus test is comparison with the host processor Sandy Bridge (details about the SB are found in the appendix). 22 of 51 Experiences with Xeon Phi Single Sandy Bridge versus Xeon Phi NPB openmp 160% Relative Phi performance 140% 120% 100% 80% 60% 40% 20% 0% BT.C CG.C EP.C FT.B IS.C LU.C MG.B SP.C Benchmark Figure 23: Single Sandy Bridge processor performance relative to a single Xeon Phi processor. Some selected problems outperform SB, while the majority struggle to beat Sandy Bridge. It is evident from figure 23 that more work is needed to fully exploit the power of the MIC architecture. The NPB is made up for small kernels taken from real world applications and are believed to mimic the scientific applications in production. 23 of 51 Experiences with Xeon Phi EuroBen shared-memory benchmark This is a shared memory version of the well known benchmark from 1991. Originally a serial benchmark for vector supercomputers. It is biased towards raw vector performance which often happen to coincide with typical Fortran based programs solving vector expressed problems. The implementation is based on OpenMP for the threading. As this is loop and data based the scaling is an issue. One might expect that OpenMP threading will limit the scaling and hence the possible attainable performance, on the other hand these problems are represented by vector operations which Fortran handles very well and map nicely onto vector units. In addition the very high memory bandwidth of the MIC architecture is beneficial. Only a few selected kernels have been selected. Table 2 show the selected kernels. Fourier Transform is an interesting one as this exhibit poor scaling and i an example of challenges one might encounter when porting code to Xeon Phi. EuroBen kernel Operation Mod 1a Kernel 2 Vector copy – y(i) = x(i), i=1,n Mod 1a Kernel 8 DAXPY – y(i) = y(i) + const*x(i), i=1,n Mod 2am Matrix multiplication – C(m,n)=A(m,l)*B(l,n) Mod 2b Full linear solver – Ax = b Mod 2f FFT – 1d, complex to complex transform Table 2: Selected EuroBen kernels for evaluation of Xeon Phi system. Only the supplied Fortran source code are used in these tests. Several of the kernels could benefit from from MKL, but this is not what is under test in this run. Evaluation of MKL performance is another benchmark task. Prefetch is left mostly to the programmer and the compiler. The optimal settings can sometimes be hard to find. Figure 24 illustrate the effect of how many loop iterations in the future to prefetch. Fetching too much data might saturate the L2 cache or the TLB. At the same time it's worth to nota that loop unrolling it not always productive on this architecture. Use unrolling with caution. 24 of 51 Experiences with Xeon Phi Euroben Mod2b - effect on prefetch Different prefetch distances 45 Performance [Gflops/s] 40 35 30 25 20 15 10 5 0 default off 2 4 4,2 4,4 8 12 16 16,2 16,4 16,8 32 64,32 Prefetch distance Figure 24: Effect on different compiler prefetch settings. The distances are set in loop interations. The first number is prefcthing into L2 while the second number is the distance into L1. If not given the compiler tries to guess an optimal value. What effect can be expected as result of tuning the prefetch settings with the compiler ? Figure 26 show possible gains that can be obtained for simple benchmark kernels. 25 of 51 Experiences with Xeon Phi EuroBen - prefetch settings Effect of compiler prefetch tuning 115% default sw prefetch Opt sw prefetch Reletive performance [%] 110% 105% 100% 95% 90% 85% Vector Copy DAXPY Matmul Solver FFT EuroBen kernels Figure 25: Effect of tuning the prefetch distance using compiler options. Distance is loop iterations. The unroll option might upset the loop content and must be used with caution together with the prefetch distance setting. Tuning of prefetch settings is a manual tedious process which not always is easy. For some selected cases like the FFT here the process is in fact counterproductive. Case must be taken during the process and for each step one must measure the performance. When it all fit together quite good performance increases are possible. Placement of the threads is also very important. The three models given compact, scatter and balanced do yield different performance. The placement can sometimes be guessed, while other problems require it to be tested. Figure 26 show the effect of performance for the different placements. 26 of 51 Experiences with Xeon Phi EuroBen Effect of process placement 180% Compact Scatter Balanced Reletime performance 160% 140% 120% 100% 80% 60% Vector Copy DAXPY Matmul Solver FFT EuroBen Kernel Figure 26: Effect of placements. Best performance for each test is used, performance vary with the number of scheduled threads per core. The best values are taken. How does the performance of the Intel Xeon Phi compare with Sandy Bridge for this kind of benchmark ? The goal is always to beat SB if not there would be nothing to gain from porting to Intel Xeon Phi. Figure 27 illustrate the performance difference. 27 of 51 Experiences with Xeon Phi EuroBen OpenMP benchmark Selected kernels Xeon Phi vs Sandy bridge Rel performande Phi vs SB 250% 200% 150% 100% 50% 0% Vector Copy DAXPY Matmul Solver FFT Kernel Figure 27: Relative performance of Intel Xeon Phi compared with Sandy Bridge for the tested EuroBen kernels. For most of the kernels the Xeon Phi perform well. The FFT kernel scale poorly and consequently does not benefit from the large number of cores in Xeon Phi. Vector copy and DAXPY perform very well. Most of this is probably an effect of the very high memory bandwidth. The FFT kernel does not scale very well in this OpenMP implementation. Performance on the Xeon Phi is only fraction for the Sandy Bridge performance. This is an example of the kind of performance issues one might come across when porting code to Xeon Phi. One possible solution is to rewrite the routine to take advantage of the MKL FFT functions, which BTW includes interface to FFTW. The mod2f kernel is not easily adapted to FFTW or MKL so this is not really an option with this kernel. 28 of 51 Experiences with Xeon Phi HYDRO benchmark HYDRO is a much used benchmark in the PRACE community, it is extracted from a real code (RAMSES, which is a computational Fluid Dynamics code). Being widely used it has been ported to a number of platforms. The code exist in many versions, Fortran 90, C, CUDA, OpenCL as well as serial, OpenMP and MPI versions of these. Some versions have been instrumented with performance counters to calculate the performance in Mflops/s. The instrumented version is a Fortran 90 versions and this version in both OpenMP and MPI versions have been used for evaluation. How well this architecture scales using MPI or a hybrid mode using both MPI and OpenMP is of interest in the initial testing of HYDRO. Unfortunately there the hybrid model does not provide performance numbers in Mflops/s so run time is used to measure scaling. HYDRO scaling Pure MPI - 2d Performance [Mflops/s] 10000 1000 100 10 1 2 4 8 16 30 60 120 240 # ranks Figure 28: HYDRO scaling using a pure MPI model solving the 2d problem. Figure 28 show the scaling using a pure MPI implementation of HYDRO. The scaling seems quite good up to a point where 2 or more threads are scheduled per core. The placements are at this test the defaults. Better tuning of the placements might provide slightly better results. 29 of 51 Experiences with Xeon Phi For hybrid models the implementations of HYDRO does not yield performance numbers so run times are taken as indicators and a scaling is calculated. As there is no runs with only one core it is assumed perfect scaling from 1 to 4 cores, e.g. a scaling factor of 4 for the 4 core run. Figure 30 show the obtained scaling using 2, 3 and 4 OpenMP threads per MPI rank. Scaling is quite good up till 3 threads per rank, e.q. 180 cores, but the highest performance was measured using all 240 cores. HYDRO scaling Hybrid model 2d - OpenMP / MPI Speedup 200,00 Threads 2 Threads 3 Threads 4 20,00 2,00 4 6 8 12 16 24 32 45 60 90 120 180 240 # cores Figure 29: Scaling using a hybrid MPI/OpenMP model solving the 2d problem. Runs are performed using 2,2 or 4 OpenMP thraeds per MPI rank. The core count are the total number of cores employed. The speedup using 4 cores is assumed to be perfect from a single core. How does the Xeon Phi performance stand up to Sandy Bridge ? Figure 30 show that the performance in Mflops/s for Sandy Bridge outperform the Xeon Phi. The difference is 23% higher performance for Sandy Bridge. It might be possible to do some more tuning on the HYDRO code to reduce this gap. 30 of 51 Experiences with Xeon Phi Hydro, Xeon Phi vs Sandy Bridge Pure MPI model - 2d problem Performance [Mflosp/s] 14000 12000 10000 8000 6000 4000 2000 0 Xeon Phi Sandy Bridge Processor Figure 30: HYDRO pure MPI implementation solving the 2d problem using Xeon Phi and Sandy Bridge. For the hybrid case the picture is about the same. Figure show the difference in run times, where lower is better in this case. Again there is significant difference in performance in Sandy Bridge's favor. Hydro, Xeon Phi vs Sandy Bridge Hybrid MPI/OpenMP tests, 2d problem 70 Run time [secs] 60 50 40 30 20 10 0 Xeon Phi Sandy Bridge Processor Figure 31: HYDRO hybrid implementation solving the 2d problem using Xeon Phi and Sandy Bridge. Performance is given by run times where lower is better. It might be possible to reduce or even close this gap with careful tuning of the HYDRO core on Xeon Phi, but currently Sandy Bridge hold the record for this benchmark. 31 of 51 Experiences with Xeon Phi Vienna Ab-initio Simulation Package, VASP This well known and widely used software package is a candidate to run on any accelerated system due to its popularity. It is known to be painful to build and to run properly. It is very sensitive to compiler settings and compiler optimization. Several routines must be compiled without any opt settings. In addition some versions of the compilers must be avoided. Not an optimistic staring point for compilation to Xeon Phi, nor to optimize it to run on many cores. The application is MPI based to we need to run MPI on a shared memory system which is also not a good starting point. Not all shared memory communications for MPI are implemented in the most optimal way. Most probably not for Xeon Phi as this is a rather new architecture. As the only software available is Intel based both the Intel MPI and Intel compilers/libraries must be used. In addition the current release of the Intel compiler also run into some internal issues when invoking higher optimization, e.g. -O3. The compilation also takes very much longer with -O3, it seems to struggle with the high level optimizer code is using internally. Hence not all routines could be compiled with -O3. Not a good sign for sensitive code like VASP. The O3 level does a lot of aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. This might be the cause of VASP failures at this level of optimization. The -O2 is not only give a more stable executable but also faster. The convergence seems often to be upset by the highest level of optimization. Selecting a more precise or strict floating point model is a possible countermeasure, or just stick to -O2. The VASP version 5.3.3 is used in this test, with a Bismuth oxide benchmark. This benchmark has been used for a considerably number of tests with our VASP installation. It is also small enough to fit on the limited memory of the Xeon Phi system. Compilation is quite simple to initiate, just change the flags somewhat to instruct the compiler to generate MIC code, and change include and library paths to mic instead of intel64. The FFTW wrappers need to be rebuilt for MIC architecture. This is a simple task. Once the binary excutable is built the tests to verify that settings etc have produced a working executable. This is often not the case as VASP in know to be sensitive to all kinds of very minute facets in the middleware. The scaling of the VASP code solving the input in question is part of the initial exploration. If sufficient scaling is measured one can assume that most of the performance of the Xeon Phi can be harvested. The placement of processors using MPI is somewhat different than that of OpenMP. It looks like Intel MPI place the ranks in a scatter way og do not place the ranks like the system does when running OpenMP. It is possible to prepare a special rank file and fix each rank at a certain core, this has not been done during these tests. By default the Intel mpirun enable process pinning. Figure 32 show how the speedup changes with the number of cores used. It scales quite well up to 60 cores, which is one thread per core. When scheduling more than one thread per core the performance drops. As the MPI ranks are independent it might well be that the amount of data needed becomes too much for the L2, TBL or even the memory bandwidth. 32 of 51 Experiences with Xeon Phi VASP scaling Bi2O3 input 45 40 35 Speedup 30 25 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 110 120 140 # ranks / cores Figure 32: VASP scaling. This is pure MPI run. Due to memory limitations only up to 140 ranks could be run. It is assumed perfect scaling from a single core to 10. The VASP code is a pure MPI code, and hybrid model is not possible using the VASP code. However, the VASP application uses quite a bit of linear algebra and FFT. These functions are part of the MKL library and this library is available in a threaded version. Consequently a semi hybrid model can be built. Even if mpirun pin processes per default it does only affect the actual MPI ranks, not the OpenMP threads spawned by MKL. The placements of the threads have an impact on performance. Figure 33 show tests run with a multi threaded MKL library and different placement models. It clearly show that two threads per MPI rank yield best result for this run. 33 of 51 Experiences with Xeon Phi VASP performance Semihybrid model runs, MKL threaded. 1100,0 Run time [seconds] 1050,0 compact scatter 1000,0 balanced 950,0 900,0 850,0 800,0 60/2 60/3 60/4 Ranks/threads pr rank Figure 33: Effect on thread placement in Semihybrid VASP runs. I all cases 60 MPI ranks is started, in addition MKL is threaded and is allowed to use 2,3 or 4 threads. This will run 2, 3 or 4 threads per core. Performance is given in run time where lower is better. However, tests show, figure 34, that performance is better using a serial MKL with only one MPI rank per core. VASP benchmarking Bi2O3 input / seqential vs parallel MKL 950 Run time [seconds] 900 850 800 750 700 MPI/MKL=seq MPI/MKL=par Figure 34: Comparison og VASP run with a serial MKL and threaded MKL. The effect of using multiple MKL threads per MPI rank seem counter productive. Performance is again in run time, where lower is better. 34 of 51 Experiences with Xeon Phi For this application it has proven very hard top beat Sandy Bridge on performance. This comes as no surprise as VASP does not scale with all inputs. The Bismuth oxide might be on of the inputs that does not scale very good. This example, figure 35 illustrate the fact that even for MPI programs the scaling of the application must continue to at least the number of cores on the Xeon Phi. VASP performane, Xeon Phi vs Sandy Bridge Bi2O3 benchmark 900 800 Rune time [seconds] 700 600 500 400 300 200 100 0 Xeon Phi Sandy Bridge Figure 35: It is very hard to attain comparable performance as that of the Sandy Bridge processor. Twoforld reasion, one is that MPI over shared memory if not always very efficient or that the VASP application exhibit poor scaling with this input. 35 of 51 Experiences with Xeon Phi Program MARK Program MARK provides parameter estimates from marked animals when they are re-encountered at a later time. Program MARK computes the estimates of model parameters via numerical maximum likelihood techniques. The FORTRAN program that does this computation also determines numerically the number of parameters that are estimable in the model, and reports its guess of one parameter that is not estimable if one or more parameters are not estimable. The number of estimable parameters is used to compute the quasi-likelihood AIC value (QAICc) for the model. This application is used by bio-scientists users at Abel making it an interesting test case. This application was straight forward to compile and build on the MIC architecture. Intel compilers and libraries provide all that is needed, both BLAS and LAPACK. The scaling of the applications do vary with the input data set. One input that have suitable run times was selected for the test. MARK application Billbig input 12000 Wall time [secs] 10000 8000 6000 4000 2000 0 15 30 60 120 240 # Cores Figure 36: MARK application scaling solving the Billbig input case. Performance is measured in run times. Figure 36 show the scaling of run times when increasing the number of OpenMP threads. The optimum number of threads seems to be 2 threads per core, yielding a total core count of 120. Quite nice scaling for a program developed years ago for Windows XP. The code was a well written OpenMP Fortran code that confirms to the standard. 36 of 51 Experiences with Xeon Phi Unfortunately the performance is inferior to that of a Sandy Bridge processor as figure 37 show. MARK application Input Billbig Wall time [Seconds] 3000 2500 2000 1500 1000 500 0 Xeon Phi Compute node Node type Figure 37: MARK application solving Billbig input using Xeon Phi and Sandy Bridge. Performance is given in run times. This example illustrate that it might be very easy to compile, build and install an application on a Xeon Phi system it is not always beneficial with respect of performance. 37 of 51 Experiences with Xeon Phi Performance evaluation Offload Model MKL enabled offload Intel provides support for automatic usage of the Xeon Phi as a co-processor by means of automatic offloading of work using MKL routines. This is a very simple way to exploit the MIC co-processor. No recompile, just set some environment variables and run. Presently, unfortunately only a very limited set of MKL functions and routines have been offload enabled. Only level 3 BLAS functions ?GEMM, ?TRMM, ?TRSM are offload enabled at the time of writing. Currently this is limiting the usage to a few special cases. It should come a no surprise that the ones used in HPL is enabled. The most common of these is the dense matrix matrix multiplication, dgemm (s,c,d,z variant are supported). Usage is very simple as will be shown is the following example: time_start=mysecond() call dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N) time_end=mysecond() write(*,fmt=form) & "dgemm end, timing :",time_end-time_start," secs, ",& ops*1.0e-9/(time_end-time_start)," Gflops/s" This f90 code is all it takes to do A*B => C . All the magic is done by MKL behind the scenes. Compiling is equally simple: ifort -o dgemm-test.x -mcmodel=medium -O3 -openmp -xAVX -mkl dgemm-test.f90 MKL is very flexible and can be instructed to use multiple threads on the host system and offload some of the work to the co-processors, the Intel Xeon Phis, or MIC for short. MKL also accommodate variables and functions to set the fraction of work to be offloaded from the host processor to the coprocessors in addition to automatic load balancing, some quite interesting results is shown in the figures below. The following table gives a selection of some environment variable that control the offload. Some of these are also available as functions to be called from the program that utilizes offloading. Environment variable Function MKL_MIC_ENABLE Enables Automatic Offload. OFFLOAD_DEVICES List offload devices MKL_MIC_WORKDIVISION Specifies the fraction of work to do on all the Intel Xeon Phi coprocessors on the system, including auto. OFFLOAD_REPORT Specifies the profiling report level for any offload MIC_LD_LIBRARY_PATH Search path for coprocessor-side dynamic libraries. Table 3: Relevant environment variables for offload. 38 of 51 Experiences with Xeon Phi MKL dgemm automatic offload Two SB processors, One Phi card Performance [Gflops/s] 1200,0 2288 MiB 20599 MiB 57220 MiB 1000,0 800,0 600,0 400,0 200,0 0,0 auto 0 50 80 90 100 Percent offloaded to mic Figure 38: MKL automatic offload using a single Xeon Phi card and both SB host processors. MKL dgemm automatic offload Two SB processors, two Phi cards 1800,0 2288 MiB 20599 MiB 57220 MiB Performance [Gflops/s] 1600,0 1400,0 1200,0 1000,0 800,0 600,0 400,0 200,0 0,0 auto 0 50 80 90 100 Percent offloaded to mics Figure 39: MKL automatic offload using two Xeon Phi cards and both SB host processors. Figure 39 and 38 show performance measures when performing dense matrix matrix multiplication using MKL and automatic offloading. Nothing extra is done as shown in the Fortran code example shown above. Only some environment variables are set to achieve this performance. It is quite remarkable how well the automatic load balancing in MKL's run time system works. It's actually quite hard to beat the automatic load partitioning between the mic processor and the host processor. Looking at the actual numbers there is reason be be really impressed. A normal compute node used can perform at about 320 Gflops/s when doing dgemm. With two installed Xeon Phi cards with mic processors this performance clocks in at 1694 Gflops/s, or 1.7 Tflops/s per compute node (only 16 racks for a Petaflop/s). 39 of 51 Experiences with Xeon Phi User function offload The Intel compilers have support for user defined functions or regions inside a program to be offloaded onto the MIC co-processor. This is a bit more complex that just calling MKL routines. One can use regions in a program or write a complete function or subroutine to be compiled and run on the co-processor. In both cases the code marked for offload will be cross compiled for the mic architecture. The run time system will launch the code on the MIC processor and data is exchanged with yet another set of run time functions. Many possible combinations are possible, overlap of data transfer, load balancing between the host processor and the co-processor etc. However, all of this is left to the programmer. This makes usage of the co-processor a little bit harder to use in production where none og very few changes to the source code is wanted. Offloading regions in program This is the simplest solution where directives to instruct the compiler to generated offload code are just inserted into the program code. The following code show an example of how the nested do loop is effectively offloaded from the host processor to the mic co-processor. !dir$ offload begin target(mic) in(a,b) out(c) !$omp parallel do private(j,l,i) do j=1,n do l=1,n do i=1,n c(j,l)=c(j,l) + a(i,l)*b(i,l) enddo enddo enddo nt=omp_get_max_threads() #ifdef __MIC__ print*, "Hello MIC threads:",nt #else print*, "Hello CPU threads:",nt #endif !dir$ end offload The offloaded part of the code is executed on the MIC co-processor using an environment either set up by the system or bye the user via environment variables. Both the number of threads and thread placements on the mic processor can be controlled in this way. During the time the offload code is run on the co-processor the host processors are idle. To achieve load balancing the user must explicitly program the work partition. 40 of 51 Experiences with Xeon Phi Offloading functions or subroutines This approach is making usage of the offload code simpler and is the more common way of programming. The user write a complete function or subroutine to be offloaded. Using this routine is straighforward it is called just as any other function with the only addition that data must be handled. Initiating data transfer between the two memories must be handled. This data transfer can be overlapping with other workload on the host processor hiding latency of the transfer. To follow the example above this piece of code can easily be put into a subroutine. !dir$ attributes offload : mic :: mxm, omp_get_max_threads subroutine mxm(a,b,c,n) use constants integer :: n real(r8),dimension(n,n) :: a,b,c integer :: i,j,l,nt !$omp parallel do private(j,l,i) do j=1,n do l=1,n do i=1,n c(j,l)=c(j,l) + a(i,l)*b(i,l) enddo enddo enddo nt=omp_get_max_threads() #ifdef __MIC__ print*, "Hello MIC threads:",nt #else print*, "Hello CPU threads:",nt #endif end subroutine mxm This routine will now be compiled to an object file suitable for executing one the MIC co-processor. It can be called as any other routine, but data transfer must be accommodated for. The calling program need to arrange transfer. time_start=mysecond() !dir$ offload_transfer target(mic:0) in( a: alloc_if(.true.) free_if(.false.) ) !dir$ offload_transfer target(mic:0) in( b: alloc_if(.true.) free_if(.false.) ) !dir$ offload_transfer target(mic:0) in( c: alloc_if(.true.) free_if(.false.) ) !dir$ offload target(mic:0) in(a,b: alloc_if(.false.) free_if(.false.)) & out(c: alloc_if(.false.) free_if(.false.)) call mxm(a,b,c,n) time_end=mysecond() 41 of 51 Experiences with Xeon Phi The subroutine call will block using this construct. In order to utilize both host and co-processor resources concurrency and synchronization need to be introduced. However, the above setup works very for testing and timing purposes. Figure 40 show the performance measured when comparing host cpus and co-processors running Fortran 90 code with OpenMP threading of three nested loops doing matrix multiplication in double precision. Far inferior of the MKL library, but serves as an illustration of what can be expected using user Fortran 90 code. No special form of optimization have been performed, only compiler flags like -O3, -mavx in addition to -openmp. MxM offloading Fortran 90 code, double prec. 40 Performance [Gflops/s] 35 Host procs Co-processor 30 25 20 15 10 5 0 2288 MiB 5149 MiB 5859 MiB 6614 MiB Memory footprint matrices Figure 40: Comparing Host processors, SandyBridge 16 threads, with Xeon Phi mic coprocessor. Fortran 90 code using OpenMP threading. One single mic processor using 240 threads and a scatter placement is used. Load distribution and balancing, hosts cpus, co-processors (single and multiple) It is relatively straightforward to set up concurrent runs, workload distribution and ultimately load balancing between the host cpus and the Xeon Phi mic processors. However, all of the administration is left to the programmer. Since there is no shared memory the work and memory partition must be explicitly handled. Only the buffers used on the co-processors need to be transferred as memory movement is limited by the PCIe bus' bandwidth. There are mechanisms for offline transfer and semaphores to syncronising both transfer and execution. All of this must be explicitly handled by the programmer. While each part is relatively simple it can become quite complex when trying to partition the problem while trying to load balancing. Some examples below will try to illustrate this. 42 of 51 Experiences with Xeon Phi Filling the matrices for the co-procssors: am0(:,:)=a(1:m,:) am1(:,:)=a(m+1:2*m,:) bm0(:,:)=b(1:m,:) bm1(:,:)=b(m+1:2*m,:) Initiate data transfer: !dir$ offload_transfer target(mic:0) in( am0: alloc_if(.true.) free_if(.false.) ) !dir$ offload_transfer target(mic:0) in( bm0: alloc_if(.true.) free_if(.false.) ) !dir$ offload_transfer target(mic:1) in( am1: alloc_if(.true.) free_if(.false.) ) !dir$ offload_transfer target(mic:1) in( bm1: alloc_if(.true.) free_if(.false.) ) Variables for each co-processor have been declared and allocated. These are 1/3 og the size of the total matrix size held in the host memory. Each compute element (Host SB processors, mic0 and mic1) is doing 1/3 of the total calculation. No dynamic load balancing, fixed at 1/3 each. Calling the offloading subroutine: time_start=mysecond() !dir$ offload target (mic:0) in(am0,bm0) out(cm0) signal(s1) call mxm(am0,bm0,cm0,m,n) ! print *,"cm0",cm0(:,:) !dir$ offload target (mic:1) in(am1,bm1) out(cm1) signal(s2) call mxm(am1,bm1,cm1,m,n) ! print *,"cm1",cm1(:,:) kc=2*m+1 !$omp parallel do private(j,l,i) do j=kc,n do l=1,n do i=1,n c(j,l)=c(j,l) + a(i,l)*b(i,l) enddo enddo enddo nt=omp_get_max_threads() #ifdef __MIC__ print*, "Hello MIC threads:",nt #else print*, "Hello CPU threads:",nt #endif Here we wait for the co-processors if they have not yet finished. One semaphore for each coprocessor. 43 of 51 Experiences with Xeon Phi !dir$ offload_wait target(mic:0) wait(s1) !dir$ offload_wait target(mic:1) wait(s2) Copy the data received from co-processors back in matrices on host memory: ! Put the parts computed on the mics into tha sub matrix of c. c(1:m,:)=cm0(:,:) c(m+1:2*m,:)=cm1(:,:) ! c(2*m+1:3*m,:) is already in c, nothing to do. time_end=mysecond() The amount of work is about evenly distributed with just a little time spend waiting for another compute element to finish it's work. In a production run the load balancing would be set up dynamic and hence a better load balancing obtained. However, this is left to the programmer and require detailed knowledge of the workload. Again we see that the actual programming is easy, but the administration of the workload can be complex. One advantage is that for we have more memory and can attack bigger problems without having to run offload parts in series with a chunk of data for each run. This must be done if the problem we want to offload exceed the 8 GiB of currently installed memory on the Xeon Phi cards. Figure show the aggregated performance obtained when using both SB processors and both mic processors in the Xeon Phi cards running the triple nested do loops programmed in Fortran 90. 44 of 51 Experiences with Xeon Phi MxM offloading, load balancing Total node performande [Gflops/s] Fortran 90, OpenMP, double precision 60 50 40 30 20 10 0 2288 MiB 5149 MiB 5859 MiB 6614 MiB 9155 MiB 11077 MiB 14305 MiB 17944 MiB Memory footprint matrices Figure 41: Aggregated single node performance, two SandyBridge using 16 threads each, two Xeon Phi mic co-processors. Fortran 90 code using OpenMP threading. The mic processors are using 240 threads each and a scatter placement. Intel provide the tools needed to do the job, but the programmer need to to all the details himself. It must be noted that this is vastly simpler than doing similar programming using a GPU. Here the exact same Fortran 90 code is used for both processor classes, Sandy Bridge/X86-64 and Xeon Phi/mic. Offloading is a very easy way to start accelerating your code. It might not utilize the coprocessors full potential but nevertheless any speedup with a small effort is worth harvesting. Figure 42 and 43 show a comparison between a standard compute node and an accelerated node performing naïve nested do loops in f90 to multiply tow matrices. This kind of performance and speedup is about what you could expect to get in a production setting where common f90 code is compiled and run. The figure show that memory footprint matter. 45 of 51 Experiences with Xeon Phi Standard compute node vs. accelerated node Fortran 90 matrix multiply code 70 Compute node Accelerated Node Performance [Gflops/s] 60 50 40 30 20 10 0 2288 MiB 5149 MiB 9155 MiB 14305 MiB 17944 MiB Matrices footprint Figure 42: Comparing performance between a standard compute node with SB processors with an accelerated node with two SB processors and two Xeon Phi cards. Speedup Accelerated node vs standard node Performance speedup Two Xeon Phi cards in accelerated node 3 2,8 2,6 2,4 2,2 2 1,8 1,6 1,4 1,2 1 2288 MiB 5149 MiB 9155 MiB 14305 MiB 17944 MiB Matrices footprint Figure 43: Speedup measured comparing a standard Abel compute node with an acclerated node with two Xeon Phi cards installed. Fortran 90 simple nested do loop matrix multiplication. The workshare balancing between the two Xeon Phi cards' memory and the host memory it seems to benefit strongly from larger workloads placed on the co-processors. For a speedup of two the performance is in effect doubled. This is measured using normal f90 code with minimal changes to the actual code, just setting up load balancing and workshare. Any programmer should achieve this kind of performance with minimal effort. 46 of 51 Experiences with Xeon Phi How to reach performance close to theoretical ? Property Value Core frequency 1.05 GHz Number of cores 60 Vector width 512 bits / 8 double / 16 single precision reals Fused Multiply Add instruction flops/s gain 2 times Table 4: Properties of the MIC architecture. Theoretical performance: 1.05 x109s-1 x 60 x 8 x 2 flops = 1008 Gflops/s = 1 Tflops/s Another theoretical measure might be 16.8 Gflops/s per core. Scaling With 60 cores and 240 threads the challenge of getting OpenMP to scale well at this core count is far from trivial. It is well known that OpenMP normally does not scale well beyond 8-16 without special measures taken. There are 60 cores each with two pipes one scheduling to both vector and scalar unit the other pipe can only schedule to the scalar unit, see figure 4. Because of this one single thread cannot schedule vector instructions for every clock cycle. A minimum of two threads per core is needed to schedule a vector instruction for every clock cycle. This bring up the amount of threads that the application need to scale to up to 120. Very careful usage of OpenMP is needed to provide scaling of 120 threads. In addition comes the syncronisation of the threads placing a burden on the cache coherency machinery on the system. This can saturate the ring interconnect and the memory bandwidth. One can run MPI only or hybrid models as MPI tend to scale better than OpenMP. The shared memory device in MPI is generally not as good as one might expect, hence limiting scaleability. For applications that does not scale very well it is still not so easy to get the desired performance, see figures 33 and 30. Beating Sandy Bridge is still very hard. Vector unit In order to attain full performance the 512 bits or 8 double numbers vector need to be fully loaded for each vector instruction issued. Failing to achieve this result in a performance of on N/8, where N can be as low as one, yielding only a peak performance of 125 Gflops/s. It is only carefully laid out vector and matrix data sets that can be easily mapped to this kind of vector instructions, hence limiting the general purpose of the processor. For Fortran 90 vector layout and loop specifications this might a good match. For less structures problems not. Fused multiply add instruction This instruction is a special instruction for multiplication and addition/accumulation with does both multiply and addition in one instruction in one clock cycle (review paragraphs on page 10 and figure 7). In practice doubling the theoretical performance of the chip. The question remains how often one can fill a vector and use the FMA instruction at the same time. It works fine for the selected problem of matrix multiplication which is a major part of the top500 benchmark. The 47 of 51 Experiences with Xeon Phi matrix matrix multiplication is consequently the major benchmark used to show the outstanding performance of the MIC architecture. Both the vector unit and the FMA instruction pose a challenge for any programmer hoping to get maximum performance. 48 of 51 Experiences with Xeon Phi Appendix Node configuration Compute node Specification Vendor Megware, Myriquid / Supermicro Mainboard Supermicro X9DRT Processor 2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 8 core L2 cache 8-way Set-associative 2048 kB Write Back L3 cache 20-way Set-associative 20480 kB Write Back Memory 8 x Samsung DDR3 Registered 8192 MB, 1600 MHz InfiniBand Mellanox ConnectX-3 FDR OS CentOS release 6.4 (Final), later upgraded to 6.6 Compilers Gcc 4.8.0, 4.8.2 and 4.9.2 / Intel 2013.x and 2015.1 MPI Intel MPI 4.1.3 and 5.0.0 Math library Intel MKL 2013.3 and 2015.1 Table 5: Node configuration, standard Abel compute node. Host node Specification Vendor Megware, Myriquid / Supermicro Mainboard Supermicro X9DRG-HF Processor 2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 8 core L2 cache 8-way Set-associative 1024 kB Write Back L3 cache 20-way Set-associative 10240 kB Write Back Memory 8 x Samsung DDR3 Registered 16384 MB, 1600 MHz InfiniBand Mellanox ConnectX-3 FDR Phi accelerator 2 x Xeon Phi 5110P, device 2250, 240 cores @ 1.05 GHz Phi memory 8 GiB GDDR5 memory per card total 16 GiB. Phi OS Busy Box kernel 2.6.38.8-gefd324e OS CentOS release 6.4 (Final), later upgraded to 6.6 Compilers Gcc 4.8.0, 4.8.2 and 4.9.2 / Intel 2013.x and 2015.1 MPI Intel MPI 4.1.3 and 5.0.0 Math library Intel MKL 2013.3 and 2015.1 Table 6: Node configuration, Xeon Phi accelerated Abel compute node. 49 of 51 Experiences with Xeon Phi References Background : http://www.drdobbs.com/parallel/programming-intels-xeon-phi-a-jumpstart/240144160 NVIDIA web site about applications: http://www.nvidia.co.uk/object/gpu-computing-applications-uk.html http://www.nvidia.co.uk/object/bio_info_life_sciences_uk.html Porting of VASP to support GPUs: http://www.ncbi.nlm.nih.gov/pubmed/22903247 ACEMD http://www.acellera.com/products/acemd/ HYDRO http://www.prace-ri.eu/IMG/pdf/porting_and_optimizing_hydro_to_new_platforms.pdf MARK http://warnercnr.colostate.edu/~gwhite/mark/mark.htm Notes on optimization : http://software.intel.com/en-us/articles/step-by-step-optimizing-with-intel-c-compiler -O1/-Os This option enables optimizations for speed and disables some optimizations that increase code size and affect speed. To limit code size, this option enables global optimization which includes dataflow analysis, code motion, strength reduction and test replacement, split-lifetime analysis, and instruction scheduling. This option also disables inlining of some intrinsics. If -O1 is specified, -Os option would be default enabled. In O1 option, the compiler auto-vectorization is disabled. If your application are sensitive to the code size, you may choose O1 option. -O2 This option enables optimizations for speed. This is the generally recommended optimization level. The compiler vectorization is enabled at O2 and higher levels. With this option, the compiler performs some basic loop optimizations, inlining of intrinsic, Intra-file interprocedural optimization, and most common compiler optimization technologies. -O3 Performs O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-and-Jam, and collapsing IF statements. The O3 optimizations may not cause higher performance unless loop and memory access transformations take place. The optimizations may slow down code in some cases compared to O2 optimizations. The O3 option is recommended for 50 of 51 Experiences with Xeon Phi applications that have loops that heavily use floating-point calculations and process large data sets. Notes on MKL : http://software.intel.com/en-us/articles/intel-mkl-automatic-offload-enabled-functions-for-intelxeon-phi-coprocessors 51 of 51