Experiences with Xeon Phi coprocessors Abel compute cluster at UiO 2014-2015 UiO/USIT/UAV/ITF/FI

advertisement
Experiences with Xeon Phi coprocessors
Abel compute cluster at UiO 2014-2015
Ole W. Saastad, PhD
UiO/USIT/UAV/ITF/FI
Feb. 2015
Experiences with Xeon Phi (aka MIC) at
the Abel compute cluster at University of Oslo
2014-2015.
Preface
The introduction of many integrated cores processors from Intel has made massive parallel
computation imperative. The motivation have been the very high compute capacity, a large number
of relatively low performing cores yield a high combined performance. The theoretical performance
for a single processor, using 60 cores, is over 1 Tflops/s in double precision. A typical node with
Sandy Bridge processor in the Abel cluster have a theoretical performance of 166 Gflops/s per
processor and provide per processor about 160 Gflops/s of performance when running the top500
test.
Comparing this to a matrix multiplication test on Xeon Phi that clocks in at slightly less than 500
Gflops/s per processor. This is about 3 times the original Sandy Bridge node performance. All this
using the same source code, just a recompile. The fraction of theoretical performance seen in real
application and real benchmarks are far less than that with Sandy Bridge. Even so the Xeon Phi is
formidable processor. The Xeon Phi is a x84-64 processor with the same Intel compilers used for all
other Intel processors. Not directly binary compatible, many legacy less fortunate constructs have
been phased out.
Performance reporting is done in two different ways, in many cases wall clock time which is total
time from launch to completion or in some cases performance measurement of the kind like jobs per
unit of time. The first is of the category lower is better while the latter is higher is better. Please
examine the graphs and notice which measurement type is used.
Experiences with Xeon Phi
Introduction
The Intel Many Integrated Cores (MIC) architecture is developed with the concept that higher
performance can be attained using more transistors for computation. Most of the transistors in a
modern big core processor like Sandy Bridge does not do calculation. The idea behind the MIC
architecture was to employ a rather large number of simpler cores on the chip. Combined they
would yield a higher performance than a smaller number of big cores. Big cores contain elaborate
hardware prefetch, out of order execution, branch prediction, nested level of cache. A large fraction
of transistors do not really take part in calculation, e.g. not vector units or ALUs.
On the other hand if a far higher fraction of transistors could be exploited to do calculation better
utilization and performance could be obtained. This is what's done in MIC and GPUs, for the GPUs
an even higher number of very small cores are doing calculations. Like the GPUs Xeon Phi (the
trade name) also comes on a extension card, figure 1. Later versions might be offered as a complete
motherboard. No definite date is given.
Figure 1: Overview of coprocessor and host system. Presently the Xeon Phi is offered as a PCIe add on card
that are hosted by an Intel Sandy Bridge based host. The limitation of PCIe bus bandwidth can represent a
bottleneck.
The processor is made up of a rather large number of simple x86-64 cores. The cores of Intel MIC
are based on a modified version of P54C design, used in the original Pentium. The basis of the Intel
MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture
that can utilize existing parallelisation software tools. As with the Itanium architecture more is left
to the software, compilers and libraries. Intel provide all the compiler, math-, thread- and MPIlibraries for the MIC architecture.
5 of 51
Experiences with Xeon Phi
The Intel Xeon Phi coprocessor supports the same floating-point data types as the Intel Xeon
processor. Single (32-bit) and double (64-bit) precision are supported in hardware; quadruple (128bit) precision is supported through software. Extended (80-bit) precision is supported through the
x87 instruction set. The same set of rounding modes is supported as for Intel Xeon processors.
Figure 2 show the layout of the processor, all the cores, memory and the PCIe interfaces located
along a bidirectional ring bus. Each core has both level one and level two cache. The system is fully
cache coherent (cc), the boxes marked TD are “tag memory” for the cache lines cache coherency
directory. The L2 cache organization per core is inclusive of the L1 data and instruction caches
while the L2 organization comprises 64 bytes per way with 8-way associativity, 1024 sets, 2 banks,
32GB (35 bits) of cacheable address range and a raw latency of 11 clocks. The L2 cache is part of
the Core-Ring Interface block. This block also houses the tag directory (TD). Both the L1 and L2
caches use the standard MESI protocol for maintaining the shared state among cores. See table 1 for
information about the L1 and L2 caches.
Figure 2: Intel Xeon Phi chip architecture layout. Notice the ring bus as interconnect..
6 of 51
Experiences with Xeon Phi
Parameter
L1
L2
Coherence
MESI
MESI
32 KB + 32 KB
512 KB
8-way
8-way
64 bytes
64 bytes
8
8
1 cycle
11 cycles
pseudo LRU
pseudo LRU
1 per clock
1 per clock
Read or Write
Read or Write
Size
Associativity
Line
Banks
Access Time
Policy
Duty Cycle
Ports
Table 1: Characteristics of the cache hierarchy og the MIC architecture.
The MIC architecture supports 40-bit physical address in 64-bit mode and 4-KB and 2-MB page
sizes. On a TLB miss, a four-level page table walk is performed as usual, and the INVLPG
instruction works as expected (changes to the page table sizes). The coprocessor core implements
two types of memory: uncacheable (UC) and write-back (WB). No other memory type is legal or
supported.
Figure 3: Multithreading Architectural Support in the Intel.® Xeon Phi™ Coprocessor
Multithreading is a key element to hide latency and used extensively in the MIC architecture. There
are also additional reasons for using several threads in flight. The vector unit can only issue a vector
instruction from one stream every second clock cycle. This constraint require that at least two
threads per core are scheduled at the same time to fill the vector pipeline. The vector unit can issue
execute one instruction per clock cycle if fed from two different threads. Issuing more threads helps
7 of 51
Experiences with Xeon Phi
both to hide latency and to fill the vector pipeline. Four hardware threads available help to
accommodate this. Figure 3 show an overview of the hardware threads in the early stage of
execution.
Figure 4: Intel Xeon Phi Knights corner core. Note that only pipe 0 can issue instructions to the vector unit.
Figure 4 and 5 shows detailed and simplified layout the compute core. There are two pipes that feed
instructions into one of the scalar unit, the x87 unit or the 512 bits wide vector unit. For HPC it's the
vector unit that attract special attention.
Figure 5: Simplified schematic of
the MIC core.
8 of 51
Experiences with Xeon Phi
The vector unit can hold and operate simultaneously on 8 double precision or 16 single precision
floating point numbers and provide one result per clock cycle. Giving a clock sycle of about 1 GHz
the maximum theoretical performance of the vector unit is 8 Gflops/s. With 60 such cores it yield a
aggregated combined floating point performance of 480 Gflops/s. Using Fused Multiply Add
(FMA) twice this performance is theoretically possible.
Figure 6: Vector units on Intel processors, SSE (Pentium III), AVX (Sandy
Bridge) and MIC-512 (MIC architecture).
Figure 6 show evolution of the Intel vector units since the introduction of the 128 bits SSE
(successor to the MMX instructions first introduced in 1997) found in Pentium III in 1999, through
AVX introduced with Sandy Bridge in 2011 and finally the 512 bits wide vector unit found the
Knights cores cores introduced in 2013. Wider vector instructions is highly beneficial for so called
vector operations frequently found in scientific applications. Vector operations with stride one maps
very well onto this kind of vector units. As an example the practical performance measured for
matrix matrix multiplication using the math kernel library (MKL) is 459 Gflops/s using double
precision. How much of this that can be harvested in user applications is the task of the compiler
and ultimately the programmer.
All the cores, caches, interfaces and the memory channels are connected to the interconnect bus.
This is a bidirectional ring that provides an efficient transport between all the elements within the
chip. Intel might change this to mesh or something else in the future.
In addition Intel introduced the well known Fused Multiply Add instruction with the MIC
architecture. This instruction is well known in all supercomputer architectures. It was the important
instruction that increased the Cray I in 1976 performance from 80 Mflops/s without to 140 Mflops/s
using this instruction (interesting to note that Cray did not claim 2x performance gain as Intel does
today). The FMA instruction is very well suited for vector and matrix operations. The prime
example is matrix matrix multiplication. Fused multiply add instructions come in two kinds, one
with three arguments FMA3 and one with four arguments FMA4. GPU cards have supported this
instruction since about 2009.
9 of 51
Experiences with Xeon Phi
Figure 7: Fused multiply add (FMA) instruction schematic. For FMA3 the result
must be one of a,b or c. Most common is accumulation of type a=a+b*c.
The fused multiply add instruction is not one of the most used floating point operations and is not
always easy to utilize. When theoretical performance numbers are posted it is always using this
instruction together with the fully loaded vector.
The memory subsystem is based on GDDR5 memory. Now GDDR5 SDRAM is high-bandwidth
memory generally found on graphics cards, not computing engines. GDDR5 memory supports very
high data rates in the tens of Gbits/sec using multi-GHz transfer clocks. These SDRAMs also cost
more per Gbit than bulk SDRAM, but you’re paying for performance.
Figure 8: Layout the coprocessor board chip and memory. There are foure memory controllers. Since all
memory seems to be equal it looks like there is an interleaved access to the memory banks.
Simple streams memory benchmark test show that over 150 GiB/s is easy to obtain, twice that of a
Sandy Bridge based node. This demonstrates the very high bandwidth of the GDDR5 memory.
Demonstrations with the usage of streaming stores from Intel emphasizes the problems with
saturation of the interconnect ring. Again it turns out that software and programming skills are
needed to exploit this new architecture fully.
10 of 51
Experiences with Xeon Phi
Programming experiences
The common Intel programming tools all support the MIC architecture. The compilation and build
process are something known a cross compiling. This can have consequences when build script try
to verify that compilers work by testing if executables can be built and run. The executables built
for the MIC architecture cannot be run on Sandy Bridge. Special attention is needed when dealing
with such builds. For all other compilations the only thing needed is to inform the compiler that
you want to compile for MIC. This is easily done with a flag called “-mmic”.
Programming for the so called native model
Native model is the programming model when the Xeon Phi coprocessor is used a a stand alone
linux system. You log in to the Busy box Linux and run programs just as you would on a normal
compute node. Typically there is a directory NFS mounted that share files with the host system.
As an example is compilation of the simple stream memory bandwidth benchmark. The compilation
on the host can look like this:
icc
-mmic
-O3 -openmp -o
stream.x
stream.c
The excutable can be copied over top the target system and run. Any tests for its execution on the
host system fails even the library check tool ldd. It will just rebort “not a dynamic executable”,
while on the target system it displays the dynamic libraries as normal (using ldd).
Even if the compilers are identical the compiler flags are different from Sandy Bridge. The most
noticeable is the flag that trigger generation of the novel instruction for fused multiply add (FMA),
where a=b*c + d (FMA4) and a=b*c+a (FMA3). The flag to invoke this is “-fma”. Intel is using this
instruction to calculate theoretical performance. With 8 double precision (DP) numbers in the vector
unit and one result per clock cycle we arrive at about 16 Gflops/s, one multiply and one add, two
instructions per clock tick. This number is somewhat optimistic. Is it possible to write programs
that schedule 120 threads or more with long vector tasks that are capable of filling all vector pipes is
an open question. Not only do the vector need to be filled, the fused multiply add also need to make
a a rather large fraction of the code. Maybe this is only possible for the top 500 test ? Assuming no
FMA instructions we arrive at 8 Gflops/s per core which is 480 Gflops/s using 60 cores. Still a
formidable performance.
Intel MKL is also ported and supported on the Intel MIC architecture. This makes it easy to port
applications that rely on the functions within that library.
Intel MPI is also ported to MIC and runs without any extra installations. Just copy the bin and lib
directories to the MIC set the relevant paths and things runs and resolve with minimal problems.
To compile MPI programs using Intel MPI are just as easy as for none MPI programs, all include
files etc are available for MIC.
In short the programming and porting of applications to run natively on the MIC architecture is very
easy. Tuning for application performance is another matter, but this is true for all types of
accelerators.
11 of 51
Experiences with Xeon Phi
Programming for the offload model
The offload programming model treats the Xeon Phi just like a co-processor. Part of the executable
code is executed on the mic co-processor. The program is run on the host processor and is compiled
for Sandy-Bridge architecture. The only difference is that some part of the executable is execute on
the co-processor.
The part of the code run on the co-processor can be a library function like MKL, a user written
function or routine or a region of the program. The cases involving user written code the compiler
must be instructed to generate code to be cross compiled for the mic architecture. This is done with
compiler directives much like OpenMP directives.
In the case of MKL offload functions very little extra is needed by the programmer. Just setting the
right environment variables. If the co-processor is present the MKL will automatically execute the
MKL routines on the co-processor. However, only a very limited set of MKL functions are ported at
the time of writing. However, MKL automatic offload do load balancing between the host
processors and the co-processor.
To offload user written code or functions there is bit bit more setup to be done. A number of
directives and data movement must be taken care of. All of this is done with compiler directives.
Load balancing must be done explicitly by the user. The partitioning of workload between host
processors and the co-processor must be done manually. However functions can be run concurrently
do that the host processor can work on one functions while the co-processor can work on another as
long as there is no shared data as this would require synchronisation. This is possible to achieve, but
the complexity can be quite high. The host memory and device memory are are not shared. Data
must explicitly be copied between the two.
Compiling is relatively easy. Only a few extra flags are needed. Code to be executed on the mic is
automatically generated by the compiler.
ifort -offload-attribute-target=mic -openmp -O3 -xhost
12 of 51
-o mxm.x mxm.F90 -mkl
Experiences with Xeon Phi
Performance evaluation Native execution
Stream memory bandwidth benchmark
Stream is a well known benchmark for measuring memory bandwidth written by John D. Calpin of
TACC. TACC also happen to host the large supercomputer system called “Stampede” which is an
accelerated system using a large array of Intel Xeon Phis.
Stream can be built in several ways, it turned out that static allocation of the three vectors of which
to operate on provided the best results. The source code illustrate how the data is allocated:
# ifndef USE_MALLOC
static double
a[N+OFFSET],
b[N+OFFSET],
c[N+OFFSET];
# else
static volatile double *a, *b, *c;
# endif
#ifdef USE_MALLOC
a = malloc(sizeof(double)*(N+OFFSET));
b = malloc(sizeof(double)*(N+OFFSET));
c = malloc(sizeof(double)*(N+OFFSET));
#endif
The figure below show the difference between using malloc and static allocation. For this reason all
subsequent runs using stream very done using static allocation.
13 of 51
Experiences with Xeon Phi
Stream benchmark
Dynamic (malloc) vs. Static allocation
160
140
Bandwidth [GiB/s]
120
Copy:
Scale:
Add:
Triad:
100
80
60
40
20
0
Malloc
Static
Data allocation type
Figure 9: Effect of data allocation scheme, C malloc versus static
allocation. Both runs with compact placements.
To achieve optimum performance with this special architecture which lacks out of order execution,
fancy prefetch machinery, use a high core count and a large number of threads one need to take
extra care when compiling programs. There are a large range fo compiler switches that can help
with this. However, the switches may have different effect on the MIC architecture than with Sandy
Bridge. One such optimalisation is the prefetching. The MIC relies much more on software
prefetching as these cores lack the same efficient hardware prefetcher. More is left to the
programmer and the compiler.
14 of 51
Experiences with Xeon Phi
Stream benchmark
Effect of prefetch compiler options
155
150
Copy:
Scale:
Bandwidth [GiB/s]
Add:
145
Triad:
140
135
130
125
120
No prefetch opts
Prefetch opts
Compiler flags used
Figure 10: Effect of compiler prefetch options (compiler options used:
-opt-prefetch=4 -opt-prefetch-distance=64,32).
The figure above show the beneficial effect of providing prefetch compiler options to the Ccompiler enabling it to issue prefetch instructions in loops and other places where memory latencies
impacts performance. The distance numbers of iterations given to the prefetcher options are found
by trial and error. There are in most cases not a magic number that will be optimal in all cases.
Additionally one might use streaming store instructions to prevent store instructions to write to
cache. There is no reason to pollute the cache with data if the data is not reused. For the tests done
here it had a small effect. Bandwidth for Triadd went up from 134.9 to 135.2 GiB/s, hardly
significant.
15 of 51
Experiences with Xeon Phi
Placement of threads per core is also a major performance issue. Hardware threads are grouped
together onto cores, within which they share resources. This sharing of resources can benefit or
harm particular algorithms, depending on their behaviors. Understanding the behavior of your
algorithms will guide you in selecting the optimal affinity. Affinity can be specified in the
environment (KMP_AFFINITY) or via a function call (kmp_set_affinity). There are basically three
models, compact, scatter and balanced. In addition there is granularity. Granularity is set to fine so
each OpenMP thread is constrained to a single HW thread. Alternatively, setting core granularity
groups the OpenMP threads assigned to a core into little quartets with free reign over their
respective cores.
The affinity types COMPACT and SCATTER either clump OpenMP threads together on as few
cores as possible or spread them apart so that adjacent thread numbers are on different cores.
Sometimes though, it is advantageous to enable a limited number of threads distributed across the
cores in a manner that leaves adjacent thread numbers on the same cores. For this, there is a new
affinity type available, BALANCED, which does exactly that. Using the verbose setting shown
above, you can determine how the OpenMP thread numbers are dispersed.
Figure 11: Placements of threads onto cores using balanced affinity settings.
Figure 12: Placements of threads onto cores using scatter affinity settings.
16 of 51
Experiences with Xeon Phi
Figure 13: Illustration of the compact placement model. All 60 threads are scheduled to be as close together
as possible. Four threads are sharing a single core even if there are idle cores. Beneficial when the threads
can share the L2 cache.
Figure 14: Illustration of scatter placement model. All 60 threads are scheduled as far apart as possible. In
this case one thread per core. Beneficial when memory bandwidth is required.
In addition the number of threads employed in the test is interesting. Will one thread per core
saturate the interconnect torus, memory controllers or memory?
17 of 51
Experiences with Xeon Phi
Stream benchmark
Effect on core affinity settings
160
140
bandwidth [GiB/s]
120
100
Copy:
Scale:
Add:
Triad:
80
60
40
20
0
Compact
Scatter
Balanced
Affinity setting
Figure 15: Effect on core affinity setting. 60 threads are used in this test.
For 240 threads the effect is small.
Figure 16 show the effect on performance when using one, two, three or four threads per core.
When effectively run using all options the 60 cores are capable of saturating the memory
bandwidth. The stream benchmark is just copying data and do not perform any significant
calculation so effect of more cores do not show up when all the memory bandwidth is already
utilized.
Stream benchmark
Size 4.5 GiB, Affinity=scatter
155,0
Copy:
Scale:
Add:
Triad:
Bandwidth [GiB/s]
150,0
145,0
140,0
135,0
130,0
125,0
60
120
180
240
#threads
Figure 16: Effect of number of threads scheduled on stream performance.
18 of 51
Experiences with Xeon Phi
FFT – MKL / FFTW interface
FFT is used in a large number of applications and need no more introduction. One of the most
common implementations is the FFTW. While the FFTW implementation can be successfully built
on SB it is not so easy to cross build for MIC and hence the MKL with its associated FFTW
interface is used to assess the scaling and performance. As a reference for performance and scaling
some runs using SB and Haswell are included.
FFT 2-d FFTW/MKL
Sandy Bridge vs. Haswell
14
Wall time [sec]
12
SB
10
HW
8
6
4
2
0
1
2
4
8
# threads
16
20
32
40
Figure 17: 2d-FFT performance using Sandy Bridge and Haswell
processors. SB outperform HW in this benchmark due to larger L2
cache (20MB vs 12MB), an example where cache size matter. Size of 2d
NxN array, N=20000.
2d-FFT benchmark
Wall time [sec]
MKL using FFTW interface
160
140
120
100
80
60
40
20
0
1
2
4
8
16
32
48
60
80 100 120 180 240
#cores
Figure 18: 2d-FFT performance using XeonPhi and MKL with FFTW interface.
Size of problem (NxN) is N=18000. Scaling is good downto about 16 to 32 cores.
After this scaling is poor, this limits the XeonPhi performance as this architecture
rely on strong scaling.
19 of 51
Experiences with Xeon Phi
2d-FFT scaling- Sandy Bridge vs Xeon Phi
MKL using FFTW interface
25
SandyBridge
Xeon Phi
Speedup [times]
20
15
10
5
0
11
22
44
88
16 16 32 32 48 60 60 80 80 100
100 120
120
180
180
240
240
#cores
Figure 19: 2d-FFT scaling comparing Sandy Bridge vs. Xeon Phi. The stronger scaling experienced with
Xeon Phi is evident. Size: N=18000.
2d-FFT performance Sandy Bridge vs Xeon Phi.
MKL using FFTW interface
8
Wall time [seconds]
7
6
5
4
3
2
1
0
Sandy Bridge
Xeon Phi
Processor
Figure 20: Performance of Sandy Bridge comparedto Xeon Phi. Sandy Bridge clearly outperform the
Xeon Phi. The good scaling show by Xeon Phi in the above figure is not enough as this problem does
not scale perfect. Xeon Phi would have beaten SB if scaled to all 240 threads with a calculated run
time at 1.67 seconds. Size as above.
20 of 51
Experiences with Xeon Phi
NAS Kernels benchmark
The NAS benchmarks are well known. They are mostly known as MPI benchmarks, but have been
rewritten to OpenMP version and other parallel implementations.
As the Xeon Phi is a general processor it can run both OpenMP threaded shared memory code or
distributed memory code like MPI. Hence both implementations have been tested. Most attention
to the threaded version as this is more interesting with a cache coherent shared memory system.
For this kind of benchmarks based on real application all optimalisation comes into play, prefetch,
placement and affinity, threads per core etc. The effect of different placements is shown in figure
21 where the three models compact, scatter and balanced are shown. The best result for each test is
compared, the actual number of cores might change as behavior changes with most parameters. The
performance difference effect of placement is significant and care must always be taken to select
optimal affinity. Which placement model yield best performance is not obvious. For small selected
problems where all data for two or more threads can be kept in the L2 cache a compact model might
be the best option. However, if those threads are competing for the execution units the core might
be starved for execution units, but bear in mind that the vector unit only can schedule one
instruction from each thread every other cycle. At least two threads are needed to fill the vector
pipeline. In addition memory bandwidth are often a limiting factor. One core has a certain
bandwidth and by spreading the threads onto many cores the total aggregated bandwidth is far
larger than from a smaller set of cores.
Performance relative to compact placement
NPB benchmark
Effect of placement / affinity
130%
125%
Compact
120%
Scatter
115%
Balanced
110%
105%
100%
95%
90%
85%
80%
BT.C
CG.C
EP.C
FT.B
IS.C
LU.C
MG.B
SP.C
NPB benchmark
Figure 21: Effect on processor placement using the three affinity models compact, scatter and
balanced.
21 of 51
Experiences with Xeon Phi
Since the simplified cores on the MIC architecture possess only a modest hardware prefetch
machinery a bigger burden is placed on both the programmer and ultimately the compiler. Efficient
usage of prefetch on a cache based system is need to hide the very long memory latencies. Memory
latency are often in excess of 100 ns (over 100 clock cycles) while the cache is one tenth of this . To
get data into the cache before it is needed is quite important. As there is no crystal ball or psychic
inside so one must rely on clever guesswork, or as in my case trial and error. Setting the prefetch to
fetch data too far into the future is counterproductive. Either by exhausting the TLB cache or
polluting the L2 cache.
Performancerel to compiler defaults
120%
NPB Benchmark
Sw off
Effect of compiler prefetch settings
Def ault
Dist=64,32
Dist=4,2
110%
Dist=4
100%
90%
80%
70%
60%
50%
BT.C
CG.C
EP.C
FT.B
IS.C
LU.C
MG.B
SP.C
NPB benchmark
Figure 22: Effect on various prefetch settings. Selecting off leave it all the the rather simple hardware
prefetcher.
All this performance optimization is needed to harvest the power of the 60 cores with 240 hardware
threads. The ultimate litmus test is comparison with the host processor Sandy Bridge (details about
the SB are found in the appendix).
22 of 51
Experiences with Xeon Phi
Single Sandy Bridge versus Xeon Phi
NPB openmp
160%
Relative Phi performance
140%
120%
100%
80%
60%
40%
20%
0%
BT.C
CG.C
EP.C
FT.B
IS.C
LU.C
MG.B
SP.C
Benchmark
Figure 23: Single Sandy Bridge processor performance relative to a single Xeon Phi
processor. Some selected problems outperform SB, while the majority struggle to beat
Sandy Bridge.
It is evident from figure 23 that more work is needed to fully exploit the power of the MIC
architecture. The NPB is made up for small kernels taken from real world applications and are
believed to mimic the scientific applications in production.
23 of 51
Experiences with Xeon Phi
EuroBen shared-memory benchmark
This is a shared memory version of the well known benchmark from 1991. Originally a serial
benchmark for vector supercomputers. It is biased towards raw vector performance which often
happen to coincide with typical Fortran based programs solving vector expressed problems. The
implementation is based on OpenMP for the threading. As this is loop and data based the scaling is
an issue. One might expect that OpenMP threading will limit the scaling and hence the possible
attainable performance, on the other hand these problems are represented by vector operations
which Fortran handles very well and map nicely onto vector units. In addition the very high
memory bandwidth of the MIC architecture is beneficial.
Only a few selected kernels have been selected. Table 2 show the selected kernels. Fourier
Transform is an interesting one as this exhibit poor scaling and i an example of challenges one
might encounter when porting code to Xeon Phi.
EuroBen kernel
Operation
Mod 1a Kernel 2
Vector copy – y(i) = x(i), i=1,n
Mod 1a Kernel 8
DAXPY – y(i) = y(i) + const*x(i), i=1,n
Mod 2am
Matrix multiplication – C(m,n)=A(m,l)*B(l,n)
Mod 2b
Full linear solver – Ax = b
Mod 2f
FFT – 1d, complex to complex transform
Table 2: Selected EuroBen kernels for evaluation of Xeon Phi system.
Only the supplied Fortran source code are used in these tests. Several of the kernels could benefit
from from MKL, but this is not what is under test in this run. Evaluation of MKL performance is
another benchmark task.
Prefetch is left mostly to the programmer and the compiler. The optimal settings can sometimes be
hard to find. Figure 24 illustrate the effect of how many loop iterations in the future to prefetch.
Fetching too much data might saturate the L2 cache or the TLB. At the same time it's worth to nota
that loop unrolling it not always productive on this architecture. Use unrolling with caution.
24 of 51
Experiences with Xeon Phi
Euroben Mod2b - effect on prefetch
Different prefetch distances
45
Performance [Gflops/s]
40
35
30
25
20
15
10
5
0
default
off
2
4
4,2
4,4
8
12
16
16,2
16,4
16,8
32
64,32
Prefetch distance
Figure 24: Effect on different compiler prefetch settings. The distances are set in loop interations. The first
number is prefcthing into L2 while the second number is the distance into L1. If not given the compiler tries
to guess an optimal value.
What effect can be expected as result of tuning the prefetch settings with the compiler ? Figure 26
show possible gains that can be obtained for simple benchmark kernels.
25 of 51
Experiences with Xeon Phi
EuroBen - prefetch settings
Effect of compiler prefetch tuning
115%
default sw prefetch
Opt sw prefetch
Reletive performance [%]
110%
105%
100%
95%
90%
85%
Vector Copy
DAXPY
Matmul
Solver
FFT
EuroBen kernels
Figure 25: Effect of tuning the prefetch distance using compiler options. Distance
is loop iterations. The unroll option might upset the loop content and must be used
with caution together with the prefetch distance setting.
Tuning of prefetch settings is a manual tedious process which not always is easy. For some selected
cases like the FFT here the process is in fact counterproductive. Case must be taken during the
process and for each step one must measure the performance. When it all fit together quite good
performance increases are possible.
Placement of the threads is also very important. The three models given compact, scatter and
balanced do yield different performance. The placement can sometimes be guessed, while other
problems require it to be tested. Figure 26 show the effect of performance for the different
placements.
26 of 51
Experiences with Xeon Phi
EuroBen
Effect of process placement
180%
Compact
Scatter
Balanced
Reletime performance
160%
140%
120%
100%
80%
60%
Vector Copy
DAXPY
Matmul
Solver
FFT
EuroBen Kernel
Figure 26: Effect of placements. Best performance for each test is used, performance
vary with the number of scheduled threads per core. The best values are taken.
How does the performance of the Intel Xeon Phi compare with Sandy Bridge for this kind of
benchmark ? The goal is always to beat SB if not there would be nothing to gain from porting to
Intel Xeon Phi. Figure 27 illustrate the performance difference.
27 of 51
Experiences with Xeon Phi
EuroBen OpenMP benchmark
Selected kernels Xeon Phi vs Sandy bridge
Rel performande Phi vs SB
250%
200%
150%
100%
50%
0%
Vector Copy
DAXPY
Matmul
Solver
FFT
Kernel
Figure 27: Relative performance of Intel Xeon Phi compared with Sandy Bridge for the
tested EuroBen kernels. For most of the kernels the Xeon Phi perform well. The FFT kernel
scale poorly and consequently does not benefit from the large number of cores in Xeon Phi.
Vector copy and DAXPY perform very well. Most of this is probably an effect of the very high
memory bandwidth.
The FFT kernel does not scale very well in this OpenMP implementation. Performance on the Xeon
Phi is only fraction for the Sandy Bridge performance. This is an example of the kind of
performance issues one might come across when porting code to Xeon Phi. One possible solution is
to rewrite the routine to take advantage of the MKL FFT functions, which BTW includes interface
to FFTW. The mod2f kernel is not easily adapted to FFTW or MKL so this is not really an option
with this kernel.
28 of 51
Experiences with Xeon Phi
HYDRO benchmark
HYDRO is a much used benchmark in the PRACE community, it is extracted from a real code
(RAMSES, which is a computational Fluid Dynamics code). Being widely used it has been ported to
a number of platforms. The code exist in many versions, Fortran 90, C, CUDA, OpenCL as well as
serial, OpenMP and MPI versions of these. Some versions have been instrumented with
performance counters to calculate the performance in Mflops/s.
The instrumented version is a Fortran 90 versions and this version in both OpenMP and MPI
versions have been used for evaluation.
How well this architecture scales using MPI or a hybrid mode using both MPI and OpenMP is of
interest in the initial testing of HYDRO. Unfortunately there the hybrid model does not provide
performance numbers in Mflops/s so run time is used to measure scaling.
HYDRO scaling
Pure MPI - 2d
Performance [Mflops/s]
10000
1000
100
10
1
2
4
8
16
30
60
120
240
# ranks
Figure 28: HYDRO scaling using a pure MPI model solving the 2d problem.
Figure 28 show the scaling using a pure MPI implementation of HYDRO. The scaling seems quite
good up to a point where 2 or more threads are scheduled per core. The placements are at this test
the defaults. Better tuning of the placements might provide slightly better results.
29 of 51
Experiences with Xeon Phi
For hybrid models the implementations of HYDRO does not yield performance numbers so run
times are taken as indicators and a scaling is calculated. As there is no runs with only one core it is
assumed perfect scaling from 1 to 4 cores, e.g. a scaling factor of 4 for the 4 core run. Figure 30
show the obtained scaling using 2, 3 and 4 OpenMP threads per MPI rank. Scaling is quite good up
till 3 threads per rank, e.q. 180 cores, but the highest performance was measured using all 240
cores.
HYDRO scaling
Hybrid model 2d - OpenMP / MPI
Speedup
200,00
Threads 2
Threads 3
Threads 4
20,00
2,00
4
6
8
12
16
24
32
45
60
90
120 180 240
# cores
Figure 29: Scaling using a hybrid MPI/OpenMP model solving the 2d problem. Runs are
performed using 2,2 or 4 OpenMP thraeds per MPI rank. The core count are the total number
of cores employed. The speedup using 4 cores is assumed to be perfect from a single core.
How does the Xeon Phi performance stand up to Sandy Bridge ? Figure 30 show that the
performance in Mflops/s for Sandy Bridge outperform the Xeon Phi. The difference is 23% higher
performance for Sandy Bridge. It might be possible to do some more tuning on the HYDRO code to
reduce this gap.
30 of 51
Experiences with Xeon Phi
Hydro, Xeon Phi vs Sandy Bridge
Pure MPI model - 2d problem
Performance [Mflosp/s]
14000
12000
10000
8000
6000
4000
2000
0
Xeon Phi
Sandy Bridge
Processor
Figure 30: HYDRO pure MPI implementation solving the 2d
problem using Xeon Phi and Sandy Bridge.
For the hybrid case the picture is about the same. Figure show the difference in run times, where
lower is better in this case. Again there is significant difference in performance in Sandy Bridge's
favor.
Hydro, Xeon Phi vs Sandy Bridge
Hybrid MPI/OpenMP tests, 2d problem
70
Run time [secs]
60
50
40
30
20
10
0
Xeon Phi
Sandy Bridge
Processor
Figure 31: HYDRO hybrid implementation solving the 2d
problem using Xeon Phi and Sandy Bridge. Performance is
given by run times where lower is better.
It might be possible to reduce or even close this gap with careful tuning of the HYDRO core on
Xeon Phi, but currently Sandy Bridge hold the record for this benchmark.
31 of 51
Experiences with Xeon Phi
Vienna Ab-initio Simulation Package, VASP
This well known and widely used software package is a candidate to run on any accelerated system
due to its popularity. It is known to be painful to build and to run properly. It is very sensitive to
compiler settings and compiler optimization. Several routines must be compiled without any opt
settings. In addition some versions of the compilers must be avoided. Not an optimistic staring point
for compilation to Xeon Phi, nor to optimize it to run on many cores. The application is MPI based
to we need to run MPI on a shared memory system which is also not a good starting point. Not all
shared memory communications for MPI are implemented in the most optimal way. Most probably
not for Xeon Phi as this is a rather new architecture. As the only software available is Intel based
both the Intel MPI and Intel compilers/libraries must be used. In addition the current release of the
Intel compiler also run into some internal issues when invoking higher optimization, e.g. -O3. The
compilation also takes very much longer with -O3, it seems to struggle with the high level optimizer
code is using internally. Hence not all routines could be compiled with -O3. Not a good sign for
sensitive code like VASP. The O3 level does a lot of aggressive loop transformations such as
Fusion, Block-Unroll-and-Jam, and collapsing IF statements. This might be the cause of VASP
failures at this level of optimization. The -O2 is not only give a more stable executable but also
faster. The convergence seems often to be upset by the highest level of optimization. Selecting a
more precise or strict floating point model is a possible countermeasure, or just stick to -O2.
The VASP version 5.3.3 is used in this test, with a Bismuth oxide benchmark. This benchmark has
been used for a considerably number of tests with our VASP installation. It is also small enough to
fit on the limited memory of the Xeon Phi system.
Compilation is quite simple to initiate, just change the flags somewhat to instruct the compiler to
generate MIC code, and change include and library paths to mic instead of intel64. The FFTW
wrappers need to be rebuilt for MIC architecture. This is a simple task. Once the binary excutable is
built the tests to verify that settings etc have produced a working executable. This is often not the
case as VASP in know to be sensitive to all kinds of very minute facets in the middleware.
The scaling of the VASP code solving the input in question is part of the initial exploration. If
sufficient scaling is measured one can assume that most of the performance of the Xeon Phi can be
harvested. The placement of processors using MPI is somewhat different than that of OpenMP. It
looks like Intel MPI place the ranks in a scatter way og do not place the ranks like the system does
when running OpenMP. It is possible to prepare a special rank file and fix each rank at a certain
core, this has not been done during these tests. By default the Intel mpirun enable process pinning.
Figure 32 show how the speedup changes with the number of cores used. It scales quite well up to
60 cores, which is one thread per core. When scheduling more than one thread per core the
performance drops. As the MPI ranks are independent it might well be that the amount of data
needed becomes too much for the L2, TBL or even the memory bandwidth.
32 of 51
Experiences with Xeon Phi
VASP scaling
Bi2O3 input
45
40
35
Speedup
30
25
20
15
10
5
0
10
20
30
40
50
60
70
80
90
100
110
120
140
# ranks / cores
Figure 32: VASP scaling. This is pure MPI run. Due to memory limitations only up to
140 ranks could be run. It is assumed perfect scaling from a single core to 10.
The VASP code is a pure MPI code, and hybrid model is not possible using the VASP code.
However, the VASP application uses quite a bit of linear algebra and FFT. These functions are part
of the MKL library and this library is available in a threaded version. Consequently a semi hybrid
model can be built. Even if mpirun pin processes per default it does only affect the actual MPI
ranks, not the OpenMP threads spawned by MKL. The placements of the threads have an impact on
performance. Figure 33 show tests run with a multi threaded MKL library and different placement
models. It clearly show that two threads per MPI rank yield best result for this run.
33 of 51
Experiences with Xeon Phi
VASP performance
Semihybrid model runs, MKL threaded.
1100,0
Run time [seconds]
1050,0
compact
scatter
1000,0
balanced
950,0
900,0
850,0
800,0
60/2
60/3
60/4
Ranks/threads pr rank
Figure 33: Effect on thread placement in Semihybrid VASP runs. I all cases 60 MPI
ranks is started, in addition MKL is threaded and is allowed to use 2,3 or 4 threads.
This will run 2, 3 or 4 threads per core. Performance is given in run time where lower
is better.
However, tests show, figure 34, that performance is better using a serial MKL with only one MPI
rank per core.
VASP benchmarking
Bi2O3 input / seqential vs parallel MKL
950
Run time [seconds]
900
850
800
750
700
MPI/MKL=seq
MPI/MKL=par
Figure 34: Comparison og VASP run with a serial MKL and threaded MKL.
The effect of using multiple MKL threads per MPI rank seem counter
productive. Performance is again in run time, where lower is better.
34 of 51
Experiences with Xeon Phi
For this application it has proven very hard top beat Sandy Bridge on performance. This comes as
no surprise as VASP does not scale with all inputs. The Bismuth oxide might be on of the inputs that
does not scale very good. This example, figure 35 illustrate the fact that even for MPI programs the
scaling of the application must continue to at least the number of cores on the Xeon Phi.
VASP performane, Xeon Phi vs Sandy Bridge
Bi2O3 benchmark
900
800
Rune time [seconds]
700
600
500
400
300
200
100
0
Xeon Phi
Sandy Bridge
Figure 35: It is very hard to attain comparable performance as that of the Sandy Bridge
processor. Twoforld reasion, one is that MPI over shared memory if not always very
efficient or that the VASP application exhibit poor scaling with this input.
35 of 51
Experiences with Xeon Phi
Program MARK
Program MARK provides parameter estimates from marked animals when they are re-encountered
at a later time. Program MARK computes the estimates of model parameters via numerical
maximum likelihood techniques. The FORTRAN program that does this computation also
determines numerically the number of parameters that are estimable in the model, and reports its
guess of one parameter that is not estimable if one or more parameters are not estimable. The
number of estimable parameters is used to compute the quasi-likelihood AIC value (QAICc) for the
model.
This application is used by bio-scientists users at Abel making it an interesting test case.
This application was straight forward to compile and build on the MIC architecture. Intel compilers
and libraries provide all that is needed, both BLAS and LAPACK.
The scaling of the applications do vary with the input data set. One input that have suitable run
times was selected for the test.
MARK application
Billbig input
12000
Wall time [secs]
10000
8000
6000
4000
2000
0
15
30
60
120
240
# Cores
Figure 36: MARK application scaling solving the Billbig input case.
Performance is measured in run times.
Figure 36 show the scaling of run times when increasing the number of OpenMP threads. The
optimum number of threads seems to be 2 threads per core, yielding a total core count of 120. Quite
nice scaling for a program developed years ago for Windows XP. The code was a well written
OpenMP Fortran code that confirms to the standard.
36 of 51
Experiences with Xeon Phi
Unfortunately the performance is inferior to that of a Sandy Bridge processor as figure 37 show.
MARK application
Input Billbig
Wall time [Seconds]
3000
2500
2000
1500
1000
500
0
Xeon Phi
Compute node
Node type
Figure 37: MARK application solving Billbig input using Xeon Phi
and Sandy Bridge. Performance is given in run times.
This example illustrate that it might be very easy to compile, build and install an application on a
Xeon Phi system it is not always beneficial with respect of performance.
37 of 51
Experiences with Xeon Phi
Performance evaluation Offload Model
MKL enabled offload
Intel provides support for automatic usage of the Xeon Phi as a co-processor by means of automatic
offloading of work using MKL routines. This is a very simple way to exploit the MIC co-processor.
No recompile, just set some environment variables and run. Presently, unfortunately only a very
limited set of MKL functions and routines have been offload enabled. Only level 3 BLAS
functions ?GEMM, ?TRMM, ?TRSM are offload enabled at the time of writing. Currently this is
limiting the usage to a few special cases. It should come a no surprise that the ones used in HPL is
enabled. The most common of these is the dense matrix matrix multiplication, dgemm (s,c,d,z
variant are supported). Usage is very simple as will be shown is the following example:
time_start=mysecond()
call dgemm('n', 'n', N, N, N, alpha, a, N, b, N, beta, c, N)
time_end=mysecond()
write(*,fmt=form) &
"dgemm end, timing :",time_end-time_start," secs, ",&
ops*1.0e-9/(time_end-time_start)," Gflops/s"
This f90 code is all it takes to do A*B => C . All the magic is done by MKL behind the scenes.
Compiling is equally simple:
ifort -o dgemm-test.x -mcmodel=medium -O3 -openmp -xAVX -mkl dgemm-test.f90
MKL is very flexible and can be instructed to use multiple threads on the host system and offload
some of the work to the co-processors, the Intel Xeon Phis, or MIC for short. MKL also
accommodate variables and functions to set the fraction of work to be offloaded from the host
processor to the coprocessors in addition to automatic load balancing, some quite interesting results
is shown in the figures below. The following table gives a selection of some environment variable
that control the offload. Some of these are also available as functions to be called from the program
that utilizes offloading.
Environment variable
Function
MKL_MIC_ENABLE
Enables Automatic Offload.
OFFLOAD_DEVICES
List offload devices
MKL_MIC_WORKDIVISION
Specifies the fraction of work to do on all the Intel Xeon
Phi coprocessors on the system, including auto.
OFFLOAD_REPORT
Specifies the profiling report level for any offload
MIC_LD_LIBRARY_PATH
Search path for coprocessor-side dynamic libraries.
Table 3: Relevant environment variables for offload.
38 of 51
Experiences with Xeon Phi
MKL dgemm automatic offload
Two SB processors, One Phi card
Performance [Gflops/s]
1200,0
2288 MiB
20599 MiB
57220 MiB
1000,0
800,0
600,0
400,0
200,0
0,0
auto
0
50
80
90
100
Percent offloaded to mic
Figure 38: MKL automatic offload using a single Xeon Phi card and both SB host
processors.
MKL dgemm automatic offload
Two SB processors, two Phi cards
1800,0
2288 MiB
20599 MiB
57220 MiB
Performance [Gflops/s]
1600,0
1400,0
1200,0
1000,0
800,0
600,0
400,0
200,0
0,0
auto
0
50
80
90
100
Percent offloaded to mics
Figure 39: MKL automatic offload using two Xeon Phi cards and both SB host processors.
Figure 39 and 38 show performance measures when performing dense matrix matrix multiplication
using MKL and automatic offloading. Nothing extra is done as shown in the Fortran code example
shown above. Only some environment variables are set to achieve this performance. It is quite
remarkable how well the automatic load balancing in MKL's run time system works. It's actually
quite hard to beat the automatic load partitioning between the mic processor and the host processor.
Looking at the actual numbers there is reason be be really impressed. A normal compute node used
can perform at about 320 Gflops/s when doing dgemm. With two installed Xeon Phi cards with mic
processors this performance clocks in at 1694 Gflops/s, or 1.7 Tflops/s per compute node (only 16
racks for a Petaflop/s).
39 of 51
Experiences with Xeon Phi
User function offload
The Intel compilers have support for user defined functions or regions inside a program to be
offloaded onto the MIC co-processor. This is a bit more complex that just calling MKL routines.
One can use regions in a program or write a complete function or subroutine to be compiled and run
on the co-processor. In both cases the code marked for offload will be cross compiled for the mic
architecture. The run time system will launch the code on the MIC processor and data is
exchanged with yet another set of run time functions. Many possible combinations are possible,
overlap of data transfer, load balancing between the host processor and the co-processor etc.
However, all of this is left to the programmer. This makes usage of the co-processor a little bit
harder to use in production where none og very few changes to the source code is wanted.
Offloading regions in program
This is the simplest solution where directives to instruct the compiler to generated offload code are
just inserted into the program code. The following code show an example of how the nested do loop
is effectively offloaded from the host processor to the mic co-processor.
!dir$ offload begin target(mic) in(a,b) out(c)
!$omp parallel do private(j,l,i)
do j=1,n
do l=1,n
do i=1,n
c(j,l)=c(j,l) + a(i,l)*b(i,l)
enddo
enddo
enddo
nt=omp_get_max_threads()
#ifdef __MIC__
print*, "Hello MIC threads:",nt
#else
print*, "Hello CPU threads:",nt
#endif
!dir$ end offload
The offloaded part of the code is executed on the MIC co-processor using an environment either set
up by the system or bye the user via environment variables. Both the number of threads and thread
placements on the mic processor can be controlled in this way. During the time the offload code is
run on the co-processor the host processors are idle. To achieve load balancing the user must
explicitly program the work partition.
40 of 51
Experiences with Xeon Phi
Offloading functions or subroutines
This approach is making usage of the offload code simpler and is the more common way of
programming. The user write a complete function or subroutine to be offloaded. Using this routine
is straighforward it is called just as any other function with the only addition that data must be
handled. Initiating data transfer between the two memories must be handled. This data transfer can
be overlapping with other workload on the host processor hiding latency of the transfer.
To follow the example above this piece of code can easily be put into a subroutine.
!dir$ attributes offload : mic :: mxm, omp_get_max_threads
subroutine mxm(a,b,c,n)
use constants
integer :: n
real(r8),dimension(n,n) :: a,b,c
integer :: i,j,l,nt
!$omp parallel do private(j,l,i)
do j=1,n
do l=1,n
do i=1,n
c(j,l)=c(j,l) + a(i,l)*b(i,l)
enddo
enddo
enddo
nt=omp_get_max_threads()
#ifdef __MIC__
print*, "Hello MIC threads:",nt
#else
print*, "Hello CPU threads:",nt
#endif
end subroutine mxm
This routine will now be compiled to an object file suitable for executing one the MIC co-processor.
It can be called as any other routine, but data transfer must be accommodated for.
The calling program need to arrange transfer.
time_start=mysecond()
!dir$ offload_transfer target(mic:0) in( a: alloc_if(.true.) free_if(.false.) )
!dir$ offload_transfer target(mic:0) in( b: alloc_if(.true.) free_if(.false.) )
!dir$ offload_transfer target(mic:0) in( c: alloc_if(.true.) free_if(.false.) )
!dir$ offload
target(mic:0)
in(a,b: alloc_if(.false.) free_if(.false.)) &
out(c: alloc_if(.false.) free_if(.false.))
call mxm(a,b,c,n)
time_end=mysecond()
41 of 51
Experiences with Xeon Phi
The subroutine call will block using this construct. In order to utilize both host and co-processor
resources concurrency and synchronization need to be introduced. However, the above setup works
very for testing and timing purposes. Figure 40 show the performance measured when comparing
host cpus and co-processors running Fortran 90 code with OpenMP threading of three nested loops
doing matrix multiplication in double precision. Far inferior of the MKL library, but serves as an
illustration of what can be expected using user Fortran 90 code. No special form of optimization
have been performed, only compiler flags like -O3, -mavx in addition to -openmp.
MxM offloading
Fortran 90 code, double prec.
40
Performance [Gflops/s]
35
Host procs
Co-processor
30
25
20
15
10
5
0
2288 MiB
5149 MiB
5859 MiB
6614 MiB
Memory footprint matrices
Figure 40: Comparing Host processors, SandyBridge 16 threads, with Xeon Phi mic coprocessor. Fortran 90 code using OpenMP threading. One single mic processor using
240 threads and a scatter placement is used.
Load distribution and balancing, hosts cpus, co-processors (single and multiple)
It is relatively straightforward to set up concurrent runs, workload distribution and ultimately load
balancing between the host cpus and the Xeon Phi mic processors. However, all of the
administration is left to the programmer. Since there is no shared memory the work and memory
partition must be explicitly handled. Only the buffers used on the co-processors need to be
transferred as memory movement is limited by the PCIe bus' bandwidth. There are mechanisms for
offline transfer and semaphores to syncronising both transfer and execution. All of this must be
explicitly handled by the programmer. While each part is relatively simple it can become quite
complex when trying to partition the problem while trying to load balancing. Some examples
below will try to illustrate this.
42 of 51
Experiences with Xeon Phi
Filling the matrices for the co-procssors:
am0(:,:)=a(1:m,:)
am1(:,:)=a(m+1:2*m,:)
bm0(:,:)=b(1:m,:)
bm1(:,:)=b(m+1:2*m,:)
Initiate data transfer:
!dir$ offload_transfer target(mic:0) in( am0: alloc_if(.true.) free_if(.false.) )
!dir$ offload_transfer target(mic:0) in( bm0: alloc_if(.true.) free_if(.false.) )
!dir$ offload_transfer target(mic:1) in( am1: alloc_if(.true.) free_if(.false.) )
!dir$ offload_transfer target(mic:1) in( bm1: alloc_if(.true.) free_if(.false.) )
Variables for each co-processor have been declared and allocated. These are 1/3 og the size of the
total matrix size held in the host memory. Each compute element (Host SB processors, mic0 and
mic1) is doing 1/3 of the total calculation. No dynamic load balancing, fixed at 1/3 each.
Calling the offloading subroutine:
time_start=mysecond()
!dir$ offload target (mic:0) in(am0,bm0) out(cm0) signal(s1)
call mxm(am0,bm0,cm0,m,n)
!
print *,"cm0",cm0(:,:)
!dir$ offload target (mic:1) in(am1,bm1) out(cm1) signal(s2)
call mxm(am1,bm1,cm1,m,n)
!
print *,"cm1",cm1(:,:)
kc=2*m+1
!$omp parallel do private(j,l,i)
do j=kc,n
do l=1,n
do i=1,n
c(j,l)=c(j,l) + a(i,l)*b(i,l)
enddo
enddo
enddo
nt=omp_get_max_threads()
#ifdef __MIC__
print*, "Hello MIC threads:",nt
#else
print*, "Hello CPU threads:",nt
#endif
Here we wait for the co-processors if they have not yet finished. One semaphore for each coprocessor.
43 of 51
Experiences with Xeon Phi
!dir$ offload_wait target(mic:0) wait(s1)
!dir$ offload_wait target(mic:1) wait(s2)
Copy the data received from co-processors back in matrices on host memory:
! Put the parts computed on the mics into tha sub matrix of c.
c(1:m,:)=cm0(:,:)
c(m+1:2*m,:)=cm1(:,:)
!
c(2*m+1:3*m,:) is already in c, nothing to do.
time_end=mysecond()
The amount of work is about evenly distributed with just a little time spend waiting for another
compute element to finish it's work. In a production run the load balancing would be set up dynamic
and hence a better load balancing obtained. However, this is left to the programmer and require
detailed knowledge of the workload. Again we see that the actual programming is easy, but the
administration of the workload can be complex. One advantage is that for we have more memory
and can attack bigger problems without having to run offload parts in series with a chunk of data for
each run. This must be done if the problem we want to offload exceed the 8 GiB of currently
installed memory on the Xeon Phi cards.
Figure show the aggregated performance obtained when using both SB processors and both mic
processors in the Xeon Phi cards running the triple nested do loops programmed in Fortran 90.
44 of 51
Experiences with Xeon Phi
MxM offloading, load balancing
Total node performande [Gflops/s]
Fortran 90, OpenMP, double precision
60
50
40
30
20
10
0
2288 MiB
5149 MiB
5859 MiB
6614 MiB
9155 MiB
11077 MiB
14305 MiB
17944 MiB
Memory footprint matrices
Figure 41: Aggregated single node performance, two SandyBridge using 16 threads each, two
Xeon Phi mic co-processors. Fortran 90 code using OpenMP threading. The mic processors are
using 240 threads each and a scatter placement.
Intel provide the tools needed to do the job, but the programmer need to to all the details himself. It
must be noted that this is vastly simpler than doing similar programming using a GPU. Here the
exact same Fortran 90 code is used for both processor classes, Sandy Bridge/X86-64 and Xeon
Phi/mic. Offloading is a very easy way to start accelerating your code. It might not utilize the coprocessors full potential but nevertheless any speedup with a small effort is worth harvesting.
Figure 42 and 43 show a comparison between a standard compute node and an accelerated node
performing naïve nested do loops in f90 to multiply tow matrices. This kind of performance and
speedup is about what you could expect to get in a production setting where common f90 code is
compiled and run. The figure show that memory footprint matter.
45 of 51
Experiences with Xeon Phi
Standard compute node vs. accelerated node
Fortran 90 matrix multiply code
70
Compute node
Accelerated Node
Performance [Gflops/s]
60
50
40
30
20
10
0
2288 MiB
5149 MiB
9155 MiB
14305 MiB
17944 MiB
Matrices footprint
Figure 42: Comparing performance between a standard compute node with SB processors
with an accelerated node with two SB processors and two Xeon Phi cards.
Speedup Accelerated node vs standard node
Performance speedup
Two Xeon Phi cards in accelerated node
3
2,8
2,6
2,4
2,2
2
1,8
1,6
1,4
1,2
1
2288 MiB
5149 MiB
9155 MiB
14305 MiB
17944 MiB
Matrices footprint
Figure 43: Speedup measured comparing a standard Abel compute node with an
acclerated node with two Xeon Phi cards installed. Fortran 90 simple nested do
loop matrix multiplication.
The workshare balancing between the two Xeon Phi cards' memory and the host memory it seems
to benefit strongly from larger workloads placed on the co-processors. For a speedup of two the
performance is in effect doubled. This is measured using normal f90 code with minimal changes to
the actual code, just setting up load balancing and workshare. Any programmer should achieve this
kind of performance with minimal effort.
46 of 51
Experiences with Xeon Phi
How to reach performance close to theoretical ?
Property
Value
Core frequency
1.05 GHz
Number of cores
60
Vector width
512 bits / 8 double / 16 single precision reals
Fused Multiply Add instruction flops/s gain
2 times
Table 4: Properties of the MIC architecture.
Theoretical performance: 1.05 x109s-1 x 60 x 8 x 2 flops = 1008 Gflops/s = 1 Tflops/s
Another theoretical measure might be 16.8 Gflops/s per core.
Scaling
With 60 cores and 240 threads the challenge of getting OpenMP to scale well at this core count is
far from trivial. It is well known that OpenMP normally does not scale well beyond 8-16 without
special measures taken. There are 60 cores each with two pipes one scheduling to both vector and
scalar unit the other pipe can only schedule to the scalar unit, see figure 4. Because of this one
single thread cannot schedule vector instructions for every clock cycle. A minimum of two threads
per core is needed to schedule a vector instruction for every clock cycle. This bring up the amount
of threads that the application need to scale to up to 120. Very careful usage of OpenMP is needed
to provide scaling of 120 threads. In addition comes the syncronisation of the threads placing a
burden on the cache coherency machinery on the system. This can saturate the ring interconnect and
the memory bandwidth.
One can run MPI only or hybrid models as MPI tend to scale better than OpenMP. The shared
memory device in MPI is generally not as good as one might expect, hence limiting scaleability. For
applications that does not scale very well it is still not so easy to get the desired performance, see
figures 33 and 30. Beating Sandy Bridge is still very hard.
Vector unit
In order to attain full performance the 512 bits or 8 double numbers vector need to be fully loaded
for each vector instruction issued. Failing to achieve this result in a performance of on N/8, where N
can be as low as one, yielding only a peak performance of 125 Gflops/s. It is only carefully laid out
vector and matrix data sets that can be easily mapped to this kind of vector instructions, hence
limiting the general purpose of the processor. For Fortran 90 vector layout and loop specifications
this might a good match. For less structures problems not.
Fused multiply add instruction
This instruction is a special instruction for multiplication and addition/accumulation with does both
multiply and addition in one instruction in one clock cycle (review paragraphs on page 10 and
figure 7). In practice doubling the theoretical performance of the chip. The question remains how
often one can fill a vector and use the FMA instruction at the same time. It works fine for the
selected problem of matrix multiplication which is a major part of the top500 benchmark. The
47 of 51
Experiences with Xeon Phi
matrix matrix multiplication is consequently the major benchmark used to show the outstanding
performance of the MIC architecture. Both the vector unit and the FMA instruction pose a challenge
for any programmer hoping to get maximum performance.
48 of 51
Experiences with Xeon Phi
Appendix
Node configuration
Compute node
Specification
Vendor
Megware, Myriquid / Supermicro
Mainboard
Supermicro X9DRT
Processor
2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 8 core
L2 cache
8-way Set-associative 2048 kB Write Back
L3 cache
20-way Set-associative 20480 kB Write Back
Memory
8 x Samsung DDR3 Registered 8192 MB, 1600 MHz
InfiniBand
Mellanox ConnectX-3 FDR
OS
CentOS release 6.4 (Final), later upgraded to 6.6
Compilers
Gcc 4.8.0, 4.8.2 and 4.9.2 / Intel 2013.x and 2015.1
MPI
Intel MPI 4.1.3 and 5.0.0
Math library
Intel MKL 2013.3 and 2015.1
Table 5: Node configuration, standard Abel compute node.
Host node
Specification
Vendor
Megware, Myriquid / Supermicro
Mainboard
Supermicro X9DRG-HF
Processor
2 x Intel Sandy Bridge, Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 8 core
L2 cache
8-way Set-associative 1024 kB Write Back
L3 cache
20-way Set-associative 10240 kB Write Back
Memory
8 x Samsung DDR3 Registered 16384 MB, 1600 MHz
InfiniBand
Mellanox ConnectX-3 FDR
Phi accelerator
2 x Xeon Phi 5110P, device 2250, 240 cores @ 1.05 GHz
Phi memory
8 GiB GDDR5 memory per card total 16 GiB.
Phi OS
Busy Box kernel 2.6.38.8-gefd324e
OS
CentOS release 6.4 (Final), later upgraded to 6.6
Compilers
Gcc 4.8.0, 4.8.2 and 4.9.2 / Intel 2013.x and 2015.1
MPI
Intel MPI 4.1.3 and 5.0.0
Math library
Intel MKL 2013.3 and 2015.1
Table 6: Node configuration, Xeon Phi accelerated Abel compute node.
49 of 51
Experiences with Xeon Phi
References
Background :
http://www.drdobbs.com/parallel/programming-intels-xeon-phi-a-jumpstart/240144160
NVIDIA web site about applications:
http://www.nvidia.co.uk/object/gpu-computing-applications-uk.html
http://www.nvidia.co.uk/object/bio_info_life_sciences_uk.html
Porting of VASP to support GPUs:
http://www.ncbi.nlm.nih.gov/pubmed/22903247
ACEMD
http://www.acellera.com/products/acemd/
HYDRO
http://www.prace-ri.eu/IMG/pdf/porting_and_optimizing_hydro_to_new_platforms.pdf
MARK
http://warnercnr.colostate.edu/~gwhite/mark/mark.htm
Notes on optimization :
http://software.intel.com/en-us/articles/step-by-step-optimizing-with-intel-c-compiler
-O1/-Os
This option enables optimizations for speed and disables some optimizations that increase code size
and affect speed. To limit code size, this option enables global optimization which includes dataflow analysis, code motion, strength reduction and test replacement, split-lifetime analysis, and
instruction scheduling. This option also disables inlining of some intrinsics. If -O1 is specified, -Os
option would be default enabled. In O1 option, the compiler auto-vectorization is disabled. If your
application are sensitive to the code size, you may choose O1 option.
-O2
This option enables optimizations for speed. This is the generally recommended optimization level.
The compiler vectorization is enabled at O2 and higher levels. With this option, the compiler
performs some basic loop optimizations, inlining of intrinsic, Intra-file interprocedural
optimization, and most common compiler optimization technologies.
-O3
Performs O2 optimizations and enables more aggressive loop transformations such as Fusion,
Block-Unroll-and-Jam, and collapsing IF statements. The O3 optimizations may not cause higher
performance unless loop and memory access transformations take place. The optimizations may
slow down code in some cases compared to O2 optimizations. The O3 option is recommended for
50 of 51
Experiences with Xeon Phi
applications that have loops that heavily use floating-point calculations and process large data sets.
Notes on MKL :
http://software.intel.com/en-us/articles/intel-mkl-automatic-offload-enabled-functions-for-intelxeon-phi-coprocessors
51 of 51
Download