slides

advertisement
OpenPOWER ABI
Performance
Improvements
IantoMcIntosh
Click
add text
November 5, 2014
CASCON 2014
www.ibm.com
© 2006-2008 IBM Corporation
Abstract
OpenPOWER Linux ABI changes to improve performance
Many of the differences between the AIX and Linux on PowerPC
ABIs versus the new OpenPOWER Linux ABI are designed to
improve performance of calls, passing parameters, returning
results, accessing static memory, and they also reduce memory
usage. Some of the approaches are novel.
2
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER
 OpenPOWER is a new partnership led by IBM, with 68 others.
 See www.openpowerfoundation.org and
www.en.wikipedia.org/wiki/OpenPOWER_Foundation.
 IBM will openly license PowerPC CPU designs (not just the
architecture) to their partners.
 The minimum PowerPC hardware architecture level is Power8.
 Nvidia GPU processors integrated with PowerPC speed some
programs.
3
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI
 The OpenPOWER Linux on PowerPC variant includes a new
ABI (Application Binary Interface).
 An ABI defines the interfaces between program components;
eg, how calls are made, how parameters are passed, how
results are returned, what object files and debug information
looks like, etc.
 This ABI is 64-bit mode only and Little Endian (LE) only.
 Some of it resembles the 64 bit Big Endian ABI and the AIX
ABI, parts resemble the 32 bit Big Endian ABI, but some
important parts are new and different, some very different.
 Most of the changes are to improve performance.
4
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI
 The object file format is ELF (same as Big Endian Linux on
PowerPC), different than AIX’s XCOFF format.
 The debug format is DWARF4.
 The default code model is medium, with each library or
the main module (executable) limited to 4 GB (a 32 bit
displacement).
There are also small and large code models.
Currently the XL compilers only support the medium model.
 Most changes affect only compiler back ends not front ends,
and not user application programming.
5
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI - Addressing
 Pointers are 64 bits.
 For the medium code module, normally two instructions (add
immediate shifted, then load / store / add immediate) are used
to access static memory (function instructions, the GOT / TOC,
some extern variables, static variables and the constant area).
Each provides 16 bits of the displacement.
 A GOT (Global Offset Table) holds pointers to extern functions
and some extern / static data used by a module.
 A TOC (Table of Contents) contains the GOT and also may
contain directly accessible data.
 Other static memory might be accessed indirectly via the GOT
/ TOC, directly (if not visible outside the library) or via an
absolute address (main module only, and only up to 2 GB).
6
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI Major Changes
 Major Changes:
– Accessing static and some extern variables is faster.
– How calls are made is normally faster.
– How parameters are passed is sometimes faster.
– How results are returned is sometimes faster.
– The minimum stack frame size is smaller.
– Code size can be smaller.
– Some of the changes take advantage of Power8 CPU
improvements like Instruction Fusion.
– There are some changes to Altivec / VMX functions.
7
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI – Static/Extern Addressing
 How static variables and some extern variables are
accessed is normally faster:
– Static and some extern variables can be accessed directly with
an add immediate shifted (addis) from gpr2 instruction, then a
load / store / add immediate, without first having to load a
GOT / TOC pointer to the block they’re in.
(For many loads and stores, the two instructions are fusable on
Power8.)
– The default is nopic, allowing direct access in the main module.
– In libraries, use pic (which uses the medium model). Extern
variables that could be accessed from the main module must be
accessed indirectly via the GOT / TOC, because the library
loader may move them to the main module. Using datalocal
and visibility to ensure they are not exported allows the
compiler to use direct access, which is often faster.
8
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI - Calls
 How calls are made is normally faster:
– Functions have both a Global Entry Point and a Local Entry
Point. The Global code typically uses two instructions (addis +
addi) to point to the GOT / TOC, then falls into the Local code.
– If the function doesn’t use the GOT / TOC then it can omit those.
– There are no Function Descriptors:
• Calls to extern functions and calls via pointers (including C++ virtual
calls) do not need to load the GOT / TOC pointer because the GEP will
construct it when it's needed. Local calls skip that, so are faster.
• Calls to nested functions via pointers do not need to load the parent’s
environment pointer because a “trampoline” will load it.
• This eliminates a load of the FD address then 2 or 3 loads from it.
– The GOT / TOC pointer only needs to be saved once per function
not once per call.
9
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI - Parameters
 How parameters are passed is sometimes faster:
– When practical, value parameters are passed in registers,
including cases where AIX and BE Linux did not.
– Homogeneous floating point struct etc. parameters (with up to
8 compatible floating point members) are normally passed via
FPRs instead of GPRs or memory. That includes complex and
similar structs.
– Homogeneous vector struct etc. parameters (with up to 8 vector
members) are normally passed via VRs instead of GPRs or
memory.
– Using the right registers saves instructions, store-reload
delays (aka load-hit-store) and compiler temporaries that waste
stack space and cache locality.
10
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI - Results
 How results are returned is sometimes faster:
– When practical, results are returned in registers, including
cases where AIX and BE Linux did not.
– Homogeneous floating-point struct results are returned via FPRs
instead of in memory. That includes complex and similar structs.
– Homogeneous vector struct results are returned via VRs instead
of in memory.
– Small non-floating-point non-vector struct results of up to 2
doublewords are returned via GPRs instead of in memory.
– Using the right registers saves store and load instructions,
store-reload delays and compiler temps.
11
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI – ABI Overhead
Work instructions in green, AIX ABI overhead in red. NO OpenPOWER overhead.
1
0
1
1
0
0
0
0
0
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
12 1
| 000000
199|
0| 0031A0 stdu
F821FF31
206| 0031A4 addi
380000C0
203| 0031A8 addi
38810130
0| 0031AC std
F8C10118
0| 0031B0 std
F8A10110
0| 0031B4 std
F8E10120
0| 0031B8 std
F9010128
0| 0031BC std
F9210130
202| 0031C0 addi
38A10110
0| 0031C4 std
F9410138
203| 0031C8 lxvd2x
7C202698
202| 0031CC lxvd2x
7C002E98
203| 0031D0 addi
38C40010
202| 0031D4 addi
38850010
205| 0031D8 xvmulsp F0400A80
203| 0031DC lxvd2x
7C203698
202| 0031E0 lxvd2x
7C002698
206| 0031E4 xvmulsp F0000A80
205| 0031E8 addi
388000B0
205| 0031EC stxvd2x 7C412798
0| 0031F0 ori
60420000
206| 0031F4 stxvd2x 7C010798
207| 0031F8 ld
E80100B0
207| 0031FC std
F8030000
207| 003200 ld
E88100B8
207| 003204 std
F8830008
0| 003208 ori
60420000
207| 00320C ld
E80100C0
207| 003210 ld
E88100C8
208| 003214 addi
382100D0
207| 003218 std
F8030010
207| 00321C std
F8830018
208|
003220 bclr
4E800020
© 2007 IBM Corporation
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
0
PDEF
PROC
ST8U
LI
LA
ST8
ST8
ST8
ST8
ST8
LA
ST8
VLQD
VLQD
AI
AI
VFM
VLQD
VLQD
VFM
LI
VSTQD
XNOP
VSTQD
L8
ST8
L8
ST8
XNOP
L8
L8
AI
ST8
ST8
BA
_mm256_mul_ps
#retvalptr_49,left,right,gr3,gr5-gr10
gr1,#stack(gr1,-208)=gr1
gr0=192
gr4=right(gr1,304)
left(gr1,280)=gr6
left(gr1,272)=gr5
left(gr1,288)=gr7
left(gr1,296)=gr8
right(gr1,304)=gr9
gr5=left(gr1,272)
right(gr1,312)=gr10
vs1=right(gr4,0)
vs0=left(gr5,0)
gr6=gr4,16
gr4=gr5,16
vs2=vs0,vs1,fcr
vs1=right(gr6,0)
vs0=left(gr4,0)
vs0=vs0,vs1,fcr
gr4=176
result.m128_0(gr1,gr4,0)=vs2
result.m128_1(gr1,gr0,0)=vs0
gr0=result(gr1,176)
#retval_49(gr3,0)=gr0
gr4=result(gr1,184)
#retval_49(gr3,8)=gr4
gr0=result(gr1,192)
gr4=result(gr1,200)
gr1=gr1,208
#retval_49(gr3,16)=gr0
#retval_49(gr3,24)=gr4
lr
w3.ibm.com
OpenPOWER ABI – Stack Frame Size
 The minimum stack frame size is smaller:
– The minimum Parameter Save Area is 0 not 64 bytes, and two
unneeded linkage doublewords are eliminated, so the minimum
stack frame size is 32 bytes.
– AIX and BE Linux 32-bit mode minimum is 64 bytes.
– AIX and BE Linux 64-bit mode minimum is 112 bytes.
– The smaller size may allow better cache locality, deeper
recursion, or more threads.
13
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI – Code Size
 Many calls take fewer instructions.
 Because the cost of a call, its parameter passing and value
returning can be lower, often inlining is less important.
Doing less inlining can reduce code size even more.
 In addition to fetching and executing fewer instructions,
smaller code size can make the instruction cache more
effective, improving performance more.
14
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI - Fusion
 Some of the changes take advantage of Power8 CPU
improvements like Instruction Fusion:
– For the medium code module, normally two instructions (add
immediate shifted then load / store / add immediate) are used to
access all static memory (function instructions, the GOT / TOC,
extern variables, static variables and the constant area). The
addis provides the upper 16 bits of the displacement, and the
load / store / add immediate provides the lower 16 bits.
– For the main module with a 2 GB limit the add immediate shifted
can be a load immediate shifted without using the GOT / TOC.
– The Power8 CPU usually fuses the two instructions used to
load a general purpose register into one instruction going down
the pipeline, and makes some other instruction pairs faster too.
15
© 2007 IBM Corporation
w3.ibm.com
OpenPOWER ABI - Performance
 Unchanged performance in programs with:
– Large functions / subroutines (including most of SPEC).
– Extensive inlining.
– Even for these, accessing static memory can be faster.
 Faster performance in programs with:
– Many calls to small functions.
– Many pointer calls (except to nested functions),
including many C++ virtual function calls.
– Homogeneous floating-point struct parameters and
homogeneous vector struct parameters.
– Homogeous floating-point struct and vector struct results.
– Small non-FP non-vector results.
16
© 2007 IBM Corporation
w3.ibm.com
Discussion and Questions
?
17
© 2007 IBM Corporation
w3.ibm.com
Download