OpenPOWER ABI Performance Improvements IantoMcIntosh Click add text November 5, 2014 CASCON 2014 www.ibm.com © 2006-2008 IBM Corporation Abstract OpenPOWER Linux ABI changes to improve performance Many of the differences between the AIX and Linux on PowerPC ABIs versus the new OpenPOWER Linux ABI are designed to improve performance of calls, passing parameters, returning results, accessing static memory, and they also reduce memory usage. Some of the approaches are novel. 2 © 2007 IBM Corporation w3.ibm.com OpenPOWER OpenPOWER is a new partnership led by IBM, with 68 others. See www.openpowerfoundation.org and www.en.wikipedia.org/wiki/OpenPOWER_Foundation. IBM will openly license PowerPC CPU designs (not just the architecture) to their partners. The minimum PowerPC hardware architecture level is Power8. Nvidia GPU processors integrated with PowerPC speed some programs. 3 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI The OpenPOWER Linux on PowerPC variant includes a new ABI (Application Binary Interface). An ABI defines the interfaces between program components; eg, how calls are made, how parameters are passed, how results are returned, what object files and debug information looks like, etc. This ABI is 64-bit mode only and Little Endian (LE) only. Some of it resembles the 64 bit Big Endian ABI and the AIX ABI, parts resemble the 32 bit Big Endian ABI, but some important parts are new and different, some very different. Most of the changes are to improve performance. 4 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI The object file format is ELF (same as Big Endian Linux on PowerPC), different than AIX’s XCOFF format. The debug format is DWARF4. The default code model is medium, with each library or the main module (executable) limited to 4 GB (a 32 bit displacement). There are also small and large code models. Currently the XL compilers only support the medium model. Most changes affect only compiler back ends not front ends, and not user application programming. 5 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI - Addressing Pointers are 64 bits. For the medium code module, normally two instructions (add immediate shifted, then load / store / add immediate) are used to access static memory (function instructions, the GOT / TOC, some extern variables, static variables and the constant area). Each provides 16 bits of the displacement. A GOT (Global Offset Table) holds pointers to extern functions and some extern / static data used by a module. A TOC (Table of Contents) contains the GOT and also may contain directly accessible data. Other static memory might be accessed indirectly via the GOT / TOC, directly (if not visible outside the library) or via an absolute address (main module only, and only up to 2 GB). 6 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI Major Changes Major Changes: – Accessing static and some extern variables is faster. – How calls are made is normally faster. – How parameters are passed is sometimes faster. – How results are returned is sometimes faster. – The minimum stack frame size is smaller. – Code size can be smaller. – Some of the changes take advantage of Power8 CPU improvements like Instruction Fusion. – There are some changes to Altivec / VMX functions. 7 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI – Static/Extern Addressing How static variables and some extern variables are accessed is normally faster: – Static and some extern variables can be accessed directly with an add immediate shifted (addis) from gpr2 instruction, then a load / store / add immediate, without first having to load a GOT / TOC pointer to the block they’re in. (For many loads and stores, the two instructions are fusable on Power8.) – The default is nopic, allowing direct access in the main module. – In libraries, use pic (which uses the medium model). Extern variables that could be accessed from the main module must be accessed indirectly via the GOT / TOC, because the library loader may move them to the main module. Using datalocal and visibility to ensure they are not exported allows the compiler to use direct access, which is often faster. 8 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI - Calls How calls are made is normally faster: – Functions have both a Global Entry Point and a Local Entry Point. The Global code typically uses two instructions (addis + addi) to point to the GOT / TOC, then falls into the Local code. – If the function doesn’t use the GOT / TOC then it can omit those. – There are no Function Descriptors: • Calls to extern functions and calls via pointers (including C++ virtual calls) do not need to load the GOT / TOC pointer because the GEP will construct it when it's needed. Local calls skip that, so are faster. • Calls to nested functions via pointers do not need to load the parent’s environment pointer because a “trampoline” will load it. • This eliminates a load of the FD address then 2 or 3 loads from it. – The GOT / TOC pointer only needs to be saved once per function not once per call. 9 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI - Parameters How parameters are passed is sometimes faster: – When practical, value parameters are passed in registers, including cases where AIX and BE Linux did not. – Homogeneous floating point struct etc. parameters (with up to 8 compatible floating point members) are normally passed via FPRs instead of GPRs or memory. That includes complex and similar structs. – Homogeneous vector struct etc. parameters (with up to 8 vector members) are normally passed via VRs instead of GPRs or memory. – Using the right registers saves instructions, store-reload delays (aka load-hit-store) and compiler temporaries that waste stack space and cache locality. 10 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI - Results How results are returned is sometimes faster: – When practical, results are returned in registers, including cases where AIX and BE Linux did not. – Homogeneous floating-point struct results are returned via FPRs instead of in memory. That includes complex and similar structs. – Homogeneous vector struct results are returned via VRs instead of in memory. – Small non-floating-point non-vector struct results of up to 2 doublewords are returned via GPRs instead of in memory. – Using the right registers saves store and load instructions, store-reload delays and compiler temps. 11 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI – ABI Overhead Work instructions in green, AIX ABI overhead in red. NO OpenPOWER overhead. 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 12 1 | 000000 199| 0| 0031A0 stdu F821FF31 206| 0031A4 addi 380000C0 203| 0031A8 addi 38810130 0| 0031AC std F8C10118 0| 0031B0 std F8A10110 0| 0031B4 std F8E10120 0| 0031B8 std F9010128 0| 0031BC std F9210130 202| 0031C0 addi 38A10110 0| 0031C4 std F9410138 203| 0031C8 lxvd2x 7C202698 202| 0031CC lxvd2x 7C002E98 203| 0031D0 addi 38C40010 202| 0031D4 addi 38850010 205| 0031D8 xvmulsp F0400A80 203| 0031DC lxvd2x 7C203698 202| 0031E0 lxvd2x 7C002698 206| 0031E4 xvmulsp F0000A80 205| 0031E8 addi 388000B0 205| 0031EC stxvd2x 7C412798 0| 0031F0 ori 60420000 206| 0031F4 stxvd2x 7C010798 207| 0031F8 ld E80100B0 207| 0031FC std F8030000 207| 003200 ld E88100B8 207| 003204 std F8830008 0| 003208 ori 60420000 207| 00320C ld E80100C0 207| 003210 ld E88100C8 208| 003214 addi 382100D0 207| 003218 std F8030010 207| 00321C std F8830018 208| 003220 bclr 4E800020 © 2007 IBM Corporation 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 0 PDEF PROC ST8U LI LA ST8 ST8 ST8 ST8 ST8 LA ST8 VLQD VLQD AI AI VFM VLQD VLQD VFM LI VSTQD XNOP VSTQD L8 ST8 L8 ST8 XNOP L8 L8 AI ST8 ST8 BA _mm256_mul_ps #retvalptr_49,left,right,gr3,gr5-gr10 gr1,#stack(gr1,-208)=gr1 gr0=192 gr4=right(gr1,304) left(gr1,280)=gr6 left(gr1,272)=gr5 left(gr1,288)=gr7 left(gr1,296)=gr8 right(gr1,304)=gr9 gr5=left(gr1,272) right(gr1,312)=gr10 vs1=right(gr4,0) vs0=left(gr5,0) gr6=gr4,16 gr4=gr5,16 vs2=vs0,vs1,fcr vs1=right(gr6,0) vs0=left(gr4,0) vs0=vs0,vs1,fcr gr4=176 result.m128_0(gr1,gr4,0)=vs2 result.m128_1(gr1,gr0,0)=vs0 gr0=result(gr1,176) #retval_49(gr3,0)=gr0 gr4=result(gr1,184) #retval_49(gr3,8)=gr4 gr0=result(gr1,192) gr4=result(gr1,200) gr1=gr1,208 #retval_49(gr3,16)=gr0 #retval_49(gr3,24)=gr4 lr w3.ibm.com OpenPOWER ABI – Stack Frame Size The minimum stack frame size is smaller: – The minimum Parameter Save Area is 0 not 64 bytes, and two unneeded linkage doublewords are eliminated, so the minimum stack frame size is 32 bytes. – AIX and BE Linux 32-bit mode minimum is 64 bytes. – AIX and BE Linux 64-bit mode minimum is 112 bytes. – The smaller size may allow better cache locality, deeper recursion, or more threads. 13 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI – Code Size Many calls take fewer instructions. Because the cost of a call, its parameter passing and value returning can be lower, often inlining is less important. Doing less inlining can reduce code size even more. In addition to fetching and executing fewer instructions, smaller code size can make the instruction cache more effective, improving performance more. 14 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI - Fusion Some of the changes take advantage of Power8 CPU improvements like Instruction Fusion: – For the medium code module, normally two instructions (add immediate shifted then load / store / add immediate) are used to access all static memory (function instructions, the GOT / TOC, extern variables, static variables and the constant area). The addis provides the upper 16 bits of the displacement, and the load / store / add immediate provides the lower 16 bits. – For the main module with a 2 GB limit the add immediate shifted can be a load immediate shifted without using the GOT / TOC. – The Power8 CPU usually fuses the two instructions used to load a general purpose register into one instruction going down the pipeline, and makes some other instruction pairs faster too. 15 © 2007 IBM Corporation w3.ibm.com OpenPOWER ABI - Performance Unchanged performance in programs with: – Large functions / subroutines (including most of SPEC). – Extensive inlining. – Even for these, accessing static memory can be faster. Faster performance in programs with: – Many calls to small functions. – Many pointer calls (except to nested functions), including many C++ virtual function calls. – Homogeneous floating-point struct parameters and homogeneous vector struct parameters. – Homogeous floating-point struct and vector struct results. – Small non-FP non-vector results. 16 © 2007 IBM Corporation w3.ibm.com Discussion and Questions ? 17 © 2007 IBM Corporation w3.ibm.com