PeachPy: A Python Framework for Developing High-Performance Assembly Kernels Marat Dukhan School of Computational Science and Engineering College of Computing Georgia Institute of Technology Presentation on PyHPC 2013 Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 1 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 2 / 43 The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET Marat Dukhan (Georgia Tech) PeachPy 16] 16] 32] 32] PyHPC 2013 3 / 43 The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET Marat Dukhan (Georgia Tech) dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB r8, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rcx] MOVUPS xmm7, [rdx] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rcx + MOVUPS xmm7, [rdx + MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rcx + MOVUPS xmm7, [rdx + MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: MOVSS xmm3, [rcx] MOVSS xmm4, [rdx] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET 16] 16] 32] 32] PeachPy 16] 16] 32] 32] PyHPC 2013 4 / 43 The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET 16] 16] 32] 32] dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB r8, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rcx] MOVUPS xmm7, [rdx] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rcx + MOVUPS xmm7, [rdx + MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rcx + MOVUPS xmm7, [rdx + MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: MOVSS xmm3, [rcx] MOVSS xmm4, [rdx] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET Marat Dukhan (Georgia Tech) 16] 16] 32] 32] dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB rdx, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rdi] MOVUPD xmm7, [rsi] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rdi + MOVUPD xmm7, [rsi + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rdi + MOVUPD xmm7, [rsi + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 6 JAE .batch .restore: ADD rdx, 6 JZ .reduce .remainder: MOVSD xmm3, [rdi] MOVSD xmm4, [rsi] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rdi, 8 ADD rsi, 8 SUB rdx, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET PeachPy 16] 16] 32] 32] dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB rdx, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rdi] MOVUPS xmm7, [rsi] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rdi + MOVUPS xmm7, [rsi + MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rdi + MOVUPS xmm7, [rsi + MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 12 JAE .batch .restore: ADD rdx, 12 JZ .reduce .remainder: MOVSS xmm3, [rdi] MOVSS xmm4, [rsi] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rdi, 4 ADD rsi, 4 SUB rdx, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET PyHPC 2013 16] 16] 32] 32] 5 / 43 The Problem dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB r8, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rcx] MOVUPD xmm7, [rdx] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rcx + MOVUPD xmm7, [rdx + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rcx + MOVUPD xmm7, [rdx + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 6 JAE .batch .restore: ADD r8, 6 JZ .reduce .remainder: MOVSD xmm3, [rcx] MOVSD xmm4, [rdx] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET 16] 16] 32] 32] dot_product: VXORPD ymm0, ymm0 VXORPD ymm1, ymm1 VXORPD ymm2, ymm2 SUB r8, 12 JAE .restore align 16 .batch: VMOVUPD ymm3, [rcx] VMOVUPD ymm7, [rdx] VMULPD ymm3, ymm7 VADDPD ymm0, ymm3 VMOVUPD ymm4, [rcx + 32] VMOVUPD ymm7, [rdx + 32] VMULPD ymm4, ymm7 VADDPD ymm1, ymm4 VMOVUPD ymm5, [rcx + 64] VMOVUPD ymm7, [rdx + 64] VMULPD ymm5, ymm7 VADDPD ymm2, ymm5 ADD rcx, 96 ADD rdx, 96 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: VMOVSD xmm3, [rcx] VMOVSD xmm4, [rdx] VMULSD xmm3, xmm4 VADDSD ymm0, ymm3 ADD rcx, 8 ADD rdx, 8 SUB r8, 1 JNZ .remainder .reduce: VADDPD ymm0, ymm1 VADDPD ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPD xmm0, xmm1 VHADDPD xmm0, xmm0 RET Marat Dukhan (Georgia Tech) dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB r8, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rcx] MOVUPS xmm7, [rdx] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rcx + 16] MOVUPS xmm7, [rdx + 16] MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rcx + 32] MOVUPS xmm7, [rdx + 32] MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rcx, 48 ADD rdx, 48 SUB r8, 12 JAE .batch .restore: ADD r8, 12 JZ .reduce .remainder: MOVSS xmm3, [rcx] MOVSS xmm4, [rdx] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET dot_product: XORPD xmm0, xmm0 XORPD xmm1, xmm1 XORPD xmm2, xmm2 SUB rdx, 6 JAE .restore align 16 .batch: MOVUPD xmm3, [rdi] MOVUPD xmm7, [rsi] MULPD xmm3, xmm7 ADDPD xmm0, xmm3 MOVUPD xmm4, [rdi + MOVUPD xmm7, [rsi + MULPD xmm4, xmm7 ADDPD xmm1, xmm4 MOVUPD xmm5, [rdi + MOVUPD xmm7, [rsi + MULPD xmm5, xmm7 ADDPD xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 6 JAE .batch .restore: ADD rdx, 6 JZ .reduce .remainder: MOVSD xmm3, [rdi] MOVSD xmm4, [rsi] MULSD xmm3, xmm4 ADDSD xmm0, xmm3 ADD rdi, 8 ADD rsi, 8 SUB rdx, 1 JNZ .remainder .reduce: ADDPD xmm0, xmm1 ADDPD xmm0, xmm2 HADDPD xmm0, xmm0 RET dot_product: VXORPS ymm0, ymm0 VXORPS ymm1, ymm1 VXORPS ymm2, ymm2 SUB r8, 24 JAE .restore align 16 .batch: VMOVUPS ymm3, [rcx] VMOVUPS ymm7, [rdx] VMULPS ymm3, ymm7 VADDPS ymm0, ymm3 VMOVUPS ymm4, [rcx + 32] VMOVUPS ymm7, [rdx + 32] VMULPS ymm4, ymm7 VADDPS ymm1, ymm4 VMOVUPS ymm5, [rcx + 64] VMOVUPS ymm7, [rdx + 64] VMULPS ymm5, ymm7 VADDPS ymm2, ymm5 ADD rcx, 96 ADD rdx, 96 SUB r8, 24 JAE .batch .restore: ADD r8, 24 JZ .reduce .remainder: VMOVSS xmm3, [rcx] VMOVSS xmm4, [rdx] VMULSS xmm3, xmm4 VADDSS ymm0, ymm3 ADD rcx, 4 ADD rdx, 4 SUB r8, 1 JNZ .remainder .reduce: VADDPS ymm0, ymm1 VADDPS ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPS xmm0, xmm1 VHADDPS xmm0, xmm0 VHADDPS xmm0, xmm0 RET dot_product: VXORPD ymm0, ymm0 VXORPD ymm1, ymm1 VXORPD ymm2, ymm2 SUB rdx, 12 JAE .restore align 16 .batch: VMOVUPD ymm3, [rdi] VMOVUPD ymm7, [rsi] VMULPD ymm3, ymm7 VADDPD ymm0, ymm3 VMOVUPD ymm4, [rdi + 32] VMOVUPD ymm7, [rsi + 32] VMULPD ymm4, ymm7 VADDPD ymm1, ymm4 VMOVUPD ymm5, [rdi + 64] VMOVUPD ymm7, [rsi + 64] VMULPD ymm5, ymm7 VADDPD ymm2, ymm5 ADD rdi, 96 ADD rsi, 96 SUB rdx, 12 JAE .batch .restore: ADD rdx, 12 JZ .reduce .remainder: VMOVSD xmm3, [rdi] VMOVSD xmm4, [rsi] VMULSD xmm3, xmm4 VADDSD ymm0, ymm3 ADD rdi, 8 ADD rsi, 8 SUB rdx, 1 JNZ .remainder .reduce: VADDPD ymm0, ymm1 VADDPD ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPD xmm0, xmm1 VHADDPD xmm0, xmm0 RET PeachPy 16] 16] 32] 32] dot_product: XORPS xmm0, xmm0 XORPS xmm1, xmm1 XORPS xmm2, xmm2 SUB rdx, 12 JAE .restore align 16 .batch: MOVUPS xmm3, [rdi] MOVUPS xmm7, [rsi] MULPS xmm3, xmm7 ADDPS xmm0, xmm3 MOVUPS xmm4, [rdi + 16] MOVUPS xmm7, [rsi + 16] MULPS xmm4, xmm7 ADDPS xmm1, xmm4 MOVUPS xmm5, [rdi + 32] MOVUPS xmm7, [rsi + 32] MULPS xmm5, xmm7 ADDPS xmm2, xmm5 ADD rdi, 48 ADD rsi, 48 SUB rdx, 12 JAE .batch .restore: ADD rdx, 12 JZ .reduce .remainder: MOVSS xmm3, [rdi] MOVSS xmm4, [rsi] MULSS xmm3, xmm4 ADDSS xmm0, xmm3 ADD rdi, 4 ADD rsi, 4 SUB rdx, 1 JNZ .remainder .reduce: ADDPS xmm0, xmm1 ADDPS xmm0, xmm2 HADDPS xmm0, xmm0 HADDPS xmm0, xmm0 RET dot_product: VXORPS ymm0, ymm0 VXORPS ymm1, ymm1 VXORPS ymm2, ymm2 SUB rdx, 24 JAE .restore align 16 .batch: VMOVUPS ymm3, [rdi] VMOVUPS ymm7, [rsi] VMULPS ymm3, ymm7 VADDPS ymm0, ymm3 VMOVUPS ymm4, [rdi + 32] VMOVUPS ymm7, [rsi + 32] VMULPS ymm4, ymm7 VADDPS ymm1, ymm4 VMOVUPS ymm5, [rdi + 64] VMOVUPS ymm7, [rsi + 64] VMULPS ymm5, ymm7 VADDPS ymm2, ymm5 ADD rdi, 96 ADD rsi, 96 SUB rdx, 24 JAE .batch .restore: ADD rdx, 24 JZ .reduce .remainder: VMOVSS xmm3, [rdi] VMOVSS xmm4, [rsi] VMULSS xmm3, xmm4 VADDSS ymm0, ymm3 ADD rdi, 4 ADD rsi, 4 SUB rdx, 1 JNZ .remainder .reduce: VADDPS ymm0, ymm1 VADDPS ymm0, ymm2 VEXTRACTF128 xmm1, ymm0, 1 VADDPS xmm0, xmm1 VHADDPS xmm0, xmm0 VHADDPS xmm0, xmm0 RET PyHPC 2013 6 / 43 The Research Problem This reasearch is about the problem of generating multiple similar assembly kernels: Kernels which perform similar operations I E.g. vector addition/subtraction Kernels which do same operation on dierent data types I E.g. single- and double-precision dot product Kernels which target dierent microarchitectures or ISA I E.g. dot product for AVX, FMA4, FMA3 Kernels which use dierent ABIs I E.g. x86-64 on Windows and Linux Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 7 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 8 / 43 Assembly Compilation Process Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 9 / 43 Assembly Compilation Process Lets replace macro processor with something More exible More standardized More popular Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 10 / 43 Assembly Compilation Process Lets replace macro processor with something More exible More standardized More popular Python! Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 11 / 43 Introducing PeachPy PeachPy is. . . . . . an automation and metaprogramming tool for assembly programming . . . an Assembly-like DSL: PeachPy user is exposed to the same low-level details as assembly programmer . . . a Python framework: any PeachPy code is a valid Python code PeachPy is not. . . . . . a compiler: PeachPy does not oer high-level programming abstractions . . . an assembler: PeachPy does not generate machine code Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 12 / 43 PeachPy Philosophy PeachPy is for writing high-performance codes I I I No support for invoke, OOP, and other "high-level assembly" No kernel-mode instructions No system instructions PeachPy is for writing assembly codes I Not a replacement for high-level compiler All optimizations possible to do in assembly should be possible to do in PeachPy Everything that can be automated in assembly programming should be automated Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 13 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 14 / 43 Minimal PeachPy Function from peachpy.x64 import * abi = peachpy.c.ABI('x64-sysv') assembler = Assembler(abi) x_argument = peachpy.c.Parameter("x", peachpy.c.Type("uint32_t")) arguments = (x_argument,) function_name = "f" microachitecture = "SandyBridge" with Function(assembler, function_name, arguments, microarchitecture): RETURN() print assembler Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 15 / 43 Modules PeachPy functionality is concentrated in three Python modules: peachpy.c for C compatilibity classes (C types and ABIs) peachpy.x64 for x86-64 assembly classes peachpy.arm for ARM assembly classes Assembly modules are intended to be imported into program workspace: # This will make the syntax of PeachPy codes # very similar to native assembly from peachpy.x64 import * Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 16 / 43 Assembler and Function Assembler I I I Container for functions Contains only functions with specied ABI Normally may be saved as assembly le to disk Function I I Created using with syntax: with Function(...): Creates an active instruction stream Microarchitecture I I String parameter for Function constructor Restricts the set of supported instructions Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 17 / 43 Instructions All instructions are named in uppercase All instructions are python objects When an instruction is called (as Python object), it is generated and added to the active PeachPy function (as assembly instruction) PeachPy veries the correctness of instruction operands Most computational x86-64 and many ARM instructions are supported Traditional Assembly PeachPy .loop: ADDPD xmm0, [rsi] ADD rsi, 16 SUB rcx, 2 JAE .loop LABEL( "loop" ) ADDPD( xmm0, [rsi] ) ADD( rsi, 16 ) SUB( rcx, 2 ) JAE( "loop" ) Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 18 / 43 Registers PeachPy maps dierent types of architectural registers on Python classes x86 register classes: I I I I I I I I GeneralPurposeRegister (base class) GeneralPurposeRegister8 GeneralPurposeRegister16 GeneralPurposeRegister32 GeneralPurposeRegister64 MMXRegister SSERegister AVXRegister ARM register classes: I I I I GeneralPurposeRegister SRegister DRegister QRegister Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 19 / 43 Registers Architectural registers are represented in PeachPy as Python objects All register names are in lowercase Traditional x86-64 Assembly PeachPy MOVZX rax, al PADD mm0, mm1 ADDPS xmm0, xmm1 VMULPD ymm0, ymm1, ymm2 MOVSX( rax, al ) PADD( mm0, mm1 ) ADDPS( xmm0, xmm1 ) VMULPD( ymm0, ymm1, ymm2 ) Traditional ARM Assembly PeachPy ADD r0, r0, r1 VLD1.32 {d0[]}, [r2] VFMA.F32 q2, q1, q1 ADD( r0, r0, r1 ) VLD1.F32( (d0[:],), [r2] ) VFMA.F32( q2, q1, q1 ) Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 20 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 21 / 43 Register Allocation Traditional x86-64 Assembly VMOVAPD ymm0, [rsi] VMOVAPD ymm1, ymm0 VFMADD132PD ymm1, ymm13, ymm12 VFMADD231PD ymm0, ymm1, ymm14 VFMADD231PD ymm0, ymm1, ymm15 PeachPy ymm_x = AVXRegister() VMOVAPD( ymm_x, [xPointer] ymm_t = AVXRegister() VMOVAPD( ymm_t, ymm_x ) VFMADD132PD( ymm_t, ymm_t, VFMADD231PD( ymm_x, ymm_t, VFMADD231PD( ymm_x, ymm_t, Marat Dukhan (Georgia Tech) ) ymm_log2e, ymm_magic_bias) ymm_minus_ln2_hi, ymm_x ) ymm_minus_ln2_lo, ymm_x ) PeachPy PyHPC 2013 22 / 43 In-place Memory Constant Declarations Traditional x86-64 Assembly Right here: section .rdata rdata c0 dq 3.141592, 3.141592 In a galaxy far far away: section .text code MULPD xmm0, [c0] PeachPy MULPD( xmm_x, Constant.float64x2(3.141592) ) Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 23 / 43 Hexadecimal Floating-Point Constants Hexadecimal oating-point constants provide an accurate and portable way to specify oating-point constants and without rounding errors Required in C99 standard Supported by gcc, clang, icc, xlc, and NASM But not supported by GNU Assembler PeachPy lets programmers use hexadecimal oating-point constants on all supported platforms C99 const double ln2 = 0x1.71547652B82FEp+0; ARM Assembly (GNU) x86-64 Assembly (NASM) ln2: .quad 0x3FF71547652B82FE ln2 dq 0x1.71547652B82FEp+0 PeachPy (x86-64 and ARM) ln2 = Constant.float64('0x1.71547652B82FEp+0') Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 24 / 43 Calling conventions The Problem Consider Assembly implementations for the C function uint64_t add(uint64_t x, uint64_t y) { return x + y; } Assembly for Microsoft x86-64 calling convention add: LEA rax, [rcx + rdx * 1] RET Assembly for System V x86-64 calling convention add: LEA rax, [rdi + rsi * 1] RET Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 25 / 43 Calling conventions PeachPy Approach PeachPy code from peachpy.x64 import * asm = Assembler(peachpy.c.ABI("x64-ms")) # or "x64-sysv" x_arg = peachpy.c.Parameter("x", peachpy.c.Type("uint64_t")) y_arg = peachpy.c.Parameter("y", peachpy.c.Type("uint64_t")) with Function(asm, "add", (x_arg, y_arg), "Bobcat"): x = GeneralPurposeRegister64() LOAD.PARAMETER( x, x_arg ) # Does the magic! y = GeneralPurposeRegister64() LOAD.PARAMETER( y, y_arg ) # Does the magic! LEA( rax, [x + y * 1] ) RETURN() Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 26 / 43 ISA-based runtime dispatching PeachPy known the instruction set of each instruction PeachPy also collects ISA information about each function This helps to do ne-grained runtime dispatching I More ecient vs recompiling the function for each ISA with high-level compiler Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 27 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 28 / 43 Parametrized Unroll with Function(asm, "dot_product", args, "SandyBridge"): xPointer, yPointer, zPointer, length = LOAD.PARAMETERS() reg_size = 32 reg_elements = 8 unroll_regs = 8 acc = [AVXRegister() for _ in range(unroll_regs)] temp = [AVXRegister() for _ in range(unroll_regs)] ... LABEL( "process_batch" ) for i in range(unroll_regs): VMOVAPS( temp[i], [xPointer + i * reg_size] ) VMULPS( temp[i], [yPointer + i * reg_size] ) VADDPS( acc[i], temp[i] ) ADD( xPointer, reg_size * unroll_regs ) ADD( yPointer, reg_size * unroll_regs ) SUB( length, reg_elements * unroll_regs ) JAE( "process_batch" ) ... Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 29 / 43 Parametrization by Element Type reg_size = 32 reg_elements = reg_size / element_size unroll_regs = 8 SIMD_LOAD = {4: VMOVAPS, 8: VMOVAPD}[element_size] SIMD_MUL = {4: VMULPS, 8: VMULPD}[element_size] SIMD_ADD = {4: VADDPS, 8: VADDPD}[element_size] ... LABEL( "process_batch" ) for i in range(unroll_regs): SIMD_LOAD( temp[i], [xPointer + i * reg_size] ) SIMD_MUL( temp[i], [yPointer + i * reg_size] ) SIMD_ADD( acc[i], temp[i] ) ADD( xPointer, reg_size * unroll_regs ) ADD( yPointer, reg_size * unroll_regs ) SUB( length, reg_elements * unroll_regs ) JAE( "process_batch" ) ... Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 30 / 43 Supporting Multiple Instruction Sets if Target.has_fma(): VMLAPS = VFMADDPS if Target.has_fma4() else VFMADD231PS else: def VMLAPS(x, a, b, c): t = AVXRegister() VMULPS( t, a, b ) VADDPS( x, t, c ) ... LABEL( "processBatch" ) for i in range(unroll_regs): VMOVAPS( temp[i], [xPointer + i * reg_size] ) VMLAPS( acc[i], temp[i], [yPointer + i * reg_size], acc[i] ) ADD( xPointer, reg_size * unroll_regs ) ADD( yPointer, reg_size * unroll_regs ) SUB( length, reg_elements * unroll_regs ) JAE( "processBatch" ) ... Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 31 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 32 / 43 Introduction Usually we want assembly instructions to appear in the same order as we write them. To increase IPC it is useful to interleave two parts of code using dierent types of instructions. However, it might be convenient to write the code of those parts separately. I I ARM Cortex-A9 can decode one SIMD instruction and one scalar instruction per cycle. By interleaving SIMD and scalar processing we can achieve higher performance. On x86 we may use scalar instructions to detect special cases while SIMD units are busy doing calculations. For software pipelining we may want the skew the sequences of similar instructions relative to each other. I But we don't want to skew our code Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 33 / 43 Instruction Stream Objects The Python with statement can be used to redirect generated instructions to an InstructionStream object. scalar_stream = InstructionStream() with scalar_stream: x = GeneralPurposeRegister64() MOV( x, [xPointer] ) CMP.JA( x, threshold, "above_threshold" ) vector_stream = InstructionStream() with vector_stream: ... Instructions from instruction stream can then be re-issued to current instruction stream: while scalar_stream or vector_stream: scalar_stream.issue() vector_stream.issue() Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 34 / 43 Software Pipelining Instruction streams are useful for implementing software pipelining instruction_columns = [InstructionStream(), InstructionStream(), InstructionStream()] for i in range(unroll_regs): with instruction_columns[0]: VMOVDQU( ymm_x[i], [xPointer + i * reg_size] ) with instruction_columns[1]: VPADDD( ymm_x[i], ymm_y ) with instruction_columns[2]: VMOVDQU( [zPointer + i * reg_size], ymm_x[i] ) with instruction_columns[0]: ADD( xPointer, reg_size * unroll_regs ) with instruction_columns[2]: ADD( zPointer, reg_size * unroll_regs ) Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 35 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 36 / 43 Is It Worth The Eort? Albeit PeachPy simplies develiping assembly kernels, PeachPy is still assembly, with all its drawbacks. For many HPC scientists C code with compiler intrinsics is a viable alternative to writing assembly: C code with intrinsics is more portable with assembly Many of the problems targeted by PeachPy become irrelevant (e.g. calling convention) Compiler could take into account more processors details than humans We did a simple experiment to check if PeachPy (and assebly in general) can deliver better performance than optimizing compilers. Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 37 / 43 Experimental Setup For the experiment we used branchless versions of vector logarithm and exponential functions from Yeppp! library. These are high-performance implementation originally developed and tuned using Peach-Py. We converted the assembly instructions one-to-one to C++ intrinsics and compiled with modern C++ compilers. The C++ code is a nearly ideal input for a compiler: Code is already vectorized with intrinsics. Each function processes 40 elements and has only one branch. The only parts left to the compilers are register allocation and instruction scheduling. Initial instruction schedulling is close to optimal. A lot of room for improving instruction scheduling: the original version contains 581 instructions for log function and 400 instructions for exp. The produced codes are benchmarked on Intel Core i7-4770K processor with the recent Haswell microarchitecture. Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 38 / 43 Benchmarking Results Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 39 / 43 Outline 1 Motivation 2 PeachPy Foundations 3 PeachPy Basics 4 Assembly Programming Automation 5 Metaprogramming 6 Instruction Streams 7 Experimental Validation 8 Conclusion Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 40 / 43 Plans and Goals The current priorities for PeachPy are Support for PowerPC (including Blue Gene/Q) and Xeon Phi architectures Distribute PeachPy via Python Package Index Enable generation of machine code directly from PeachPy Provide additional features for x86-64 and ARM architectures (e.g. table lookups) ARM64 and x86-32 ports In the long term we hope that PeachPy will replace conventional assembly in HPC workow. Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 41 / 43 Public Availability PeachPy repository is hosted on bitbucket.org/MDukhan/peachpy The primary user of PeachPy is Yeppp! library (www.yeppp.info). I The codegen directory in Yeppp! source tree contains a large number of Peach-Py codes. Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 42 / 43 Funding This research was supported in part by National Science Foundation (NSF) under NSF CAREER award number 0953100. A grant from the Defense Advanced Research Projects Agency (DARPA) Computer Science Study Group program. Declaimer Any opinions, conclusions or recommendations expressed in this presentation are those of the authors and not necessarily reect those of NSF or DARPA. Marat Dukhan (Georgia Tech) PeachPy PyHPC 2013 43 / 43