CS 201 Advanced Topics SIMD, x86-64, ARM Vector instructions (MMX/SSE/AVX) Background: IA32 Floating Point What does this have to do with SIMD? Floating Point Unit (X87 FPU) Hardware to add, multiply, and divide IEEE floating point numbers 8 80-bit registers organized as a stack (st0-st7) Operands pushed onto stack and operators can pop results off into memory Instruction decoder and sequencer Integer Unit History 8086: first computer to implement IEEE FP separate 8087 FPU (floating point unit) –3– 486: merged FPU and Integer Unit onto one chip Memory FPU FPU Data Register Stack FPU register format (extended precision) 79 78 s 0 64 63 exp frac FPU registers 8 registers Logically forms shallow stack Top called %st(0) When push too many, bottom values disappear %st(3) %st(2) %st(1) %st(0) “Top” stack grows down –4– Simplified FPU operation “load” instruction Pushes number onto stack “storep” instruction Pops top element from stack and stores it in memory unary operation “neg” = pop top element, negate it, push result onto stack binary operations “addp”, “multp” = pop top two elements, perform operation, push result onto stack Stack operation similar to Reverse Polish Notation –5– a b + = push a, push b, add (pop a & b, add, push result) Example calculation x = (a-b)/(-b+c) –6– load c load b neg addp load b load a subp divp storep x FPU instructions Large number of floating point instructions and formats ~50 basic instruction types load (fld*), store (fst*), add (fadd), multiply (fmul) sin (fsin), cos (fcos), tan (ftan) etc… Sample instructions: Instruction Effect Description fldz flds Addr fmuls Addr faddp push 0.0 Load zero push M[Addr] Load single precision real %st(0) <- %st(0)*M[Addr] Multiply %st(1) <- %st(0)+%st(1); pop Add and pop After pop, %st(0) has result –7– FPU instruction mnemonics Precision “s” single precision “l” double precision Operand order Default Op1 <op> Op2 “r” reverse operand order (i.e. Op2 <op> Op1) Stack operation –8– “p” pop a single value from stack upon completion Floating Point Code Example Compute Inner Product of Two Vectors Single precision arithmetic Common computation float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) { result += x[i] * y[i]; } return result; } –9– pushl %ebp movl %esp,%ebp pushl %ebx movl 8(%ebp),%ebx movl 12(%ebp),%ecx movl 16(%ebp),%edx fldz xorl %eax,%eax cmpl %edx,%eax jge .L3 .L5: flds (%ebx,%eax,4) fmuls (%ecx,%eax,4) faddp incl %eax cmpl %edx,%eax jl .L5 .L3: movl -4(%ebp),%ebx movl %ebp, %esp popl %ebp ret # setup # # # # # # %ebx=&x %ecx=&y %edx=n push +0.0 i=0 if i>=n done # # # # # push x[i] st(0)*=y[i] st(1)+=st(0); pop i++ if i<n repeat # finish # st(0) = result Inner Product Stack Trace Initialization 1. fldz 0.0 %st(0) Iteration 0 Iteration 1 2. flds (%ebx,%eax,4) 0.0 x[0] 5. flds (%ebx,%eax,4) %st(0) 3. fmuls (%ecx,%eax,4) 0.0 x[0]*y[0] %st(1) %st(0) 6. fmuls (%ecx,%eax,4) x[0]*y[0] x[1]*y[1] %st(1) %st(0) 4. faddp %st(1) %st(0) 7. faddp 0.0+x[0]*y[0] – 10 – x[0]*y[0] x[1] %st(1) %st(0) x[0]*y[0]+x[1]*y[1] Serial, sequential operation %st(0) Motivation for SIMD Multimedia, graphics, scientific, and security applications Require a single operation across large amounts of data Frame differencing for video encoding Image Fade-in/Fade-out Sprite overlay in game Matrix computations Encryption/decryption Algorithm characteristics Access data in a regular pattern Operate on short data types (8-bit, 16-bit, 32-bit) Have an operating paradigm that has data streaming through fixed processing stages » Data-flow operation – 11 – Natural fit for SIMD instructions Single Instruction, Multiple Data Also known as vector instructions Before SIMD One instruction per data location With SIMD One instruction over multiple sequential data locations Execution units must support “wide” parallel execution Examples in many processors Intel x86 MMX, SSE, AVX AMD 3DNow! – 12 – Example R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835 R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835 R R G = G + X[i:i+2] B B – 13 – Example for (i=0; i<64; i+=1) A[i+0] = A[i+0] + B[i+0] for (i=0; A[i+0] A[i+1] A[i+2] A[i+3] } i<64; i+=4){ = A[i+0] + B[i+0] = A[i+1] + B[i+1] = A[i+2] + B[i+2] = A[i+3] + B[i+3] for (i=0; i<100; i+=4) A[i:i+3] = A[i:i+3] + B[i:i+3] – 14 – SIMD in x86 MMX (MultiMedia eXtensions) Pentium, Pentium II SSE (Streaming SIMD Extensions) (1999) Pentium 3 SSE2 (2000), SSE3 (2004) Pentium 4 SSSE3 (2004), SSE4 (2007) Intel Core AVX (2011) – 15 – Intel Sandy Bridge, Ivy Bridge General idea SIMD (single-instruction, multiple data) vector instructions New data types, registers, operations Parallel operation on small (length 2-8) vectors of integers or floats Example: + – 16 – x “4-way” MMX (MultiMedia eXtensions) MMX re-uses FPU registers for SIMD execution of integer ops Alias the FPU registers st0-st7 as MM0-MM7 Treat as 8 64-bit data registers randomly accessible Partition registers based on data type of vector How many different partitions are there for a vectored add? 8 byte additions (PADDB) 4 short or word additions (PADDW) 2 int or dword additions (PADDD) Wanted to avoid adding CPU state Change does not impact context switching OS does not need to know about MMX – 17 – + + Single operation applied in parallel on individual parts Why not new registers? + Drawback: can't use FPU and MMX at the same time SSE (Streaming SIMD Extensions) Larger, independent registers MMX doesn't allow use of FPU and SIMD simultaneously 8 128-bit data registers separate from FPU New hardware registers (XMM0-XMM7) New status register for flags (MXCSR) Vectored floating point supported MMX only for vectored integer operations SSE adds support for vectored floating point operations 4 single precision floats Streaming support Prefetching and cacheability control in loading/storing operands Additional integer operations for permutations – 18 – Shuffling, interleaving SSE2 Adds more data types and instructions Vectored double-precision floating point operations 2 double precision floats Full support for vectored integer types over 128-bit XMM registers – 19 – 16 single byte vectors 8 word vectors 4 double word vectors 2 quad word vector SSE3 Horizontal vector operations Operations within vector (e.g. min, max) Speed up DSP and 3D ops Complex arithmetic (SSE3) All x86-64 chips have SSE3 SSE4 Video encoding accelerators Sum of absolute differences (frame differencing) Horizontal Minimum Search (motion estimation) Conditional copying Graphics building blocks Dot product 32-bit vector integer operations on 128-bit registers – 20 – Dword multiplies Vector rounding Feature summary Integer vectors (64-bit registers) (MMX) Single-precision vectors (SSE) Double-precision vectors (SSE2) Integer vectors (128-bit registers) (SSE2) Horizontal arithmetic within register (SSE3/SSSE3) Video encoding accelerators (H.264) (SSE4) Graphics building blocks (SSE4) – 21 – Intel Architectures (Focus Floating Point) Processors 8086 Architectures Features x86-16 286 386 486 Pentium Pentium MMX time x86-32 MMX Pentium III SSE 4-way single precision fp Pentium 4 SSE2 2-way double precision fp Pentium 4E SSE3 Pentium 4F x86-64 / em64t Core 2 Duo SSE4 – 22 – SSE3 Registers All caller saved %xmm0 for floating point return value 128 bit – 23 – %xmm0 Argument #1 %xmm8 %xmm1 Argument #2 %xmm9 %xmm2 Argument #3 %xmm10 %xmm3 Argument #4 %xmm11 %xmm4 Argument #5 %xmm12 %xmm5 Argument #6 %xmm13 %xmm6 Argument #7 %xmm14 %xmm7 Argument #8 %xmm15 SSE3 Registers Different data types and associated instructions 128 bit Integer vectors: 16-way byte 8-way short 4-way int Floating point vectors: 4-way single (float) 2-way double Floating point scalars: – 24 – single double LSB SSE3 Instruction Names packed (vector) addps single slot (scalar) addss single precision addpd double precision – 25 – addsd SSE3 Instructions: Examples Single precision 4-way vector add: addps %xmm0 %xmm1 %xmm0 + %xmm1 Single precision scalar add: addss %xmm0 %xmm1 %xmm0 + %xmm1 – 26 – SSE3 Basic Instructions Moves Single Double Effect movss movsd D←S Usual operand form: reg → reg, reg → mem, mem → reg Packed versions to load vector from memory Arithmetic – 27 – Single Double Effect addss addsd D←D+S subss subsd D←D–S mulss mulsd D←DxS x86-64 FP Code Example Compute inner product of two vectors float ipf (float x[], float y[], int n) { int i; float result = 0.0; for (i = 0; i < n; i++) result += x[i]*y[i]; return result; Single precision arithmetic Uses SSE3 instructions } ipf: xorps %xmm1, %xmm1 # result = 0.0 xorl %ecx, %ecx # i = 0 jmp .L8 # goto middle .L10: # loop: movslq %ecx,%rax # icpy = i incl %ecx # i++ movss (%rsi,%rax,4), %xmm0 # t = y[icpy] mulss (%rdi,%rax,4), %xmm0 # t *= x[icpy] addss %xmm0, %xmm1 # result += t .L8: # middle: cmpl %edx, %ecx # i:n jl .L10 # if < goto loop movaps %xmm1, %xmm0 # return result ret – 28 – SSE3 Conversion Instructions Conversions – 29 – Same operand forms as moves Instruction Description cvtss2sd single → double cvtsd2ss double → single cvtsi2ss int → single cvtsi2sd int → double cvtsi2ssq quad int → single Detecting if it is supported mov eax, 1 cpuid ; supported since test edx, 00800000h ; 00800000h ; 02000000h ; 04000000h jnz HasMMX – 30 – Pentium (bit 23) MMX (bit 25) SSE (bit 26) SSE2 Detecting if it is supported #include <stdio.h> #include <string.h> #define cpuid(func,ax,bx,cx,dx)\ __asm__ __volatile__ ("cpuid":\ "=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func)); int main(int argc, char* argv[]) { int a, b, c, d, i; char x[13]; int* q; for (i=0; i < 13; i++) x[i]=0; q=(int ) x; /* 12 char string returned in 3 registers */ cpuid(0,a,q[0],q[2],q[1]); printf("str: %s\n", x); /* Bits returned in all 4 registers */ cpuid(1,a,b,c,d); printf("a: %08x, b: %08x, c: %08x, d: %08x\n",a,b,c,d); printf(" bh * 8 = cache line size\n"); printf(" bit 0 of c = SSE3 supported\n"); printf(" bit 25 of c = AES supported\n"); printf(" bit 0 of d = On-board FPU\n"); printf(" bit 4 of d = Time-stamp counter\n"); printf(" bit 26 of d = SSE2 supported\n"); printf(" bit 25 of d = SSE supported\n"); printf(" bit 23 of d = MMX supported\n"); – 31 – } http://thefengs.com/wuchang/courses/cs201/class/11/cpuid.c Detecting if it is supported mashimaro <~> 12:43PM % cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz stepping : 11 cpu MHz : 2393.974 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4791.08 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: – 32 – AVX2 Intel codename Haswell (2013), Broadwell (2014) Expansion of most integer AVX to 256 bits “Gather support” to load data from non-contiguous memory 3-operand FMA operations (fused multiply-add operations) at full precision (a+b*c) Dot products, matrix multiplications, polynomial evaluations via Horner's rule (see DEC VAX POLY instruction 1977) Speeds up software-based division and square root operations (so dedicated hardware for these operations can be removed) – 33 – Programming SIMD Store data contiguously (i.e. in an array) Define total size of vector in bytes 8 bytes (64 bits) for MMX 16 bytes (128 bits) for SSE2 and beyond Define type of vector elements For 128 bit registers 2 double 4 float 4 int 8 short 16 char – 34 – SIMD instructions based on each vector type Example: SIMD via macros/libraries Rely on compiler macros or library calls for SSE acceleration Macros embed in-line assembly into program Call into library functions compiled with SSE Adding two 128-bit vectors containing 4 float // Microsoft-specific compiler intrinsic function __m128 _mm_add_ps(__m128 a , __m128 b ); __m128 a, b, c; // intrinsic function c = __mm_add_ps(a, b); a 1 2 3 4 b 2 4 6 8 + + + + 3 6 9 12 – 35 – http://msdn.microsoft.com/en-us/library/y0dh78ez.aspx Example: SIMD in C Adding two vectors (SSE) Must pass the compiler hints about your vector Size of each vector in bytes (i.e. vector_size(16)) Type of vector element (i.e. float) // vector of four single floats typedef float v4sf __attribute__ ((vector_size(16))); union f4vector { v4sf v; float f[4]; }; void add_vector() { union f4vector a, b, c; a.f[0] = 1; a.f[1] = 2; a.f[2] = 3; a.f[3] = 4; b.f[0] = 5; b.f[1] = 6; b.f[2] = 7; b.f[3] = 8; c.v = a.v + b.v; } – 36 – gcc –msse2 http://thefengs.com/wuchang/courses/cs201/class/11/add_nosse.c http://thefengs.com/wuchang/courses/cs201/class/11/add_sse.c Examples: SSE in C Measuring performance improvement using rdtsc http://thefengs.com/wuchang/courses/cs201/class/11 – 37 – Vector Instructions Starting with version 4.1.1, gcc can autovectorize to some extent -O3 or –ftree-vectorize No speed-up guaranteed Very limited icc as of now much better For highest performance vectorize yourself using intrinsics – 38 – Intrinsics = C interface to vector instructions AES AES-NI announced 2008 Added to Intel Westmere processors and beyond (2010) Separate from MMX/SSE/AVX AESENC/AESDEC performs one round of an AES encryption/decryption flow One single byte substitution step, one row-wise permutation step, one column-wise mixing step, addition of the round key (order depends on whether one is encrypting or decrypting) Speed up from 28 cycles per byte to 3.5 cycles per byte 10 rounds per block for 128-bit keys, 12 rounds per block for 192-bit keys, 14 rounds per block for 256-bit keys Software support from security vendors widespread http://software.intel.com/file/24917 – 39 – x86-64 x86-64 History 64-bit version of x86 architecture Developed by AMD in 2000 First processor released in 2003 Adopted by Intel in 2004 Features 64-bit registers and instructions Additional integer registers Adoption and extension of Intel’s SSE No-execute bit Conditional move instruction (avoiding branches) http://www.x86-64.org/ – 41 – 64-bit registers From IA-32 %ah/al : 8 bits %ax: 16 bits %eax: 32 bits Now %rax - 64 bits 31 63 %rax 15 7 %ax %eax %ah – 42 – 0 %al More integer registers r8 – r15 Denoted %rXb - 8 bits %rXw - 16 bits %rXd - 32 bits %rX - 64 bits where X is from 8 to 15 Within gdb ‘info registers’ – 43 – x86-64 Integer Registers %rax %eax %r8 %r8d %rbx %ebx %r9 %r9d %rcx %ecx %r10 %r10d %rdx %edx %r11 %r11d %rsi %esi %r12 %r12d %rdi %edi %r13 %r13d %rsp %esp %r14 %r14d %rbp %ebp %r15 %r15d – 44 – Twice the number of registers Accessible as 8, 16, 32, 64 bits More vector registers XMM0-XMM7 128-bit SSE registers prior to x86-64 XMM8-XMM15 – 45 – Additional 8 128-bit registers 64-bit instructions All 32-bit instructions have quad-word equivalents Use suffix 'q‘ to denote movq addq $0x4,%rax %rcx,%rax Exception for stack operations pop, push, call, ret, enter, leave Implicitly 64 bit 32 bit versions not valid Values translated to 64 bit versions with zeros. – 46 – Modified calling convention Previously Function parameters pushed onto the stack Frame pointer management and update A lot of memory operations and overhead! x86-64 Use registers to pass function parameters %rdi, %rsi, %rdx, %rcx, %r8, %r9 used for argument build %xmm0 - %xmm7 for floating point arguments Avoid frame management when possible Simple functions do not incur frame management overhead Use stack if more than 6 parameters Kernel interface also uses registers for parameters %rdi, %rsi, %rdx, %r10, %r8, %r9 Callee saved registers %rbp, %rbx, from %r12 to %r15 All references to stack frame via stack pointer Eliminates need to update %ebp/%rbp – 47 – x86-64 Integer Registers – 48 – %rax Return value %r8 Argument #5 %rbx Callee saved %r9 Argument #6 %rcx Argument #4 %r10 Callee saved %rdx Argument #3 %r11 Used for linking %rsi Argument #2 %r12 C: Callee saved %rdi Argument #1 %r13 Callee saved %rsp Stack pointer %r14 Callee saved %rbp Callee saved %r15 Callee saved x86-64 Long Swap void swap(long *xp, long *yp) { long t0 = *xp; long t1 = *yp; *xp = t1; *yp = t0; } swap: movq movq movq movq ret (%rdi), %rdx (%rsi), %rax %rax, (%rdi) %rdx, (%rsi) Operands passed in registers First (xp) in %rdi, second (yp) in %rsi 64-bit pointers No stack operations required (except ret) Avoiding stack – 49 – Can hold all local information in registers x86-64 Locals in the Red Zone /* Swap, using local array */ void swap_a(long *xp, long *yp) { volatile long loc[2]; loc[0] = *xp; loc[1] = *yp; *xp = loc[1]; *yp = loc[0]; } swap_a: movq movq movq movq movq movq movq movq ret Avoiding Stack Pointer Change – 50 – Have compiler manage the stack frame in a function without changing %rsp Allocate window beyond stack pointer (%rdi), %rax %rax, -24(%rsp) (%rsi), %rax %rax, -16(%rsp) -16(%rsp), %rax %rax, (%rdi) -24(%rsp), %rax %rax, (%rsi) rtn Ptr −8 unused −16 loc[1] −24 loc[0] %rsp Interesting Features of Stack Frame Allocate entire frame at once All stack accesses can be relative to %rsp Do by decrementing stack pointer Can delay allocation, since safe to temporarily use red zone Simple deallocation – 51 – Increment stack pointer No base/frame pointer needed x86-64 function calls via Jump long scount = 0; /* Swap a[i] & a[i+1] */ void swap_ele(long a[], int i) { swap(&a[i], &a[i+1]); } When swap executes ret, it will return from swap_ele Possible since swap is a “tail call” (no instructions afterwards) swap_ele: movslq %esi,%rsi # Sign extend i leaq (%rdi,%rsi,8), %rdi # &a[i] leaq 8(%rdi), %rsi # &a[i+1] jmp swap # swap() – 52 – x86-64 Procedure Summary Heavy use of registers Parameter passing More temporaries since more registers Minimal use of stack Sometimes none Allocate/deallocate entire block Many tricky optimizations What kind of stack frame to use Calling with jump Various allocation techniques Turning on/off 64-bit – 53 – $ gcc -m64 code.c -o code $ gcc -m32 code.c -o code ARM – 54 – ARM history Acorn RISC Machine (Acorn Computers, UK) – 55 – Design initiated 1983, first silicon 1985 32-bit reduced instruction set machine inspired by Berkeley RISC (Patterson, 1980-1984) Licensing model allows for custom designs (contrast to x86) • Does not produce their own chips • Companies customize base CPU for their products • PA Semiconductor (fabless, SoC startup acquired by Apple for its A4 design that powers iPhone/iPad) • ARM estimated to make $0.11 on each chip (royalties + license) Runs 98% of all mobile phones (2005) • Per-watt performance currently better than x86 • Less “legacy” instructions to implement ARM architecture RISC features – 56 – Fewer instructions • Complex instructions handled via multiple simpler ones • Results in a smaller execution unit Only loads/stores to and from memory Uniform-size instructions • Less decoding logic • 16-bit in Thumb mode to increase code density ARM architecture ALU features – 57 – Conditional execution built into many instructions • Less branches • Less power lost to stalled pipelines • No need for branch prediction logic Operand bit-shifts supported in certain instructions • Built-in barrel shifter in ALU • Bit shifting plus ALU operation in one Support for 3 operand instructions • <R> = <Op1> OP <Op2> ARM architecture Control state features – 58 – Shadow registers (pre v7) • Allows efficient interrupt processing (no need to save registers onto stack) Link register • Stores return address for leaf functions (no stack operation needed) ARM architecture Advanced features – 59 – SIMD (NEON) to compete with x86 at high end • mp3, AES, SHA support Hardware virtualization • Hypervisor mode Jazelle DBX (Direct Bytecode eXecution) • Native execution of Java Security • No-execute page protection – Return2libc attacks still possible • TrustZone – Support for trusted execution via hardware-based access control and context management – e.g. isolate DRM processing ARM vs. x86 Key architectural differences – 60 – CISC vs. RISC • Legacy instructions impact per-watt performance • Atom (stripped-down 80386 core) – Once a candidate for the iPad until Apple VP threatened to quit over the choice State pushed onto stack vs. swapped from shadow registers Conditional execution via branches • Later use of conditional moves Bit shifting separate, explicit instructions Memory locations usable as ALU operands Mostly 2 operand instructions ( <D> = <D> OP <S> ) ARM vs. x86 Key differences – 61 – Intel is the only producer of x86 chips and designs • No SoC customization (everyone gets same hardware) • Must wait for Intel to give you what you want • ARM allows Apple to differentiate itself Intel and ARM • XScale: Intel's version of ARM sold to Marvell in 2006 • Speculation – Leakage current will eventually dominate power consumption (versus switching current) – Intel advantage on process to make RISC/CISC moot – Make process advantage bigger than custom design + RISC advantage (avoid wasting money on license) – Latest attempt: Medfield (2012) Extra – 62 – Example: SIMD in assembly Add a constant to a vector (MMX) // Microsoft Macro Assembler format (MASM) char d[]={5, 5, 5, 5, 5, 5, 5, 5}; char clr[]={65,66,68,...,87,88}; // 8 bytes // 24 bytes __asm{ movq mm1, d // load constant into mm1 reg mov cx, 3 // initialize loop counter mov esi, 0 // set index to 0 L1: movq mm0, clr[esi] // load 8 bytes into mm0 reg paddb mm0, mm1 // perform vector addition movq clr[esi], mm0 // store 8 bytes of result add esi, 8 } – 63 – // update index loop L1 // loop macro (on cx) emms // clear MMX register state