# ppt - The Fengs

```CS 201
SIMD, x86-64, ARM
Vector instructions
(MMX/SSE/AVX)
Background: IA32 Floating Point
What does this have to do with
SIMD?
Floating Point Unit (X87 FPU)

Hardware to add, multiply, and divide
IEEE floating point numbers

8 80-bit registers organized as a
stack (st0-st7)
Operands pushed onto stack and
operators can pop results off into
memory

Instruction
decoder and
sequencer
Integer
Unit
History

8086: first computer to implement IEEE
FP
 separate 8087 FPU (floating point unit)

–3–
486: merged FPU and Integer Unit onto
one chip
Memory
FPU
FPU Data Register Stack
FPU register format (extended precision)
79 78
s
0
64 63
exp
frac
FPU registers




8 registers
Logically forms shallow stack
Top called %st(0)
When push too many, bottom
values disappear
%st(3)
%st(2)
%st(1)
%st(0)
“Top”
stack grows down
–4–
Simplified FPU operation

Pushes number onto stack
“storep” instruction

Pops top element from stack and stores it in memory
unary operation

“neg” = pop top element, negate it, push result onto stack
binary operations

“addp”, “multp” = pop top two elements, perform operation,
push result onto stack
Stack operation similar to Reverse Polish Notation

–5–
a b + = push a, push b, add (pop a & b, add, push result)
Example calculation
x = (a-b)/(-b+c)









–6–
neg
subp
divp
storep x
FPU instructions
Large number of floating point instructions and formats



~50 basic instruction types
sin (fsin), cos (fcos), tan (ftan) etc…
Sample instructions:
Instruction
Effect
Description
fldz
push 0.0
Multiply
%st(1) <- %st(0)+%st(1); pop
After pop, %st(0) has result
–7–
FPU instruction mnemonics
Precision


“s” single precision
“l” double precision
Operand order


Default Op1 <op> Op2
“r” reverse operand order (i.e. Op2 <op> Op1)
Stack operation

–8–
“p” pop a single value from stack upon completion
Floating Point Code Example
Compute Inner Product
of Two Vectors


Single precision
arithmetic
Common computation
float ipf (float x[],
float y[],
int n)
{
int i;
float result = 0.0;
for (i = 0; i < n; i++) {
result += x[i] * y[i];
}
return result;
}
–9–
pushl %ebp
movl %esp,%ebp
pushl %ebx
movl 8(%ebp),%ebx
movl 12(%ebp),%ecx
movl 16(%ebp),%edx
fldz
xorl %eax,%eax
cmpl %edx,%eax
jge .L3
.L5:
flds (%ebx,%eax,4)
fmuls (%ecx,%eax,4)
incl %eax
cmpl %edx,%eax
jl .L5
.L3:
movl -4(%ebp),%ebx
movl %ebp, %esp
popl %ebp
ret
# setup
#
#
#
#
#
#
%ebx=&x
%ecx=&y
%edx=n
push +0.0
i=0
if i>=n done
#
#
#
#
#
push x[i]
st(0)*=y[i]
st(1)+=st(0); pop
i++
if i<n repeat
# finish
# st(0) = result
Inner Product Stack Trace
Initialization
1. fldz
0.0
%st(0)
Iteration 0
Iteration 1
2. flds (%ebx,%eax,4)
0.0
x[0]
5. flds (%ebx,%eax,4)
%st(0)
3. fmuls (%ecx,%eax,4)
0.0
x[0]*y[0]
%st(1)
%st(0)
6. fmuls (%ecx,%eax,4)
x[0]*y[0]
x[1]*y[1]
%st(1)
%st(0)
%st(1)
%st(0)
0.0+x[0]*y[0]
– 10 –
x[0]*y[0]
x[1]
%st(1)
%st(0)
x[0]*y[0]+x[1]*y[1]
Serial, sequential operation
%st(0)
Motivation for SIMD
Multimedia, graphics, scientific, and security
applications

Require a single operation across large amounts of data
 Frame differencing for video encoding
 Sprite overlay in game
 Matrix computations
 Encryption/decryption

Algorithm characteristics
 Access data in a regular pattern
 Operate on short data types (8-bit, 16-bit, 32-bit)
 Have an operating paradigm that has data streaming through
fixed processing stages
» Data-flow operation
– 11 –
Natural fit for SIMD instructions
Single Instruction, Multiple Data


Also known as vector instructions
Before SIMD
 One instruction per data location

With SIMD
 One instruction over multiple sequential data locations
 Execution units must support “wide” parallel execution
Examples in many processors

Intel x86
 MMX, SSE, AVX

AMD
 3DNow!
– 12 –
Example
R = R + XR * 1.08327
G = G + XG * 1.89234
B = B + XB * 1.29835
R = R + X[i+0]
G = G + X[i+1]
B = B + X[i+2]
R
R
XR
1.08327
G = G + XG * 1.89234
B
B
XB
1.29835
R
R
G = G + X[i:i+2]
B
B
– 13 –
Example
for (i=0; i<64; i+=1)
A[i+0] = A[i+0] + B[i+0]
for (i=0;
A[i+0]
A[i+1]
A[i+2]
A[i+3]
}
i<64; i+=4){
= A[i+0] + B[i+0]
= A[i+1] + B[i+1]
= A[i+2] + B[i+2]
= A[i+3] + B[i+3]
for (i=0; i<100; i+=4)
A[i:i+3] = A[i:i+3] + B[i:i+3]
– 14 –
SIMD in x86
MMX (MultiMedia eXtensions)

Pentium, Pentium II
SSE (Streaming SIMD Extensions) (1999)

Pentium 3
SSE2 (2000), SSE3 (2004)

Pentium 4
SSSE3 (2004), SSE4 (2007)

Intel Core
AVX (2011)

– 15 –
Intel Sandy Bridge, Ivy Bridge
General idea
SIMD (single-instruction, multiple data) vector
instructions



New data types, registers, operations
Parallel operation on small (length 2-8) vectors of integers or
floats
Example:
+
– 16 –
x
“4-way”
MMX (MultiMedia eXtensions)
MMX re-uses FPU registers for SIMD execution of integer ops




Alias the FPU registers st0-st7 as MM0-MM7
Treat as 8 64-bit data registers randomly accessible
Partition registers based on data type of vector
How many different partitions are there for a vectored add?

Wanted to avoid adding CPU state
 Change does not impact context switching
 OS does not need to know about MMX

– 17 –
+
+
Single operation applied in parallel on individual parts
Why not new registers?

+
Drawback: can't use FPU and MMX at the same time
SSE (Streaming SIMD Extensions)
Larger, independent registers


MMX doesn't allow use of FPU and SIMD simultaneously
8 128-bit data registers separate from FPU
 New hardware registers (XMM0-XMM7)
 New status register for flags (MXCSR)
Vectored floating point supported


MMX only for vectored integer operations
SSE adds support for vectored floating point operations
 4 single precision floats
Streaming support

operands

– 18 –
Shuffling, interleaving
SSE2
Adds more data types and instructions
Vectored double-precision floating point operations

2 double precision floats
Full support for vectored integer types over 128-bit
XMM registers




– 19 –
16 single byte vectors
8 word vectors
4 double word vectors
SSE3
Horizontal vector operations

Operations within vector (e.g. min, max)

Speed up DSP and 3D ops

Complex arithmetic (SSE3)
All x86-64 chips have SSE3
SSE4
Video encoding accelerators
Sum of absolute differences (frame differencing)
Horizontal Minimum Search (motion estimation)
Conditional copying
Graphics building blocks
Dot product
32-bit vector integer operations on 128-bit registers
– 20 –
Dword multiplies
Vector rounding
Feature summary
Integer vectors (64-bit registers) (MMX)
Single-precision vectors (SSE)
Double-precision vectors (SSE2)
Integer vectors (128-bit registers) (SSE2)
Horizontal arithmetic within register (SSE3/SSSE3)
Video encoding accelerators (H.264) (SSE4)
Graphics building blocks (SSE4)
– 21 –
Intel Architectures (Focus Floating
Point)
Processors
8086
Architectures
Features
x86-16
286
386
486
Pentium
Pentium MMX
time
x86-32
MMX
Pentium III
SSE 4-way single precision fp
Pentium 4
SSE2 2-way double precision fp
Pentium 4E
SSE3
Pentium 4F
x86-64 / em64t
Core 2 Duo
SSE4
– 22 –
SSE3 Registers
All caller saved
%xmm0 for floating point return value
128 bit
– 23 –
%xmm0
Argument #1
%xmm8
%xmm1
Argument #2
%xmm9
%xmm2
Argument #3
%xmm10
%xmm3
Argument #4
%xmm11
%xmm4
Argument #5
%xmm12
%xmm5
Argument #6
%xmm13
%xmm6
Argument #7
%xmm14
%xmm7
Argument #8
%xmm15
SSE3 Registers
Different data types and associated instructions
128 bit
Integer vectors:



16-way byte
8-way short
4-way int
Floating point vectors:


4-way single (float)
2-way double
Floating point scalars:

– 24 –

single
double
LSB
SSE3 Instruction Names
packed (vector)
single slot (scalar)
single precision
double precision
– 25 –
SSE3 Instructions: Examples
%xmm0
+
%xmm1
%xmm0
+
%xmm1
– 26 –
SSE3 Basic Instructions
Moves


Single
Double
Effect
movss
movsd
D←S
Usual operand form: reg → reg, reg → mem, mem → reg
Packed versions to load vector from memory
Arithmetic
– 27 –
Single
Double
Effect
D←D+S
subss
subsd
D←D–S
mulss
mulsd
D←DxS
x86-64 FP Code
Example
Compute inner product
of two vectors
float ipf (float x[],
float y[],
int n) {
int i;
float result = 0.0;
for (i = 0; i < n; i++)
result += x[i]*y[i];
return result;
 Single
precision arithmetic
 Uses SSE3 instructions
}
ipf:
xorps
%xmm1, %xmm1 # result = 0.0
xorl
%ecx, %ecx # i = 0
jmp
.L8 # goto middle
.L10:
# loop:
movslq %ecx,%rax # icpy = i
incl
%ecx
# i++
movss (%rsi,%rax,4), %xmm0 # t = y[icpy]
mulss (%rdi,%rax,4), %xmm0 # t *= x[icpy]
%xmm0, %xmm1 # result += t
.L8:
# middle:
cmpl
%edx, %ecx # i:n
jl
.L10
# if < goto loop
movaps %xmm1, %xmm0 # return result
ret
– 28 –
SSE3 Conversion Instructions
Conversions

– 29 –
Same operand forms as moves
Instruction
Description
cvtss2sd
single → double
cvtsd2ss
double → single
cvtsi2ss
int → single
cvtsi2sd
int → double
cvtsi2ssq
Detecting if it is supported
mov
eax, 1
cpuid
; supported since
test edx, 00800000h ; 00800000h
; 02000000h
; 04000000h
jnz
HasMMX
– 30 –
Pentium
(bit 23) MMX
(bit 25) SSE
(bit 26) SSE2
Detecting if it is supported
#include <stdio.h>
#include <string.h>
#define cpuid(func,ax,bx,cx,dx)\
__asm__ __volatile__ ("cpuid":\
"=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func));
int main(int argc, char* argv[]) {
int a, b, c, d, i;
char x[13];
int* q;
for (i=0; i < 13; i++) x[i]=0;
q=(int ) x;
/* 12 char string returned in 3 registers */
cpuid(0,a,q[0],q[2],q[1]);
printf("str: %s\n", x);
/* Bits returned in all 4 registers */
cpuid(1,a,b,c,d);
printf("a: %08x, b: %08x, c: %08x, d: %08x\n",a,b,c,d);
printf(" bh * 8 = cache line size\n");
printf(" bit 0 of c = SSE3 supported\n");
printf(" bit 25 of c = AES supported\n");
printf(" bit 0 of d = On-board FPU\n");
printf(" bit 4 of d = Time-stamp counter\n");
printf(" bit 26 of d = SSE2 supported\n");
printf(" bit 25 of d = SSE supported\n");
printf(" bit 23 of d = MMX supported\n");
– 31 –
}
http://thefengs.com/wuchang/courses/cs201/class/11/cpuid.c
Detecting if it is supported
mashimaro <~> 12:43PM % cat /proc/cpuinfo
processor
: 0
vendor_id
: GenuineIntel
cpu family
: 6
model
: 15
model name
Q6600 @ 2.40GHz
stepping
: 11
cpu MHz
: 2393.974
cache size
: 4096 KB
physical id
: 0
siblings
: 4
core id
: 0
cpu cores
: 4
apicid
: 0
initial apicid : 0
fpu
: yes
fpu_exception
: yes
cpuid level
: 10
wp
: yes
flags
: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl
vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips
: 4791.08
clflush size
: 64
cache_alignment : 64
: 36 bits physical, 48 bits virtual
power management:
– 32 –
AVX2
Intel codename Haswell (2013), Broadwell (2014)



Expansion of most integer AVX to 256 bits
“Gather support” to load data from non-contiguous memory
3-operand FMA operations (fused multiply-add operations) at
full precision (a+b*c)
 Dot products, matrix multiplications, polynomial evaluations via
Horner's rule (see DEC VAX POLY instruction 1977)
 Speeds up software-based division and square root operations
(so dedicated hardware for these operations can be removed)
– 33 –
Programming SIMD
Store data contiguously (i.e. in an array)
Define total size of vector in bytes


8 bytes (64 bits) for MMX
16 bytes (128 bits) for SSE2 and beyond
Define type of vector elements

For 128 bit registers
2 double
4 float
4 int
8 short
16 char

– 34 –
SIMD instructions based on each vector type
Example: SIMD via macros/libraries
Rely on compiler macros or library calls for SSE acceleration

Macros embed in-line assembly into program

Call into library functions compiled with SSE
Adding two 128-bit vectors containing 4 float
// Microsoft-specific compiler intrinsic function
__m128 _mm_add_ps(__m128 a , __m128 b );
__m128 a, b, c;
// intrinsic function
a 1 2 3 4
b
2 4 6 8
+ + + +
3 6 9 12
– 35 –
http://msdn.microsoft.com/en-us/library/y0dh78ez.aspx
Example: SIMD in C

 Size of each vector in bytes (i.e. vector_size(16))
 Type of vector element (i.e. float)
// vector of four single floats
typedef float v4sf __attribute__ ((vector_size(16)));
union f4vector {
v4sf v;
float f[4];
};
union f4vector a, b, c;
a.f[0] = 1; a.f[1] = 2; a.f[2] = 3; a.f[3] = 4;
b.f[0] = 5; b.f[1] = 6; b.f[2] = 7; b.f[3] = 8;
c.v = a.v + b.v;
}

– 36 –
gcc –msse2
Examples: SSE in C
Measuring performance improvement using rdtsc
http://thefengs.com/wuchang/courses/cs201/class/11
– 37 –
Vector Instructions
Starting with version 4.1.1, gcc can autovectorize to
some extent




-O3 or –ftree-vectorize
No speed-up guaranteed
Very limited
icc as of now much better
For highest performance vectorize yourself using
intrinsics

– 38 –
Intrinsics = C interface to vector instructions
AES
AES-NI announced 2008



Added to Intel Westmere processors and beyond (2010)
Separate from MMX/SSE/AVX
AESENC/AESDEC performs one round of an AES
encryption/decryption flow
 One single byte substitution step, one row-wise permutation
step, one column-wise mixing step, addition of the round key
(order depends on whether one is encrypting or decrypting)
 Speed up from 28 cycles per byte to 3.5 cycles per byte
 10 rounds per block for 128-bit keys, 12 rounds per block for
192-bit keys, 14 rounds per block for 256-bit keys
 Software support from security vendors widespread
http://software.intel.com/file/24917
– 39 –
x86-64
x86-64
History




64-bit version of x86 architecture
Developed by AMD in 2000
First processor released in 2003
Features





64-bit registers and instructions
Adoption and extension of Intel’s SSE
No-execute bit
Conditional move instruction (avoiding branches)
http://www.x86-64.org/
– 41 –
64-bit registers
From IA-32
%ah/al : 8 bits
%ax: 16 bits
%eax: 32 bits
Now
%rax
- 64 bits
31
63
%rax
15
7
%ax
%eax
%ah
– 42 –
0
%al
More integer registers
r8 – r15

Denoted
%rXb - 8 bits
%rXw - 16 bits
%rXd - 32 bits
%rX - 64 bits
where X is from 8 to 15

Within gdb
 ‘info registers’
– 43 –
x86-64 Integer Registers
%rax
%eax
%r8
%r8d
%rbx
%ebx
%r9
%r9d
%rcx
%ecx
%r10
%r10d
%rdx
%edx
%r11
%r11d
%rsi
%esi
%r12
%r12d
%rdi
%edi
%r13
%r13d
%rsp
%esp
%r14
%r14d
%rbp
%ebp
%r15
%r15d

– 44 –

Twice the number of registers
Accessible as 8, 16, 32, 64 bits
More vector registers
XMM0-XMM7

128-bit SSE registers prior to x86-64
XMM8-XMM15

– 45 –
64-bit instructions
All 32-bit instructions have quad-word equivalents
Use suffix 'q‘ to denote
movq
\$0x4,%rax
%rcx,%rax
Exception for stack operations
pop, push, call, ret, enter, leave
Implicitly 64 bit
32 bit versions not valid
Values translated to 64 bit versions with zeros.
– 46 –
Modified calling convention
Previously



Function parameters pushed onto the stack
Frame pointer management and update
A lot of memory operations and overhead!
x86-64

Use registers to pass function parameters
%rdi, %rsi, %rdx, %rcx, %r8, %r9 used for argument build
%xmm0 - %xmm7 for floating point arguments
 Avoid frame management when possible
 Simple functions do not incur frame management overhead
 Use stack if more than 6 parameters

Kernel interface also uses registers for parameters
%rdi, %rsi, %rdx, %r10, %r8, %r9

Callee saved registers
%rbp, %rbx, from %r12 to %r15

All references to stack frame via stack pointer
 Eliminates need to update %ebp/%rbp
– 47 –
x86-64 Integer Registers
– 48 –
%rax
Return value
%r8
Argument #5
%rbx
Callee saved
%r9
Argument #6
%rcx
Argument #4
%r10
Callee saved
%rdx
Argument #3
%r11
%rsi
Argument #2
%r12
C: Callee saved
%rdi
Argument #1
%r13
Callee saved
%rsp
Stack pointer
%r14
Callee saved
%rbp
Callee saved
%r15
Callee saved
x86-64 Long Swap
void swap(long *xp, long *yp)
{
long t0 = *xp;
long t1 = *yp;
*xp = t1;
*yp = t0;
}
swap:
movq
movq
movq
movq
ret
(%rdi), %rdx
(%rsi), %rax
%rax, (%rdi)
%rdx, (%rsi)
Operands passed in registers

First (xp) in %rdi, second (yp) in %rsi

64-bit pointers
No stack operations required (except ret)
Avoiding stack

– 49 –
Can hold all local information in registers
x86-64 Locals in the Red Zone
/* Swap, using local array */
void swap_a(long *xp, long *yp)
{
volatile long loc[2];
loc[0] = *xp;
loc[1] = *yp;
*xp = loc[1];
*yp = loc[0];
}
swap_a:
movq
movq
movq
movq
movq
movq
movq
movq
ret
Avoiding Stack Pointer Change


– 50 –
Have compiler manage the stack
frame in a function without
changing %rsp
Allocate window beyond stack
pointer
(%rdi), %rax
%rax, -24(%rsp)
(%rsi), %rax
%rax, -16(%rsp)
-16(%rsp), %rax
%rax, (%rdi)
-24(%rsp), %rax
%rax, (%rsi)
rtn Ptr
−8
unused
−16 loc[1]
−24 loc[0]
%rsp
Interesting Features of Stack
Frame
Allocate entire frame at once

All stack accesses can be relative to %rsp

Do by decrementing stack pointer
Can delay allocation, since safe to temporarily use red zone

Simple deallocation


– 51 –
Increment stack pointer
No base/frame pointer needed
x86-64 function calls via Jump
long scount = 0;
/* Swap a[i] & a[i+1] */
void swap_ele(long a[], int i)
{
swap(&a[i], &a[i+1]);
}
When swap executes ret, it
will return from
swap_ele
Possible since swap is a
“tail call”
(no instructions
afterwards)
swap_ele:
movslq %esi,%rsi
# Sign extend i
leaq
(%rdi,%rsi,8), %rdi # &a[i]
leaq
8(%rdi), %rsi
# &a[i+1]
jmp
swap
# swap()
– 52 –
x86-64 Procedure Summary
Heavy use of registers


Parameter passing
More temporaries since more registers
Minimal use of stack


Sometimes none
Allocate/deallocate entire block
Many tricky optimizations



What kind of stack frame to use
Calling with jump
Various allocation techniques
Turning on/off 64-bit
– 53 –
\$ gcc -m64 code.c -o code
\$ gcc -m32 code.c -o code
ARM
– 54 –
ARM history
Acorn RISC Machine (Acorn Computers, UK)




– 55 –
32-bit reduced instruction set machine inspired by Berkeley
RISC (Patterson, 1980-1984)
Licensing model allows for custom designs (contrast to x86)
• Does not produce their own chips
• Companies customize base CPU for their products
• PA Semiconductor (fabless, SoC startup acquired by
Apple for its A4 design that powers iPhone/iPad)
• ARM estimated to make \$0.11 on each chip (royalties +
Runs 98% of all mobile phones (2005)
• Per-watt performance currently better than x86
• Less “legacy” instructions to implement
ARM architecture
RISC features



– 56 –
Fewer instructions
• Complex instructions handled via multiple simpler ones
• Results in a smaller execution unit
Only loads/stores to and from memory
Uniform-size instructions
• Less decoding logic
• 16-bit in Thumb mode to increase code density
ARM architecture
ALU features



– 57 –
Conditional execution built into many instructions
• Less branches
• Less power lost to stalled pipelines
• No need for branch prediction logic
Operand bit-shifts supported in certain instructions
• Built-in barrel shifter in ALU
• Bit shifting plus ALU operation in one
Support for 3 operand instructions
• <R> = <Op1> OP <Op2>
ARM architecture
Control state features


– 58 –
• Allows efficient interrupt processing (no need to save
registers onto stack)
• Stores return address for leaf functions (no stack
operation needed)
ARM architecture




– 59 –
SIMD (NEON) to compete with x86 at high end
• mp3, AES, SHA support
Hardware virtualization
• Hypervisor mode
Jazelle DBX (Direct Bytecode eXecution)
• Native execution of Java
Security
• No-execute page protection
– Return2libc attacks still possible
• TrustZone
– Support for trusted execution via hardware-based
access control and context management
– e.g. isolate DRM processing
ARM vs. x86
Key architectural differences






– 60 –
CISC vs. RISC
• Legacy instructions impact per-watt performance
• Atom (stripped-down 80386 core)
– Once a candidate for the iPad until Apple VP
threatened to quit over the choice
State pushed onto stack vs. swapped from shadow registers
Conditional execution via branches
• Later use of conditional moves
Bit shifting separate, explicit instructions
Memory locations usable as ALU operands
Mostly 2 operand instructions ( <D> = <D> OP <S> )
ARM vs. x86
Key differences


– 61 –
Intel is the only producer of x86 chips and designs
• No SoC customization (everyone gets same hardware)
• Must wait for Intel to give you what you want
• ARM allows Apple to differentiate itself
Intel and ARM
• XScale: Intel's version of ARM sold to Marvell in 2006
• Speculation
– Leakage current will eventually dominate power
consumption (versus switching current)
– Intel advantage on process to make RISC/CISC moot
– Make process advantage bigger than custom design
– Latest attempt: Medfield (2012)
Extra
– 62 –
Example: SIMD in assembly
Add a constant to a vector (MMX)
// Microsoft Macro Assembler format (MASM)
char d[]={5, 5, 5, 5, 5, 5, 5, 5};
char clr[]={65,66,68,...,87,88};
// 8 bytes
// 24 bytes
__asm{
movq mm1, d
// load constant into mm1 reg
mov cx, 3
// initialize loop counter
mov esi, 0
// set index to 0
L1: movq mm0, clr[esi] // load 8 bytes into mm0 reg
movq clr[esi], mm0 // store 8 bytes of result
}
– 63 –
// update index
loop L1
// loop macro (on cx)
emms
// clear MMX register state
```