Intel® Processor Architecture:
SIMD Instructions
Intel® Software College
Objectives
After completion of this module you will be able to understand
• SIMD rationale
• Intel SIMD instructions
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
SIMD Rationale
Intel SIMD History
SIMD Data Types
Intel SIMD Instructions Sets
Programming with Intel SIMD Instructions
Backup – Instructions Reference Table
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SIMD (Single Instruction Multiple Data)
Technology
• Increase processor throughput by performing multiple computations in a
single instruction
• MMX™ technology, SSE, SSE2 and SSE3 are architectural extensions
Example (SSE 2)
Performs two double precision ops in one cycle
•
a1+b1=c1 in parallel with a0+b0=c0
Useful for matrix operations
a1
+
a0
+
b1
b0
c1
c0
128-bit Registers
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Streaming SIMD Extensions
A brief history
MMX™ technology - Intel® Pentium® with MMX™ and Pentium® II processors
• Introduced 64-bit MMX registers for SIMD integer operations
• Supports SIMD operations on packed byte, word, and double-word integers
• Useful for multimedia and communications software
SSE – Intel® Pentium® III processor
• Introduced 128-bit extended memory manager (XMM) registers for SIMD integers and FP-SP operands
• Executes FP and SIMD simultaneously
• Introduced data prefetch instructions
• Useful for 3D geometry, 3D rendering, and video encoding/decoding
SSE2 –
•
•
•
•
•
Intel® Pentium® 4 and Intel® Xeon™ processors
Added extra 64-bit SIMD integer support
Has same XMM registers for SIMD integer and floating point double precision (FP-DP)
Has 144 new instructions for data support (no new registers)
Adds support for cacheability and memory ordering operations
Useful for 3D graphics, video encoding/decoding and encryption
SSE3 – Intel® Pentium® 4 Processor
• Accelerates performance of Streaming SIMD Extensions technology, Streaming SIMD Extensions 2
technology, and X87-FP math capabilities.
• Useful in some 3D operations (Quaternions), complex arithmetic and video codec algorithms
SSSE3 – Intel® Core® 2 Processor
• application performance improvement.
• potential for specifc application domains
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
X86 Register Sets
SSE-Registers introduced first in Pentium® 3
MMX™ Technology /
IA-FP Registers
IA-INT
Registers
SSE Registers
80
32
128
64
eax
st0
mm0
xmm0
st7
mm7
xmm7
…
edi
Eight 128-bit registers
 Hold data only:
 4 x single FP numbers
 2 x double FP numbers
 128-bit packed integers
 Direct access to the registers
 Use simultaneously with FP /
MMX Technology

Fourteen 32-bit registers
 Scalar data & addresses
 Direct access to regs

Eight 80/64-bit registers
 Hold data only
 Stack access to FP0..FP7
 Direct access to MM0..MM7
 No MMX™ Technology / FP
interoperability

Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SIMD Data Types (1)
 64-Bit Packed Integer Data Types
 8 packed bytes
63
31
0
 4 packed words
63
31
0
31
0
 2 doublewords
63
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SIMD Data Types (2)
 128-bit Floating-Point and Integer Data Types
 16 packed bytes integer
127
63
87 0
 8 packed words integer
127
63
16 15
0
 4 packed doublewords Single Precision Floating Point or Integer
127
63
32 31
0
 2 quadwords Double Precision Floating Point or Integer
127
63
0
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SIMD Data Types (3)
• Packed BCD data-type
 Packed BCD Integers
7
3
BCD
0
BCD
 80-Bit Packed BCD Decimal Integers
79
71
0
X D17 D16
……
D0
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SSE-Instructions Set Extensions
Introduced by Pentium® 3 in 1999; now frequently called
SSE-1
Only new data type supported: 4x32Bit (Single Precision)
floating point data
Some 70 instructions
• Arithmetic, compare, convert operations on SSE SP FP data
• PACKED, UNPACKED
•
•
•
•
•
Data load/store
Prefetch
Extension of MMX
Streaming Store (store without using cache in between)
…
2001 PTE Engineering Enabling Conference
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SSE Sample: Branch Removal
R = (A < B)? C : D //remember: everything packed
A
0.0
0.0
-3.0
3.0
cmplt
B
0.0
1.0
-5.0
5.0
00000
11111
00000
11111
and
nand
c3
c2
c1
c0
d3
d2
d1
d0
00000
c2
00000
c0
d3
00000
d1
00000
or
Intel® Processor Architecture:SIMD Technology® Overview
d3
Copyright © 2006, Intel Corporation. All rights reserved.
c2
d1
c0
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SSE-2 Instructions Set Extensions
Introduced by Intel® Pentium®4 processor in
2000
Some 140 new instructions
Added double precision floating point data
(2x64Bit) and all related instructions including
conversion
Again some extensions to MMX
Added all possible combinations of integer data to
SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related
operations
2001 PTE Engineering Enabling Conference
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SIMD Single vs. SIMD Double
SIMD SP FP Operand = 4 Elements
4 x Single Precision:
SSE-1
Element = SP FP Number
127
0
X3
X2
X1
31 30
X0
0
23 22
S Exponent
Significand
SIMD DP FP Operand = 2 Elements
Element = DP FP Number
127
2 x Double Precision:
SSE-2
0
X1
63 62
S
Exponent
X0
52 51
Copyright © 2006, Intel Corporation. All rights reserved.
0
Significand
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Sample for SSE-2:
SIMD Double  SIMD Int Conversion
SIMD Double  SIMD Int: conversion to two lower ints, two
higher ints cleared
x1
00000
x0
00000
(int)x1 (int)x0
__m128d x;
__m128i ix;
ix = _mm_cvtpd_epi32(x);
Int  SIMD Double: conversion from
two lower ints
 SIMD
????
????
ix1
ix0
x = _mm_cvtepi32_pd(ix);
Intel® Processor Architecture:SIMD Technology® Overview
(double)x1
(double)x0
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SSE3: No new Data Types but new Instructions
FISTTP
FP to integer
conversions
ADDSUBPD, ADDSUBPS,
Complex arithmetic
MOVDDUP, MOVSHDUP,
MOVSLDUP
Video encoding
SIMD FP using AOS
format*
LDDQU
HADDPD, HSUBPD
Thread
Synchronization
HADDPS, HSUBPS
MONITOR, MWAIT
* Also benefits Complex and Vectorization
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Streaming SIMD Extensions 3
13 new instructions
Three have limited use for application performance
improvement
• FISTTP - X87 to integer conversion (requires –longdouble switch)
• MONITOR/MWAIT - thread synchronization
• Available today in Ring 0 only; being used by newer Windows* and Linux*
thread packages
The other ten have some potential for specifc
application domains
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SSE-3 Sample Complex Arithmetic: ADDSUBPS
ADDSUBPS OperandA OperandB
• OperandA (xmm register; 4 data elements)
• a3, a2, a1, a0
• OperandB (xmm reg. Or memory addr; 4 data elements)
• b3, b2, b1, b0
• Result (Stored in OperandA)
• a3+b3, a2-b2, a1+b1, a0-b0
__m128 _mm_addsub_ps(__m128 a, __m128 b)
a3
b3
Add
a3+b3
a2
b2
a1
b1
Sub
a0
b0
Add
Sub
Intel® Processor Architecture:SIMD Technology® Overview
a2-b2
a1+b1
a0-b0
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Supplemental SSE-3 (SSSE-3)
Extension introduced by Intel® Core™ Architecture
Horizontal Addition/Subtraction
PHADDW, PHADDSW, PHADDD,
PHSUBW, PHSUBSW, PHSUBD
Packed Absolute Values
PABSB, PABSW, PABSD
Multiply and Add Packed
Signed/Unsigned bytes
Packed multiply High with
Round and Scale
Packed Shuffle Bytes
Packed SIGN
Packed Align Right
PMADDUBSW
PMULHRSW
PSHUFB
PSIGNB/W/D
PALIGNR
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
SSSE-3 New Instructions
Useful for media and imaging
16 new packed integer instructions
Six new Instructions categories
64-bit
SIMD
latencies
128-bit
SIMD
latencies
Absolute value and Integer “Sign”
1 clock
1 clock
Byte multiply and add
3 clocks
3 clocks
Word multiply high with round and shift
3 clocks
3 clocks
Byte Permute
1 clock
3 clocks
Byte Concatenate with Shift
1 clock
2 clocks
Integer Horizontal Adds and Subtracts
4 clocks
5 clocks
All instructions come in both 128-bit and 64-bit (MMX) flavors
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Sample SSSE-3 Inst.: Byte Permute
PSHUFB mm, mm/m64
PSHUFB xmm, xmm/m128
•
•
•
•
•
A complete byte-granularity permutation
The source operand is used as the control field (variable control)
The destination operand gets permuted
Each byte of the source field selects the origin of the corresponding
destination byte
Also includes force-byte-to-zero flag (bit 7)
src
0x7
0x7 0xFF 0x80 0x01 0x00 0x00 0x00
dest
0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01
dest
0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Ways to SSE/SIMD programming
Coding using SSE/SSE2/3/4 assembler instructions
• Very tedious (manually schedule) – discouraged: Don’t do it !
• E.g.: How do you exploit the benefits of having now 16 instead of
8 SSE registers for Intel® 64 without maintaining two versions ?
Intel® compiler’s C/C++ SIMD intrinsics
• No need to take care of register allocation, scheduling etc
Intel® compiler’s C++ Vector Class Library
• Use this if you are heavy into C++ classes
Vectorizer of Intel® C++ and Fortran Compilers
• Recommended for most cases – easy and efficient
Use ready-to-go vectorized code from a library like
Intel® Math Kernel Library (MKL)
2001 PTE Engineering Enabling Conference
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Compiler Based Vectorization
Processor Specific
Generate Code and Optimize for
Linux*
Pentium® 3 compatible and Athlon XPprocessors including code generation for
MMX and SSE
-axK
-axK
Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode,
including code generation for MMX, SSE and SSE2
-xW
-axW
Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2
- depreciated switch: use xW instead
-xN
-axN
Pentium® M processors including code generation for MMX, SSE and SSE-2
-xB
-axB
Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit
mode) – including code generation for MMX, SSE, SSE2 and SSE-3
-xP,
-axP
Intel® processors with MNI capability – Intel® Core™2 Duo processors (
Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE3 and MNI
-xT,
-axT
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Instruction Set Extensions
New Instructions Added to Intel® Processors
160
144
140
120
100
70
80
56
60
~ 50
32
40
32
13
20
0
Process (nm)
Dec-00
Jan-97
Feb-99
MMX™
Streaming SIMD
Extensions (SSE)
350
Feb-04
Streaming SIMD
Streaming SIMD
Extensions 2 (SSE2) Extensions 3 (SSE3)
180
250
Jul-06
Supplemental SSE3
(SSSE3)
90
65
2008+
Future
Intel instruction
FutureSSE-4
set extensions
45 45
nm
Beginning in 2008: ~50 new instructions in 13 groups
All function in 32-bit and 64-bit modes
Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D &
3D Imaging, Vectorizing Compiler Performance
Intel® Processor Architecture:SIMD Technology® Overview
24
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Transfer Instructions (1)
Instruction Category
MMX Transfer
Instructions
Description
MOVD Move doubleword
MOVQ Move quadword
SSE SIMD SinglePrecision Data
Transfer Instructions
MOVAPS Move four aligned packed single-precision floating-point values between
XMM registers or between and XMM register and memory
MOVUPS Move four unaligned packed single-precision floating-point values between
XMM registers or between and XMM register and memory
MOVHPS Move two packed single-precision floating-point values to an from the high
quadword of an XMM register and memory
MOVHLPS Move two packed single-precision floating-point values from the high
quadword of an XMM register to the low quadword of another XMM Register
MOVLPS Move two packed single-precision floating-point values to an from the low
quadword of an XMM register and memory
MOVLHPS Move two packed single-precision floating-point values from the low
quadword of an XMM register to the high quadword of another XMM register
MOVMSKPS Extract sign mask from four packed single-precision floating-point
values
MOVSS Move scalar single-precision floating-point value between XMM registers or
between an XMM register and memory
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Transfer Instructions (2)
Instruction Category
SSE 64-Bit SIMD
Integer Instructions
Description
SSE2 128-Bit SIMD
Integer Instructions
MOVDQA Move aligned double quadword.
PMOVMSKB Move byte mask
MOVDQU Move unaligned double quadword
MOVQ2DQ Move quadword integer from MMX to XMM registers
MOVDQ2Q Move quadword integer from XMM to MMX registers
SSE2 double-precision
floating-point data
Movement
Instructions
MOVAPD Move two aligned packed double-precision floating-point values between
XMM registers or between and XMM register and memory
MOVUPD Move two unaligned packed double-precision floating-point values
between XMM registers or between and XMM register and memory
MOVHPD Move high packed double-precision floating-point value to an from the
high quadword of an XMM register and memory
MOVLPD Move low packed single-precision floating-point value to an from the low
quadword of an XMM register and memory
MOVMSKPD Extract sign mask from two packed double-precision floating-point
values
MOVSD Move scalar double-precision floating-point value between XMM registers
or between an XMM register and memory
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Transfer Instructions (3)
Instruction Category
SSE3 SIMD FloatingPoint
LOAD/MOVE/DUPLICATE
Description
Instructions
MOVSLDUP Loads/moves 128 bits; duplicating the first and third 32-bit data
elements
MOVSHDUP Loads/moves 128 bits; duplicating the second and fourth 32-bit
data elements
MOVDDUP Loads/moves 64 bits (bits[63:0] if the source is a register) and
returns the same 64 bits in both the lower and upper halves of the 128-bit
result register; duplicates the 64 bits from the source
SSE3 Specialized 128-bit
Unaligned Data Load
Instruction
64-BIT MODE
INSTRUCTIONS
LDDQU Special 128-bit unaligned load designed to avoid cache line splits
LODSQ Load qword at address (R)SI into RAX
MOVSQ Move qword from address (R)SI to (R)DI
MOVZX (64-bits) Move doubleword to quadword, zero-extension
STOSQ Store RAX at address RDI
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Packed Arithmetic Instructions (1)
Instruction Category
MMX Packed
Arithmetic
Instructions
Description
PADDB Add packed byte integers
PADDW Add packed word integers
PADDD Add packed double word integers
PADDSB Add packed signed byte integers with signed saturation
PADDSW Add packed signed word integers with signed saturation
PADDUSB Add packed unsigned byte integers with unsigned saturation
PADDUSW Add packed unsigned word integers with unsigned saturation
PSUBB Subtract packed byte integers
PSUBW Subtract packed word integers
PSUBD Subtract packed double word integers
PSUBSB Subtract packed signed byte integers with signed saturation
PSUBSW Subtract packed signed word integers with signed saturation
PSUBUSB Subtract packed unsigned byte integers with unsigned saturation
PSUBUSW Subtract packed unsigned word integers with unsigned saturation
PMULHW Multiply packed signed word integers and store high result
PMULLW Multiply packed signed word integers and store low result
PMADDWD Multiply and add packed word integers
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Packed Arithmetic Instructions (2)
Instruction Category
Description
SSE Packed
Arithmetic
Instructions
ADDPS Add packed single-precision floating-point values
ADDSS Add scalar single-precision floating-point values
SUBPS Subtract packed single-precision floating-point values
SUBSS Subtract scalar single-precision floating-point values
MULPS Multiply packed single-precision floating-point values
MULSS Multiply scalar single-precision floating-point values
DIVPS Divide packed single-precision floating-point values
DIVSS Divide scalar single-precision floating-point values
RCPPS Compute reciprocals of packed single-precision floating-point values
RCPSS Compute reciprocal of scalar single-precision floating-point values
SQRTPS Compute square roots of packed single-precision floating-point values
SQRTSS Compute square root of scalar single-precision floating-point values
RSQRTPS Compute reciprocals of square roots of packed single-precision floating point
values
RSQRTSS Compute reciprocal of square root of scalar single-precision floating-point
values
MAXPS Return maximum packed single-precision floating-point values
MAXSS Return maximum scalar single-precision floating-point values
Intel® Processor Architecture:SIMD Technology® Overview
MINPS Return minimum packed single-precision floating-point values
MINSS Return minimum scalar single-precision floating-point values
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Packed Arithmetic Instructions (3)
Instruction Category
SSE 64-Bit SIMD
Integer Instructions
SSE2 Packed
Arithmetic
Instructions
Description
PMULHUW Multiply packed unsigned integers and store high result
ADDPD Add packed double-precision floating-point values
ADDSD Add scalar double precision floating-point values
SUBPD Subtract scalar double-precision floating-point values
SUBSD Subtract scalar double-precision floating-point values
MULPD Multiply packed double-precision floating-point values
MULSD Multiply scalar double-precision floating-point values
DIVPD Divide packed double-precision floating-point values
DIVSD Divide scalar double-precision floating-point values
SQRTPD Compute packed square roots of packed double-precision floating-point
Values
SQRTSD Compute scalar square root of scalar double-precision floating-point
values
MAXPD Return maximum packed double-precision floating-point values
MAXSD Return maximum scalar double-precision floating-point values
MINPD Return minimum packed double-precision floating-point values
MINSD Return minimum scalar double-precision floating-point values
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Packed Arithmetic Instructions (4)
Instruction Category
Description
SSE2 128-Bit SIMD
Integer Instructions
PMULUDQ Multiply packed unsigned doubleword integers
PADDQ Add packed quadword integers
PSUBQ Subtract packed quadword integers
SSE3 SIMD
Floating-Point
Packed ADD/SUB
Instructions
ADDSUBPS Performs single-precision addition on the second and fourth pairs of 32-bit data
elements within the operands; single-precision subtraction on the first and third pairs
ADDSUBPD Performs double-precision addition on the second pair of quad words, and
double-precision subtraction on the first pair
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Packed Arithmetic Instructions (5)
Instruction Category
SSE3 SIMD
Floating-Point
Horizontal ADD/SUB
Instructions
Description
HADDPS Performs a single-precision addition on contiguous data elements. The first data
element of the result is obtained by adding the first and second elements of the first
operand; the second element by adding the third and fourth elements of the first operand;
the third by adding the first and second elements of the second operand; and the fourth by
adding the third and fourth elements of the second operand.
HSUBPS Performs a single-precision subtraction on contiguous data elements. The first
data element of the result is obtained by subtracting the second element of the first
operand from the first element of the first operand; the second element by subtracting the
fourth element of the first operand from the third element of the first operand; the third by
subtracting the second element of the second operand from the first element of the second
operand; and the fourth by subtracting the fourth element of the second operand from the
third element of the second operand.
HADDPD Performs a double-precision addition on contiguous data elements. The first data
element of the result is obtained by adding the first and second elements of the first
operand; the second element by adding the first and second elements of the second
operand.
HSUBPD Performs a double-precision subtraction on contiguous data elements. The
first data element of the result is obtained by subtracting the second element of the first
operand from the first element of the first operand; the second element by subtracting the
second element of the second operand from the first element of the second operand.
SSSE3 Packed
Arithmetic
Instructions
phaddw/d/sw Pairwise integer horizontal addition + pack
phsubw/d/sw Pairwise integer horizontal subtract + pack
PMADDUBSW Multiply signed & unsigned bytes. Accumulate result to signed-words.
(Multiply Accumulate)
Intel® Processor Architecture:SIMD Technology® Overview
PMULHRSW Signed 16 bits multiply, return high bits
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Conversion Instructions (1)
Instruction Category
MMX Conversion
Instructions
Description
PACKSSWB Pack words into bytes with signed saturation
PACKSSDW Pack double words into words with signed saturation
PACKUSWB Pack words into bytes with unsigned saturation.
PUNPCKHBW Unpack high-order bytes
PUNPCKHWD Unpack high-order words
PUNPCKHDQ Unpack high-order double words
PUNPCKLBW Unpack low-order bytes
PUNPCKLWD Unpack low-order words
PUNPCKLDQ Unpack low-order double words
SSE Conversion
Instructions
CVTPI2PS Convert packed double word integers to packed single-precision floating point
values
CVTSI2SS Convert double word integer to scalar single-precision floating-point value
CVTPS2PI Convert packed single-precision floating-point values to packed double word
integers
CVTTPS2PI Convert with truncation packed single-precision floating-point values to
packed double word integers
CVTSS2SI Convert a scalar single-precision floating-point value to a double word integer
CVTTSS2SI Convert with truncation a scalar single-precision floating-point value to a
scalar double word integer
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Conversion Instructions (2)
Instruction Category
SSE2 Conversion
Instructions
Description
CVTPD2PI Convert packed double-precision floating-point values to packed doubleword
integers.
CVTTPD2PI Convert with truncation packed double-precision floating-point values to
packed doubleword integers
CVTPI2PD Convert packed doubleword integers to packed double-precision floatingpoint
values
CVTPD2DQ Convert packed double-precision floating-point values to packed doubleword
integers
CVTTPD2DQ Convert with truncation packed double-precision floating-point values to
packed doubleword integers
CVTDQ2PD Convert packed doubleword integers to packed double-precision floatingpoint
values
CVTPS2PD Convert packed single-precision floating-point values to packed doubleprecision
floating-point values
CVTPD2PS Convert packed double-precision floating-point values to packed singleprecision
floating-point values
CVTSS2SD Convert scalar single-precision floating-point values to scalar doubleprecision
floating-point values
CVTSD2SS Convert scalar double-precision floating-point values to scalar singleprecision
floating-point values
CVTSD2SI Convert scalar double-precision floating-point values to a doubleword integer
CVTTSD2SI Convert with truncation scalar double-precision floating-point values to scalar
Intel® Processor Architecture:SIMD Technology® Overview
doubleword integers
CVTSI2SD Convert doubleword integer to scalar double-precision floating-point Value
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Conversion Instructions (3)
Instruction Category
SSE2 Conversion
Instructions
Description
CVTDQ2PS Convert packed doubleword integers to packed single-precision floatingpoint
values
CVTPS2DQ Convert packed single-precision floating-point values to packed doubleword
integers
CVTTPS2DQ Convert with truncation packed single-precision floating-point values to
packed doubleword integers
SSE3 x87-FP
Integer Conversion
Instruction
FISTTP Behaves like the FISTP instruction but uses truncation, irrespective of the
rounding mode specified in the floating-point control word (FCW)
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Comparison Instructions
Instruction Category
MMX Comparison
Instructions
Description
PCMPEQB Compare packed bytes for equal
PCMPEQW Compare packed words for equal
PCMPEQD Compare packed doublewords for equal
PCMPGTB Compare packed signed byte integers for greater than
PCMPGTW Compare packed signed word integers for greater than
PCMPGTD Compare packed signed doubleword integers for greater than
SSE Comparison
Instructions
CMPPS Compare packed single-precision floating-point values
CMPSS Compare scalar single-precision floating-point values
COMISS Perform ordered comparison of scalar single-precision floating-point
values and set flags in EFLAGS register
UCOMISS Perform unordered comparison of scalar single-precision floating-point
values and set flags in EFLAGS register
SSE2 Compare
Instructions
CMPPD Compare packed double-precision floating-point values
CMPSD Compare scalar double-precision floating-point values
COMISD Perform ordered comparison of scalar double-precision floating-point
values and set flags in EFLAGS register
UCOMISD Perform unordered comparison of scalar double-precision floating-point
values and set flags in EFLAGS register.
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Logical Instructions
Instruction Category
MMX Logical
Instructions
Description
PAND Bitwise logical AND
PANDN Bitwise logical AND NOT
POR Bitwise logical OR
PXOR Bitwise logical exclusive OR
SSE Logical
Instructions
ANDPS Perform bitwise logical AND of packed single-precision floating-point values
ANDNPS Perform bitwise logical AND NOT of packed single-precision floatingpoint values
ORPS Perform bitwise logical OR of packed single-precision floating-point values
XORPS Perform bitwise logical XOR of packed single-precision floating-point Values
SSE2 Logical
Instructions
ANDPD Perform bitwise logical AND of packed double-precision floating-point values
ANDNPD Perform bitwise logical AND NOT of packed double-precision floatingpoint values
ORPD Perform bitwise logical OR of packed double-precision floating-point values
XORPD Perform bitwise logical XOR of packed double-precision floating-point Values
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Shift and Rotate Instructions
Instruction Category
MMX Shift and
Rotate Instructions
Description
PSLLW Shift packed words left logical
PSLLD Shift packed doublewords left logical
PSLLQ Shift packed quadword left logical
PSRLW Shift packed words right logical
PSRLD Shift packed doublewords right logical
PSRLQ Shift packed quadword right logical
PSRAW Shift packed words right arithmetic
PSRAD Shift packed doublewords right arithmetic
SSE2 128-Bit SIMD
Integer Instructions
PSLLDQ Shift double quadword left logical
PSRLDQ Shift double quadword right logical
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Shuffle and Unpack Instructions
Instruction Category
SSE 64-Bit SIMD
Integer Instructions
Description
SSE2 128-Bit SIMD
Integer Instructions
PSHUFLW Shuffle packed low words
PSHUFW Shuffle packed integer word in MMX register
PSHUFHW Shuffle packed high words
PSHUFD Shuffle packed doublewords
PUNPCKHQDQ Unpack high quadwords
PUNPCKLQDQ Unpack low quadwords
SSE Shuffle and
Unpack Instructions
SHUFPS Shuffles values in packed single-precision floating-point operands
UNPCKHPS Unpacks and interleaves the two high-order values from two single-precision
floating-point operands
UNPCKLPS Unpacks and interleaves the two low-order values from two single-precision
floating-point operands
SSE2 Shuffle and
Unpack Instructions
SHUFPD Shuffles values in packed double-precision floating-point operands
UNPCKHPD Unpacks and interleaves the high values from two packed double-precision
floating-point operands
UNPCKLPD Unpacks and interleaves the low values from two packed double-precision
floating-point operands
SSSE3 Packed
Shuffle Bytes
PSHUFB A complete byte-granularity permutation, including force-to-zero flag
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Functional Instructions
Instruction Category
SSE 64-Bit SIMD
Integer Instructions
Description
PAVGB Compute average of packed unsigned byte integers
PAVGW Compute average of packed unsigned byte integers
PEXTRW Extract word
PINSRW Insert word
PMAXUB Maximum of packed unsigned byte integers
PMAXSW Maximum of packed signed word integers
PMINUB Minimum of packed unsigned byte integers
PMINSW Minimum of packed signed word integers
PSADBW Compute sum of absolute differences
SSSE3 Instructions
psignb/w/d Per element, if the source operand is negative, multiply the destination
operand by -1
pabsb/w/d Per element, overwrite destination with absolute value of source
PALIGNR Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src]
and store them to the dst register
PMULHRSW Signed 16 bits multiply, return high bits
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
System Related Instructions
Instruction Category
MMX State
Management
Instructions
SSE MXCSR State
Management
Instructions
Description
SSE3 Agent
Synchronization
Instructions
MONITOR Sets up an address range used to monitor write-back stores
EMMS Empty MMX state
LDMXCSR Load MXCSR register
STMXCSR Save MXCSR register state
MWAIT Enables a logical processor to enter into an optimized state while waiting for a
write-back store to the address range set up by the MONITOR Instruction
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Cacheability Control, Prefetch, and
Instruction Ordering Instructions
Instruction Category
SSE Cacheability
Control, Prefetch,
and Instruction
Ordering
Instructions
Description
MASKMOVQ Non-temporal store of selected bytes from an MMX register into memory
MOVNTQ Non-temporal store of quadword from an MMX register into memory
MOVNTPS Non-temporal store of four packed single-precision floating-point values from an
XMM register into memory
PREFETCHh Load 32 or more of bytes from memory to a selected level of the processor’s
cache hierarchy
SFENCE Serializes store operations
SSE2 Cacheability
Control and
Ordering
Instructions
CLFLUSH Flushes and invalidates a memory operand and its associated cache line from all
levels of the processor’s cache hierarchy
LFENCE Serializes load operations
MFENCE Serializes load and store operations
PAUSE Improves the performance of “spin-wait loops”
MASKMOVDQU Non-temporal store of selected bytes from an XMM register into memory
MOVNTPD Non-temporal store of two packed double-precision floating-point values from an
XMM register into memory
MOVNTDQ Non-temporal store of double quadword from an XMM register into memory
MOVNTI Non-temporal store of a doubleword from a general-purpose register into Memory
Intel® Processor Architecture:SIMD Technology® Overview
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.