Intel® Processor Architecture: SIMD Instructions Intel® Software College Objectives After completion of this module you will be able to understand • SIMD rationale • Intel SIMD instructions Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Agenda SIMD Rationale Intel SIMD History SIMD Data Types Intel SIMD Instructions Sets Programming with Intel SIMD Instructions Backup – Instructions Reference Table Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SIMD (Single Instruction Multiple Data) Technology • Increase processor throughput by performing multiple computations in a single instruction • MMX™ technology, SSE, SSE2 and SSE3 are architectural extensions Example (SSE 2) Performs two double precision ops in one cycle • a1+b1=c1 in parallel with a0+b0=c0 Useful for matrix operations a1 + a0 + b1 b0 c1 c0 128-bit Registers Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Streaming SIMD Extensions A brief history MMX™ technology - Intel® Pentium® with MMX™ and Pentium® II processors • Introduced 64-bit MMX registers for SIMD integer operations • Supports SIMD operations on packed byte, word, and double-word integers • Useful for multimedia and communications software SSE – Intel® Pentium® III processor • Introduced 128-bit extended memory manager (XMM) registers for SIMD integers and FP-SP operands • Executes FP and SIMD simultaneously • Introduced data prefetch instructions • Useful for 3D geometry, 3D rendering, and video encoding/decoding SSE2 – • • • • • Intel® Pentium® 4 and Intel® Xeon™ processors Added extra 64-bit SIMD integer support Has same XMM registers for SIMD integer and floating point double precision (FP-DP) Has 144 new instructions for data support (no new registers) Adds support for cacheability and memory ordering operations Useful for 3D graphics, video encoding/decoding and encryption SSE3 – Intel® Pentium® 4 Processor • Accelerates performance of Streaming SIMD Extensions technology, Streaming SIMD Extensions 2 technology, and X87-FP math capabilities. • Useful in some 3D operations (Quaternions), complex arithmetic and video codec algorithms SSSE3 – Intel® Core® 2 Processor • application performance improvement. • potential for specifc application domains Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. X86 Register Sets SSE-Registers introduced first in Pentium® 3 MMX™ Technology / IA-FP Registers IA-INT Registers SSE Registers 80 32 128 64 eax st0 mm0 xmm0 st7 mm7 xmm7 … edi Eight 128-bit registers Hold data only: 4 x single FP numbers 2 x double FP numbers 128-bit packed integers Direct access to the registers Use simultaneously with FP / MMX Technology Fourteen 32-bit registers Scalar data & addresses Direct access to regs Eight 80/64-bit registers Hold data only Stack access to FP0..FP7 Direct access to MM0..MM7 No MMX™ Technology / FP interoperability Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SIMD Data Types (1) 64-Bit Packed Integer Data Types 8 packed bytes 63 31 0 4 packed words 63 31 0 31 0 2 doublewords 63 Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SIMD Data Types (2) 128-bit Floating-Point and Integer Data Types 16 packed bytes integer 127 63 87 0 8 packed words integer 127 63 16 15 0 4 packed doublewords Single Precision Floating Point or Integer 127 63 32 31 0 2 quadwords Double Precision Floating Point or Integer 127 63 0 Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SIMD Data Types (3) • Packed BCD data-type Packed BCD Integers 7 3 BCD 0 BCD 80-Bit Packed BCD Decimal Integers 79 71 0 X D17 D16 …… D0 Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SSE-Instructions Set Extensions Introduced by Pentium® 3 in 1999; now frequently called SSE-1 Only new data type supported: 4x32Bit (Single Precision) floating point data Some 70 instructions • Arithmetic, compare, convert operations on SSE SP FP data • PACKED, UNPACKED • • • • • Data load/store Prefetch Extension of MMX Streaming Store (store without using cache in between) … 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SSE Sample: Branch Removal R = (A < B)? C : D //remember: everything packed A 0.0 0.0 -3.0 3.0 cmplt B 0.0 1.0 -5.0 5.0 00000 11111 00000 11111 and nand c3 c2 c1 c0 d3 d2 d1 d0 00000 c2 00000 c0 d3 00000 d1 00000 or Intel® Processor Architecture:SIMD Technology® Overview d3 Copyright © 2006, Intel Corporation. All rights reserved. c2 d1 c0 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SSE-2 Instructions Set Extensions Introduced by Intel® Pentium®4 processor in 2000 Some 140 new instructions Added double precision floating point data (2x64Bit) and all related instructions including conversion Again some extensions to MMX Added all possible combinations of integer data to SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related operations 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SIMD Single vs. SIMD Double SIMD SP FP Operand = 4 Elements 4 x Single Precision: SSE-1 Element = SP FP Number 127 0 X3 X2 X1 31 30 X0 0 23 22 S Exponent Significand SIMD DP FP Operand = 2 Elements Element = DP FP Number 127 2 x Double Precision: SSE-2 0 X1 63 62 S Exponent X0 52 51 Copyright © 2006, Intel Corporation. All rights reserved. 0 Significand Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Sample for SSE-2: SIMD Double SIMD Int Conversion SIMD Double SIMD Int: conversion to two lower ints, two higher ints cleared x1 00000 x0 00000 (int)x1 (int)x0 __m128d x; __m128i ix; ix = _mm_cvtpd_epi32(x); Int SIMD Double: conversion from two lower ints SIMD ???? ???? ix1 ix0 x = _mm_cvtepi32_pd(ix); Intel® Processor Architecture:SIMD Technology® Overview (double)x1 (double)x0 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SSE3: No new Data Types but new Instructions FISTTP FP to integer conversions ADDSUBPD, ADDSUBPS, Complex arithmetic MOVDDUP, MOVSHDUP, MOVSLDUP Video encoding SIMD FP using AOS format* LDDQU HADDPD, HSUBPD Thread Synchronization HADDPS, HSUBPS MONITOR, MWAIT * Also benefits Complex and Vectorization Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Streaming SIMD Extensions 3 13 new instructions Three have limited use for application performance improvement • FISTTP - X87 to integer conversion (requires –longdouble switch) • MONITOR/MWAIT - thread synchronization • Available today in Ring 0 only; being used by newer Windows* and Linux* thread packages The other ten have some potential for specifc application domains Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SSE-3 Sample Complex Arithmetic: ADDSUBPS ADDSUBPS OperandA OperandB • OperandA (xmm register; 4 data elements) • a3, a2, a1, a0 • OperandB (xmm reg. Or memory addr; 4 data elements) • b3, b2, b1, b0 • Result (Stored in OperandA) • a3+b3, a2-b2, a1+b1, a0-b0 __m128 _mm_addsub_ps(__m128 a, __m128 b) a3 b3 Add a3+b3 a2 b2 a1 b1 Sub a0 b0 Add Sub Intel® Processor Architecture:SIMD Technology® Overview a2-b2 a1+b1 a0-b0 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Supplemental SSE-3 (SSSE-3) Extension introduced by Intel® Core™ Architecture Horizontal Addition/Subtraction PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD Packed Absolute Values PABSB, PABSW, PABSD Multiply and Add Packed Signed/Unsigned bytes Packed multiply High with Round and Scale Packed Shuffle Bytes Packed SIGN Packed Align Right PMADDUBSW PMULHRSW PSHUFB PSIGNB/W/D PALIGNR Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. SSSE-3 New Instructions Useful for media and imaging 16 new packed integer instructions Six new Instructions categories 64-bit SIMD latencies 128-bit SIMD latencies Absolute value and Integer “Sign” 1 clock 1 clock Byte multiply and add 3 clocks 3 clocks Word multiply high with round and shift 3 clocks 3 clocks Byte Permute 1 clock 3 clocks Byte Concatenate with Shift 1 clock 2 clocks Integer Horizontal Adds and Subtracts 4 clocks 5 clocks All instructions come in both 128-bit and 64-bit (MMX) flavors Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Sample SSSE-3 Inst.: Byte Permute PSHUFB mm, mm/m64 PSHUFB xmm, xmm/m128 • • • • • A complete byte-granularity permutation The source operand is used as the control field (variable control) The destination operand gets permuted Each byte of the source field selects the origin of the corresponding destination byte Also includes force-byte-to-zero flag (bit 7) src 0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00 dest 0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01 dest 0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01 Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Ways to SSE/SIMD programming Coding using SSE/SSE2/3/4 assembler instructions • Very tedious (manually schedule) – discouraged: Don’t do it ! • E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ? Intel® compiler’s C/C++ SIMD intrinsics • No need to take care of register allocation, scheduling etc Intel® compiler’s C++ Vector Class Library • Use this if you are heavy into C++ classes Vectorizer of Intel® C++ and Fortran Compilers • Recommended for most cases – easy and efficient Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL) 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Compiler Based Vectorization Processor Specific Generate Code and Optimize for Linux* Pentium® 3 compatible and Athlon XPprocessors including code generation for MMX and SSE -axK -axK Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, including code generation for MMX, SSE and SSE2 -xW -axW Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2 - depreciated switch: use xW instead -xN -axN Pentium® M processors including code generation for MMX, SSE and SSE-2 -xB -axB Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit mode) – including code generation for MMX, SSE, SSE2 and SSE-3 -xP, -axP Intel® processors with MNI capability – Intel® Core™2 Duo processors ( Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE3 and MNI -xT, -axT Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Instruction Set Extensions New Instructions Added to Intel® Processors 160 144 140 120 100 70 80 56 60 ~ 50 32 40 32 13 20 0 Process (nm) Dec-00 Jan-97 Feb-99 MMX™ Streaming SIMD Extensions (SSE) 350 Feb-04 Streaming SIMD Streaming SIMD Extensions 2 (SSE2) Extensions 3 (SSE3) 180 250 Jul-06 Supplemental SSE3 (SSSE3) 90 65 2008+ Future Intel instruction FutureSSE-4 set extensions 45 45 nm Beginning in 2008: ~50 new instructions in 13 groups All function in 32-bit and 64-bit modes Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D & 3D Imaging, Vectorizing Compiler Performance Intel® Processor Architecture:SIMD Technology® Overview 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Transfer Instructions (1) Instruction Category MMX Transfer Instructions Description MOVD Move doubleword MOVQ Move quadword SSE SIMD SinglePrecision Data Transfer Instructions MOVAPS Move four aligned packed single-precision floating-point values between XMM registers or between and XMM register and memory MOVUPS Move four unaligned packed single-precision floating-point values between XMM registers or between and XMM register and memory MOVHPS Move two packed single-precision floating-point values to an from the high quadword of an XMM register and memory MOVHLPS Move two packed single-precision floating-point values from the high quadword of an XMM register to the low quadword of another XMM Register MOVLPS Move two packed single-precision floating-point values to an from the low quadword of an XMM register and memory MOVLHPS Move two packed single-precision floating-point values from the low quadword of an XMM register to the high quadword of another XMM register MOVMSKPS Extract sign mask from four packed single-precision floating-point values MOVSS Move scalar single-precision floating-point value between XMM registers or between an XMM register and memory Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Transfer Instructions (2) Instruction Category SSE 64-Bit SIMD Integer Instructions Description SSE2 128-Bit SIMD Integer Instructions MOVDQA Move aligned double quadword. PMOVMSKB Move byte mask MOVDQU Move unaligned double quadword MOVQ2DQ Move quadword integer from MMX to XMM registers MOVDQ2Q Move quadword integer from XMM to MMX registers SSE2 double-precision floating-point data Movement Instructions MOVAPD Move two aligned packed double-precision floating-point values between XMM registers or between and XMM register and memory MOVUPD Move two unaligned packed double-precision floating-point values between XMM registers or between and XMM register and memory MOVHPD Move high packed double-precision floating-point value to an from the high quadword of an XMM register and memory MOVLPD Move low packed single-precision floating-point value to an from the low quadword of an XMM register and memory MOVMSKPD Extract sign mask from two packed double-precision floating-point values MOVSD Move scalar double-precision floating-point value between XMM registers or between an XMM register and memory Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Transfer Instructions (3) Instruction Category SSE3 SIMD FloatingPoint LOAD/MOVE/DUPLICATE Description Instructions MOVSLDUP Loads/moves 128 bits; duplicating the first and third 32-bit data elements MOVSHDUP Loads/moves 128 bits; duplicating the second and fourth 32-bit data elements MOVDDUP Loads/moves 64 bits (bits[63:0] if the source is a register) and returns the same 64 bits in both the lower and upper halves of the 128-bit result register; duplicates the 64 bits from the source SSE3 Specialized 128-bit Unaligned Data Load Instruction 64-BIT MODE INSTRUCTIONS LDDQU Special 128-bit unaligned load designed to avoid cache line splits LODSQ Load qword at address (R)SI into RAX MOVSQ Move qword from address (R)SI to (R)DI MOVZX (64-bits) Move doubleword to quadword, zero-extension STOSQ Store RAX at address RDI Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Packed Arithmetic Instructions (1) Instruction Category MMX Packed Arithmetic Instructions Description PADDB Add packed byte integers PADDW Add packed word integers PADDD Add packed double word integers PADDSB Add packed signed byte integers with signed saturation PADDSW Add packed signed word integers with signed saturation PADDUSB Add packed unsigned byte integers with unsigned saturation PADDUSW Add packed unsigned word integers with unsigned saturation PSUBB Subtract packed byte integers PSUBW Subtract packed word integers PSUBD Subtract packed double word integers PSUBSB Subtract packed signed byte integers with signed saturation PSUBSW Subtract packed signed word integers with signed saturation PSUBUSB Subtract packed unsigned byte integers with unsigned saturation PSUBUSW Subtract packed unsigned word integers with unsigned saturation PMULHW Multiply packed signed word integers and store high result PMULLW Multiply packed signed word integers and store low result PMADDWD Multiply and add packed word integers Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Packed Arithmetic Instructions (2) Instruction Category Description SSE Packed Arithmetic Instructions ADDPS Add packed single-precision floating-point values ADDSS Add scalar single-precision floating-point values SUBPS Subtract packed single-precision floating-point values SUBSS Subtract scalar single-precision floating-point values MULPS Multiply packed single-precision floating-point values MULSS Multiply scalar single-precision floating-point values DIVPS Divide packed single-precision floating-point values DIVSS Divide scalar single-precision floating-point values RCPPS Compute reciprocals of packed single-precision floating-point values RCPSS Compute reciprocal of scalar single-precision floating-point values SQRTPS Compute square roots of packed single-precision floating-point values SQRTSS Compute square root of scalar single-precision floating-point values RSQRTPS Compute reciprocals of square roots of packed single-precision floating point values RSQRTSS Compute reciprocal of square root of scalar single-precision floating-point values MAXPS Return maximum packed single-precision floating-point values MAXSS Return maximum scalar single-precision floating-point values Intel® Processor Architecture:SIMD Technology® Overview MINPS Return minimum packed single-precision floating-point values MINSS Return minimum scalar single-precision floating-point values Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Packed Arithmetic Instructions (3) Instruction Category SSE 64-Bit SIMD Integer Instructions SSE2 Packed Arithmetic Instructions Description PMULHUW Multiply packed unsigned integers and store high result ADDPD Add packed double-precision floating-point values ADDSD Add scalar double precision floating-point values SUBPD Subtract scalar double-precision floating-point values SUBSD Subtract scalar double-precision floating-point values MULPD Multiply packed double-precision floating-point values MULSD Multiply scalar double-precision floating-point values DIVPD Divide packed double-precision floating-point values DIVSD Divide scalar double-precision floating-point values SQRTPD Compute packed square roots of packed double-precision floating-point Values SQRTSD Compute scalar square root of scalar double-precision floating-point values MAXPD Return maximum packed double-precision floating-point values MAXSD Return maximum scalar double-precision floating-point values MINPD Return minimum packed double-precision floating-point values MINSD Return minimum scalar double-precision floating-point values Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Packed Arithmetic Instructions (4) Instruction Category Description SSE2 128-Bit SIMD Integer Instructions PMULUDQ Multiply packed unsigned doubleword integers PADDQ Add packed quadword integers PSUBQ Subtract packed quadword integers SSE3 SIMD Floating-Point Packed ADD/SUB Instructions ADDSUBPS Performs single-precision addition on the second and fourth pairs of 32-bit data elements within the operands; single-precision subtraction on the first and third pairs ADDSUBPD Performs double-precision addition on the second pair of quad words, and double-precision subtraction on the first pair Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Packed Arithmetic Instructions (5) Instruction Category SSE3 SIMD Floating-Point Horizontal ADD/SUB Instructions Description HADDPS Performs a single-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand; the second element by adding the third and fourth elements of the first operand; the third by adding the first and second elements of the second operand; and the fourth by adding the third and fourth elements of the second operand. HSUBPS Performs a single-precision subtraction on contiguous data elements. The first data element of the result is obtained by subtracting the second element of the first operand from the first element of the first operand; the second element by subtracting the fourth element of the first operand from the third element of the first operand; the third by subtracting the second element of the second operand from the first element of the second operand; and the fourth by subtracting the fourth element of the second operand from the third element of the second operand. HADDPD Performs a double-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand; the second element by adding the first and second elements of the second operand. HSUBPD Performs a double-precision subtraction on contiguous data elements. The first data element of the result is obtained by subtracting the second element of the first operand from the first element of the first operand; the second element by subtracting the second element of the second operand from the first element of the second operand. SSSE3 Packed Arithmetic Instructions phaddw/d/sw Pairwise integer horizontal addition + pack phsubw/d/sw Pairwise integer horizontal subtract + pack PMADDUBSW Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate) Intel® Processor Architecture:SIMD Technology® Overview PMULHRSW Signed 16 bits multiply, return high bits Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Conversion Instructions (1) Instruction Category MMX Conversion Instructions Description PACKSSWB Pack words into bytes with signed saturation PACKSSDW Pack double words into words with signed saturation PACKUSWB Pack words into bytes with unsigned saturation. PUNPCKHBW Unpack high-order bytes PUNPCKHWD Unpack high-order words PUNPCKHDQ Unpack high-order double words PUNPCKLBW Unpack low-order bytes PUNPCKLWD Unpack low-order words PUNPCKLDQ Unpack low-order double words SSE Conversion Instructions CVTPI2PS Convert packed double word integers to packed single-precision floating point values CVTSI2SS Convert double word integer to scalar single-precision floating-point value CVTPS2PI Convert packed single-precision floating-point values to packed double word integers CVTTPS2PI Convert with truncation packed single-precision floating-point values to packed double word integers CVTSS2SI Convert a scalar single-precision floating-point value to a double word integer CVTTSS2SI Convert with truncation a scalar single-precision floating-point value to a scalar double word integer Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Conversion Instructions (2) Instruction Category SSE2 Conversion Instructions Description CVTPD2PI Convert packed double-precision floating-point values to packed doubleword integers. CVTTPD2PI Convert with truncation packed double-precision floating-point values to packed doubleword integers CVTPI2PD Convert packed doubleword integers to packed double-precision floatingpoint values CVTPD2DQ Convert packed double-precision floating-point values to packed doubleword integers CVTTPD2DQ Convert with truncation packed double-precision floating-point values to packed doubleword integers CVTDQ2PD Convert packed doubleword integers to packed double-precision floatingpoint values CVTPS2PD Convert packed single-precision floating-point values to packed doubleprecision floating-point values CVTPD2PS Convert packed double-precision floating-point values to packed singleprecision floating-point values CVTSS2SD Convert scalar single-precision floating-point values to scalar doubleprecision floating-point values CVTSD2SS Convert scalar double-precision floating-point values to scalar singleprecision floating-point values CVTSD2SI Convert scalar double-precision floating-point values to a doubleword integer CVTTSD2SI Convert with truncation scalar double-precision floating-point values to scalar Intel® Processor Architecture:SIMD Technology® Overview doubleword integers CVTSI2SD Convert doubleword integer to scalar double-precision floating-point Value Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Conversion Instructions (3) Instruction Category SSE2 Conversion Instructions Description CVTDQ2PS Convert packed doubleword integers to packed single-precision floatingpoint values CVTPS2DQ Convert packed single-precision floating-point values to packed doubleword integers CVTTPS2DQ Convert with truncation packed single-precision floating-point values to packed doubleword integers SSE3 x87-FP Integer Conversion Instruction FISTTP Behaves like the FISTP instruction but uses truncation, irrespective of the rounding mode specified in the floating-point control word (FCW) Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Comparison Instructions Instruction Category MMX Comparison Instructions Description PCMPEQB Compare packed bytes for equal PCMPEQW Compare packed words for equal PCMPEQD Compare packed doublewords for equal PCMPGTB Compare packed signed byte integers for greater than PCMPGTW Compare packed signed word integers for greater than PCMPGTD Compare packed signed doubleword integers for greater than SSE Comparison Instructions CMPPS Compare packed single-precision floating-point values CMPSS Compare scalar single-precision floating-point values COMISS Perform ordered comparison of scalar single-precision floating-point values and set flags in EFLAGS register UCOMISS Perform unordered comparison of scalar single-precision floating-point values and set flags in EFLAGS register SSE2 Compare Instructions CMPPD Compare packed double-precision floating-point values CMPSD Compare scalar double-precision floating-point values COMISD Perform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register UCOMISD Perform unordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register. Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Logical Instructions Instruction Category MMX Logical Instructions Description PAND Bitwise logical AND PANDN Bitwise logical AND NOT POR Bitwise logical OR PXOR Bitwise logical exclusive OR SSE Logical Instructions ANDPS Perform bitwise logical AND of packed single-precision floating-point values ANDNPS Perform bitwise logical AND NOT of packed single-precision floatingpoint values ORPS Perform bitwise logical OR of packed single-precision floating-point values XORPS Perform bitwise logical XOR of packed single-precision floating-point Values SSE2 Logical Instructions ANDPD Perform bitwise logical AND of packed double-precision floating-point values ANDNPD Perform bitwise logical AND NOT of packed double-precision floatingpoint values ORPD Perform bitwise logical OR of packed double-precision floating-point values XORPD Perform bitwise logical XOR of packed double-precision floating-point Values Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Shift and Rotate Instructions Instruction Category MMX Shift and Rotate Instructions Description PSLLW Shift packed words left logical PSLLD Shift packed doublewords left logical PSLLQ Shift packed quadword left logical PSRLW Shift packed words right logical PSRLD Shift packed doublewords right logical PSRLQ Shift packed quadword right logical PSRAW Shift packed words right arithmetic PSRAD Shift packed doublewords right arithmetic SSE2 128-Bit SIMD Integer Instructions PSLLDQ Shift double quadword left logical PSRLDQ Shift double quadword right logical Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Shuffle and Unpack Instructions Instruction Category SSE 64-Bit SIMD Integer Instructions Description SSE2 128-Bit SIMD Integer Instructions PSHUFLW Shuffle packed low words PSHUFW Shuffle packed integer word in MMX register PSHUFHW Shuffle packed high words PSHUFD Shuffle packed doublewords PUNPCKHQDQ Unpack high quadwords PUNPCKLQDQ Unpack low quadwords SSE Shuffle and Unpack Instructions SHUFPS Shuffles values in packed single-precision floating-point operands UNPCKHPS Unpacks and interleaves the two high-order values from two single-precision floating-point operands UNPCKLPS Unpacks and interleaves the two low-order values from two single-precision floating-point operands SSE2 Shuffle and Unpack Instructions SHUFPD Shuffles values in packed double-precision floating-point operands UNPCKHPD Unpacks and interleaves the high values from two packed double-precision floating-point operands UNPCKLPD Unpacks and interleaves the low values from two packed double-precision floating-point operands SSSE3 Packed Shuffle Bytes PSHUFB A complete byte-granularity permutation, including force-to-zero flag Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Functional Instructions Instruction Category SSE 64-Bit SIMD Integer Instructions Description PAVGB Compute average of packed unsigned byte integers PAVGW Compute average of packed unsigned byte integers PEXTRW Extract word PINSRW Insert word PMAXUB Maximum of packed unsigned byte integers PMAXSW Maximum of packed signed word integers PMINUB Minimum of packed unsigned byte integers PMINSW Minimum of packed signed word integers PSADBW Compute sum of absolute differences SSSE3 Instructions psignb/w/d Per element, if the source operand is negative, multiply the destination operand by -1 pabsb/w/d Per element, overwrite destination with absolute value of source PALIGNR Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src] and store them to the dst register PMULHRSW Signed 16 bits multiply, return high bits Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. System Related Instructions Instruction Category MMX State Management Instructions SSE MXCSR State Management Instructions Description SSE3 Agent Synchronization Instructions MONITOR Sets up an address range used to monitor write-back stores EMMS Empty MMX state LDMXCSR Load MXCSR register STMXCSR Save MXCSR register state MWAIT Enables a logical processor to enter into an optimized state while waiting for a write-back store to the address range set up by the MONITOR Instruction Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Cacheability Control, Prefetch, and Instruction Ordering Instructions Instruction Category SSE Cacheability Control, Prefetch, and Instruction Ordering Instructions Description MASKMOVQ Non-temporal store of selected bytes from an MMX register into memory MOVNTQ Non-temporal store of quadword from an MMX register into memory MOVNTPS Non-temporal store of four packed single-precision floating-point values from an XMM register into memory PREFETCHh Load 32 or more of bytes from memory to a selected level of the processor’s cache hierarchy SFENCE Serializes store operations SSE2 Cacheability Control and Ordering Instructions CLFLUSH Flushes and invalidates a memory operand and its associated cache line from all levels of the processor’s cache hierarchy LFENCE Serializes load operations MFENCE Serializes load and store operations PAUSE Improves the performance of “spin-wait loops” MASKMOVDQU Non-temporal store of selected bytes from an XMM register into memory MOVNTPD Non-temporal store of two packed double-precision floating-point values from an XMM register into memory MOVNTDQ Non-temporal store of double quadword from an XMM register into memory MOVNTI Non-temporal store of a doubleword from a general-purpose register into Memory Intel® Processor Architecture:SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.