Introduction 4. Instruction tables Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs By Agner Fog. Technical University of Denmark. Copyright © 1996 – 2016. Last updated 2016-01-09. Introduction This is the fourth in a series of five manuals: 1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms. 2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms. 3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers. 4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. 5. Calling conventions for different C++ compilers and operating systems. The latest versions of these manuals are always available from www.agner.org/optimize. Copyright conditions are listed below. The present manual contains tables of instruction latencies, throughputs and micro-operation breakdown and other tables for x86 family microprocessors from Intel, AMD and VIA. The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower than the values published elsewhere. The discrepancies can be explained by the following factors: ● My figures are experimental values while figures published by microprocessor vendors may be based on theory or simulations. ● My figures are obtained with a particular test method under particular conditions. It is possible that different values can be obtained under other conditions. ● Some latencies are difficult or impossible to measure accurately, especially for memory access and type conversions that cannot be chained. ● Latencies for moving data from one execution unit to another are listed explicitly in some of my tables while they are included in the general latencies in some tables published by Intel. Most values are the same in all microprocessor modes (real, virtual, protected, 16-bit, 32-bit, 64-bit). Values for far calls and interrupts may be different in different modes. Call gates have not been tested. Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand. If any text in the pdf version of this manual is unreadable, then please refer to the spreadsheet version. Copyright notice Page 1 Introduction This series of five manuals is copyrighted by Agner Fog. Public distribution and mirroring is not allowed. Non-public distribution to a limited audience for educational purposes is allowed. The code examples in these manuals can be used without restrictions. A GNU Free Documentation License shall automatically come into force when I die. See www.gnu.org/copyleft/fdl.html Page 2 Definition of terms Definition of terms Instruction The instruction name is the assembly code for the instruction. Multiple instructions or multiple variants of the same instruction may be joined into the same line. Instructions with and without a 'v' prefix to the name have the same values unless otherwise noted. Operands Operands can be different types of registers, memory, or immediate constants. Abbreviations used in the tables are: i = immediate constant, r = any general purpose register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm register, y = 256 bit ymm register, z = 512 bit zmm register, v = any vector register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Latency The latency of an instruction is the delay that the instruction generates in a dependency chain. The measurement unit is clock cycles. Where the clock frequency is varied dynamically, the figures refer to the core clock frequency. The numbers listed are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity may increase the latencies by possibly more than 100 clock cycles on many processors, except in move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results may give a similar delay. A missing value in the table means that the value has not been measured or that it cannot be measured in a meaningful way. Some processors have a pipelined execution unit that is smaller than the largest register size so that different parts of the operand are calculated at different times. Assume, for example, that we have a long depencency chain of 128-bit vector instructions running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64 bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64 bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles per instruction plus one extra clock cycle in the end. The latency in this case is listed as 4 in the tables because this is the value it adds to a dependency chain. Reciprocal throughput The throughput is the maximum number of instructions of the same kind that can be executed per clock cycle when the operands of each instruction are independent of the preceding instructions. The values listed are the reciprocals of the throughputs, i.e. the average number of clock cycles per instruction when the instructions are not part of a limiting dependency chain. For example, a reciprocal throughput of 2 for FMUL means that a new FMUL instruction can start executing 2 clock cycles after a previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution units can handle 3 integer additions per clock cycle. The reason for listing the reciprocal values is that this makes comparisons between latency and throughput easier. The reciprocal throughput is also called issue latency. Page 3 Definition of terms The values listed are for a single thread or a single core. A missing value in the table means that the value has not been measured. μops Uop or μop is an abbreviation for micro-operation. Processors with out-of-order cores are capable of splitting complex instructions into μops. For example, a read-modify instruction may be split into a read-μop and a modify-μop. The number of μops that an instruction generates is important when certain bottlenecks in the pipeline limit the number of μops per clock cycle. Execution unit The execution core of a microprocessor has several execution units. Each execution unit can handle a particular category of μops, for example floating point additions. The information about which execution unit a particular μop goes to can be useful for two purposes. Firstly, two μops cannot execute simultaneously if they need the same execution unit. And secondly, some processors have a latency of an extra clock cycle when the result of a μop executing in one execution unit is needed as input for a μop in another execution unit. Execution port The execution units are clustered around a few execution ports on most Intel processors. Each μop passes through an execution port to get to the right execution unit. An execution port can be a bottleneck because it can handle only one μop at a time. Two μops cannot execute simultaneously if they need the same execution port, even if they are going to different execution units. Instruction set This indicates which instruction set an instruction belongs to. The instruction is only available in processors that support this instruction set. The different instruction sets are listed at the end of this manual. Availability in processors prior to 80386 does not apply for 32-bit and 64-bit operands. Availability in the MMX instruction set does not apply to 128-bit packed integer instructions, which require SSE2. Availability in the SSE instruction set does not apply to double precision floating point instructions, which require SSE2. 32-bit instructions are available in 80386 and later. 64-bit instructions in general purpose registers are available only under 64-bit operating systems. Instructions that use XMM registers (SSE and later) are only available under operating systems that support this register set. Instructions that use YMM registers (AVX and later) are only available under operating systems that support this register set. How the values were measured The values in the tables are measured with the use of my own test programs, which are available from www.agner.org/optimize/testp.zip The time unit for all measurements is CPU clock cycles. It is attempted to obtain the highest clock frequency if the clock frequency is varying with the workload. Many Intel processors have a performance counter named "core clock cycles". This counter gives measurements that are independent of the varying clock frequency. Where no "core clock cycles" counter is available, the "time stamp counter" is used (RDTSC instruction). In cases where this gives inconsistent results (e.g. in AMD Bobcat) it is necessary to make the processor boost the clock frequency by executing a large number of instructions (> 1 million) or turn off the power-saving feature in the BIOS setup. Instruction throughputs are measured with a long sequence of instructions of the same kind, where subsequent instructions use different registers in order to avoid dependence of each instruction on the previous one. The input registers are cleared in the cases where it is impossible to use different registers. The test code is carefully constructed in each case to make sure that no other bottleneck is limiting the throughput than the one that is being measured. Instruction latencies are measured in a long dependency chain of identical instructions where the output of each instruction is needed as input for the next instruction. Page 4 Definition of terms The sequence of instructions should be long, but not so long that it doesn't fit into the level-1 code cache. A typical length is 100 instructions of the same type. This sequence is repeated in a loop if a larger number of instructions is desired. It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time. A similar problem occurs where the input and the output of an instruction use different types of registers. For example, the MOVD instruction can transfer data between general purpose registers and XMM vector registers. The value that can be measured is the combined latency of data transfer from one type of registers to another type and back again (A → B → A). The division of this latency between the A → B latency and the B → A latency is sometimes obvious, sometimes based on guesswork, µop counts, indirect evidence, or triangular sequences such as A → B → Memory → A. In many cases, however, the division of the total latency between A → B latency and B → A latency is arbitrary. However, what cannot be measured cannot matter for performance optimization. What counts is the sum of the A → B latency and the B → A latency, not the individual terms. The µop counts are usually measured with the use of the performance monitor counters (PMCs) that are built into modern microprocessors. The PMCs for VIA processors are undocumented, and the interpretation of these PMCs is based on experimentation. The execution ports and execution units that are used by each instruction or µop are detected in different ways depending on the particular microprocessor. Some microprocessors have PMCs that can give this information directly. In other cases it is necessary to obtain this information indirectly by testing whether a particular instruction or µop can execute simultaneously with another instruction/µop that is known to go to a particular execution port or execution unit. On some processors, there is a delay for transmitting data from one execution unit (or cluster of execution units) to another. This delay can be used for detecting whether two different instructions/µops are using the same or different execution units. Page 5 Instruction sets Instruction sets Explanation of instruction sets for x86 processors x86 80186 80286 80386 80486 x87 80287 80387 Pentium PPro MMX This is the name of the common instruction set, supported by all processors in this lineage. This is the first extension to the x86 instruction set. New integer instructions: PUSH i, PUSHA, POPA, IMUL r,r,i, BOUND, ENTER, LEAVE, shifts and rotates by immediate ≠ 1. System instructions for 16-bit protected mode. The eight general purpose registers are extended from 16 to 32 bits. 32-bit addressing. 32-bit protected mode. Scaled index addressing. MOVZX, MOVSX, IMUL r,r, SHLD, SHRD, BT, BTR, BTS, BTC, BSF, BSR, SETcc. BSWAP. Later versions have CPUID. This is the floating point instruction set. Supported when a 8087 or later coprocessor is present. Some 486 processors and all processors since Pentium/K5 have built-in support for floating point instructions without the need for a coprocessor. FSTSW AX FPREM1, FSIN, FCOS, FSINCOS. RDTSC, RDPMC. Conditional move (CMOV, FCMOV) and fast floating point compare (FCOMI) instructions introduced in Pentium Pro. These instructions are not supported in Pentium MMX, but are supported in all processors with SSE and later. Integer vector instructions with packed 8, 16 and 32-bit integers in the 64-bit MMX registers MM0 - MM7, which are aliased upon the floating point stack registers ST(0) - ST(7). SSE Single precision floating point scalar and vector instructions in the new 128-bit XMM registers XMM0 - XMM7. PREFETCH, SFENCE, FXSAVE, FXRSTOR, MOVNTQ, MOVNTPS. The use of XMM registers requires operating system support. SSE2 Double precision floating point scalar and vector instructions in the 128-bit XMM registers XMM0 - XMM7. 64-bit integer arithmetics in the MMX registers. Integer vector instructions with packed 8, 16, 32 and 64-bit integers in the XMM registers. MOVNTI, MOVNTPD, PAUSE, LFENCE, MFENCE. FISTTP, LDDQU, MOVDDUP, MOVSHDUP, MOVSLDUP, ADDSUBPS, ADDSUPPD, HADDPS, HADDPD, HSUBPS, HSUBPD. (Supplementary SSE3): PSHUFB, PHADDW, PHADDSW, PHADDD, PMADDUBSW, PHSUBW, PHSUBSW, PHSUBD, PSIGNB, PSIGNW, PSIGND, PMULHRSW, PABSB, PABSW, PABSD, PALIGNR. SSE3 SSSE3 64 bit This instruction set is called x86-64, x64, AMD64 or EM64T. It defines a new 64bit mode with 64-bit addressing and the following extensions: The general purpose registers are extended to 64 bits, and the number of general purpose registers is extended from eight to sixteen. The number of XMM registers is also extended from eight to sixteen, but the number of MMX and ST registers is still eight. Data can be addressed relative to the instruction pointer. There is no way to get access to these extensions in 32-bit mode Most instructions that involve segmentation are not available in 64 bit mode. Direct far jumps and calls are not allowed, but indirect far jumps, indirect far calls and far returns are allowed. These are used in system code for switching mode. Segment registers DS, ES, and SS cannot be used. The FS and GS segments and segment prefixes are available in 64 bit mode and are used for addressing thread environment blocks and processor environment blocks Page 6 Instruction sets Instructions not The following instructions are not available in 64-bit mode: PUSHA, POPA, available in 64 BOUND, INTO, BCD instructions: AAA, AAS, DAA, DAS, AAD, AAM, bit mode undocumented instructions (SALC, ICEBP, 82H alias for 80H opcode), SYSENTER, SYSEXIT, ARPL. On some early Intel processors, LAHF and SAHF are not available in 64 bit mode. Increment and decrement register instructions cannot be coded in the short one-byte opcode form because these codes have been reassigned as REX prefixes. Most instructions that involve segmentation are not available in 64 bit mode. Direct far jumps and calls are not allowed, but indirect far jumps, indirect far calls and far returns are allowed. These are used in system code for switching mode. PUSH CS, PUSH DS, PUSH ES, PUSH SS, POP DS, POP ES, POP SS, LDS and LES instructions are not allowed. CS, DS, ES and SS prefixes are allowed but ignored. The FS and GS segments and segment prefixes are available in 64 bit mode and are used for addressing thread environment blocks and processor environment blocks. Monitor SSE4.1 SSE4.2 AES CLMUL AVX AVX2 FMA3 The instructions MONITOR and MWAIT are available in some Intel and AMD multiprocessor CPUs with SSE3 MPSADBW, PHMINPOSUW, PMULDQ, PMULLD, DPPS, DPPD, BLEND.., PMIN.., PMAX.., ROUND.., INSERT.., EXTRACT.., PMOVSX.., PMOVZX.., PTEST, PCMPEQQ, PACKUSDW, MOVNTDQA CRC32, PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, PCMPGTQ, POPCNT. AESDEC, AESDECLAST, AESENC, AESENCLAST, AESIMC, AESKEYGENASSIST. PCLMULQDQ. The 128-bit XMM registers are extended to 256-bit YMM registers with room for further extension in the future. The use of YMM registers requires operating system support. Floating point vector instructions are available in 256-bit versions. Almost all previous XMM instructions now have two versions: with and without zero-extension into the full YMM register. The zero-extension versions have three operands in most cases. Furthermore, the following instructions are added in AVX: VBROADCASTSS, VBROADCASTSD, VEXTRACTF128, VINSERTF128, VLDMXCSR, VMASKMOVPS, VMASKMOVPD, VPERMILPD, VPERMIL2PD, VPERMILPS, VPERMIL2PS, VPERM2F128, VSTMXCSR, VZEROALL, VZEROUPPER. Integer vector instructions are available in 256-bit versions. Furthermore, the following instructions are added in AVX2: ANDN, BEXTR, BLSI, BLSMSK, BLSR, BZHI, INVPCID, LZCNT, MULX, PEXT, PDEP, RORX, SARX, SHLX, SHRX, TZCNT, VBROADCASTI128, VBROADCASTSS, VBROADCASTSD, VEXTRACTI128, VGATHERDPD, VGATHERQPD, VGATHERDPS, VGATHERQPS, VPGATHERDD, VPGATHERQD, VPGATHERDQ, VPGATHERQQ, VINSERTI128, VPERM2I128, VPERMD, VPERMPD, VPERMPS, VPERMQ, VPMASKMOVD, VPMASKMOVQ, VPSLLVD, VPSLLVQ, VPSRAVD, VPSRLVD, VPSRLVQ. (FMA): Fused multiply and add instructions: VFMADDxxxPD, VFMADDxxxPS, VFMADDxxxSD, VFMADDxxxSS, VFMADDSUBxxxPD, VFMADDSUBxxxPS, VFMSUBADDxxxPD, VFMSUBADDxxxPS, VFMSUBxxxPD, VFMSUBxxxPS, VFMSUBxxxSD, VFMSUBxxxSS, VFNMADDxxxPD, VFNMADDxxPS, VFNMADDxxxSD, VFNMADDxxxSS, VFNMSUBxxxPD, VFNMSUBxxxPS, VFNMSUBxxxSD, VFNMSUBxxxSS. FMA4 MOVBE Same as Intel FMA, but with 4 different operands according to a preliminary Intel specification which is now supported only by AMD. Intel's FMA specification has later been changed to FMA3, which is now also supported by AMD. MOVBE Page 7 Instruction sets POPCNT PCLMUL XSAVE XSAVEOPT RDRAND RDSEED BMI1 BMI2 ADX AVX512F AVX512BW AVX512DQ AVX512VL AVX512CD AVX512ER AVX512PF SHA MPX SMAP CVT16 POPCNT PCLMULQDQ RDRAND RDSEED ANDN, BEXTR, BLSI, BLSMSK, BLSR, LZCNT, TXCNT BZHI, MULX, PDEP, PEXT, RORX, SARX, SHRX, SHLX ADCX, ADOX, CLAC The 256-bit YMM registers are extended to 512-bit ZMM registers. The number of vector registers is extended to 32 in 64-bit mode, while there are still only 8 vector registers in 32-bit mode. 8 new vector mask registers k0 – k7. Masked vector instructions. Many new instructions. Single- and double precision floating point vectors are always supported. Other instructions are supported if the various optional AVX512 variants, listed below, are supported as well. Vectors of 8-bit and 16-bit integers in ZMM registers. Vectors of 32-bit and 64-bit integers in ZMM registers. The vector operations defined for 512-bit vectors in the various AVX512 subsets, including masked operations, can be applied to 128-bit and 256-bit vectors as well. Conflict detection instructions Approximate exponential function, reciprocal and reciprocal square root Gather and scatter prefetch Secure hash algorithm Memory protection extensions CLAC, STAC VCVTPH2PS, VCVTPS2PH. 3DNow (AMD only. Obsolete). Single precision floating point vector instructions in the 64-bit MMX registers. Only available on AMD processors. The 3DNow instructions are: FEMMS, PAVGUSB, PF2ID, PFACC, PFADD, PFCMPEQ/GT/GE, PFMAX, PFMIN, PFRCP/IT1/IT2, PFRSQRT/IT1, PFSUB, PFSUBR, PI2FD, PMULHRW, PREFETCH/W. (AMD only. Obsolete). PF2IW, PFNACC, PFPNACC, PI2FW, PSWAPD. 3DNowE PREFETCHW This instruction has survived from 3DNow and now has its own feature name PREFETCHWT1 SSE4A PREFETCHWT1 (AMD only). EXTRQ, INSERTQ, LZCNT, MOVNTSD, MOVNTSS, POPCNT. (POPCNT shared with Intel SSE4.2). XOP (AMD only). VFRCZPD, VFRCZPS, VFRCZSD, VFRCZSS, VPCMOV, VPCOMB, VPCOMD, VPCOMQ, PCOMW, VPCOMUB, VPCOMUD, VPCOMUQ, VPCOMUW, VPHADDBD, VPHADDBQ, VPHADDBW, VPHADDDQ, VPHADDUBD, VPHADDUBQ, VPHADDUBW, VPHADDUDQ, VPHADDUWD, VPHADDUWQ, VPHADDWD, VPHADDWQ, VPHSUBBW, VPHSUBDQ, VPHSUBWD, VPMACSDD, VPMACSDQH, VPMACSDQL, VPMACSSDD, VPMACSSDQH, VPMACSSDQL, VPMACSSWD, VPMACSSWW, VPMACSWD, VPMACSWW, VPMADCSSWD, VPMADCSWD, VPPERM, VPROTB, VPROTD, VPROTQ, VPROTW, VPSHAB, VPSHAD, VPSHAQ, VPSHAW, VPSHLB, VPSHLD, VPSHLQ, VPSHLW. Page 8 Microprocessors tested Microprocessor versions tested The tables in this manual are based on testing of the following microprocessors Processor name AMD K7 Athlon AMD K8 Opteron AMD K10 Opteron AMD Bulldozer AMD Piledriver AMD Steamroller AMD Bobcat AMD Kabini Intel Pentium Intel Pentium MMX Intel Pentium II Intel Pentium III Intel Pentium 4 Intel Pentium 4 EM64T Intel Pentium M Intel Core Duo Intel Core 2 (65 nm) Intel Core 2 (45 nm) Intel Core i7 Intel 2nd gen. Core Intel 3rd gen. Core Intel 4th gen. Core Intel 5th gen. Core Intel 6th gen. Core Intel Atom 330 Intel Bay Trail VIA Nano L2200 VIA Nano L3050 Family Model Microarchitecture number number Code name (hex) (hex) Comment Bulldozer, Zambezi Piledriver Steamroller, Kaveri Bobcat Jaguar P5 P5 P6 P6 Netburst Netburst, Prescott Dothan Yonah Merom Wolfdale Nehalem Sandy Bridge Ivy Bridge Haswell Broadwell Skylake Diamondville Silvermont Isaiah 6 F 10 15 15 15 14 16 5 5 6 6 F F 6 6 6 6 6 6 6 6 6 6 6 6 6 6 Page 9 6 5 2 1 2 30 1 0 2 4 6 7 2 4 D E F 17 1A 2A 3A 3C 56 5E 1C 37 F F Step. 2, rev. A5 Stepping A 2350, step. 1 FX-6100, step 2 FX-8350, step 0. And others A10-7850K, step 1 E350, step. 0 A4-5000, step 1 Stepping 4 Stepping 4, rev. B0 Xeon. Stepping 1 Stepping 6, rev. B1 Not fully tested T5500, Step. 6, rev. B2 E8400, Step. 6 i7-920, Step. 5, rev. D0 i5-2500, Step 7 i7-3770K, Step 9 i7-4770K, step. 3 D1540, step 2 Step. 3 Step. 2 Step. 3 Step. 2 Step. 8 (prerelease sample) AMD K7 AMD K7 List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m). Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution unit: Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units. Integer instructions Instruction Move instructions MOV MOV Operands r,r r,i Ops 1 1 Latency Reciprocal throughput 1 1 1/3 1/3 MOV MOV MOV MOV r8,m8 r16,m16 r32,m32 m8,r8H 1 1 1 1 4 4 3 8 1/2 1/2 1/2 1/2 MOV m8,r8L 1 2 1/2 MOV MOV MOV m16/32,r m,i r,sr 1 1 1 2 2 2 1/2 1/2 1 Page 10 Execution unit Notes ALU ALU Any addr. mode. Add 1 clk if code segment base ≠ ALU, AGU 0 do. ALU, AGU do. AGU AH, BH, CH, DH AGU Any other 8-bit register AGU Any addressing mode AGU AGU AMD K7 MOV MOVZX, MOVSX MOVZX, MOVSX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POP POPF(D) POPA(D) LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL IMUL sr,r/m r,r r,m r,r r,m r,r 6 1 1 1 1 3 9-13 1 4 1 r,m 3 2 1 1 2 2 1 9 2 3 6 9 2 9 2 1 4 2 1 10 1 16 5 1 1 7 1 1 7 1 r8/m8 1 1 1 1 1 1 1 1 1 1 9 12 16 4 31 3 1 7 5 6 7 5 13 3 r16/m16 r32/m32 r16,r16/m16 3 3 2 3 4 3 r i m sr r m DS/ES/FS/GS SS r16,[m] r32,[m] r,m r r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m 2 3 2 3 2 1 1 Page 11 8 1/3 1/2 1/3 1/2 1 16 1 1 1 1 1 4 1 1 10 18 1 4 1 1/3 2 2 1 9 1/3 1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3 5 6 7 ALU ALU, AGU ALU ALU, AGU ALU Timing depends ALU, AGU on hw ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU AGU Any addr. size AGU Any addr. size ALU ALU ALU ALU 2 ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU0 ALU ALU0 2 3 2 ALU0_1 ALU0_1 ALU0 latency ax=3, dx=4 AMD K7 IMUL IMUL IMUL IMUL IMUL DIV DIV DIV IDIV IDIV IDIV IDIV IDIV IDIV CBW, CWDE CWD, CDQ Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,ROR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC, BTR, BTS BSF BSR r32,r32/m32 r16,(r16),i r32,(r32),i r16,m16,i r32,m32,i r8/m8 r16/m16 r32/m32 r8 r16 r32 m8 m16 m32 2 2 2 3 3 32 47 79 41 56 88 42 57 89 1 1 r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r r,r r,r 1 1 1 1 1 1 1 1 1 1 9 7 9 7 1 1 10 9 9 8 6 7 8 1 1 5 2 5 4 8 19 23 4 4 5 24 24 40 17 25 41 17 25 41 1 1 1 1 7 1 1 1 7 1 1 1 4 3 3 3 7 7 5 8 6 7 4 4 7 1 2 7 7 6 7 9 Page 12 2.5 1 2 2 2 23 23 40 17 25 41 17 25 41 1/3 1/3 ALU0 ALU0 ALU0 ALU0 ALU0 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 1/3 1/2 2.5 1/3 1/2 1/3 2.5 1/3 1/3 1/3 4 3 3 3 3 4 4 4 4 3 2 3 3 1/3 1/2 2 1 2 2 3 7 9 ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU ALU AMD K7 BSF BSR SETcc SETcc CLC, STC CMC CLD STD r,m r,m r m Control transfer instructions JMP short/near JMP JMP JMP 20 23 1 1 1 1 2 3 8 10 1 1 1 8 10 1/3 1/2 1/3 1/3 1 2 ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU 2 ALU low values = real mode far r m(near) 16-20 1 1 23-32 m(far) short/near short short near 17-21 1 2 7 3 25-33 CALL CALL CALL far r m(near) 16-22 4 5 23-32 3 3 CALL RETN RETN m(far) 16-22 2 2 24-33 3 3 15-23 24-35 i 15-24 32 33 24-35 81 42 m 6 2 INTO 2 2 String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS 4 5 4 3 7 4 5 5 7 6 JMP Jcc J(E)CXZ LOOP CALL i RETF RETF IRET INT BOUND i 2 2 3-4 2 2 2 2 1 3 1-4 2 2 6 3-4 Page 13 1/3 - 2 1/3 - 2 3-4 2 ALU ALU, AGU ALU ALU ALU ALU low values = real mode rcp. t.= 2 if jump rcp. t.= 2 if jump low values = real mode 3 3 ALU ALU, AGU low values = real mode 3 3 2 2 2 1 3 1-4 2 2 6 3-4 ALU ALU low values = real mode low values = real mode real mode real mode values are for no jump values are for no jump values per count values per count values per count values per count values per count AMD K7 Other NOP (90) Long NOP (0F 1F) ENTER 1 1 i,0 LEAVE CLI STI CPUID RDTSC RDPMC 3 8-9 16-17 19-28 5 9 0 0 12 1/3 1/3 12 ALU ALU 12 3 ops, 5 clk if 16 bit 3 5 27 44-74 11 11 Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 Operands r m32/64 m80 m80 r m32/64 m80 m80 r m m FCMOVcc FFREE FINCSTP, FDECSTP st0,r r FNSTSW FSTSW FNSTSW FNSTCW Ops 1 1 7 30 1 1 10 260 1 1 1 1 Latency Reciprocal throughput 2 4 16 41 2 3 7 0 9 7 1/2 1/2 4 39 1/2 1 5 188 0.4 1 1 1 Execution unit Notes FA/M FANY FA/M FMISC FMISC FMISC, FA/M FMISC 42 Low latency immediately after FMISC, FA/M FCOMI FANY FANY Low latency immediately after FMISC, ALU FCOM FTST FMISC, ALU do. FMISC, ALU do. FMISC, ALU faster if FMISC, ALU unchanged 4 4 4 4 1 1-2 1 2 FADD FADD,FMISC FMUL FMUL,FMISC 11-25 8-22 FMUL 9 1 1 6 AX AX m16 m16 2 3 2 3 6-12 6-12 FLDCW m16 14 Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL r/m m r/m m 1 2 1 2 FDIV(R)(P) r/m 1 0 Page 14 5 1/3 1/3 12 12 8 1 Low values are for round divisors AMD K7 FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 m 2 1 1 1 1 2 1 2 5 1 1 12-26 2 2 2 3 Math FSQRT FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 1 44 51 76 46 72 5 7 8 49 63 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR 1 1 7 25 76 65 44 85 r/m r m 9-23 1 1 1 1 1 1 2 3 8 8 FMUL,FMISC FMUL FADD FADD FADD 35 90-100 90-100 100-150 100-200 160-170 8 11 27 126 147 12 FMUL 0 0 1/3 1/3 24 92 147 120 59 87 FANY ALU FMISC FMISC 2 10 7-10 8-11 do. FADD, FMISC FADD FMISC, ALU FMUL FMUL Integer MMX instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVNTQ PACKSSWB/DW PACKUSWB Operands Ops Latency Reciprocal throughput r32, mm mm, r32 mm,m32 m32, r mm,mm mm,m64 m64,mm m,mm 2 2 1 1 1 1 1 1 7 9 mm,r/m 1 2 2 Page 15 Execution unit 2 2 1/2 1 1/2 1/2 1 2 FMICS, ALU FANY, ALU FANY FMISC FA/M FANY FMISC FMISC 2 FA/M Notes AMD K7 PUNPCKH/LBW/WD PSHUFW MASKMOVQ PMOVMSKB PEXTRW PINSRW mm,r/m mm,mm,i mm,mm r32,mm r32,mm,i mm,r32,i 1 1 32 3 2 2 mm,r/m mm,r/m 2 2 5 12 2 1/2 24 3 2 2 FA/M FA/M FADD FMISC, ALU FA/M 1 1 2 2 1/2 1/2 FA/M FA/M mm,r/m mm,r/m mm,r/m mm,r/m mm,r/m 1 1 1 1 1 3 3 2 2 3 1 1 1/2 1/2 1 FMUL FMUL FA/M FA/M FADD mm,r/m 1 2 1/2 FA/M mm,i/mm/m 1 2 1/2 FA/M 1/3 FANY Arithmetic instructions PADDB/W/D PADDSB/W PADDUSB/W PSUBB/W/D PSUBSB/W PSUBUSB/W PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMADDWD PAVGB/W PMIN/MAX SW/UB PSADBW Logic PAND PANDN POR PXOR PSLL/RLW/D/Q PSRAW/D Other EMMS 1 Floating point XMM instructions Instruction Move instructions MOVAPS MOVAPS MOVAPS MOVUPS MOVUPS MOVUPS MOVSS MOVSS MOVSS MOVHLPS, MOVLHPS MOVHPS, MOVLPS MOVHPS, MOVLPS MOVNTPS MOVMSKPS SHUFPS Operands Ops r,r r,m m,r r,r r,m m,r r,r r,m m,r r,r r,m m,r m,r r32,r r,r/m,i 2 2 2 2 5 5 1 2 1 1 1 1 2 3 3 Latency Reciprocal throughput 2 2 2 4 3 2 3 Page 16 1 2 2 1 2 2 1 1 1 1/2 1/2 1 4 2 3 Execution unit FA/M FMISC FMISC FA/M FA/M FANY FMISC FMISC FA/M FMISC FMISC FMISC FADD FMUL Notes AMD K7 UNPCK H/L PS Conversion CVTPI2PS CVT(T)PS2PI CVTSI2SS CVT(T)SS2SI Arithmetic ADDSS SUBSS ADDPS SUBPS MULSS MULPS r,r/m 2 3 xmm,mm mm,xmm xmm,r32 r32,xmm 1 1 4 2 4 6 r,r/m r,r/m r,r/m r,r/m 1 2 1 2 4 4 4 4 3 FMUL 10 3 FMISC FMISC FMISC FMISC 1 2 1 2 FADD FADD FMUL FMUL DIVSS DIVPS RCPSS RCPPS MAXSS MINSS MAXPS MINPS CMPccSS CMPccPS COMISS UCOMISS r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 2 1 2 1 11-16 18-30 3 3 2 2 2 2 2 8-13 18-30 1 2 1 2 1 2 1 FMUL FMUL FMUL FMUL FADD FADD FADD FADD FADD Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,r/m 2 2 2 FMUL Math SQRTSS SQRTPS RSQRTSS RSQRTPS r,r/m r,r/m r,r/m r,r/m 1 2 1 2 19 36 3 3 16 36 1 2 FMUL FMUL FMUL FMUL Other LDMXCSR STMXCSR m m 8 3 Low values are for round divisors, e.g. powers of 2. do. 9 10 3DNow instructions (obsolete) Instruction Operands Move and convert instructions PREFETCH(W) m PF2ID mm,mm PI2FD mm,mm PF2IW mm,mm PI2FW mm,mm PSWAPD mm,mm Ops 1 1 1 1 1 1 Latency Reciprocal throughput 5 5 5 5 2 Page 17 1/2 1 1 1 1 1/2 Execution unit AGU FMISC FMISC FMISC FMISC FA/M Notes 3DNow E 3DNow E 3DNow E AMD K7 Integer instructions PAVGUSB PMULHRW mm,mm mm,mm 1 1 2 3 1/2 1 FA/M FMUL Floating point instructions PFADD/SUB/SUBR PFCMPEQ/GE/GT PFMAX/MIN PFMUL PFACC PFNACC, PFPNACC PFRCP PFRCPIT1/2 PFRSQRT PFRSQIT1 mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm mm,mm 1 1 1 1 1 1 1 1 1 1 4 2 2 4 4 4 3 4 3 4 1 1 1 1 1 1 1 1 1 1 FADD FADD FADD FMUL FADD FADD FMUL FMUL FMUL FMUL Other FEMMS mm,mm 1 1/3 FANY Page 18 3DNow E K8 AMD K8 List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m). Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution unit: Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units. Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI Operands Ops Latency Reciprocal Execution throughput unit r,r r,i r8,m8 r16,m16 r32,m32 r64,m64 m8,r8H 1 1 1 1 1 1 1 1 1 4 4 3 3 8 1/3 1/3 1/2 1/2 1/2 1/2 1/2 ALU ALU ALU, AGU ALU, AGU AGU AGU AGU m8,r8L m16/32/64,r m,i m64,i32 r,sr sr,r/m m,r 1 1 1 1 1 6 1 3 3 3 3 2 9-13 1/2 1/2 1/2 1/2 1/2-1 8 2-3 AGU AGU AGU AGU Page 19 AGU Notes Any addressing mode. Add 1 clock if code segment base ≠ 0 AH, BH, CH, DH Any other 8-bit register Any addressing mode K8 MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG r,r r,m r64,r32 r64,m32 r,r r,m r,r 1 1 1 1 1 1 3 1 4 1 XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE IN OUT r,m 3 2 1 1 2 2 5 9 2 3 4-6 7-9 25 9 2 1 1 4 1 1 10 1 1 1 6 1 7 270 300 16 5 1 1 1 1 2 4 1 1 8 28 10 4 3 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 9 12 16 4 1 1 7 1 1 7 1 r i m sr r m DS/ES/FS/GS Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD SS r16,[m] r32,[m] r64,[m] r,m r m m r,i/DX i/DX,r r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m 1 2 1 1 7 5 6 7 5 Page 20 1/3 1/2 1/3 1/2 1/3 1/2 1 ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU 16 ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU AGU AGU AGU ALU ALU ALU 1 1 1 1 2 4 1 1 8 28 10 4 1 1/3 1/3 2 1/3 1/3 9 1/3 1/2 1/2 8 5 16 1/3 1/2 2.5 1/3 1/2 2.5 1/3 1/2 1/3 3 5 6 7 ALU AGU AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU0 Timing depends on hw Any address size Any address size Any address size K8 AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,R OR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD 31 1 3 2 2 1 1 1 2 1 1 3 3 3 31 46 78 143 40 55 87 152 41 56 88 153 1 1 13 3 3-4 3 4-5 3 3 4 4 3 4 r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r16/m16 r32/m32 r64/m64 r8 r16 r32 r64 m8 m16 m32 m64 15 23 39 71 17 25 41 73 17 25 41 73 1 1 1 2 1 2 1 1 2 1 1 2 2 2 2 15 23 39 71 17 25 41 73 17 25 41 73 1/3 1/3 ALU ALU0 ALU0_1 ALU0_1 ALU0_1 ALU0 ALU0 ALU0_1 ALU0 ALU0 ALU0 ALU0 ALU0 ALU0_1 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 1 1 1 1 1 1 1 1 1 1 9 7 9 7 1 1 7 1 1 1 7 1 1 1 3 3 4 3 1/3 1/2 2.5 1/3 1/2 1/3 2.5 1/3 1/3 1/3 3 3 4 3 ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU 1 1 10 9 9 8 6 7 7 7 9 8 7 8 3 3 3 4 4 4 4 3 3 3 ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU Page 21 latency ax=3, dx=4 latency rax=4, rdx=5 K8 SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF BSF BSR BSF BSF BSF BSR SETcc SETcc CLC, STC CMC CLD STD m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r16/32,r r64,r r,r r16,m r32,m r64,m r,m r m Control transfer instructions JMP short/near JMP JMP JMP 8 1 1 5 2 5 4 8 8 21 22 28 20 22 25 28 1 1 1 1 1 2 6 1 2 7 7 5 8 8 9 10 8 9 10 10 1 1 1 far r m(near) 16-20 1 1 23-32 m(far) short/near short short near 17-21 1 2 7 3 25-33 CALL CALL CALL far r m(near) 16-22 4 5 23-32 3 3 CALL RETN RETN m(far) 16-22 2 2 24-33 3 3 15-23 24-35 15-24 32 33 6 2 24-35 81 42 JMP Jcc J(E/R)CXZ LOOP CALL i RETF RETF IRET INT BOUND INTO i i m 3 1/3 1/2 2 1 2 2 5 3 8 9 10 8 9 10 10 1/3 1/2 1/3 1/3 1/3 1/3 ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU 2 ALU low values = real mode 2 2 3-4 2 1/3 - 2 1/3 - 2 3-4 2 Page 22 ALU ALU ALU ALU low values = real mode recip. thrp.= 2 if jump recip. thrp.= 2 if jump low values = real mode 3 3 ALU ALU, AGU low values = real mode 3 3 2 2 String instructions ALU ALU, AGU ALU ALU low values = real mode low values = real mode real mode real mode values are for no jump values are for no jump K8 LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS 4 2 5 2 4 2 1.5 - 2 0.5 - 1 7 3 3 1-2 5 2 5 2 2 3 6 2 2 2 2 0.5 - 1 3 1-2 2 2 3 2 Other NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC 1 0 1 0 i,0 12 2 8-9 16-17 22-50 47-164 6 10 9 12 1/3 1/3 12 3 5 27 values are per count values are per count values are per count values are per count values are per count ALU ALU 12 3 ops, 5 clk if 16 bit 7 7 Floating point x87 instructions Instruction Operands Ops r m32/64 m80 m80 r m32/64 m80 m80 r m m 1 1 7 30 1 1 10 260 1 1 1 1 2 4 16 41 2 3 7 173 0 9 7 1/2 1/2 4 39 1/2 1 5 160 0.4 1 1 1 FCMOVcc FFREE FINCSTP, FDECSTP st0,r r 9 1 1 4-15 4 2 1/3 FNSTSW FSTSW FNSTSW FNSTCW FLDCW AX AX m16 m16 m16 2 3 2 3 18 6-12 6-12 12 12 8 1 50 Arithmetic instructions FADD(P),FSUB(R)(P) r/m 1 4 1 Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 Latency Reciprocal Execution throughput unit 0 Page 23 Notes FA/M FANY FA/M FMISC FMISC FMISC, FA/M FMISC Low latency immediFMISC, FA/M ately after FCOMI FANY FANY Low latency immediately after FCOM FMISC, ALU FTST FMISC, ALU do. FMISC, ALU do. FMISC, ALU FMISC, ALU faster if unchanged FADD K8 FIADD,FISUB(R) FMUL(P) FIMUL m r/m m 2 1 2 4 4 4 1-2 1 2 FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 r/m m 1 2 1 1 1 1 2 1 2 5 1 1 11-25 12-26 2 2 2 3 8-22 9-23 1 1 1 1 1 1 1 3 8 8 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 1 1 66 73 98 67 97 5 7 53 72 75 27 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR 1 1 8 26 77 70 61 101 r/m r m 2 10 7-10 8-11 140-190 150-190 170-200 150-180 217 8 12 126 179 175 0 0 12 1 FADD,FMISC FMUL FMUL,FMISC Low values are for round divisors FMUL FMUL,FMISC do. FMUL FADD FADD FADD FADD, FMISC FADD FMISC, ALU FMUL FMUL FMUL FMISC 7 1/3 1/3 27 100 171 136 56 95 FANY ALU FMISC FMISC Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD Operands r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32, r Ops 2 2 1 3 3 2 1 Latency Reciprocal Execution throughput unit 4 9 2 3 Page 24 2 2 1/2 2 2 1 1 FMICS, ALU FANY, ALU FANY FMISC, ALU FANY FMISC Notes K8 MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFD PSHUFW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW PINSRW 2 2 3 1 2 1 2 1 2 2 2 4 5 1 2 1 2 4 9 9 2 2 mm,r/m 1 2 2 FA/M xmm,r/m 3 3 2 FA/M mm,r/m 1 2 2 FA/M xmm,r/m xmm,r/m xmm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i mm,mm xmm,xmm r32,mm/xmm r32,mm/x,i mm,r32,i xmm,r32,i 2 2 1 3 1 2 32 64 1 2 2 3 2 2 2 3 2 2 FA/M FA/M FA/M FA/M FA/M FA/M 2 5 12 12 2 1 1/2 1.5 1/2 1 13 26 1 2 2 3 FADD FMISC, ALU FA/M FA/M mm,r/m 1 2 1/2 FA/M xmm,r/m mm,r/m xmm,r/m 2 1 2 2 2 2 1 1/2 1 FA/M FA/M FA/M Arithmetic instructions PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PADDB/W/D/Q PADDSB/W ADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PCMPEQ/GT B/W/D PCMPEQ/GT B/W/D 2 2 2 Page 25 2 2 2 1/2 1 1/2 1 1 1 2 2 2 2 1/2 1 2 3 Moves 64 bits.Name FMISC, ALU of instruction differs FANY, ALU do. FANY, ALU do. FA/M FA/M, FMISC FANY FANY, FMISC FMISC FA/M FMISC FMISC r64,mm/xmm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,mm/x xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm FA/M FA/M, FMISC FMISC FMISC K8 PMULLW PMULHW PMULHUW PMULUDQ PMULLW PMULHW PMULHUW PMULUDQ PMADDWD PMADDWD PAVGB/W PAVGB/W PMIN/MAX SW/UB PMIN/MAX SW/UB PSADBW PSADBW Logic PAND PANDN POR PXOR PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ mm,r/m 1 3 1 FMUL xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m 2 1 2 1 2 1 2 1 2 3 3 3 2 2 2 2 3 3 2 1 2 1/2 1 1/2 1 1 2 FMUL FMUL FMUL FA/M FA/M FA/M FA/M FADD FADD mm,r/m 1 2 1/2 FA/M xmm,r/m 2 2 1 FA/M mm,i/mm/m 1 2 1/2 FA/M x,i/x/m xmm,i 2 2 2 2 1 1 FA/M FA/M 1/3 FANY Other EMMS 1 Floating point XMM instructions Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVDDUP MOVSH/LDUP MOVNTPS/D MOVMSKPS/D Operands Ops Latency Reciprocal Execution throughput unit r,r r,m m,r r,r r,m m,r r,r r,m m,r 2 2 2 2 4 5 1 2 1 2 2 4 3 1 2 2 1 2 2 1 1 1 FA/M FANY FMISC FMISC r,r 1 2 1/2 FA/M r,m 1 1 FMISC m,r r,r r,r m,r r32,r 1 2 2 2 1 1 1 2 3 1 FMISC 2 2 2 8 Page 26 Notes FA/M FMISC FMISC FA/M SSE3 SSE3 FMISC FADD K8 SHUFPS/D UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D HADDPS/D HSUBPS/D MULSS/D MULPS/D DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D r,r/m,i r,r/m 3 2 3 3 2 3 FMUL FMUL r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 2 4 3 1 2 2 2 4 1 2 1 3 3 2 2 2 4 8 8 2 5 5 5 8 4 5 6 8 14 12 10 9 2 3 8 1 2 2 2 3 1 2 1 2 2 2 2 2 FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC FMISC r,r/m r,r/m 1 2 4 4 1 2 FADD FADD r,r/m r,r/m r,r/m 2 1 2 4 4 4 2 1 2 FADD FMUL FMUL r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 2 1 2 1 2 11-16 18-30 11-20 16-34 3 3 2 2 2 2 8-13 18-30 8-17 16-34 1 2 1 2 1 2 FMUL FMUL FMUL FMUL FMUL FMUL FADD FADD FADD FADD r,r/m 1 2 1 FADD Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,r/m 2 2 2 FMUL Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 19 36 27 48 3 16 36 24 48 1 FMUL FMUL FMUL FMUL FMUL Page 27 SSE3 Low values are for round divisors, e.g. powers of 2. do. do. do. K8 RSQRTPS r,r/m 2 Other LDMXCSR STMXCSR m m 8 3 3 2 9 10 Page 28 FMUL K10 AMD K10 List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m). Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution unit: Indicates which execution unit is used for the macro-operations. ALU means any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used. AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means floating point store and miscellaneous unit. FA/M means FADD or FMUL is used. FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units. Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX Operands r,r r,i r8,m8 r16,m16 r32,m32 r64,m64 m8,r8H m8,r8L m16/32/64,r m,i m64,i32 r,sr sr,r/m m,r r,r Ops 1 1 1 1 1 1 1 1 1 1 1 1 6 1 1 Latency Reciprocal Execution throughput unit 1 1 4 4 3 3 8 3 3 3 3 3-4 8-26 1 Page 29 1/3 1/3 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 8 1 1/3 ALU ALU ALU, AGU ALU, AGU AGU AGU AGU AGU AGU AGU AGU Notes Any addr. mode. Add 1 clock if code segment base ≠ 0 AH, BH, CH, DH Any other 8-bit reg. Any addressing mode from AMD manual AGU ALU K10 MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP PREFETCHNTA PREFETCHT0/1/2 PREFETCH(W) SFENCE LFENCE MFENCE IN OUT 1 1 1 1 1 2 2 2 r 1 i 1 m 2 sr 2 9 9 r 1 m 3 6 DS/ES/FS/GS SS 10 28 9 r16,[m] 2 r32/64,[m] 1 r32/64,[m] 1 4 1 1 r,m 10 r 1 m 1 m 1 m 1 6 1 4 r,i/DX ~270 i/DX,r ~300 Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM r,m r64,r32 r64,m32 r,r r,m r,r r,m r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m 1 1 1 1 1 1 1 1 1 1 9 12 16 4 30 4 1 4 1 4 1 21 5 6 3 10 26 16 6 3 1 2 3 1 1 1 1 4 1 4 1 1 7 5 6 7 5 13 Page 30 1/2 1/3 1/2 1/3 1/2 1 19 5 1/2 1/2 1 1 3 6 1/2 1 8 16 11 6 1 1/3 1/3 2 1/3 1 10 1/3 1/2 1/2 1/2 8 1 33 ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU AGU ALU ALU ALU 1/3 1/2 1 1/3 1/2 1 1/3 1/2 1/3 2 5 6 7 5 13 ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU0 ALU ALU AGU AGU AGU Timing depends on hw Any address size ≤ 2 source operands W. scale or 3 opr. 3DNow K10 MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV IDIV DIV DIV DIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL,RO RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r8 m8 r16/m16 r32/m32 r64/m64 r16/m16 r32/m32 r64/m64 1 3 2 2 1 1 1 2 1 1 3 3 3 1 1 r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i 1 1 1 1 1 1 1 1 1 1 9 7 9 7 1 1 10 9 9 8 6 7 8 1 1 5 2 3 3 3 4 3 3 4 4 3 4 17 19 22 15-30 15-46 15-78 24-39 24-55 24-87 1 1 1 4 1 1 7 1 1 1 3 3 4 3 7 7 7 7 8 7 3 3 7.5 1 7 2 Page 31 1 2 1 2 1 1 2 1 1 2 2 2 2 17 19 22 15-30 15-46 15-78 24-39 24-55 24-87 1/3 1/3 ALU0 ALU0_1 ALU0_1 ALU0_1 ALU0 ALU0 ALU0_1 ALU0 ALU0 ALU0 ALU0 ALU0 ALU0_1 ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 1/3 1/2 1 1/3 1/2 1/3 1 1/3 1/3 1 3 3 4 3 1 1 5 6 6 5 2 3 6 1/3 1/2 2 1/3 ALU ALU, AGU ALU, AGU ALU ALU, AGU ALU ALU, AGU ALU ALU ALU ALU ALU ALU ALU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU ALU, AGU ALU, AGU ALU latency ax=3, dx=4 latency rax=4, rdx=5 Depends on number of significant bits in absolute value of dividend. See AMD software optimization guide. K10 BTC BTR, BTS BTC BTR, BTS BSF BSR BSF BSR POPCNT LZCNT SETcc SETcc CLC, STC CMC CLD STD m,i m,i m,r m,r r,r r,r r,m r,m r,r/m r,r/m r m Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET INT i BOUND m INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS 5 4 8 8 6 7 7 8 1 1 1 1 1 1 1 2 1 16-20 1 1 17-21 1 2 7 3 16-22 4 5 16-22 2 2 15-23 15-24 32 33 6 2 4 5 4 2 7 3 5 5 7 3 9 9 8 8 4 4 7 7 2 2 1 1 1.5 1.5 10 7 3 3 3 3 1 1 1/3 1/2 1/3 1/3 1/3 2/3 ALU, AGU ALU, AGU ALU, AGU ALU, AGU ALU ALU ALU, AGU ALU, AGU ALU ALU ALU ALU, AGU ALU ALU ALU ALU 2 ALU 23-32 low values = real mode 2 2 ALU ALU, AGU 1/3 - 2 2/3 - 2 3 2 ALU ALU ALU ALU 3 3 ALU ALU, AGU 3 3 ALU ALU 25-33 2 23-32 3 3 24-33 3 3 24-35 24-35 81 42 SSE4.A / SSE4.2 SSE4.A, AMD only low values = real mode low values = real mode low values = real mode low values = real mode low values = real mode 2 2 2 2 2 1 3 1 2 2 3 1 Other Page 32 recip. thrp.= 2 if jump recip. thrp.= 2 if jump 2 2 2 1 3 1 2 2 3 1 real mode real mode values are for no jump values are for no jump values are per count values are per count values are per count values are per count values are per count K10 NOP (90) Long NOP (0F 1F) ENTER LEAVE CLI STI CPUID RDTSC RDPMC 1 0 1 0 i,0 12 2 8-9 16-17 22-50 47-164 30 13 1/3 1/3 ALU ALU 12 3 5 27 3 ops, 5 clk if 16 bit 67 5 Floating point x87 instructions Instruction Operands Ops r m32/64 m80 m80 r m32/64 m80 m80 r m m 1 1 7 20 1 1 10 218 1 1 1 1 FCMOVcc FFREE FINCSTP, FDECSTP st0,r r 9 1 1 FNSTSW FSTSW FNSTSW FNSTCW FLDCW AX AX m16 m16 m16 2 3 2 3 12 r/m m r/m m r/m m 1 2 1 2 1 2 1 1 1 1 2 1 2 6 Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 Latency Reciprocal Execution throughput unit 2 4 13 94 2 2 8 167 0 6 4 0 1/2 1/2 4 30 1/2 1 7 163 1/3 1 1 1 Notes FA/M FANY FA/M FMISC FMISC FMISC FMISC 1/3 1/3 Low latency immediFMISC, FA/M ately after FCOMI FANY FANY 16 14 9 2 14 FMISC, ALU after FCOM FTST FMISC, ALU do. FMISC, ALU do. FMISC, ALU FMISC, ALU faster if unchanged Low latency immediately Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT r/m r m 4 4 ? 31 2 Page 33 1 4 1 4 24 24 2 1 1 1 1 1 1 37 FADD FADD,FMISC FMUL FMUL,FMISC FMUL FMUL,FMISC FMUL FADD FADD FADD FADD, FMISC FADD FMISC, ALU K10 FPREM FPREM1 1 1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 1 1 45 51 76 45 9 5 11 8 8 12 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR 1 1 8 26 77 70 61 85 m m m m 35 ~51? ~90? ~125? ~119 151? 9 9 65 13 114 0 0 162 133 63 89 7 7 FMUL FMUL 35 1 FMUL FMISC 45? 29 41 30? 30? 44? 1/3 1/3 28 103 149 149 58 79 FANY ALU FMISC FMISC Integer MMX and XMM instructions Instruction Operands Ops Latency Reciprocal Execution throughput unit Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32,mm/x 1 2 1 1 2 1 1 3 6 4 3 6 2 2 1 3 1/2 1 3 1/2 1 MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU r64,(x)mm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,(x)mm xmm,xmm xmm,m m,xmm xmm,m 1 2 2 1 1 1 1 1 1 1 2 1 3 6 6 2 2.5 4 2 2 2.5 2 2 2 1 3 3 1/2 1/3 1/2 1/2 1 1/3 1/2 1 1/2 Page 34 Notes FADD FANY FADD FMISC Moves 64 bits.Name of instruction differs do. FMUL, ALU do. FA/M FANY FANY ? FMISC FANY ? FMUL,FMISC FADD K10 MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LBW/WD/ DQ PUNPCKHQDQ PUNPCKLQDQ PSHUFD PSHUFW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW INSERTQ INSERTQ EXTRQ EXTRQ m,xmm mm,xmm xmm,mm m,mm m,xmm 3 1 1 1 2 3 2 2 2 1/3 1/3 1 1 FANY FANY FMISC FMUL,FMISC mm,r/m 1 2 1/2 FA/M xmm,r/m 1 3 1/2 FA/M mm,r/m 1 2 1/2 FA/M xmm,r/m xmm,r/m xmm,r/m xmm,xmm,i mm,mm,i xmm,xmm,i mm,mm xmm,xmm r32,mm/xmm r32,(x)mm,i (x)mm,r32,i xmm,xmm xmm,xmm,i,i xmm,xmm xmm,xmm,i,i 1 1 1 1 1 1 32 64 1 2 2 3 3 1 1 3 3 3 3 2 2 FA/M FA/M FA/M FA/M FA/M FA/M 3 6 9 6 6 2 2 1/2 1/2 1/2 1/2 1/2 1/2 13 24 1 1 3 2 2 1/2 1/2 mm/xmm,r/m 1 1 2 2 1/2 1/2 FA/M FA/M mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m mm/xmm,r/m 1 1 1 1 1 3 3 2 2 3 1 1 1/2 1/2 1 FMUL FMUL FA/M FA/M FADD mm/xmm,r/m 1 2 1/2 FA/M mm,i/mm/m 1 2 1/2 FA/M x,i/(x)mm xmm,i 1 1 3 3 1/2 1/2 FA/M FA/M Arithmetic instructions PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W mm/xmm,r/m PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMADDWD PAVGB/W PMIN/MAX SW/UB PSADBW Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ Page 35 FADD FA/M FA/M FA/M FA/M FA/M SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only K10 Other EMMS 1 1/3 FANY Floating point XMM instructions Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVNTSS/D MOVMSKPS/D SHUFPS/D UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D MULSS/D Operands Ops Latency Reciprocal Execution throughput unit r,r r,m m,r r,r r,m m,r r,r r,m m,r 1 1 2 1 1 3 1 1 1 2.5 2 2 2.5 2 3 2 2 2 1/2 1/2 1 1/2 1/2 2 1/2 1/2 1 FANY ? FMUL,FMISC FANY ? FMISC FA/M ? FMISC r,r 1 3 1/2 FA/M r,m 1 4 1/2 FA/M m,r m,r m,r r32,r r,r/m,i r,r/m 1 2 1 1 1 1 3 3 3 1 3 1 1 1/2 1/2 FMISC FMUL,FMISC FMISC FADD FA/M FA/M r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 1 2 3 3 1 1 1 2 2 1 1 2 3 3 2 2 2 7 8 7 4 4 4 7 7 4 4 7 14 14 8 8 1 1 2 2 1 1 1 1 1 1 1 1 3 3 1 1 FMISC FADD,FMISC FADD,FMISC r,r/m r,r/m r,r/m 1 1 1 4 4 4 1 1 1 FADD FADD FMUL Page 36 FMISC FMISC FMISC FMISC FMISC Notes SSE4.A, AMD only K10 MULPS/D DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 1 1 1 1 1 1 1 1 1 r,r/m 1 Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,r/m 1 Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m Other LDMXCSR STMXCSR m m 4 16 18 20 20 3 2 2 2 2 1 13 15 17 17 1 1 1 1 1 FMUL FMUL FMUL FMUL FMUL FMUL FADD FADD FADD FADD 1 FADD 2 1/2 FA/M 1 1 1 1 1 1 19 21 27 27 3 3 16 18 24 24 1 1 FMUL FMUL FMUL FMUL FMUL FMUL 12 3 12 12 10 11 Obsolete 3DNow instructions Instruction Operands Ops Latency Reciprocal Execution throughput unit Move and convert instructions PF2ID mm,mm PI2FD mm,mm PF2IW mm,mm PI2FW mm,mm PSWAPD mm,mm 1 1 1 1 1 5 5 5 5 2 1 1 1 1 1/2 FMISC FMISC FMISC FMISC FA/M Integer instructions PAVGUSB PMULHRW mm,mm mm,mm 1 1 2 3 1/2 1 FA/M FMUL Floating point instructions PFADD/SUB/SUBR mm,mm PFCMPEQ/GE/GT mm,mm PFMAX/MIN mm,mm PFMUL mm,mm PFACC mm,mm PFNACC, PFPNACC mm,mm PFRCP mm,mm 1 1 1 1 1 1 1 4 2 2 4 4 4 3 1 1 1 1 1 1 1 FADD FADD FADD FMUL FADD FADD FMUL Page 37 Notes 3DNow extension 3DNow extension 3DNow extension 3DNow extension K10 PFRCPIT1/2 PFRSQRT PFRSQIT1 mm,mm mm,mm mm,mm 1 1 1 Other FEMMS mm,mm 1 4 3 4 1 1 1 FMUL FMUL FMUL 1/3 FANY Thank you to Xucheng Tang for doing the measurements on the K10. Page 38 Bulldozer AMD Bulldozer List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the listing for register and memory operand are joined (r/m). Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution pipe: Indicates which execution pipe or unit is used for the macro-operations: Integer pipes: EX0: integer ALU, division EX1: integer ALU, multiplication, jump EX01: can use either EX0 or EX1 AG01: address generation unit 0 or 1 Floating point and vector pipes: P0: floating point add, mul, div, convert, shuffle, shift P1: floating point add, mul, div, shuffle, shift P2: move, integer add, boolean P3: move, integer add, boolean, store P01: can use either P0 or P1 P23: can use either P2 or P3 Two macro-operations can execute simultaneously if they go to different execution pipes Domain: Tells which execution unit domain is used: ivec: integer vector execution unit. fp: floating point execution unit. fma: floating point multiply/add subunit. inherit: the output operand inherits the domain of the input operand. ivec/fma means the input goes to the ivec domain and the output comes from the fma domain. There is an additional latency of 1 clock cycle if the output of an ivec instruction goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before memory store instructions are included in the latency counts. An fma instruction has a latency of 5 if the output goes to another fma instruction, 6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or store instruction. Page 39 Bulldozer Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVSX MOVZX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 PREFETCH/W SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB Operands Ops r,r r,i r,m m,r m,i m,r r,r r,m r,m r64,r32 r64,m32 r,r r,m r,r 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 4 4 r,m 2 2 1 1 2 8 9 1 2 34 14 2 2 ~50 6 1 1 4 2 1 1 1 1 1 6 1 6 2 1 3 2 1 1 1 1 1 1 r i m r m r16,[m] r32,[m] r32/64,[m] r32/64,[m] r m m m r,r r,i Latency Reciprocal throughput 5 1 5 4 1 5 1 1 0.5 0.5 0.5 1 1 2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 EX01 EX01 AG01 EX01 AG01 ~50 2 1 1 1.5 4 9 1 1 19 8 EX01 2-3 2-3 Page 40 Execution pipes Notes all addr. modes all addr. modes EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 0.5 0.5 2 1 1 0.5 0.5 0.5 0.5 89 0.25 89 EX01 EX01 0.5 0.5 EX01 EX01 Timing depends on hw any addr. size 16 bit addr. size scale factor > 1 or 3 operands all other cases EX01 AMD 3DNow Bulldozer ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CDQ, CQO CWD Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST r,m m,r m,i r,r r,i r,m m,r m,i r,r r,i r,m r m r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64 r,r r,i r,m m,r m,i r,r 1 1 1 1 1 1 1 1 1 1 1 1 1 10 16 20 4 9 1 2 1 1 1 1 1 2 1 1 2 2 2 14 18 16 16 33 36 36 36 1 1 2 1 1 1 1 1 1 7-8 7-8 1 1 1 9 9 1 1 1 7-8 6 9 10 6 20 4 4 4 6 4 4 6 5 4 6 20 15-27 16-43 16-75 23 23-33 22-48 22-79 1 1 1 1 1 7-8 7-8 1 Page 41 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 1 20 2 2 2 4 2 2 4 2 2 4 2 2 4 20 15-28 16-43 16-75 20 20-27 20-43 20-75 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 0.5 1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX01 EX01 EX01 0.5 0.5 0.5 1 1 0.5 EX01 EX01 EX01 EX01 EX01 EX01 Bulldozer TEST TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL RCL RCL RCR RCR RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC, BTR, BTS BTC, BTR, BTS BSF BSF BSR BSR LZCNT POPCNT SETcc SETcc CLC, STC CMC CLD STD POPCNT POPCNT LZCNT EXTRQ EXTRQ INSERTQ INSERTQ r,i m,r m,i r m r,i/CL r,i/CL r,1 r,i r,cl r,1 r,i r,cl r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,r r,r r,m r,r r,m r,r r,r/m r m r16/32,r16/32 r64,r64 r,r x,i,i x,x x,x,i,i x,x Control transfer instructions JMP short/near JMP r JMP m Jcc short/near fused CMP+Jcc short/near J(E/R)CXZ short LOOP short LOOPE LOOPNE short 1 1 1 1 1 1 1 1 16 17 1 15 16 6 7 8 1 1 7 2 4 10 6 8 7 9 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 7 1 1 1 8 9 1 8 8 3 4 1 2 3 4 4 2 4 1 0.5 0.5 0.5 0.5 1 0.5 0.5 3 3.5 3.5 0.5 0.5 3.5 1 2 5 3 4 4 5 2 2 0.5 1 0.5 1 4 4 2 3 3 3 3 1 1 1 1 1 1 1 1 Page 42 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX0 EX1 EX01 EX01 EX01 EX01 SSE4.A SSE4.2 3 4 2 4 2 1 1 1 1 P1 P1 P1 P1 SSE4A SSE4A SSE4A SSE4A SSE4A SSE4A SSE4A 2 2 2 1-2 1-2 1-2 1-2 1-2 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 2 if jumping 2 if jumping 2 if jumping 2 if jumping 2 if jumping Bulldozer CALL CALL CALL RET RET BOUND INTO near r m i m String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Synchronization LOCK ADD XADD LOCK XADD CMPXCHG LOCK CMPXCHG CMPXCHG LOCK CMPXCHG CMPXCHG8B LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC CRC32 CRC32 CRC32 XGETBV m,r m,r m,r m8,r8 m8,r8 m,r16/32/64 m,r16/32/64 m64 m64 m128 m128 a,0 a,b r32,r8 r32,r16 r32,r32 2 2 3 1 4 11 4 2 2 2 2 2-3 5 24 3 6n 3 2n 3 per 16B 5 2n 4 per 16B 3 7n 6 9n 3 3n 3 2n 3 per 16B 3 2n 3 per 16B 3 4n 3 4n 1 4 4 5 5 6 6 18 18 22 22 1 1 40 13 11+5b 2 37-63 36 22 3 5 5 4 EX1 EX1 EX1 EX1 EX1 for no jump for no jump small n best case small n best case ~55 10 ~51 15 ~51 14 ~52 15 ~53 52 ~94 3 5 6 Page 43 0.25 0.25 43 22 16+4b 4 112-280 42 300 2 5 6 31 none none Bulldozer Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FLDCW FNSTCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN FCOS Operands Ops r m32/64 m80 m80 r m32/64 m80 m80 r m m 1 1 8 60 1 2 13 239 1 1 2 1 8 1 1 4 3 1 3 st0,r r AX m16 m16 m16 r/m m r/m m r m m r/m r m 1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 1 1 1 1 10-162 160-170 Latency Reciprocal throughput 2 8 14 61 2 8 9 240 0 12 8 3 0 ~13 ~13 5-6 5-6 10-42 2 2 ~20 4 19-62 19-65 0.5 1 4 40 0.5 1 20 244 0.5 1 1 0.5 3 0.25 0.25 22 19 3 2 1 2 1 2 5-18 0.5 0.5 0.5 1 1 0.5 0.5 1 10-53 65-210 ~160 Page 44 0.5 65-210 ~160 Execution pipes Domain, notes P01 fp fp fp fp fp fp fp fp inherit fp fp fp fp P0 P1 P2 P3 P01 P0 P1 F3 P01 F3 P0 F3 P01 P0 P1 F3 none none P0 P2 P3 P0 P2 P3 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P0 P1 F3 P01 P01 P01 P0 P0 P0 P01 P01 P0 P1 P3 P0 P1 P3 inherit fma fma fma fma fp fp fp fp fp fp fp fp fp fp fp fp fp Bulldozer FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 12-166 11-190 10-355 8 12 10 10-175 10-175 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR 1 1 18 31 103 76 m864 m864 95-160 95-245 60-440 52 10 64-71 300 312 95-160 95-245 60-440 5 0.25 0.25 57 170 300 312 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 none none P0 P0 P0 P1 P2 P3 P0 P3 Integer MMX and XMM instructions Instruction Operands Move instructions MOVD r32/64, mm/x MOVD mm/x, r32/64 MOVD mm/x,m32 MOVD m32,mm/x MOVQ mm/x,mm/x MOVQ mm/x,m64 MOVQ m64,mm/x MOVDQA xmm,xmm MOVDQA xmm,m MOVDQA m,xmm VMOVDQA ymm,ymm VMOVDQA ymm,m256 VMOVDQA m256,ymm MOVDQU xmm,xmm MOVDQU xmm,m MOVDQU m,xmm LDDQU xmm,m VMOVDQU ymm,m256 VMOVDQU m256,ymm MOVDQ2Q mm,xmm MOVQ2DQ xmm,mm MOVNTQ m,mm MOVNTDQ m,xmm MOVNTDQA xmm,m PACKSSWB/DW (x)mm,r/m PACKUSWB (x)mm,r/m PUNPCKH/LBW/WD/D Q (x)mm,r/m Ops Latency Reciprocal throughput 1 2 1 1 1 1 1 1 1 1 2 2 4 1 1 1 1 2 8 1 1 1 1 1 1 1 8 10 6 5 2 6 5 0 6 5 2 6 5 0 6 5 6 6 6 2 2 6 6 6 2 2 1 1 0.5 1 0.5 0.5 1 0.25 0.5 1 0.5 1 3 0.25 0.5 1 0.5 1-2 10 0.5 0.5 2 2 0.5 1 1 1 2 1 Page 45 Execution pipes Notes P23 P3 none inherit domain P3 P23 P3 none P3 P2 P3 P23 P23 P3 P3 P1 P1 P1 inherit domain Bulldozer PUNPCKHQDQ xmm,r/m PUNPCKLQDQ xmm,r/m PSHUFB (x)mm,r/m PSHUFD xmm,xmm,i PSHUFW mm,mm,i PSHUFL/HW xmm,xmm,i PALIGNR (x)mm,r/m,i PBLENDW xmm,r/m MASKMOVQ mm,mm MASKMOVDQU xmm,xmm PMOVMSKB r32,mm/x PEXTRB/W/D/Q r,x/mm,i PINSRB/W/D/Q x/mm,r,i PMOVSXBW/BD/BQ/ WD/WQ/DQ xmm,xmm PMOVZXBW/BD/BQ/W D/WQ/DQ xmm,xmm VPCMOV x,x,x,x/m VPCMOV y,y,y,y/m VPPERM x,x,x,x/m 1 1 1 1 1 1 1 1 31 64 2 2 2 2 2 3 2 2 2 2 2 38 48 10 10 12 1 1 1 1 1 1 1 0.5 37 61 1 1 2 P1 P1 P1 P1 P1 P1 P1 P23 P3 P1 P3 P1 P3 P1 P3 P1 1 2 1 P1 SSE4.1 1 1 2 1 2 2 2 2 1 1 2 1 P1 P1 P1 P1 SSE4.1 AMD XOP AMD XOP AMD XOP (x)mm,r/m 1 2 0.5 P23 (x)mm,r/m x,x x,m (x)mm,r/m (x)mm,r/m (x)mm,r/m 1 3 4 1 1 1 2 5 5 2 2 2 0.5 2 2 0.5 0.5 0.5 P23 P1 P23 P1 P23 P23 P23 P23 (x)mm,r/m xmm,r/m xmm,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m 1 1 1 1 1 1 1 4 5 4 4 4 4 2 1 2 1 1 1 1 0.5 P0 P0 P0 P0 P0 P0 P23 (x)mm,r/m xmm,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m x,x,i 1 2 1 1 2 8 2 4 2 2 4 8 0.5 1 0.5 0.5 1 4 P23 P1 P23 P23 P23 P23 P1 P23 VPCOMB/W/D/Q x,x,x/m,i 1 2 0.5 P23 VPCOMUB/W/D/Q x,x,x/m,i 1 2 0.5 P23 Arithmetic instructions PADDB/W/D/Q/SB/SW /USB/USW PSUBB/W/D/Q/SB/SW/ USB/USW PHADD/SUB(S)W/D PHADD/SUB(S)W/D PCMPEQ/GT B/W/D PCMPEQQ PCMPGTQ PMULLW PMULHW PMULHUW PMULUDQ PMULLD PMULDQ PMULHRSW PMADDWD PMADDUBSW PAVGB/W PMIN/MAX SB/SW/ SD UB/UW/UD PHMINPOSUW PABSB/W/D PSIGNB/W/D PSADBW MPSADBW Page 46 SSE4.1 AVX SSSE3 SSSE3 SSE4.1 SSE4.2 SSE4.1 SSE4.1 SSSE3 SSE4.1 SSSE3 SSSE3 SSE4.1 AMD XOP latency 0 if i=6,7 AMD XOP latency 0 if i=6,7 Bulldozer VPHADDBW/BD/BQ/ WD/WQ/DQ VPHADDUBW/BD/BQ/ WD/WQ/DQ VPHSUBBW/WD/DQ VPMACSWW/WD VPMACSDD VPMACSDQH/L VPMACSSWW/WD VPMACSSDD VPMACSSDQH/L VPMADCSWD VPMADCSSWD Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ PTEST VPROTB/W/D/Q VPROTB/W/D/Q VPSHAB/W/D/Q VPSHLB/W/D/Q String instructions PCMPESTRI PCMPESTRM PCMPISTRI PCMPISTRM Encryption PCLMULQDQ AESDEC AESDECLAST AESENC AESENCLAST AESIMC AESKEYGENASSIST x,x/m 1 2 0.5 P23 AMD XOP x,x/m x,x/m x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x 1 1 1 1 1 1 1 1 1 1 2 2 4 5 4 4 5 4 4 4 0.5 0.5 1 2 1 1 2 1 1 1 P23 P23 P0 P0 P0 P0 P0 P0 P0 P0 AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP (x)mm,r/m 1 2 0.5 P23 (x)mm,r/m 1 3 1 P1 (x)mm,i xmm,i xmm,r/m x,x,x/m x,x,i x,x,x/m x,x,x/m 1 1 2 1 1 1 1 2 2 3 2 3 3 1 1 1 1 1 1 1 P1 P1 P1 P3 P1 P1 P1 P1 SSE4.1 AMD XOP AMD XOP AMD XOP AMD XOP x,x,i x,x,i x,x,i x,x,i 27 27 7 7 17 10 14 7 10 10 3 4 P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3 SSE4.2 SSE4.2 SSE4.2 SSE4.2 x,x/m,i x,x x,x x,x x,x x,x x,x,i 5 2 2 2 2 1 1 12 5 5 5 5 5 5 7 2 2 2 2 1 1 P1 P01 P01 P01 P01 P0 P0 pclmul aes aes aes aes aes aes Execution pipes Domain, notes Other EMMS 1 0.25 Floating point XMM and YMM instructions Instruction Operands Ops Latency Reciprocal throughput Move instructions Page 47 Bulldozer MOVAPS/D MOVUPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D MOVNTSS/SD SHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD VBLENDPS/PD BLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS x,x y,y 1 2 0 2 0.25 0.5 x,m128 1 6 0.5 y,m256 2 6 1-2 m128,x m256,y m256,y x,x x,m32/64 m32/64,x 1 4 8 1 1 1 5 5 6 2 6 5 1 3 10 0.5 0.5 1 x,m64 m64,x m64,x x,x r32,x r32,y m128,x m256,y m,x x,x/m,i y,y,y/m,i x,x,x/m y,y,y/m x,x/m,i y,y/m,i y,y,y,i y,y,m,i x,x/m,i y,y,y/m,i x,x/m,xmm0 y,y,y/m,y x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 y,y y,m256 x,x/m y,y,y/m r32,x,i 1 2 1 1 2 7 8 7 2 10 1 1 1 1 1 P1 P3 P3 P1 P1 P3 1 6 2 P3 4 1 2 1 2 1 2 3 4 0.5 1 1 2 1 0.5 2 1 0.5 0.5 0.5 0.5 1 0.5 2 1 1 2 1 P3 P1 P1 P1 P1 P1 P1 P23 P23 P23 P23 P1 P1 P1 SSE4A ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec P1 ivec P23 P23 P23 P1 ivec P1 ivec P1 P1 P1 P3 ivec ivec 1 1 2 1 2 1 2 8 10 1 2 1 2 1 1 2 2 1 2 2 2 1 1 2 2 1 2 2 2 2 3 3 2 2 4 2 2 2 2 2 2 6 6 6 6 2 2 2 2 10 Page 48 none P23 inherit domain ivec P3 P3 P2 P3 P01 fp ivec Bulldozer EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 2 1 2 1 1 2 2 1 2 18 34 14 2 7 2 2 9 9 9 22 25 1 1 1 1 1 1 1 0.5 1 7 13 P1 P3 P23 P23 P1 P1 P23 P23 P01 P01 P0 P1 P2 P3 P0 P1 P2 P3 Conversion CVTPD2PS VCVTPD2PS CVTPS2PD VCVTPS2PD CVTSD2SS CVTSS2SD CVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI x,x x,y x,x y,x x,x x,x x,x y,y x,x y,y x,x y,x x,x x,y x,mm mm,x x,mm mm,x x,r32 r32,x x,r32/64 r32/64,x 2 4 2 4 1 1 1 2 1 2 2 4 2 4 1 1 2 2 2 2 2 2 7 7 7 7 4 4 4 4 4 4 7 8 7 7 4 4 7 7 14 13 14 13 1 2 1 2 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 P01 P01 P01 P01 P0 P0 P0 P0 P0 P0 P01 P01 P01 P01 P0 P0 P0 P1 P0 P1 P0 P0 P0 P0 fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp x,x/m x,x/m 1 1 5-6 5-6 0.5 0.5 P01 P01 fma fma VADDPS/D VSUBPS/D ADDSUBPS/D VADDSUBPS/D y,y,y/m x,x/m y,y,y/m 2 1 2 5-6 5-6 5-6 1 0.5 1 P01 P01 P01 fma fma fma HADDPS/D HSUBPS/D x,x 3 10 2 P01 P1 ivec/fma HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D VHADDPS/D VHSUBPS/D x,m128 4 2 P01 P1 ivec/fma y,y,y 8 4 P01 P1 ivec/fma y,y,m 10 4 P01 P1 ivec/fma Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D 10 Page 49 ivec ivec Bulldozer MULSS MULSD MULPS MULPD VMULPS VMULPD DIVSS DIVPS VDIVPS DIVSD DIVPD VDIVPD RCPSS/PS VRCPPS CMPSS/D CMPPS/D VCMPPS/D COMISS/D UCOMISS/D MAXSS/SD/PS/PD MINSS/SD/PS/PD x,x/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y/m 1 1 2 1 2 1 2 1 2 5-6 5-6 5-6 9-24 9-24 9-27 9-27 5 5 0.5 0.5 1 4.5-9.5 9-19 4.5-11 9-22 1 2 P01 P01 P01 P01 P01 P01 P01 P01 P01 fma fma fma fp fp fp fp fp fp x,x/m y,y,y/m 1 2 2 2 0.5 1 P01 P01 fp fp x,x/m 2 1 P01 P3 fp x,x/m 1 2 0.5 P01 fp VMAXPS/D VMINPS/D y,y,y/m 2 ROUNDSS/SD/PS/PD x,x/m,i 1 VROUNDSS/SD/PS/ PD y,y/m,i 2 DPPS x,x,i 16 DPPS x,m128,i 18 VDPPS y,y,y,i 25 VDPPS y,m256,i 29 DPPD x,x,i 15 DPPD x,m128,i 17 VFMADDSS/SD x,x,x,x/m 1 VFMADDPS/PD x,x,x,x/m 1 VFMADDPS/PD y,y,y,y/m 2 All other FMA4 instructions: same as above 2 4 1 1 P01 P0 fp fp 4 25 5-6 5-6 5-6 2 6 7 13 13 5 6 0.5 0.5 1 P0 P01 P23 P01 P23 P01 P3 P01 P3 P01 P23 P01 P23 P01 P01 P01 fp fma fma fma fma fma fma AMD FMA4 AMD FMA4 AMD FMA4 AMD FMA4 Math SQRTSS/PS VSQRTPS SQRTSD/PD VSQRTPD RSQRTSS/PS VRSQRTPS VFRCZSS/SD/PS/PD VFRCZSS/SD/PS/PD 27 15 x,x/m y,y/m x,x/m y,y/m x,x/m y,y/m x,x x,m 1 2 1 2 1 2 2 3 14-15 14-15 24-26 24-26 5 5 10 10 4.5-12 9-24 4.5-16.5 9-33 1 2 2 2 P01 P01 P01 P01 P01 P01 P01 P01 fp fp fp fp fp fp AMD XOP AMD XOP AND/ANDN/OR/XORPS/ PD x,x/m 1 2 0.5 P23 ivec VAND/ANDN/OR/XOR PS/PD y,y,y/m 2 2 1 P23 ivec Logic Other VZEROUPPER VZEROUPPER 9 16 4 5 Page 50 32 bit mode 64 bit mode Bulldozer VZEROALL VZEROALL LDMXCSR STMXCSR FXSAVE FXRSTOR XSAVE XRSTOR m32 m32 m4096 m4096 m m 17 32 1 2 67 116 122 177 10 19 136 176 196 250 Page 51 6 10 4 19 136 176 196 250 P2 P3 P2 P3 P0 P3 P0 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 32 bit mode 64 bit mode Piledriver AMD Piledriver List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the listing for register and memory operand are joined (r/m). Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution pipe: Indicates which execution pipe or unit is used for the macro-operations: Integer pipes: EX0: integer ALU, division EX1: integer ALU, multiplication, jump EX01: can use either EX0 or EX1 AG01: address generation unit 0 or 1 Floating point and vector pipes: P0: floating point add, mul, div, convert, shuffle, shift P1: floating point add, mul, div, shuffle, shift P2: move, integer add, boolean P3: move, integer add, boolean, store P01: can use either P0 or P1 P23: can use either P2 or P3 Two macro-operations can execute simultaneously if they go to different execution pipes Domain: Tells which execution unit domain is used: ivec: integer vector execution unit. fp: floating point execution unit. fma: floating point multiply/add subunit. inherit: the output operand inherits the domain of the input operand. ivec/fma means the input goes to the ivec domain and the output comes from the fma domain. There is an additional latency of 1 clock cycle if the output of an ivec instruction goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before memory store instructions are included in the latency counts. An fma instruction has a latency of 5 if the output goes to another fma instruction, 6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or store instruction. Page 52 Piledriver Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVZX, MOVSX MOVSX MOVZX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XCHG XCHG XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 PREFETCH/W SFENCE Operands Ops r8,r8 r16,r16 r32,r32 r64,r64 r,i r,m m,r m,i m,r r16,r8 r32,r r64,r r,m r,m r64,r32 r64,m32 r,r r,m r8,r8 r16,r16 r32,r32 r64,r64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 4 4 r,m 2 2 1 1 2 8 9 1 2 34 14 2 2 ~40 6 1 1 4 2 1 1 1 1 1 7 2 1 3 2 1 1 r i m r m r16,[m] r32,[m] r32/64,[m] r32/64,[m] r m m m Latency Reciprocal throughput 4 1 1 1 5 4 1 5 1 1 1 1 1 0.5 0.5 0.3 0.3 0.5 0.5 1 1 2 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 EX01 EX01 EX01 or AG01 EX01 or AG01 EX01 AG01 EX01 AG01 ~40 2 1 1 1 4 9 1 1 18 8 EX01 2-3 2-3 Page 53 Execution pipes all addr. modes all addr. modes EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 0.5 0.5 2 1 1 0.5 0.5 0.5 0.5 81 Notes EX01 EX01 Timing depends on hw any addr. size 16 bit addr. size scale factor > 1 or 3 operands all other cases EX01 PREFETCHW Piledriver LFENCE MFENCE 1 7 Arithmetic instructions ADD, SUB r,r ADD, SUB r,i ADD, SUB r,m ADD, SUB m,r ADD, SUB m,i ADC, SBB r,r ADC, SBB r,i ADC, SBB r,m ADC, SBB m,r ADC, SBB m,i CMP r,r CMP r,i CMP r,m CMP m,i INC, DEC, NEG r INC, DEC, NEG m AAA, AAS DAA DAS AAD AAM MUL, IMUL r8/m8 MUL, IMUL r16/m16 MUL, IMUL r32/m32 MUL, IMUL r64/m64 IMUL r16,r16/m16 IMUL r32,r32/m32 IMUL r64,r64/m64 IMUL r16,(r16),i IMUL r32,(r32),i IMUL r64,(r64),i IMUL r16,m16,i IMUL r32,m32,i IMUL r64,m64,i DIV r8/m8 DIV r16/m16 DIV r32/m32 DIV r64/m64 IDIV r8/m8 IDIV r16/m16 IDIV r32/m32 IDIV r64/m64 CBW, CWDE, CDQE CDQ, CQO CWD 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 16 20 4 10 1 2 1 1 1 1 1 2 1 1 2 2 2 9 7 2 2 9 7 2 2 1 1 2 17-22 13-26 12-40 13-71 17-21 13-26 13-40 13-71 1 1 1 Logic instructions AND, OR, XOR AND, OR, XOR 1 1 1 1 r,r r,i 0.25 81 1 1 7-8 7-8 1 1 1 9 9 1 1 1 7-8 6 9 10 6 15 4 4 4 6 4 4 6 5 4 6 Page 54 0.5 0.5 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 1 15 2 2 2 4 2 2 4 2 2 4 2 2 4 13-22 13-25 12-40 13-71 13-18 13-25 13-40 13-71 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 0.5 1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX01 EX01 EX01 0.5 0.5 EX01 EX01 Piledriver AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST TEST TEST NOT NOT ANDN SHL, SHR, SAR ROL, ROR RCL RCL RCL RCR RCR RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC, BTR, BTS BTC, BTR, BTS BSF BSF BSR BSR SETcc SETcc CLC, STC CMC CLD STD POPCNT POPCNT LZCNT TZCNT BEXTR BEXTR BLSI BLSMSK BLSR BLCFILL BLCI BLCIC BLCMSK BLCS BLSFILL BLSI r,m m,r m,i r,r r,i m,r m,i r m r,r,r r,i/CL r,i/CL r,1 r,i r,cl r,1 r,i r,cl r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,r r,r r,m r,r r,m r m r16/32,r16/32 r64,r64 r,r r,r r,r,r r,r,i r,r r,r r,r r,r r,r r,r r,r r,r r,r r,r 1 1 1 1 1 1 1 1 1 1 1 1 1 16 17 1 15 16 6 7 8 1 1 7 2 4 10 6 8 7 9 1 1 1 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 7-8 7-8 1 1 1 7-8 1 1 1 1 7 7 1 7 6 3 3 1 2 20 21 3 4 4 1 0.5 1 1 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 3 3 3.5 0.5 0.5 3.5 1 3 4 4 5 0.5 1 0.5 1 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Page 55 3 4 2 4 2 2 0.67 0.67 1 1 1 1 1 1 1 1 1 1 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX0 BMI1 SSE4.2 SSE4.2 LZCNT BMI1 BMI1 AMD TBM BMI1 BMI1 BMI1 AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM Piledriver BLSIC T1MSKC TZMSK r,r r,r r,r Control transfer instructions JMP short/near JMP r JMP m Jcc short/near fused CMP+Jcc short/near J(E/R)CXZ short LOOP short LOOPE LOOPNE short CALL near CALL r CALL m RET RET i BOUND m INTO String instructions LODS REP LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Synchronization LOCK ADD XADD LOCK XADD CMPXCHG LOCK CMPXCHG CMPXCHG LOCK CMPXCHG CMPXCHG8B LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP (90) Long NOP (0F 1F) PAUSE m8/m16 m32/m64 m,r m,r m,r m,r8/16 m,r8/16 m,r32/64 m,r32/64 m64 m64 m128 m128 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 2 3 1 4 11 4 2 2 2 1-2 1-2 1-2 1-2 1-2 2 2 2 2 2 5 2 3 6n 6n 3 1n 3 per 16B 5 1-3n 4.5 pr 16B 3 7n 6 9n 3 3n 2.5n 3 1n 3 per 16B 3 1n 3 per 16B 3 3-4n 3 4n 1 4 4 5 5 6 6 18 18 22 22 AMD TBM AMD TBM AMD TBM EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 for no jump for no jump small n best case small n best case ~40 20 ~39 23 ~40 20 ~40 25 ~42 66 ~80 1 1 40 0.25 0.25 40 Page 56 2 if jumping 2 if jumping 2 if jumping 2 if jumping 2 if jumping none none Piledriver ENTER ENTER LEAVE CPUID XGETBV RDTSC RDPMC CRC32 CRC32 CRC32 a,0 a,b r32,r8 r32,r16 r32,r32 13 20+3b 2 38-64 4 36 21 3 5 5 3 5 6 21 16+4b 4 105-271 30 42 310 2 5 6 Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(T)(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FLDCW FNSTCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT Operands Ops r m32/64 m80 m80 r m32/64 m80 m80 r m m 1 1 8 60 1 2 13 239 1 1 2 1 8 1 1 3 2 1 2 2 7 20 64 2 7 22 220 0 11 7 1 2 1 2 1 1 2 1 1 1 2 2 1 1 1 5-6 st0,r r AX m16 m16 m16 r/m m r/m m r m m r/m r m Latency Reciprocal throughput 3 0 5-6 9-40 2 2 ~20 4 Page 57 0.5 1 4 35 0.5 1 20 0.5 1 1 0.5 3 0.25 0.25 19 17 3 2 1 2 1 2 4-16 0.5 0.5 0.5 1 1 0.5 0.5 1 Execution pipes Domain, notes P01 fp fp fp fp fp fp fp fp inherit fp fp fp fp P0 P1 P2 P3 P01 P0 P1 F3 P01 F3 P0 F3 P01 P0 P1 F3 none none P0 P2 P3 P0 P2 P3 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P0 P1 F3 P01 P01 P01 P0 inherit fma fma fma fma fp fp fp fp fp fp fp fp fp fp fp Piledriver FPREM FPREM1 1 1 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR 17-60 17-60 1 14-50 1 10-162 60-210 160-170 ~154 12-166 86-141 11-190 166-231 10-355 60-352 8 44 12 7 10 60-73 10-176 10-176 m864 m864 1 1 18 31 103 76 300 236 P0 P0 5-20 0.5 60-146 ~154 86-141 86-204 60-352 5 5 P01 P01 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 P0 P1 P3 0.25 0.25 54 134 300 236 none none P0 P0 P0 P1 P2 P3 P0 P3 fp fp Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA VMOVDQA VMOVDQA VMOVDQA MOVDQU MOVDQU MOVDQU LDDQU VMOVDQU VMOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ Operands Ops r32/64, mm/x mm/x, r32/64 mm/x,m32 m32,mm/x mm/x,mm/x mm/x,m64 m64,mm/x xmm,xmm xmm,m m,xmm ymm,ymm ymm,m256 m256,ymm xmm,xmm xmm,m m,xmm xmm,m ymm,m256 m256,ymm mm,xmm xmm,mm m,mm 1 2 1 1 1 1 1 1 1 1 2 2 4 1 1 1 1 2 8 1 1 1 Latency Reciprocal throughput 8 10 6 5 2 6 5 0 6 5 2 6 11 0 6 5 6 6 14 2 2 5 Page 58 1 1 0.5 1 0.5 0.5 1 0.25 0.5 1 0.5 1 17 0.25 0.5 1 0.5 1 20 0.5 0.5 2 Execution pipes Notes P3 P3 P23 P3 none inherit domain P3 P23 P3 none P3 P2 P3 P23 P23 P3 inherit domain Piledriver MOVNTDQ m,xmm MOVNTDQA xmm,m (x)mm,r/m PACKSSWB/DW (x)mm,r/m PACKUSWB PUNPCKH/LBW/WD/D Q (x)mm,r/m PUNPCKHQDQ xmm,r/m PUNPCKLQDQ xmm,r/m PSHUFB (x)mm,r/m PSHUFD xmm,xmm,i PSHUFW mm,mm,i PSHUFL/HW xmm,xmm,i PALIGNR (x)mm,r/m,i PBLENDW xmm,r/m MASKMOVQ mm,mm MASKMOVDQU xmm,xmm PMOVMSKB r32,mm/x PEXTRB/W/D/Q r,x/mm,i PINSRB/W/D/Q x/mm,r,i EXTRQ x,i,i EXTRQ x,x INSERTQ x,x,i,i INSERTQ x,x PMOVSXBW/BD/BQ/ WD/WQ/DQ x,x PMOVZXBW/BD/BQ/W D/WQ/DQ x,x VPCMOV x,x,x,x/m VPCMOV y,y,y,y/m VPPERM x,x,x,x/m Arithmetic instructions PADDB/W/D/Q/SB/SW /USB/USW PSUBB/W/D/Q/SB/SW/ USB/USW PHADD/SUB(S)W/D PHADD/SUB(S)W/D PCMPEQ/GT B/W/D PCMPEQQ PCMPGTQ PMULLW PMULHW PMULHUW PMULUDQ PMULLD PMULDQ PMULHRSW PMADDWD PMADDUBSW PAVGB/W PMIN/MAX SB/SW/ SD UB/UW/UD PHMINPOSUW 1 1 1 1 5 6 2 2 2 0.5 1 1 P3 1 1 1 1 1 1 1 1 1 31 64 2 2 2 1 1 1 1 2 2 2 3 2 2 2 2 2 36 59 10 10 12 3 1 1 1 1 1 1 1 1 1 1 1 0.5 59 92 1 1 2 1 1 1 1 P1 P1 P1 P1 P1 P1 P1 P1 P23 P3 P1 P3 P1 P3 P1 P3 P1 P1 P1 P1 P1 AMD SSE4A AMD SSE4A AMD SSE4A AMD SSE4A 1 2 1 P1 SSE4.1 1 1 2 1 2 2 2 2 1 1 2 1 P1 P1 P1 P1 SSE4.1 AMD XOP AMD XOP AMD XOP (x)mm,r/m 1 2 0.5 P23 (x)mm,r/m x,x x,m (x)mm,r/m (x)mm,r/m (x)mm,r/m 1 3 4 1 1 1 2 5 5 2 2 2 0.5 2 2 0.5 0.5 0.5 P23 P1 P23 P1 P23 P23 P23 P23 (x)mm,r/m x,r/m x,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m 1 1 1 1 1 1 1 4 5 4 4 4 4 2 1 2 1 1 1 1 0.5 P0 P0 P0 P0 P0 P0 P23 (x)mm,r/m x,r/m 1 2 2 4 0.5 1 P23 P1 P23 Page 59 P1 P1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSE4.1 SSE4.2 SSE4.1 SSE4.1 SSSE3 SSE4.1 Piledriver PABSB/W/D PSIGNB/W/D PSADBW MPSADBW (x)mm,r/m (x)mm,r/m (x)mm,r/m x,x,i 1 1 2 8 2 2 4 8 0.5 0.5 1 4 P23 P23 P23 P1 P23 VPCOMB/W/D/Q x,x,x/m,i 1 2 0.5 P23 VPCOMUB/W/D/Q VPHADDBW/BD/BQ/ WD/WQ/DQ VPHADDUBW/BD/BQ/ WD/WQ/DQ VPHSUBBW/WD/DQ VPMACSWW/WD VPMACSDD VPMACSDQH/L VPMACSSWW/WD VPMACSSDD VPMACSSDQH/L VPMADCSWD VPMADCSSWD x,x,x/m,i 1 2 0.5 P23 SSE4.1 AMD XOP latency 0 if i=6,7 AMD XOP latency 0 if i=6,7 x,x/m 1 2 0.5 P23 AMD XOP x,x/m x,x/m x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x 1 1 1 1 1 1 1 1 1 1 2 2 4 5 4 4 5 4 4 4 0.5 0.5 1 2 1 1 2 1 1 1 P23 P23 P0 P0 P0 P0 P0 P0 P0 P0 AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP (x)mm,r/m 1 2 0.5 P23 (x)mm,r/m 1 3 1 P1 (x)mm,i x,i x,r/m x,x,x/m x,x,i x,x,x/m x,x,x/m 1 1 2 1 1 1 1 2 2 3 2 3 3 1 1 1 1 1 1 1 P1 P1 P1 P3 P1 P1 P1 P1 SSE4.1 AMD XOP AMD XOP AMD XOP AMD XOP x,x,i x,x,i x,x,i x,x,i 27 27 7 7 16 10 13 7 10 10 3 4 P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3 SSE4.2 SSE4.2 SSE4.2 SSE4.2 x,x/m,i x,x,x,i x,x,m,i x,x x,x x,x x,x x,x x,x,i 5 6 7 2 2 2 2 1 1 12 12 12 5 5 5 5 5 5 7 7 7 2 2 2 2 1 1 P1 P1 P1 P01 P01 P01 P01 P0 P0 pclmul pclmul pclmul aes aes aes aes aes aes Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ PTEST VPROTB/W/D/Q VPROTB/W/D/Q VPSHAB/W/D/Q VPSHLB/W/D/Q String instructions PCMPESTRI PCMPESTRM PCMPISTRI PCMPISTRM Encryption PCLMULQDQ VPCLMULQDQ PCLMULQDQ AESDEC AESDECLAST AESENC AESENCLAST AESIMC AESKEYGENASSIST Page 60 SSSE3 SSSE3 Piledriver Other EMMS 1 0.25 Floating point XMM and YMM instructions Instruction Move instructions MOVAPS/D MOVUPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D MOVNTSS/SD SHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD VBLENDPS/PD BLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 Operands Ops Latency Reciprocal throughput x,x y,y 1 2 0 2 0.25 0.5 x,m128 1 6 0.5 y,m256 2 6 1 m128,x m256,y m256,y x,x x,m32/64 m32/64,x x,m64 x,m64 m64,x m64,x x,x r32,x r32,y m128,x m256,y m,x x,x/m,i y,y,y/m,i x,x,x/m y,y,y/m x,x/m,i y,y/m,i y,y,y,i y,y,m,i x,x/m,i y,y,y/m,i x,x/m,xmm0 y,y,y/m,y x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 1 4 8 1 1 1 1 1 2 1 1 2 2 1 4 1 1 2 1 2 1 2 8 10 1 2 1 2 1 1 2 2 1 2 2 2 5 11 15 2 6 5 8 7 7 6 2 10 1 17 20 0.5 0.5 1 1 0.5 1 1 1 1 1 2 18 4 1 2 1 2 1 2 3 4 0.5 1 1 2 1 0.5 2 1 0.5 0.5 0.5 0.5 5 2 2 3 3 2 2 4 2 2 2 2 2 2 6 6 6 6 Page 61 Execution pipes Domain, notes none P23 inherit domain ivec P3 P3 P2 P3 P01 fp P1 P01 P1 P3 P3 P1 P1 P3 ivec P3 P3 P1 P1 P1 P1 P1 P1 P23 P23 P23 P23 P1 P1 P1 AMD SSE4A ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec ivec P1 ivec P23 P23 P23 Piledriver MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D x,x x,m128 y,y y,m256 x,x/m y,y,y/m r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 1 1 2 2 1 2 2 2 1 2 1 1 2 2 1 2 18 34 Conversion CVTPD2PS VCVTPD2PS CVTPS2PD VCVTPS2PD CVTSD2SS CVTSS2SD CVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI VCVTPS2PH VCVTPS2PH VCVTPH2PS VCVTPH2PS x,x x,y x,x y,x x,x x,x x,x y,y x,x y,y x,x y,x x,x x,y x,mm mm,x x,mm mm,x x,r32 r32,x x,r32/64 r32/64,x x/m,x,i x/m,y,i x,x/m y,x/m Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D 2 6 2 6 2 7 2 13 7 13 ~100 ~190 1 0.5 2 1 1 2 1 1 0.5 1 1 2 1 1 0.5 1 ~90 ~180 P1 ivec P1 ivec P1 P1 P1 P3 P1 P3 P23 P23 P1 P1 P23 P23 P01 P01 P0 P1 P2 P3 P0 P1 P2 P3 ivec ivec 2 4 2 4 1 1 1 2 1 2 2 4 2 4 2 1 2 2 2 2 2 2 2 4 2 4 8 7 8 8 4 4 4 4 4 4 8 8 8 7 8 4 7 7 13 12 13 12 8 8 8 8 1 2 1 2 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 2 2 P01 P01 P01 P01 P0 P0 P0 P0 P0 P0 P01 P01 P01 P01 P0 P23 P0 P0 P1 P0 P1 P0 P0 P3 P0 P0 P3 P0 P1 P0 P1 P0 P1 P0 P1 ivec/fp ivec/fp ivec/fp ivec/fp fp fp fp fp fp fp ivec/fp ivec/fp fp/ivec fp/ivec ivec/fp fp ivec/fp fp/ivec fp fp fp fp F16C F16C F16C F16C x,x/m x,x/m 1 1 5-6 5-6 0.5 0.5 P01 P01 fma fma y,y,y/m x,x/m 2 1 5-6 5-6 1 0.5 P01 P01 fma fma 2 2 2 Page 62 ivec ivec Piledriver VADDSUBPS/D y,y,y/m 2 5-6 1 P01 fma HADDPS/D HSUBPS/D x,x 3 10 2 P01 P1 ivec/fma HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULSD MULPS MULPD VMULPS VMULPD DIVSS DIVPS VDIVPS DIVSD DIVPD VDIVPD RCPSS/PS VRCPPS CMPSS/D CMPPS/D VCMPPS/D COMISS/D UCOMISS/D MAXSS/SD/PS/PD MINSS/SD/PS/PD x,m 4 2 P01 P1 ivec/fma y,y,y/m x,x/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y/m 8 1 1 2 1 2 1 2 1 2 10 5-6 5-6 5-6 9-24 9-24 9-27 9-27 5 5 4 0.5 0.5 1 5-10 9-20 5-10 9-18 1 2 P01 P1 P01 P01 P01 P01 P01 P01 P01 P01 P01 ivec/fma fma fma fma fp fp fp fp fp fp x,x/m y,y,y/m 1 2 2 2 0.5 1 P01 P01 fp fp x,x/m 2 1 P01 P3 fp x,x/m 1 2 0.5 P01 fp VMAXPS/D VMINPS/D y,y,y/m 2 ROUNDSS/SD/PS/PD x,x/m,i 1 VROUNDSS/SD/PS/ PD y,y/m,i 2 DPPS x,x,i 16 DPPS x,m,i 18 VDPPS y,y,y,i 25 VDPPS y,m,i 29 DPPD x,x,i 15 DPPD x,m,i 17 VFMADD132SS/SD x,x,x/m 1 VFMADD132PS/PD x,x,x/m 1 VFMADD132PS/PD y,y,y/m 2 All other FMA3 instructions: same as above VFMADDSS/SD x,x,x,x/m 1 VFMADDPS/PD x,x,x,x/m 1 VFMADDPS/PD y,y,y,y/m 2 All other FMA4 instructions: same as above 2 4 1 1 P01 P0 fp fp 4 25 5-6 5-6 5-6 2 6 7 13 13 5 6 1 1 1 P0 P01 P23 P01 P23 P01 P3 P01 P3 P01 P23 P01 P23 P01 P01 P01 5-6 5-6 5-6 0.5 0.5 1 P01 P01 P01 fp SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 FMA3 FMA3 FMA3 FMA3 AMD FMA4 AMD FMA4 AMD FMA4 AMD FMA4 13-15 14-15 24-26 24-26 5 5 10 10 5-12 9-24 5-15 9-29 1 2 2 2 P01 P01 P01 P01 P01 P01 P01 P01 Math SQRTSS/PS VSQRTPS SQRTSD/PD VSQRTPD RSQRTSS/PS VRSQRTPS VFRCZSS/SD/PS/PD VFRCZSS/SD/PS/PD x,x/m y,y/m x,x/m y,y/m x,x/m y,y/m x,x x,m 1 2 1 2 1 2 2 3 27 15 Page 63 fp fp fp fp fp fp AMD XOP AMD XOP Piledriver Logic AND/ANDN/OR/XORPS/ PD x,x/m 1 2 0.5 P23 ivec VAND/ANDN/OR/XOR PS/PD y,y,y/m 2 2 1 P23 ivec 136 176 196 250 4 5 6 10 34 17 136 176 196 250 P2 P3 P2 P3 P2 P3 P2 P3 P0 P3 P0 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 32 bit mode 64 bit mode 32 bit mode 64 bit mode m32 m32 m4096 m4096 m m 9 16 17 32 7 2 67 116 122 177 Other VZEROUPPER VZEROUPPER VZEROALL VZEROALL LDMXCSR STMXCSR FXSAVE FXRSTOR XSAVE XRSTOR Page 64 Steamroller AMD Steamroller List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. The latency listed does not include the memory operand where the listing for register and memory operand are joined (r/m). Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution pipe: Indicates which execution pipe or unit is used for the macro-operations: Integer pipes: EX0: integer ALU, division EX1: integer ALU, multiplication, jump EX01: can use either EX0 or EX1 AG01: address generation unit 0 or 1 Floating point and vector pipes: P0: floating point add, mul, div. Integer add, mul, bool P1: floating point add, mul, div. Shuffle, shift, pack P2: Integer add. Bool, store P01: can use either P0 or P1 P02: can use either P0 or P2 Two macro-operations can execute simultaneously if they go to different execution pipes Domain: Tells which execution unit domain is used: ivec: integer vector execution unit. fp: floating point execution unit. fma: floating point multiply/add subunit. inherit: the output operand inherits the domain of the input operand. ivec/fma means the input goes to the ivec domain and the output comes from the fma domain. There is an additional latency of 1 clock cycle if the output of an ivec instruction goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before memory store instructions are included in the latency counts. An fma instruction has a latency of 5 if the output goes to another fma instruction, 6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or store instruction. Integer instructions Page 65 Steamroller Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVSX MOVZX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG XCHG XCHG XCHG XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) POP LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 PREFETCH/W SFENCE LFENCE MFENCE Operands Ops r8,r8 r16,r16 r32,r32 r64,r64 r,i r,m m,r m,i m,r r,r r,m r,m r64,r32 r64,m32 r,r r,m r8,r8 r16,r16 r32,r32 r64,r64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 3 4 r,m 2 2 1 1 2 8 9 1 2 34 14 1 2 1 ~38 6 1 1 4 2 1 1 1 1 1 7 1 7 2 1 3 2 1 1 r i m r m sp r16,[m] r32,[m] r32/64,[m] r32/64,[m] r m m m Latency Reciprocal throughput 4 1 5 4 1 5 1 1 1 1 1 0.5 0.5 0.25 0.25 0.5 0.5 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 EX01 EX01 EX01 or AG01 EX01 or AG01 EX01 AG01 EX01 AG01 ~38 2 1 1 1 4 9 1 1 19 8 EX01 2 2-3 2 Arithmetic instructions Page 66 Execution pipes all addr. modes all addr. modes EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 0.5 0.5 2 1 1 0.5 0.5 0.5 0.5 ~80 0.25 ~80 Notes EX01 EX01 Timing depends on hw any addr. size 16 bit addr. size scale factor > 1 or 3 operands all other cases EX01 PREFETCHW Steamroller ADD, SUB ADD, SUB ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP CMP CMP INC, DEC, NEG INC, DEC, NEG AAA, AAS DAA DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CDQ, CQO CWD Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST r,r r,i r,m m,r m,i r,r r,i r,m m,r m,i r,r r,i r,m m,i r m r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r16,m16,i r32,m32,i r64,m64,i r8/m8 r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64 r,r r,i r,m m,r m,i r,r 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 16 20 4 10 1 2 1 1 1 1 1 2 1 1 2 2 2 9 7 2 2 9 7 2 2 1 1 2 1 1 1 1 1 1 1 1 7 7 1 1 1 9 9 1 1 1 7 6 8 10 6 15 4 4 4 6 4 4 6 5 4 6 17-22 15-25 13-39 13-70 17-22 14-25 13-39 13-70 1 1 1 1 1 7 7 1 Page 67 0.5 0.5 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 1 15 2 2 2 4 2 2 4 2 2 4 2 2 4 13-17 15-25 13-39 13-70 13-17 14-24 13-39 13-70 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 0.5 1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX0 EX01 EX01 EX01 0.5 0.5 0.5 1 1 0.5 EX01 EX01 EX01 EX01 EX01 EX01 Steamroller TEST TEST TEST NOT NOT ANDN SHL, SHR, SAR ROL, ROR RCL RCL RCL RCR RCR RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC, BTR, BTS BTC, BTR, BTS BSF BSF BSR BSR SETcc SETcc CLC, STC CMC CLD STD POPCNT POPCNT LZCNT TZCNT BEXTR BEXTR BLSI BLSMSK BLSR BLCFILL BLCI BLCIC BLCMSK BLCS BLSFILL BLSI BLSIC T1MSKC TZMSK r,i m,r m,i r m r,r,r r,i/CL r,i/CL r,1 r,i r,cl r,1 r,i r,cl r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,r r,r r,m r,r r,m r m r16/32,r16/32 r64,r64 r,r r,r r,r,r r,r,i r,r r,r r,r r,r r,r r,r r,r r,r r,r r,r r,r r,r r,r 1 1 1 1 1 1 1 1 1 16 17 1 15 16 6 7-8 8 1 1 7 2 4 10 6 8 7 9 1 1 1 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 7 1 1 1 1 7 7 1 7 7 3 4 1 2 3 4 4 1 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 3 4 4 0.5 0.5 3.5 1 2 5 3 4 4 5 0.5 1 0.5 1 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Page 68 3 4 2 4 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX01 EX0 BMI1 SSE4.2 SSE4.2 LZCNT BMI1 BMI1 AMD TBM BMI1 BMI1 BMI1 AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM AMD TBM Steamroller Control transfer instructions JMP short/near JMP r JMP m Jcc short/near fused CMP+Jcc short/near J(E/R)CXZ short LOOP short LOOPE LOOPNE short CALL near CALL r CALL m RET RET i BOUND m INTO String instructions LODS REP LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Synchronization LOCK ADD XADD LOCK XADD CMPXCHG CMPXCHG CMPXCHG LOCK CMPXCHG LOCK CMPXCHG LOCK CMPXCHG CMPXCHG8B LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER m8/m16 m32/m64 1 1 1 1 1 1 1 1 2 2 3 1 4 11 4 2 2 2 1-2 1-2 1-2 1-2 1-2 2 2 2 2 2 5 2 3 6n 6n 3 1n 3 per 16B 5 ~1n 4-5 pr 16B 3 7n 6 9n 3 3n 2.5n 3 ~1n 2 per 16B 3 ~1n ~2 per 16B 3 3-4n 3 4n m,r m,r m,r m,r8 m,r16 m,r32/64 m8,r8 m16,r16 m,r32/64 m64 m64 m128 m128 1 4 4 5 6 6 5 6 6 18 18 24 24 a,0 a,b 1 1 8 13 11+5b EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 EX1 for no jump for no jump small n best case small n best case ~39 9-12 ~39 15 15 13 ~40 ~40 ~40 ~14 ~42 ~47 ~80 0.25 0.25 4 21 20-30 Page 69 2 if jumping 2 if jumping 2 if jumping 2 if jumping 2 if jumping none none Steamroller LEAVE CPUID XGETBV RDTSC RDTSCP RDPMC CRC32 CRC32 CRC32 r32,r8 r32,r16 r32,r32 2 38-64 4 44 44 22 3 5 7 3 5 6 3 100-300 30 78 105 360 2 5 6 rdtscp Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(T)(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FLDCW FNSTCW Arithmetic instructions FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Operands Ops r m32/64 m80 m80 r m32/64 m80 m80 r m m 1 1 8 60 1 2 13 239 1 1 2 1 8 1 1 3 2 1 2 2 7 11 52 2 7 14 222 0 11 7 1 2 1 2 1 1 2 1 1 1 2 2 1 1 1 1 5 st0,r r AX m16 m16 m16 r/m m r/m m r m m r/m r m Latency Reciprocal throughput 3 0 11 5 9-37 2 2 26 4 17-60 Page 70 0.5 1 4 34 0.5 1 19 222 0.5 1 1 0.5 3 0.25 0.25 19 17 3 2 1 2 1 2 4-16 4 0.5 0.5 0.5 1 1 0.5 0.5 1 12-53 Execution pipes Domain, notes P01 fp fp fp fp fp fp fp fp inherit fp fp fp fp P0 P1 P2 P01 P0 P1 P2 P01 P01 P0 P2 P01 P0 P1 P2 none none P0 P2 P0 P2 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P01 P2 P01 P01 P01 P0 P0 inherit fma fma fma fma fp fp fp fp fp fp fp fp fp fp fp fp Steamroller Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR 1 1 10-164 18-166 12-168 11-192 10-365 10 12 10-18 9-183 206 m864 m864 1 1 18 31 98 73 10-50 60-210 76-158 90-245 60-440 49 8 60-74 60-280 ~390 256 166 5-20 0.5 60-165 90-165 90-210 60-365 5 5 0.25 0.25 63 131 256 166 P01 P01 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 none none P0 P0 P0 P1 P2 P0 P2 Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA VMOVDQA VMOVDQA VMOVDQA MOVDQU MOVDQU MOVDQU LDDQU VMOVDQU VMOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA Operands Ops r32/64, mm/x mm/x, r32/64 mm/x,m32 m32,mm/x mm/x,mm/x mm/x,m64 m64,mm/x xmm,xmm xmm,m m,xmm ymm,ymm ymm,m256 m256,ymm xmm,xmm xmm,m m,xmm xmm,m ymm,m256 m256,ymm mm,xmm xmm,mm m,mm m,xmm xmm,m 1 2 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 1 1 1 Latency Reciprocal throughput 4 5 2 3 2 2 3 0 2 3 2 3 4 0 2 3 2 3 4 1 1 3 3 2 Page 71 1 1 0.5 1 0.5 0.5 1 0.25 0.5 1 0.5 1 1 0.25 0.5 1 0.5 1 1 0.5 0.5 1 1 0.5 Execution pipes Notes P2 P02 none inherit domain P2 P02 P2 none P2 P02 P02 P2 P2 inherit domain Steamroller (x)mm,r/m PACKSSWB/DW (x)mm,r/m PACKUSWB PUNPCKH/LBW/WD/D Q (x)mm,r/m PUNPCKHQDQ xmm,r/m PUNPCKLQDQ xmm,r/m PSHUFB (x)mm,r/m PSHUFD xmm,xmm,i PSHUFW mm,mm,i PSHUFL/HW xmm,xmm,i PALIGNR (x)mm,r/m,i PBLENDW xmm,r/m MASKMOVQ mm,mm MASKMOVDQU xmm,xmm PMOVMSKB r32,mm/x PEXTRB/W/D/Q r,x/mm,i PINSRB/W/D/Q x/mm,r,i EXTRQ x,i,i EXTRQ x,x INSERTQ x,x,i,i INSERTQ x,x PMOVSXBW/BD/BQ/ WD/WQ/DQ x,x PMOVZXBW/BD/BQ/W D/WQ/DQ x,x VPCMOV x,x,x,x/m VPCMOV y,y,y,y/m VPPERM x,x,x,x/m Arithmetic instructions PADDB/W/D/Q/SB/SW /USB/USW PSUBB/W/D/Q/SB/SW/ USB/USW PHADD/SUB(S)W/D PCMPEQ/GT B/W/D PCMPEQQ PCMPGTQ PMULLW PMULHW PMULHUW PMULUDQ PMULLD PMULDQ PMULHRSW PMADDWD PMADDUBSW PAVGB/W PMIN/MAX SB/SW/ SD UB/UW/UD PHMINPOSUW PABSB/W/D PSIGNB/W/D PSADBW 1 1 2 2 1 1 P1 P1 1 1 1 1 1 1 1 1 1 31 65 2 2 2 1 1 1 1 2 2 2 3 2 2 2 2 2 32 45 5 5 6 3 1 1 1 1 1 1 1 1 1 1 1 0.5 16 31 1 1 1 1 1 1 1 P1 P1 P1 P1 P1 P1 P1 P1 P02 P2 P0 P1 P2 P1 P2 P1 P2 P1 P1 P1 P1 P1 AMD SSE4A AMD SSE4A AMD SSE4A AMD SSE4A 1 2 1 P1 SSE4.1 1 1 2 1 2 2 2 2 1 1 2 1 P1 P1 P1 P1 SSE4.1 AMD XOP AMD XOP AMD XOP (x)mm,r/m 1 2 0.5 P02 (x)mm,r/m x,x (x)mm,r/m (x)mm,r/m (x)mm,r/m 1 3 1 1 1 2 5 2 2 2 0.5 2 0.5 0.5 0.5 P02 P02 2P1 P02 P02 P02 (x)mm,r/m x,r/m x,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m 1 1 1 1 1 1 1 4 5 4 4 4 4 2 1 2 1 1 1 1 0.5 P0 P0 P0 P0 P0 P0 P02 (x)mm,r/m x,r/m (x)mm,r/m (x)mm,r/m (x)mm,r/m 1 2 1 1 2 2 4 2 2 4 0.5 1 0.5 0.5 1 P02 P1 P02 P02 P02 P02 Page 72 SSE4.1 SSE4.1 SSSE3 SSE4.1 SSE4.2 SSE4.1 SSE4.1 SSSE3 SSE4.1 SSSE3 SSSE3 Steamroller MPSADBW x,x,i 8 8 4 P1 P02 VPCOMB/W/D/Q x,x,x/m,i 1 2 0.5 P02 VPCOMUB/W/D/Q VPHADDBW/BD/BQ/ WD/WQ/DQ VPHADDUBW/BD/BQ/ WD/WQ/DQ VPHSUBBW/WD/DQ VPMACSWW/WD VPMACSDD VPMACSDQH/L VPMACSSWW/WD VPMACSSDD VPMACSSDQH/L VPMADCSWD VPMADCSSWD x,x,x/m,i 1 2 0.5 P02 SSE4.1 AMD XOP latency 0 if i=6,7 AMD XOP latency 0 if i=6,7 x,x/m 1 2 0.5 P02 AMD XOP x,x/m x,x/m x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x x,x,x/m,x 1 1 1 1 1 1 1 1 1 1 2 2 4 5 4 4 5 4 4 4 0.5 0.5 1 2 1 1 2 1 1 1 P02 P02 P0 P0 P0 P0 P0 P0 P0 P0 AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP AMD XOP (x)mm,r/m 1 2 0.5 P02 (x)mm,r/m 1 3 1 P1 (x)mm,i x,i x,r/m x,x,x/m x,x,i x,x,x/m x,x,x/m 1 1 2 1 1 1 1 2 2 14 3 2 3 3 1 1 1 1 1 1 1 P1 P1 P1 P2 P1 P1 P1 P1 SSE4.1 AMD XOP AMD XOP AMD XOP AMD XOP x,x,i x,x,i x,x,i x,x,i 30 30 9 8 11 10 5 6 11 10 5 6 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 x,x/m,i x,x,x,i x,x,m,i x,x x,x x,x x,x x,x x,x,i 7 7 8 2 2 2 2 1 1 11 11 7 7 7 1 1 1 1 1 1 P1 P1 P1 P01 P01 P01 P01 P0 P0 pclmul pclmul pclmul aes aes aes aes aes aes Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ PTEST VPROTB/W/D/Q VPROTB/W/D/Q VPSHAB/W/D/Q VPSHLB/W/D/Q String instructions PCMPESTRI PCMPESTRM PCMPISTRI PCMPISTRM Encryption PCLMULQDQ VPCLMULQDQ PCLMULQDQ AESDEC AESDECLAST AESENC AESENCLAST AESIMC AESKEYGENASSIST Other EMMS 5 5 5 5 5 5 1 0.25 Page 73 Steamroller Floating point XMM and YMM instructions Instruction Move instructions MOVAPS/D MOVUPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D MOVNTSS/SD SHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD VBLENDPS/PD BLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP Operands Ops Latency Reciprocal throughput x,x y,y 1 2 0 2 0.25 0.5 x,m128 1 2 0.5 y,m256 2 2 1 m128,x m256,y m256,y x,x x,m32/64 m32/64,x x,m64 x,m64 m64,x m64,x x,x r32,x r32,y m128,x m256,y m,x x,x/m,i y,y,y/m,i x,x,x/m y,y,y/m x,x/m,i y,y/m,i y,y,y,i y,y,m,i x,x/m,i y,y,y/m,i x,x/m,xmm0 y,y,y/m,y x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 1 2 2 1 1 1 1 1 2 1 1 2 2 1 2 1 1 2 1 2 1 2 8 12 1 2 1 2 1 1 2 2 1 2 2 2 1 1 3 3 3 2 2 3 3 3 4 3 2 5 15 3 3 1 2 2 0.5 0.5 1 1 0.5 1 1 1 1 1 1 2-3 3 1 2 1 2 1 2 3.5 4 0.5 1 0.5 1 1 0.5 2 1 0.5 0.5 0.5 0.5 1 0.5 2 2 3 3 2 2 4 2 2 2 2 2 2 8 8 8 8 2 Page 74 Execution pipes Domain, notes none P02 inherit domain ivec P2 P2 P2 P01 fp P2 P1 P01 P1 P2 P2 P1 P1 P2 P1 P2 P2 P2 P2 P2 P2 P1 P1 P1 P1 P0 P2 P0 P2 P01 P01 P01 P01 P1 ivec AMD SSE4A ivec ivec ivec ivec ivec ivec ivec ivec fp fp ivec P1 ivec P02 P02 P02 P1 ivec Steamroller VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D y,y y,m256 x,x/m y,y,y/m r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 2 2 1 2 2 2 1 2 1 1 2 2 1 2 20 41 Conversion CVTPD2PS VCVTPD2PS CVTPS2PD VCVTPS2PD CVTSD2SS CVTSS2SD CVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI VCVTPS2PH VCVTPS2PH VCVTPH2PS VCVTPH2PS x,x x,y x,x y,x x,x x,x x,x y,y x,x y,y x,x y,x x,x x,y x,mm mm,x x,mm mm,x x,r32 r32,x x,r32/64 r32/64,x x/m,x,i x/m,y,i x,x/m y,x/m Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D VADDSUBPS/D 2 P1 ivec P1 P1 P1 P2 P1 P2 P02 P0 P2 P1 P1 P02 P02 P01 P01 P0 P1 P2 P0 P1 P2 ivec ivec 10 2 10 2 9 2 10 9 9 ~35 ~35 2 1 1 2 1 1 0.5 1 1 2 1 1 0.5 1 8 16 2 4 2 4 1 1 1 2 1 2 2 4 2 4 2 1 2 2 2 2 2 2 2 4 2 4 6 6 6 6 4 4 4 4 4 4 7 7 7 7 6 5 7 7 13 12 12 12 7 7 7 7 1 2 1 2 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 2 2 P01 P01 P01 P01 P0 P0 P0 P0 P0 P0 P01 P01 P01 P01 P0 P2 P0 P0 P1 P0 P1 P0 P0 P2 P0 P0 P2 P0 P1 P0 P1 P0 P1 P0 P1 ivec/fp ivec/fp ivec/fp ivec/fp fp fp fp fp fp fp ivec/fp ivec/fp fp/ivec fp/ivec ivec/fp fp ivec/fp fp/ivec fp fp fp fp F16C F16C F16C F16C x,x/m x,x/m 1 1 5-6 5-6 1 1 P01 P01 fma fma y,y,y/m x,x/m y,y,y/m 2 1 2 5-6 5-6 5-6 2 1 1 P01 P01 P01 fma fma fma 2 2 Page 75 ivec ivec Steamroller HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULSD MULPS MULPD VMULPS VMULPD DIVSS DIVPS VDIVPS DIVSD DIVPD VDIVPD RCPSS/PS VRCPPS CMPSS/D CMPPS/D VCMPPS/D COMISS/D UCOMISS/D MAXSS/SD/PS/PD MINSS/SD/PS/PD x,x 4 10 2 P0 P1 ivec/fma y,y,y/m x,x/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y,y/m x,x/m y,y/m 8 1 1 2 1 2 1 2 1 2 10 5-6 5-6 5-6 9-17 9-17 9-32 9-32 5 5 4 0.5 0.5 1 4-6 9-12 4-13 9-27 1 2 P01 P1 P01 P01 P01 P01 P01 P01 P01 P01 P01 ivec/fma fma fma fma fp fp fp fp fp fp x,x/m y,y,y/m 1 2 2 2 0.5 1 P01 P01 fp fp x,x/m 2 1 P01 P2 fp x,x/m 1 2 0.5 P01 fp VMAXPS/D VMINPS/D y,y,y/m 2 ROUNDSS/SD/PS/PD x,x/m,i 1 VROUNDSS/SD/PS/ PD y,y/m,i 2 DPPS x,x,i 9 DPPS x,m,i 10 VDPPS y,y,y,i 13 VDPPS y,m,i 15 DPPD x,x,i 7 DPPD x,m,i 8 VFMADD132SS/SD x,x,x/m 1 VFMADD132PS/PD x,x,x/m 1 VFMADD132PS/PD y,y,y/m 2 All other FMA3 instructions: same as above VFMADDSS/SD x,x,x,x/m 1 VFMADDPS/PD x,x,x,x/m 1 VFMADDPS/PD y,y,y,y/m 2 All other FMA4 instructions: same as above 2 4 1 1 P01 P0 fp fp 4 25 5-6 5-6 5-6 2 4 5 8 8 3 4 0.5 0.5 1 P0 P0 P1 P0 P1 P0 P1 P0 P1 P0 P1 P0 P1 P01 P01 P01 5-6 5-6 5-6 0.5 0.5 1 P01 P01 P01 fp SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 FMA3 FMA3 FMA3 FMA3 AMD FMA4 AMD FMA4 AMD FMA4 AMD FMA4 Math SQRTSS/PS VSQRTPS SQRTSD/PD VSQRTPD RSQRTSS/PS VRSQRTPS VFRCZSS/SD/PS/PD VFRCZSS/SD/PS/PD 25 14 x,x/m y,y/m x,x/m y,y/m x,x/m y,y/m x,x x,m 1 2 1 2 1 2 2 4 12-13 12-13 26-29 27-28 5 5 10 4-9 9-18 4-18 9-37 1 2 2 2 P01 P01 P01 P01 P01 P01 P01 P01 fp fp fp fp fp fp AMD XOP AMD XOP x,x/m 1 2 0.5 P02 ivec Logic AND/ANDN/OR/XORPS/ PD Page 76 Steamroller VAND/ANDN/OR/XOR PS/PD Other VZEROUPPER VZEROUPPER VZEROALL VZEROALL LDMXCSR STMXCSR FXSAVE FXRSTOR XSAVE XRSTOR y,y,y/m 2 m32 m32 m4096 m4096 m m 9 16 17 32 9 2 59-67 104-112 121-137 191-209 2 1 4 5 6 10 36 17 78 160 147-166 291-297 Page 77 P02 P02 P02 P0 P2 P0 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 P0 P1 P2 ivec 32 bit mode 64 bit mode 32 bit mode 64 bit mode Bobcat AMD Bobcat List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m). The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar instructions, assuming that this will make the processor boost the clock frequency to the highest possible value. Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/2 indicates that the execution units can handle 2 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution pipe: Indicates which execution pipe is used for the micro-operations. I0 means integer pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD). FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to different execution pipes. Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD MOVSXD CMOVcc CMOVcc XCHG XCHG Operands r,r r,i r,m m,r m8,r8H m,i m,r r,r r,m r64,r32 r64,m32 r,r r,m r,r r,m Ops Latency Reciprocal throughput 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 4 4 7 6 1 5 1 5 1 1 20 Page 78 0.5 0.5 1 1 1 1 1 0.5 1 0.5 1 0.5 1 1 Execution pipe I0/1 I0/1 AGU AGU AGU AGU AGU I0/1 Notes Any addr. mode Any addr. mode AH, BH, CH, DH I0/1 I0/1 Timing dep. on hw Bobcat XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP PREFETCHNTA PREFETCHT0/1/2 PREFETCH SFENCE LFENCE MFENCE r i m r m r16,[m] r32/64,[m] r32/64,[m] r64,[m] r m m m Arithmetic instructions ADD, SUB r,r/i ADD, SUB r,m ADD, SUB m,r ADC, SBB r,r/i ADC, SBB r,m ADC, SBB m,r/i CMP r,r/i CMP r,m INC, DEC, NEG r INC, DEC, NEG m AAA AAS DAA DAS AAD AAM MUL, IMUL r8/m8 MUL, IMUL r16/m16 MUL, IMUL r32/m32 MUL, IMUL r64/m64 IMUL r16,r16/m16 IMUL r32,r32/m32 IMUL r64,r64/m64 IMUL r16,(r16),i IMUL r32,(r32),i IMUL r64,(r64),i DIV r8/m8 2 1 1 3 9 9 1 4 29 9 2 1 1 1 4 1 1 1 1 1 1 4 1 4 1 1 1 1 1 1 1 1 1 1 9 9 12 16 4 33 1 3 2 2 1 1 1 2 1 1 1 5 3 1 2-4 4 1 1 1 1 1 6-7 1 1 6 5 10 7 8 5 23 3 3-5 3-4 6-7 3 3 6 4 3 7 27 Page 79 1 1 2 6 9 1 4 22 8 2 0.5 1 0.5 2 0.5 I0 I0/1 I0 I0/1 I0/1 0.5 1 1 1 ~45 1 ~45 I0/1 AGU AGU AGU AGU AGU AGU 0.5 1 1 1 1 I0/1 0.5 1 0.5 I0/1 23 1 2 1 1 4 3 1 4 27 Any address size no scale, no offset w. scale or offset RIP relative AMD only I0/1 I0/1 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 latency ax=3, dx=5 latency eax=3, edx=4 latency rax=6, rdx=7 Bobcat DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL, ROR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT BTC, BTR, BTS BTC BTR, BTS BTC BTR, BTS BSF, BSR BSF, BSR POPCNT LZCNT SETcc SETcc CLC, STC CMC CLD STD r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64 1 1 1 1 1 1 1 1 1 33 49 81 29 37 55 81 1 1 33 49 81 29 37 55 81 I0 I0 I0 I0 I0 I0 I0 I0/1 I0/1 r,r r,m m,r r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL 1 1 1 1 1 1 1 1 1 1 9 7 9 9 1 0.5 1 1 0.5 1 0.5 1 0.5 0.5 1 5 4 5 4 I0/1 m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r r,r/i m,i m,i m,r m,r r,r r,m r,r/m r,r/m r m 1 1 10 9 9 8 6 7 8 1 1 5 2 5 4-5 8 8 11 11 9 8 1 1 1 1 1 2 7 7 1 1 1 1 1 5 4 6 5 18 3 4 18 2 16 15 6 12 5 1 1 Page 80 I0/1 I0/1 I0/1 I0/1 I0/1 1 1 ~15 ~14 15 15 3 4 15 0.5 1 3 1 15 15 13 15 6 6 5 0.5 1 0.5 0.5 1 2 SSE4.A/SSE4.2 SSE4.A, AMD only I0/1 I0/1 I0 I0,I1 Bobcat Control transfer instructions JMP short/near JMP r JMP m(near) Jcc short/near J(E/R)CXZ short LOOP short CALL near CALL r CALL m(near) RET RET i BOUND m INTO 1 1 1 1 2 8 2 2 5 1 4 8 4 2 2 2 1/2 - 2 1-2 4 2 2 2 ~3 ~4 4 2 String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS 4 5 4 2 7 2 5 6 7 6 ~3 ~3 2 Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC recip. t. = 2 if jump recip. t. = 2 if jump values for no jump values for no jump values are per count best case 6-7 B/clk 5 best case 5 B/clk 3 3 4 3 1 0 1 0 6 i,0 12 a,b 10+6b 2 30-52 70-830 26 14 0.5 0.5 6 values are per count values are per count I0/1 I0/1 36 34+6b 3 32 bit mode 87 8 Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH Operands r m32/64 m80 m80 r m32/64 m80 m80 r Ops Latency Reciprocal throughput 1 1 7 21 1 1 16 217 1 2 6 14 30 2 6 19 177 0 Page 81 0.5 1 5 35 0.5 1 9 180 1 Execution pipe FP0/1 FP0/1 FP0/1 FP1 FP1 Notes Bobcat FILD FIST(T)(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FMUL(P) FIMUL FDIV(R)(P) FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 m m st0,r r AX m16 m16 m16 r m m r m m r m m r m r m Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE 1 1 1 12 1 1 2 2 3 12 1 1 2 1 1 2 1 1 2 1 1 1 1 1 2 1 2 5 1 1 9 6 7 1 ~20 ~20 3 3 5 5 19 2 2 m 1 1 3 3 3 19 19 19 2 1 1 1 2 1 1 2 11 11-16 11-19 1 31 1 4-44 27-105 11-51 51-94 11-75 48-110 ~45 ~113 9-75 49-163 5 8 7 9 30-56 ~60 8 29 12 44 1 1 9 26 85 1 1 1 7 1 1 10 10 2 10 0 0 Page 82 1 27-105 51-94 48-110 ~113 49-163 0.5 0.5 30 78 163 FP1 FP1 FP0/1 FP1 FP1 FP1 FP1 FP0 FP1 FP0 FP0 FP0,FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 FP0, FP1 FP0 FP1 FP0, FP1 FP1 FP1 FP1 FP0 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 ALU FP0, FP1 FP0, FP1 FP0, FP1 Bobcat FRSTOR FXSAVE FXRSTOR m m m 80 71 111 123 105 118 FP0, FP1 FP0, FP1 FP0, FP1 Integer MMX and XMM instructions Instruction Operands Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32,(x)mm 1 1 1 1 3 2 1 7 7 5 6 6 5 6 1 3 1 1 3 1 2 FP0 FP0/1 FP0/1 FP0 FP1 FP1 FP1 r64,(x)mm mm,r64 xmm,r64 mm,mm xmm,xmm mm,m64 xmm,m64 m64,(x)mm xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm 1 2 3 1 2 1 2 1 2 2 2 2 2 1 2 1 2 7 7 7 1 1 5 5 6 1 6 6 6-9 6-9 1 1 13 13 1 3 3 0.5 1 1 1 2 1 2 3 2-5.5 3-6 0.5 1 1.5 3 FP0 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP1 FP1 FP0/1 AGU FP1 AGU FP1 FP0/1 FP0/1 FP1 FP1 mm,r/m 1 1 0.5 FP0/1 xmm,r/m 3 2 2 FP0/1 mm,r/m 1 1 0.5 xmm,r/m xmm,r/m xmm,r/m mm,mm xmm,xmm xmm,xmm,i mm,mm,i xmm,xmm,i xmm,xmm,i mm,mm xmm,xmm r32,(x)mm 2 2 1 1 6 3 1 2 20 32 64 1 1 1 1 2 3 2 1 2 19 146-1400 279-3000 8 1 1 0.5 1 3 2 0.5 2 12 130-1170 260-2300 2 MOVD (MOVQ) MOVD (MOVQ) MOVD (MOVQ) MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU, LDDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/D Q PUNPCKH/LBW/WD/D Q PUNPCKHQDQ PUNPCKLQDQ PSHUFB PSHUFB PSHUFD PSHUFW PSHUFL/HW PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB Ops Latency Reciprocal throughput Page 83 Execution pipe FP0, FP1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0, FP1 FP0, FP1 FP0 Notes Moves 64 bits. Name differs do. do. Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Bobcat PEXTRW PINSRW PINSRW INSERTQ INSERTQ EXTRQ EXTRQ Arithmetic instructions PADDB/W/D/Q PADDSB/W PADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PADDB/W/D/Q PADDSB/W ADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W PHADD/SUBW/SW/D PHADD/SUBW/SW/D PCMPEQ/GT B/W/D PCMPEQ/GT B/W/D PMULLW PMULHW PMULHUW PMULUDQ PMULLW PMULHW PMULHUW PMULUDQ PMULHRSW PMULHRSW PMADDWD PMADDWD PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W PMIN/MAX SW/UB PMIN/MAX SW/UB PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW Logic PAND PANDN POR PXOR PAND PANDN POR PXOR r32,(x)mm,i mm,r32,i xmm,r32,i xmm,xmm xmm,xmm,i,i xmm,xmm xmm,xmm,i,i 2 2 3 3 3 1 1 12 10 10 3-4 3-4 1 2 2 6 3 3 1 2 FP0, FP1 FP0/1 FP0/1 FP0, FP1 FP0, FP1 FP0/1 FP0/1 mm,r/m 1 1 0.5 FP0/1 xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m 2 1 2 1 2 1 1 4 1 1 1 0.5 1 0.5 1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 mm,r/m 1 2 1 FP0 xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m mm,r/m xmm,r/m 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 1 2 1 2 1 2 0.5 1 0.5 1 0.5 1 0.5 1 2 2 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0 FP0, FP1 mm,r/m 1 1 0.5 FP0/1 xmm,r/m 2 1 1 FP0/1 Page 84 SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only SSE4.A, AMD only Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Bobcat PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ mm,i/mm/m 1 1 1 FP0/1 xmm,i/xmm/m 2 2 1 1 1 1 FP0/1 FP0/1 0.5 FP0/1 xmm,i Other EMMS 1 Floating point XMM instructions Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVNTSS/D MOVDDUP MOVDDUP MOVSHDUP, MOVSLDUP MOVSHDUP, MOVSLDUP MOVMSKPS/D SHUFPS/D UNPCK H/L PS/D Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI Operands Ops Latency Reciprocal throughput Execution pipe r,r r,m m,r r,r r,m m,r r,r r,m m,r 2 2 2 2 2 2 1 2 1 1 6 6 1 6-9 6-9 1 6 5 1 2 3 1 2-6 3-6 0.5 2 2 FP0/1 AGU FP1 FP0/1 AGU FP1 FP0/1 FP1 FP1 r,r 1 1 0.5 FP0/1 r,m 1 6 2 AGU m,r m,r m,r r,r r,m64 1 2 1 2 2 5 12 12 2 7 3 3 2 1 2 FP1 FP1 FP1 FP0/1 FP0/1 r,r 2 1 1 FP0/1 r,m r32,r r,r/m,i r,r/m 2 1 3 2 12 ~6 2 1 3 2 2 1 AGU FP0 FP0/1 FP0/1 r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm 2 4 3 1 2 2 2 4 1 2 1 5 5 5 4 4 5 4 6 4 5 4 2 3 3 1 4 2 4 3 2 2 1 FP1 FP0, FP1 FP0, FP1 FP1 FP1 FP1 FP1 FP0, FP1 FP1 FP1 FP1 Page 85 Notes SSE4.A, AMD only SSE3 SSE3 Bobcat CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SS2SI CVT(T)SD2SI mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 3 3 2 2 2 6 12 11 12 11 2 3 3 1 1 FP0, FP1 FP0, FP1 FP1 FP0, FP1 FP0, FP1 r,r/m r,r/m r,r/m 1 2 2 3 3 3 1 2 2 FP0 FP0 FP0 r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 2 1 1 2 2 1 2 1 2 1 2 1 2 1 2 3 2 4 2 4 13 38 17 34 3 3 2 2 2 2 2 1 2 2 4 13 38 17 34 1 2 1 2 1 2 FP0 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 r,r/m 1 1 FP0 Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,r/m 2 1 1 FP0/1 Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 2 1 2 1 2 14 48 24 48 3 3 14 48 24 48 1 2 FP1 FP1 FP1 FP1 FP1 FP1 Other LDMXCSR STMXCSR m m 12 3 10 11 FP0, FP1 FP0, FP1 Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDSUBPS/D HADDPS/D HSUBPS/D MULSS MULSD MULPS MULPD DIVSS DIVPS DIVSD DIVPD RCPSS RCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D Page 86 SSE3 SSE3 Jaguar AMD Jaguar List of instruction timings and macro-operation breakdown Explanation of column headings: Instruction: Operands: Ops: Latency: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m). The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar instructions, assuming that this will make the processor boost the clock frequency to the highest possible value. Reciprocal throughput: This is also called issue latency. This value indicates the average number of clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/2 indicates that the execution units can handle 2 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline. Execution pipe: Indicates which execution pipe is used for the micro-operations. I0 means integer pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD). FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to different execution pipes. Integer instructions Instruction Move instructions MOV MOV Operands Ops Latency Reciprocal throughput Execution pipe r,r r,i 1 1 1 0.5 0.5 I0/1 I0/1 MOV r8/16,m 1 4 1 AGU MOV m,r8/16 1 4 1 AGU MOV r32/64,m 1 3 1 AGU MOV MOV MOVNTI MOVZX, MOVSX MOVZX, MOVSX MOVSXD m,r32/64 m,i m,r r,r r,m r64,r32 1 1 1 1 1 1 0 1 1 1 0.5 1 0.5 AGU AGU AGU I0/1 6 1 4 1 Page 87 Notes Any addressing mode Any addressing mode Any addressing mode Any addressing mode Jaguar MOVSXD CMOVcc CMOVcc XCHG XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LAHF SAHF SALC BSWAP MOVBE MOVBE PREFETCHNTA PREFETCHT0/1/2 PREFETCHW LFENCE MFENCE SFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC, NEG INC, DEC, NEG AAA AAS DAA DAS AAD AAM r64,m32 r,r r,m r8,r8 r,r 1 1 1 3 2 3 1 r,m 3 2 1 1 2 2 9 9 1 3 1 29 9 2 1 1 1 4 1 1 1 1 1 1 1 1 1 4 4 16 5 1 1 1 1 1 1 1 1 1 1 9 9 12 16 4 8 1 r i m SP r m SP r16,[m] r32/64,[m] r32/64,[m] r64,[m] r r,m m,r m m m r,r/i r,m m,r r,r/i r,m m,r/i r,r/i r,m r m 2 1 3 1 2 3 1 1 1 6 1 8 1 1 6 5 8 6 8 5 14 Page 88 1 0.5 1 2 1 I0/1 I0/1 I0/1 Timing depends on hw 3 1 1 1 1 6 8 1 2 2 18 8 2 0.5 1 0.5 2 0.5 1 0.5 1 1 ~100 ~100 ~100 0.5 ~45 ~45 I0 I0/1 I0 I0/1 I0/1 I0/1 MOVBE MOVBE AGU AGU AGU AGU AGU AGU 0.5 1 1 1 1 I0/1 0.5 1 0.5 1 I0/1 13 Any address size 1-2 comp., no scale 3 comp. or scale RIP relative I0/1 I0/1 Jaguar MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW, CWDE, CDQE CWD, CDQ, CQO Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR AND, OR, XOR ANDN ANDN TEST TEST TEST NOT NOT SHL, SHR, SAR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL,SHR,SAR,ROL, ROR RCL, RCR RCL RCR RCL RCR SHLD, SHRD SHLD, SHRD SHLD, SHRD BT BT BT r8/m8 r16/m16 r32/m32 r64/m64 r16,r16/m16 r32,r32/m32 r64,r64/m64 r16,(r16),i r32,(r32),i r64,(r64),i r8/m8 r16/m16 r32/m32 r64/m64 r8/m8 r16/m16 r32/m32 r64/m64 1 3 2 2 1 1 1 2 1 1 1 2 2 2 1 2 2 2 1 1 3 3 3 6 3 3 6 4 3 6 11-14 12-19 12-27 12-43 11-14 12-19 12-27 12-43 1 1 1 3 2 5 1 1 4 1 1 4 11-14 12-19 12-27 12-43 11-14 12-19 12-27 12-43 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0 I0/1 I0/1 r,i r,r r,m m,r r,r,r r,r,m r,i r,r r,m r m r,i/CL r,i/CL r,1 r,i r,i r,CL r,CL 1 1 1 1 1 2 1 1 1 1 1 1 1 1 9 7 9 7 1 1 0.5 0.5 1 1 0.5 1 0.5 0.5 1 0.5 1 0.5 0.5 1 5 4 5 4 I0/1 I0/1 m,i /CL m,1 m,i m,i m,CL m,CL r,r,i r,r,cl m,r,i/CL r,r/i m,i m,r 1 1 10 9 9 8 6 7 8 1 1 5 6 6 1 1 1 1 6 1 1 1 5 4 5 4 3 4 Page 89 1 1 11 11 11 11 3 4 11 0.5 1 3 BMI1 BMI1 I0/1 I0/1 I0/1 I0/1 I0/1 I0/1 Jaguar BTC, BTR, BTS BTC BTR, BTS BTC, BTR, BTS BSF BSR BSF, BSR POPCNT LZCNT TZCNT BLSI BLSR BLSI BLSR BLSMSK BLSMSK BEXTR BEXTR SETcc SETcc CLC, STC CMC CLD STD r,r/i m,i m,i m,r r,r r,r r,m r,r/m r,r r,r r,r r,m r,r r,m r,r,r r,m,r r m 2 5 4 8 7 8 8 1 1 2 2 3 2 3 1 2 1 1 1 1 1 2 2 4 4 1 1 2 2 2 1 1 1 11 11 11 4 4 4 0.5 0.5 1 1 2 1 2 0.5 1 0.5 1 0.5 1 1 2 Control transfer instructions JMP short/near JMP r JMP m(near) Jcc short/near J(E/R)CXZ short LOOP short LOOPE LOOPNE short CALL near CALL r CALL m(near) RET RET i 1 1 1 1 2 8 10 2 2 5 1 4 2 2 2 0.5 - 2 1-2 5 6 2 2 2 3 3 BOUND 8 4 4 2 4 ~5n 4 ~2n 2/16B 7 ~2n 2/16B 5 ~6n 7 2 ~3n 2 ~n 1/16B 4 ~1.5n 1/16B 3 ~3n 4 INTO String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS m Page 90 SSE4A/SSE4.2 SSE4A/LZCNT BMI1 BMI1 BMI1 BMI1 BMI1 BMI1 BMI1 I0/1 I0/1 I0 I0,I1 2 if jumping 2 if jumping values are for no jump values are for no jump for small n best case for small n best case Jaguar REP CMPS Synchronization LOCK ADD XADD LOCK XADD CMPXCHG LOCK CMPXCHG CMPXCHG LOCK CMPXCHG CMPXCHG8B LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID XGETBV RDTSC RDTSCP RDPMC CRC32 CRC32 ~6n m,r m,r m,r m,r8 m,r8 m,r16/32/64 m,r16/32/64 m64 m64 m128 m128 r,r r,m 1 4 4 5 5 6 6 18 18 28 28 ~3n 19 11 16 11 16 11 17 11 19 32 38 1 1 37 i,0 12 a,b 10+6b 2 30-59 70-230 5 34 34 30 3 3 4 0.5 0.5 46 I0/1 I0/1 18 17+3b 3 32 bit mode 5 41 42 27 2 2 rdtscp Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(T)(P) FLDZ, FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FNSTSW FNSTCW Operands r m32/64 m80 m80 r m32/64 m80 m80 r m m st0,r r AX m16 m16 Ops Latency Reciprocal throughput 1 1 7 21 1 1 10 217 1 1 1 1 12 1 1 2 2 3 2 4 9 24 2 3 9 167 0 8 4 7 1 Page 91 0.5 1 5 29 0.5 1 7 168 1 1 1 1 7 1 1 11 11 2 Execution pipe FP0/1 FP0/1 FP0/1 FP1 FP1 FP1 FP1 FP1 FP0/1 FP1 FP1 FP1 FP1 FP0 Notes Jaguar FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD(P),FSUB(R)(P) FIADD,FISUB(R) FMUL(P) FMUL(P) FIMUL FDIV(R)(P) FDIV(R)(P) FIDIV(R) FABS, FCHS FCOM(P), FUCOM(P) FCOM(P), FUCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 m16 12 r m m r m m r m m 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 2 5 1 1 r m r m Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR 1 1 4-44 11-51 11-76 11-45 9-75 5 7 8 8-51 61 m m 1 1 9 27 88 80 3 5 22 2 8 11-54 11-56 35 30-139 38-93 55-122 55-177 44-167 27 9 32-37 30-120 ~160 0 138-150 136 9 FP1 1 1 2 3 3 FP0 FP0 FP0,FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 FP0, FP1 FP0 1FP1 FP0, FP1 FP1 FP1 22 22 22 2 1 1 1 2 1 1 2 4 35 1 30-151 30-120 ~160 FP1 FP0 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 0.5 0.5 32 78 138-150 136 FP0/1 ALU FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 55-180 55-177 44-167 6 Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD Operands r32, mm mm, r32 Ops Latency Reciprocal throughput 1 2 4 6 Page 92 1 1 Execution pipe FP0 FP0/1 Notes Jaguar MOVD MOVD MOVD MOVD MOVD mm,m32 r32, x x, r32 x,m32 m32,(x)mm 1 1 2 1 1 4 4 6 4 3 1 1 1 1 1 AGU FP0 FP1 AGU FP1 MOVD / MOVQ r64,(x)mm MOVQ mm,r64 MOVQ x,r64 MOVQ mm,mm MOVQ x,x MOVQ (x)mm,m64 MOVQ m64,(x)mm MOVDQA x,x VMOVDQA y,y MOVDQA x,m VMOVDQA y,m MOVDQA m,x VMOVDQA m,y MOVDQU, LDDQU x.m MOVDQU m,x MOVDQ2Q mm,x MOVQ2DQ x,mm MOVNTQ m,mm MOVNTDQ m,x PACKSSWB/DW PACKUSWB mm,r/m PACKSSWB/DW PACKUSWB x,r/m PUNPCKH/LBW/WD/D Q mm,r/m PUNPCKH/LBW/WD/D Q x,r/m PUNPCKH/LQDQ x,r/m PSHUFB mm,mm PSHUFB x,x PSHUFD x,x,i PSHUFW mm,mm,i PSHUFL/HW x,x,i PALIGNR x,x,i PBLENDW x,r/m MASKMOVQ mm,mm MASKMOVDQU x,x PMOVMSKB r32,(x)mm PEXTRW r32,(x)mm,i PINSRW mm,r32,i PINSRB/W/D/Q x,r,i PINSRB/W/D/Q x,m,i PEXTRB/W/D/Q r,x,i PEXTRB/W/D/Q m,x,i INSERTQ x,x INSERTQ x,x,i,i EXTRQ x,x 1 2 2 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 4 6 6 1 1 4 3 1 1 4 4 3 3 4 3 1 1 429 429 1 1 1 0.5 0.5 1 1 0.5 1 1 2 1 2 1 1 0.5 0.5 2 2 FP0 FP0/1 FP0/1 FP0/1 FP0/1 AGU FP1 FP0/1 FP0/1 AGU AGU FP1 FP1 AGU FP1 FP0/1 FP0/1 FP1 FP1 1 1 0.5 FP0/1 1 2 0.5 FP0/1 1 1 0.5 FP0/1 1 1 1 3 1 1 1 1 1 32 64 1 1 2 2 1 1 1 3 3 1 2 2 1 4 2 1 1 2 1 432 43-2210 3 4 8 7 0.5 0.5 0.5 2 0.5 0.5 0.5 0.5 0.5 17 34 1 1 1 1 1 1 1 2 2 0.5 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0, FP1 FP0, FP1 FP0 FP0 FP0/1 FP0/1 FP0/1 FP0 FP1 FP0, FP1 FP0, FP1 FP0/1 3 2 2 1 Page 93 Moves 64 bits.Name of instruction differs do. do. AVX AVX AVX Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 SSE4.1 SSE4.1 SSE4.1 SSE4A, AMD only SSE4A, AMD only SSE4A, AMD only Jaguar EXTRQ PMOVSXBW/BD/BQ/ WD/WQ/DQ PMOVZXBW/BD/BQ/ WD/WQ/DQ x,x,i,i 1 1 0.5 FP0/1 SSE4A, AMD only x,x 1 2 0.5 FP0/1 SSE4.1 x,x 1 2 0.5 FP0/1 SSE4.1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 1 3 1 1 1 1 1 1 1 1 1 3 2 4 2 2 2 2 1 1 1 1 2 4 1 2 1 1 1 1 0.5 0.5 0.5 0.5 0.5 1 FP0 FP0 FP1 FP0 FP0 FP0 FP0 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 (x)mm,r/m 1 1 0.5 FP0/1 mm,i/mm/m 1 1 0.5 FP0/1 x,x 1 2 0.5 FP0/1 x,i x,i x,x/m 1 1 1 1 2 3 0.5 0.5 1 FP0/1 FP0/1 FP0 SSE4.1 x,x,i x,m,i x,x,i x,m,i x,x,i x,m,i 9 10 9 10 3 4 5 5 5 9 9 2 2 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 Arithmetic instructions PADDB/W/D/Q PADDSB/W ADDUSB/W PSUBB/W/D/Q PSUBSB/W PSUBUSB/W (x)mm,r/m PHADD/SUBW/SW/D mm,r/m PHADD/SUBW/SW/D x,r/m PCMPEQ/GT B/W/D mm,r/m PCMPEQ/GT B/W/D x,r/m PCMPEQQ (x)mm,r/m PCMPGTQ (x)mm,r/m PMULLW PMULHW PMULHUW PMULUDQ (x)mm,r/m x,r/m PMULLD x,r/m PMULDQ PMULHRSW (x)mm,r/m PMADDWD (x)mm,r/m PMADDUBSW (x)mm,r/m PAVGB/W (x)mm,r/m PMIN/MAX SW/UB (x)mm,r/m PABSB/W/D (x)mm,r/m PSIGNB/W/D (x)mm,r/m PSADBW (x)mm,r/m MPSADBW x,x,i Logic PAND PANDN POR PXOR PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLL/RL W/D/Q PSRAW/D PSLLDQ, PSRLDQ PTEST String instructions PCMPESTRI PCMPESTRI PCMPESTRM PCMPESTRM PCMPISTRI PCMPISTRI 9 2 Page 94 Suppl. SSE3 Suppl. SSE3 SSE4.1 SSE4.2 SSE4.1 SSE4.1 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 Suppl. SSE3 SSE4.1 Jaguar PCMPISTRM PCMPISTRM Encryption PCLMULQDQ AESDEC AESDECLAST AESENC AESENCLAST AESIMC AESKEYGENASSIST x,x,i x,m,i 3 4 8 8 2 FP0/1 FP0/1 SSE4.2 SSE4.2 x,x/m,i x,x x,x x,x x,x x,x x,x,i 1 2 2 2 2 1 1 3 5 5 5 5 2 2 1 1 1 1 1 1 1 FP0 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 PCLMUL AES AES AES AES AES AES 0.5 FP0/1 Other EMMS 1 Floating point XMM instructions Instruction Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D VMOVAPS/D MOVAPS/D VMOVAPS/D MOVUPS/D VMOVUPS/D MOVUPS/D VMOVUPS/D MOVUPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHLPS, MOVLHPS MOVHPS/D, MOVLPS/D MOVHPS/D, MOVLPS/D MOVNTPS/D MOVNTSS/D MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP MOVMSKPS/D VMOVMSKPS/D Operands Ops Latency Reciprocal throughput Execution pipe x,x y,y x,m y,m m,x m,y x,x y,y x,m y,m m,x m,y x,x x,m m,x 1 2 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 4 4 3 3 1 1 4 4 3 3 1 4 3 0.5 1 1 2 1 2 0.5 1 1 2 1 2 0.5 1 1 FP0/1 FP0/1 AGU AGU FP1 FP1 FP0/1 FP0/1 AGU AGU FP1 FP1 FP0/1 AGU FP1 x,x 1 2 2 FP0/1 x,m 1 5 1 FP0/1 m,x m,x m,x x,x x,m64 y,y y,m x,x x,m y,y y,m r32,x r32,y 1 1 1 1 1 2 2 1 1 2 2 1 1 4 429 1 1 1 0.5 1 1 2 0.5 1 1 2 1 1 FP1 FP1 FP1 FP0/1 AGU FP0/1 AGU FP0/1 AGU FP0/1 AGU FP0 FP0 2 2 1 1 3 3 Page 95 Notes SSE4A, AMD only SSE3 SSE3 AVX AVX AVX AVX AVX Jaguar SHUFPS/D VSHUFPS/D UNPCK H/L PS/D VUNPCK H/L PS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D x,x/m,i y,y,y,i x,x/m y,y,y r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 1 2 1 2 1 1 1 1 1 1 2 2 1 2 19 36 2 2 2 2 3 3 1 12 Conversion CVTPS2PD VCVTPS2PD CVTPD2PS VCVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS/PD VCVTDQ2PS/PD CVT(T)PS2DQ VCVT(T)PS2DQ CVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SS2SI CVT(T)SD2SI VCVTPS2PH VCVTPS2PH VCVTPH2PS VCVTPH2PS x,x/m y,x/m x,x/m x,y x,x/m x,x/m x,x/m y,y x,x/m y,y x,x/m y,y xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm x/m,x,i x/m,y,i x,x/m y,x/m x,x/m x,x/m y,y/m x,x/m y,y/m x,x/m y,y/m x,x/m y,y/m Arithmetic ADDSS/D SUBSS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D VADDSUBPS/D HADD/SUBPS/D VHADD/SUBPS/D MULSS/PS VMULPS 6 1 13 15 15 21 32 0.5 1 0.5 1 1 1 0.5 1 1 1 1 2 1 2 16 22 FP0/1 FP0/1 FP0/1 FP0/1 FP0 FP1 FP0/1 FP1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP0/1 FP1 FP1 AVX AVX >300 clk if mask=0 >300 clk if mask=0 AVX AVX 1 2 1 3 2 2 1 2 1 2 1 3 1 1 1 1 2 2 2 2 1 3 1 2 3 4 4 6 5 4 4 4 4 4 4 7 4 4 4 4 9 9 8 8 4 6 4 5 1 2 1 2 8 7 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 2 FP1 FP1 FP1 FP0, FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0, FP1 FP1 FP1 F16C F16C F16C F16C 1 1 2 1 2 1 2 1 2 3 3 3 3 3 4 4 2 2 1 1 2 1 2 1 2 1 2 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP1 FP1 Page 96 AVX AVX AVX AVX SSE3 SSE3 Jaguar MULSD/PD VMULPD DIVSS DIVPS VDIVPS DIVSD DIVPD VDIVPD RCPSS RCPPS VRCPPS MAXSS/D MINSS/D MAXPS/D MINPS/D VMAXPS/D VMINPS/D CMPccSS/D CMPccPS/D VCMPccPS/D (U)COMISS/D ROUNDSS/SD/PS/PD VROUNDSS/D/PS/D DPPS DPPS VDPPS VDPPS DPPD DPPD x,x/m y,y/m x,x/m x,x/m y,y/m x,x/m x,x/m y,y/m x,x/m x,x/m y,y/m x,x/m x,x/m y,y/m x,x/m x,x/m y,y/m x,x/m x,x/m,i y,y/m,i x,x,i x,m,i y,y,y,i y,m,i x,x,i x,m,i 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 5 6 10 12 3 4 4 4 14 19 38 19 19 38 2 2 2 2 2 2 2 2 2 Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D VANDPS/D, etc. x,x/m y,y/m 1 2 Math SQRTSS SQRTPS VSQRTPS SQRTSD SQRTPD VSQRTPD RSQRTSS/PS VRSQRTPS x,x/m x,x/m y,y/m x,x/m x,x/m y,y/m x,x/m y,y/m m m Other LDMXCSR STMXCSR VZEROUPPER VZEROUPPER VZEROALL VZEROALL FXSAVE FXSAVE FXRSTOR FXRSTOR 2 2 14 19 38 19 19 38 1 1 2 1 1 2 1 1 2 1 1 2 4 4 7 7 3 3 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP1 FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 FP0, FP1 1 1 0.5 1 FP0/1 FP0/1 1 2 2 1 2 2 1 2 16 21 42 27 27 54 2 2 16 21 42 27 27 54 1 2 FP1 FP1 FP1 FP1 FP1 FP1 FP1 FP1 12 3 21 37 41 73 66 58 115 123 9 13 8 12 30 46 58 90 66 58 189 197 FP0, FP1 FP0, FP1 4 4 11 12 9 66 58 189 198 Page 97 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 32 bit mode 64 bit mode 32 bit mode 64 bit mode 32 bit mode 64 bit mode 32 bit mode 64 bit mode Jaguar XSAVE XSAVE XRSTOR XRSTOR 130 114 219 251 145 129 342 375 Page 98 145 129 342 375 32 bit mode 64 bit mode 32 bit mode 64 bit mode Intel Pentium Intel Pentium and Pentium MMX List of instruction timings Explanation of column headings: Operands Clock cycles Pairability r = register, accum = al, ax or eax, m = memory, i = immediate data, sr = segment register, m32 = 32 bit memory operand, etc. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe, np = not pairable. Integer instructions (Pentium and Pentium MMX) Instruction NOP MOV MOV MOV MOV XCHG XCHG XCHG XLAT PUSH POP PUSH POP PUSH POP PUSHF POPF PUSHA POPA PUSHAD POPAD LAHF SAHF MOVSX MOVZX LEA LDS LES LFS LGS LSS ADD SUB AND OR XOR ADD SUB AND OR XOR ADD SUB AND OR XOR ADC SBB ADC SBB ADC SBB CMP CMP TEST TEST TEST TEST INC DEC INC DEC NEG NOT Operands r/m, r/m/i r/m, sr sr , r/m m , accum (E)AX, r r,r r,m r/i r m m sr sr r , r/m r,m m r , r/i r,m m , r/i r , r/i r,m m , r/i r , r/i m , r/i r,r m,r r,i m,i r m r/m Clock cycles Pairability 1 uv 1 uv 1 np >= 2 b) np 1 uv h) 2 np 3 np >15 np 4 np 1 uv 1 uv 2 np 3 np 1 b) np >= 3 b) np 3-5 np 4-6 np 5-9 i) np 5 np 2 np 3 a) np 1 uv 4 c) np 1 uv 2 uv 3 uv 1 u 2 u 3 u 1 uv 2 uv 1 uv 2 uv 1 f) 2 np 1 uv 3 uv 1/3 np Page 99 Intel Pentium MUL IMUL MUL IMUL DIV DIV DIV IDIV IDIV IDIV CBW CWDE CWD CDQ SHR SHL SAR SAL SHR SHL SAR SAL SHR SHL SAR SAL ROR ROL RCR RCL ROR ROL ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc JMP CALL JMP CALL conditional jump CALL JMP RETN RETN RETF RETF J(E)CXZ LOOP BOUND CLC STC CMC CLD STD CLI STI LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS BSWAP CPUID r8/r16/m8/m16 all other versions r8/m8 r16/m16 r32/m32 r8/m8 r16/m16 r32/m32 r,i m,i r/m, CL r/m, 1 r/m, i(><1) r/m, CL r/m, i(><1) r/m, CL r, i/CL m, i/CL r, r/i m, i m, i r, r/i m, i m, r r , r/m r/m short/near far short/near r/m i i short short r,m r 11 9 d) 17 25 41 22 30 46 3 2 1 3 4/5 1/3 1/3 4/5 8/10 7/9 4 a) 5 a) 4 a) 4 a) 9 a) 7 a) 8 a) 14 a) 7-73 a) 1/2 a) 1 e) >= 3 e) 1/4/5/6 e) 2/5 e 2/5 e 3/6 e) 4/7 e) 5/8 e) 4-11 e) 5-10 e) 8 2 6-9 2 7+3*n g) 3 10+n g) 4 12+n g) 4 9+4*n g) 5 8+4*n g) 1 a) 13-16 a) Page 100 np np np np np np np np np np u u np u np np np np np np np np np np np np np np v np v np np np np np np np np np np np np np np np np np np np np np np Intel Pentium RDTSC Notes: a b c d e f g h i j 6-13 a) j) np This instruction has a 0FH prefix which takes one clock cycle extra to decode on a P1 unless preceded by a multi-cycle instruction. versions with FS and GS have a 0FH prefix. see note a. versions with SS, FS, and GS have a 0FH prefix. see note a. versions with two operands and no immediate have a 0FH prefix, see note a. high values are for mispredicted jumps/branches. only pairable if register is AL, AX or EAX. add one clock cycle for decoding the repeat prefix unless preceded by a multi-cycle instruction (such as CLD). pairs as if it were writing to the accumulator. 9 if SP divisible by 4 (imperfect pairing). on P1: 6 in privileged or real mode; 11 in non-privileged; error in virtual mode. On PMMX: 8 and 13 clocks respectively. Floating point instructions (Pentium and Pentium MMX) Explanation of column headings Operands Clock cycles Pairability i-ov fp-ov Instruction FLD FLD FBLD FST(P) FST(P) FST(P) FBSTP FILD FIST(P) FLDZ FLD1 FLDPI FLDL2E etc. FNSTSW FLDCW FNSTCW FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FCHS FABS r = register, m = memory, m32 = 32-bit memory operand, etc. The numbers are minimum values. Cache misses, misalignment, denormal operands, and exceptions may increase the clock counts considerably. + = pairable with FXCH, np = not pairable with FXCH. Overlap with integer instructions. i-ov = 4 means that the last four clock cycles can overlap with subsequent integer instructions. Overlap with floating point instructions. fp-ov = 2 means that the last two clock cycles can overlap with subsequent floating point instructions. (WAIT is considered a floating point instruction here) Operand r/m32/m64 m80 m80 r m32/m64 m80 m80 m m AX/m16 m16 m16 r/m r/m r/m r/m Clock cycles Pairability 1 0 3 np 48-58 np 1 np 2 m) np 3 m) np 148-154 np 3 np 6 np 2 np 5 s) np 6 q) np 8 np 2 np 3 0 3 0 3 0 19/33/39 p) 0 1 0 Page 101 i-ov 0 0 0 0 0 0 0 2 0 0 2 0 0 0 2 2 2 38 o) 0 fp-ov 0 0 0 0 0 0 0 2 0 0 2 0 0 0 2 2 2 n) 2 0 Intel Pentium FCOM(P)(P) FUCOM FIADD FISUB(R) FIMUL FIDIV(R) FICOM FTST FXAM FPREM FPREM1 FRNDINT FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN FNOP FXCH FINCSTP FDECSTP FFREE FNCLEX FNINIT FNSAVE FRSTOR WAIT Notes: m n o p q r s r/m m m m m r r m m 1 6 6 22/36/42 p) 4 1 17-21 16-64 20-70 9-20 20-32 12-66 70 65-100 r) 89-112 r) 53-59 r) 103 r) 105 r) 120-147 r) 112-134 r) 1 1 2 2 6-9 12-22 124-300 70-95 1 0 np np np np np np np np np np np np np np np np np np np np np np np np np np np np 0 2 2 38 o) 0 0 4 2 2 0 5 0 69 o) 2 2 2 2 2 36 o) 2 0 0 0 0 0 0 0 0 0 0 2 2 2 0 0 0 2 2 0 0 0 2 2 2 2 2 2 0 2 0 0 0 0 0 0 0 0 0 The value to store is needed one clock cycle in advance. 1 if the overlapping instruction is also an FMUL. Cannot overlap integer multiplication instructions. FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9 of the floating point control word. The first 4 clock cycles can overlap with preceding integer instructions. Clock counts are typical. Trivial cases may be faster, extreme cases may be slower. May be up to 3 clocks more when output needed for FST, FCHS, or FABS. MMX instructions (Pentium MMX) A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX multiply instructions which take 3. MMX multiply instructions can be pipelined to yield a throughput of one multiplication per clock cycle. The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX. Page 102 Intel Pentium There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit is one step later in the pipeline than the load unit. But the penalty comes when you store data from an MMX register to memory or to a 32-bit register: The data have to be ready one clock cycle in advance. This is analogous to the floating point store instructions. All MMX instructions except EMMS are pairable in either pipe. Pairing rules for MMX instructions are described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". Page 103 Pentium II and III Intel Pentium II and Pentium III List of instruction timings and μop breakdown Explanation of column headings: Operands: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. μops: p0: p1: p01: p2: p3: p4: Latency: The number of μops that the instruction generates for each execution port. Port 0: ALU, etc. Port 1: ALU, jumps Instructions that can go to either port 0 or 1, whichever is vacant first. Port 2: load data, etc. Port 3: address generation for store Port 4: store data This is the delay that the instruction generates in a dependency chain. (This is not the same as the time spent in the execution unit. Values may be inaccurate in situations where they cannot be measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind. Integer instructions (Pentium Pro, Pentium II and Pentium III) Instruction Operands p0 MOV MOV MOV MOV MOV MOV MOV MOVSX MOVZX MOVSX MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH POP POP PUSH POP PUSH POP r,r/i r,m m,r/i r,sr m,sr sr,r sr,m r,r r,m r,r r,m r,r r,m r/i r (E)SP m m sr sr p1 μops p01 p2 p3 1 1 1 1 1 1 8 7 Latency p4 1 1 5 8 1 1 1 1 1 1 1 3 4 1 1 1 2 1 5 2 8 Page 104 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 high b) Reciprocal throughput Pentium II and III PUSHF(D) POPF(D) PUSHA(D) POPA(D) LAHF SAHF LEA LDS LES LFS LGS LSS ADD SUB AND OR XOR ADD SUB AND OR XOR ADD SUB AND OR XOR ADC SBB ADC SBB ADC SBB CMP TEST CMP TEST INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CWD CDQ SHR SHL SAR ROR ROL SHR SHL SAR ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc 3 10 r,m 11 6 2 2 1 1 1 8 8 1 8 1 1 c) m r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m 8 1 1 1 2 2 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 r,(r),(i) (r),m r8 r16 r32 m8 m16 m32 1 1 1 1 2 3 3 2 2 2 1 2 2 4 15 4 4 19 23 39 19 23 39 1 1 1 1 1 1 1 1 1 1 1 1 r,i/CL 1 m,i/CL r,1 r8,i/CL r16/32,i/CL m,1 m8,i/CL m16/32,i/CL r,r,i/CL m,r,i/CL r,r/i m,r/i r,r/i m,r/i r,r r,m r 1 1 4 3 1 4 4 2 2 1 4 3 2 3 2 1 1 1 1 1 1 6 1 6 1 1 1 Page 105 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 21 37 12 21 37 Pentium II and III SETcc JMP JMP JMP JMP JMP conditional jump CALL CALL CALL CALL CALL RETN RETN RETF RETF J(E)CXZ LOOP LOOP(N)E ENTER ENTER LEAVE BOUND CLC STC CMC CLD STD CLI STI INTO LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS BSWAP NOP (90) Long NOP (0F 1F) CPUID RDTSC IN OUT PREFETCHNTA d) PREFETCHT0/1/2 d) SFENCE d) Notes m short/near far r m(near) m(far) short/near near far r m(near) m(far) 1 1 1 1 2 21 1 1 1 21 1 1 1 1 1 2 4 1 1 2 3 28 1 28 i 23 23 i short short short i,0 a,b ca. r,m 7 2 2 1 2 1 8 8 12 18 +4b 2 6 1 4 1 2 1 1 3 3 1 2 1 1 2 2 2 1 2 1 1 2 2 2 2 2 1 1 1 2 2 1 b-1 1 2b 1 1 1 1 1 2 9 17 5 2 10+6n 1 1 a) 3 a) 2 4 2 ca. 5n 1 ca. 6n 12+7n 12+9n r 1 1 1 1 0.5 1 23-48 31 18 18 m m >300 >300 1 1 1 Page 106 1 6 Pentium II and III a) b) c) d) Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". Has an implicit LOCK prefix. 3 if constant without base or index register P3 only. Floating point x87 instructions (Pentium Pro, II and III) Instruction Operands p0 FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT r m32/64 m80 m80 r m32/m64 m80 m80 r m m r AX m16 m16 m16 r m r m r m r m r m m m m m p1 μops p01 p2 p3 Latency p4 Reciprocal throughput 1 1 2 2 2 38 1 1 2 2 2 165 3 2 1 2 2 3 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 6 6 6 6 1 1 23 33 30 1 1 2 2 1 1 1 1 0 5 5 ⅓ f) 2 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 3 3-4 5 5-6 38 h) 38 h) 2 1 1 1 1 1 1 2 Page 107 1 1 2 g) 2 g) 37 37 Pentium II and III FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN FNOP FINCSTP FDECSTP FFREE FFREEP FNCLEX FNINIT FNSAVE FRSTOR WAIT Notes: e) f) g) h) i) r r 56 15 1 17-97 18-110 17-48 36-54 31-53 21-102 25-86 1 1 1 2 27-103 29-130 66 103 98-107 13-143 44-143 69 e) e) e) e) e) e) e) e,i) 3 13 141 72 2 Not pipelined FXCH generates 1 μop that is resolved by register renaming without going to any port. FMUL uses the same circuitry as integer multiplication. Therefore, the combined throughput of mixed floating point and integer multiplications is 1 FMUL + 1 IMUL per 3 clock cycles. FDIV latency depends on precision specified in control word: 64 bits precision gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks. Reciprocal throughput is 1/(latency-1). Faster for lower precision. Integer MMX instructions (Pentium II and Pentium III) Instruction Operands p0 MOVD MOVQ MOVD MOVQ MOVD MOVQ PADD PSUB PCMP PADD PSUB PCMP PMUL PMADD PMUL PMADD PAND(N) POR PXOR PAND(N) POR PXOR PSRA PSRL PSLL PSRA PSRL PSLL PACK PUNPCK PACK PUNPCK EMMS MASKMOVQ d) r,r mm,m32/64 m32/64,mm mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 mm,mm/i mm,m64 mm,mm mm,m64 p1 μops p01 p2 p3 1 1 1 1 1 1 1 1 Latency p4 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 11 mm,mm 1 Page 108 1 1 6 k) 2-8 Reciprocal throughput 0.5 1 1 0.5 1 1 1 0.5 1 1 1 1 1 2 - 30 Pentium II and III PMOVMSKB d) MOVNTQ d) PSHUFW d) PSHUFW d) PEXTRW d) PINSRW d) PINSRW d) PAVGB PAVGW d) PAVGB PAVGW d) PMIN/MAXUB/SW d) PMIN/MAXUB/SW d) PMULHUW d) PMULHUW d) PSADBW d) PSADBW d) Notes: d) k) r32,mm m64,mm mm,mm,i mm,m64,i r32,mm,i mm,r32,i mm,m16,i mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 mm,mm mm,m64 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 2 1 2 3 4 5 6 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 - 30 1 1 1 1 1 0.5 1 0.5 1 1 1 2 2 P3 only. The delay can be hidden by inserting other instructions between EMMS and any subsequent floating point instruction. Floating point XMM instructions (Pentium III) Instruction Operands p0 MOVAPS MOVAPS MOVAPS MOVUPS MOVUPS MOVSS MOVSS MOVSS MOVHPS MOVLPS MOVHPS MOVLPS MOVLHPS MOVHLPS MOVMSKPS MOVNTPS CVTPI2PS CVTPI2PS CVT(T)PS2PI CVTPS2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVTSS2SI ADDPS SUBPS ADDPS SUBPS ADDSS SUBSS ADDSS SUBSS xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32 m32,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 p1 1 μops p01 p2 p3 2 2 2 4 4 1 1 1 1 1 1 1 Latency p4 2 4 1 1 1 2 2 2 2 1 2 2 1 1 2 2 1 1 Page 109 1 2 1 2 1 2 2 1 1 2 3 2 3 1 1 1 1 1 1 1 2 3 4 3 4 4 5 3 4 3 3 3 3 Reciprocal throughput 1 2 2 4 4 1 1 1 1 1 1 1 2 - 15 1 2 1 1 2 2 1 2 2 2 1 1 Pentium II and III MULPS MULPS MULSS MULSS DIVPS DIVPS DIVSS DIVSS AND(N)PS ORPS XORPS AND(N)PS ORPS XORPS MAXPS MINPS MAXPS MINPS MAXSS MINSS MAXSS MINSS CMPccPS CMPccPS CMPccSS CMPccSS COMISS UCOMISS COMISS UCOMISS SQRTPS SQRTPS SQRTSS SQRTSS RSQRTPS RSQRTPS RSQRTSS RSQRTSS RCPPS RCPPS RCPSS RCPSS SHUFPS SHUFPS UNPCKHPS UNPCKLPS UNPCKHPS UNPCKLPS LDMXCSR STMXCSR FXSAVE FXRSTOR xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm,i xmm,m128,i xmm,xmm xmm,m128 m32 m32 m4096 m4096 2 2 1 1 2 2 1 1 2 1 2 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2 1 2 1 1 2 2 2 2 2 2 1 1 2 2 1 1 2 1 2 1 2 1 2 2 2 2 1 2 2 11 6 116 89 Page 110 2 4 4 4 4 48 48 18 18 2 2 3 3 3 3 3 3 3 3 1 1 56 57 30 31 2 3 1 2 2 3 1 2 2 2 3 3 15 7 62 68 2 2 1 1 34 34 17 17 2 2 2 2 1 1 2 2 1 1 1 1 56 56 28 28 2 2 1 1 2 2 1 1 2 2 2 2 15 9 Pentium M Intel Pentium M, Core Solo and Core Duo List of instruction timings and μop breakdown Explanation of column headings: Operands: μops fused domain: μops unfused domain: p0: p1: p01: p2: p3: p4: Latency: Reciprocal throughput: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of μops at the decode, rename, allocate and retirement stages in the pipeline. Fused μops count as one. The number of μops for each execution port. Fused μops count as two. Port 0: ALU, etc. Port 1: ALU, jumps Instructions that can go to either port 0 or 1, whichever is vacant first. Port 2: load data, etc. Port 3: address generation for store Port 4: store data This is the delay that the instruction generates in a dependency chain. (This is not the same as the time spent in the execution unit. Values may be inaccurate in situations where they cannot be measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The average number of clock cycles per instruction for a series of independent instructions of the same kind. Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSX MOVZX CMOVcc CMOVcc XCHG Operands r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r32 r,r r,m r,r r,m r,r μops fused domain 1 1 1 2 1 2 8 8 2 1 1 2 2 3 μops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 throughput 1 0.5 1 1 1 1 1 1 8 7 1 1 1 1 1 1 5 8 1 1 1 1 2 2 0.5 1 1.5 2 1.5 1 1 1 1 Page 111 1 1 3 1 Pentium M XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POP POPF(D) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 SFENCE/LFENCE/MFENCE IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL IMUL IMUL MUL IMUL MUL IMUL IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV r,m r i m sr r (E)SP m sr r,m r m m m 7 2 1 2 2 2 16 18 1 3 2 10 17 10 1 2 1 2 11 1 1 2 4 1 1 1 1 1 11 2 3 2 9 6 2 1 10 1 1 1 1 1 1 1 1 8 1 1 high b) 1 1 1 1 1 8 1 1 1 1 1 8 1 1 2 1 1 1 1 8 6 8 1 1 2 1 7 1 1 1 1 8 3 1 1 1 r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r m,i r m r8 r16/r32 r,r r,r,i m8 m16/m32 r,m r,m,i r8 r16 r32 m8 m16 1 1 3 2 2 7 1 1 2 1 3 1 3 4 1 3 1 1 1 3 1 2 5 4 4 6 5 1 1 6 1 18 18 16 7 1 1 1 >300 >300 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 0.5 1 1 2 1 0.5 1 1 0.5 2 15 4 5 4 4 4 5 4 4 15-16 c) 15-24 c) 15-39 c) 15-16 c) 15-24 c) 1 1 1 1 1 1 1 1 12 12-20 c) 12-20 c) 12 12-20 c) 1 1 1 1 1 3 1 1 1 3 1 1 4 3 3 4 3 Page 112 1 2 2 1 1 1 1 1 1 1 1 1 1 1 Pentium M DIV IDIV CBW CWDE CWD CDQ Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST TEST SHR SHL SAR ROR ROL SHR SHL SAR ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m32 r,r/i r,m m,r/i r,r/i m,r m,i r,i/CL m,i/CL r,1 r8,i/CL r8,i/CL r16/32,i/CL m,1 m8,i/CL m8,i/CL m16/32,i/CL r,r,i/CL m,r,i/CL r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) conditional jump short/near J(E)CXZ short LOOP short LOOP(N)E short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN 5 1 1 1 1 3 1 1 2 1 3 2 9 8 6 7 12 11 10 2 4 1 8 2 1 10 3 2 2 1 2 1 4 1 22 1 2 25 1 2 11 11 4 32 4 4 35 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 5 4 3 2 6 5 5 2 1 1 1 1 1 2 0.5 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 4 4 3 2 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 1 1 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 23 1 1 1 1 1 1 8 8 1 27 1 1 1 2 1 2 29 2 2 2 1 1 9 6 1 7 1 21 2 11 10 9 1 4 Page 113 12-20 c) 1 1 1 1 2 2 15-39 c) 1 1 1 2 1 1 2 1 1 2 1 2 1 1 2 1 28 1 2 31 1 1 6 6 2 27 9 2 30 2 Pentium M RETN RETF RETF BOUND INTO i i r,m String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE CLI STI ENTER ENTER LEAVE CPUID RDTSC Notes: a) b) c) 3 27 27 15 5 1 24 24 7 2 6n 3 5n 6 6n 3 7n 6 9n 1 6 5 1 3 3 2 2 30 30 8 4 2 4 0.5 1 0.7 0.7 0.5 1.3 0.6 0.7 0.5 10+6n 1 ca. 5n a) 1 3 ca. 6n a) 1 2 12+7n 4 2 12+9n 1 1 2 1 1 1 1 1 1 2 0.5 1 9 17 i,0 a,b 12 ca. 3 38-59 13 10 18 +4b 2 1 1 b-1 2b 1 38-59 13 ca. 130 42 Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". Has an implicit LOCK prefix. High values are typical, low values are for round divisors. Core Solo/Duo is more efficient than Pentium M in cases with round values that allow an earlyout algorithm. Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH Operands r m32/64 m80 m80 r m32/m64 m80 m80 r μops fused domain 1 1 4 40 1 1 6 169 1 μops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 throughput 1 2 38 1 2 165 Page 114 1 1 1 2 2 1 2 2 1 2 2 1 0 3 167 0.33 f) Pentium M FILD FIST(P) FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE FFREEP FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT m m m r AX m16 m16 m16 r r r m r m r m r m r m m m m 4 4 4 1 2 2 3 2 3 3 1 1 2 142 72 1 1 1 1 1 1 1 1 1 1 2 1 6 6 6 6 1 1 26 15 28 15 1 80-100 90-110 ~ 20 ~ 40 ~ 55 ~ 100 ~ 85 1 2 3 2 2 1 2 2 3 1 1 1 1 1 2 142 72 1 1 1 1 1 1 1 1 2 7 3 1 19 3 1 2 131 91 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 3 5 5 9-38 c) 9-38 c) 1 1 1 1 1 1 3 5 9-38 c) 26 15 1 1 2 2 8-37 c) 8-37 c) 1 1 1 1 1 1 3 3 8-37 c) 4 1 1 37 19 28 15 1 80-100 90-110 ~20 ~40 ~55 ~100 ~85 1 43 9 9 h) 8 80-110 100-130 ~45 ~60 ~65 ~140 ~140 1 Page 115 2 2 2 1 1 1 3 5 5 3 1 1 5 5 5 1 1 1 Pentium M FNCLEX FNINIT Notes: c) f) g) 3 14 3 14 13 27 High values are typical, low values are for low precision or round divisors. FXCH generates 1 μop that is resolved by register renaming without going to any port. SSE3 instruction only available on Core Solo and Core Duo. Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKHQDQ PUNPCKHQDQ Operands μops fused domain μops unfused domain p0 p1 p01 p2 r32,mm mm,r32 mm,m32 m32,mm r32,xmm xmm,r32 xmm,m32 m32, xmm mm,mm mm,m64 m64,mm xmm,xmm xmm,m64 m64, xmm xmm, xmm xmm, m128 m128, xmm xmm, m128 m128, xmm xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm 1 1 1 1 1 2 2 1 1 1 1 2 2 1 2 2 2 4 8 4 1 2 1 4 mm,mm 1 1 mm,m64 1 1 xmm,xmm 3 2 1 xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm, m128 4 1 1 2 3 2 3 1 1 1 2 1 1 p3 Latency Reciprocal p4 throughput 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 5-6 1 1 2-3 2-3 1 1 1 1 2 Page 116 2 2 1 2 1 2 1 1 1 1 2 2 2 1 2 1 1 2 2 1 1 1 2 2 1 1 1 1 2 0.5 0.5 1 1 1 1 1 1 0.5 1 1 1 1 1 1 2 2 2-10 4-20 2 1 1 2 3 Pentium M PUNPCKLQDQ PUNPCKLQDQ PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW MASKMOVQ MASKMOVDQU PMOVMSKB PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PADDQ PSUBQ PADDQ PSUBQ PADDQ PSUBQ PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULL/HW PMULHUW PMULL/HW PMULHUW PMULUDQ PMULUDQ PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDWD PMADDWD PAVGB/W PAVGB/W PAVGB/W PAVGB/W PMIN/MAXUB/SW PMIN/MAXUB/SW PMIN/MAXUB/SW PMIN/MAXUB/SW PSADBW PSADBW PSADBW xmm,xmm xmm, m128 mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i mm,mm xmm,xmm r32,mm r32,xmm r32,mm,i r32,xmm,i mm,r32,i xmm,r32,i 1 1 1 2 3 4 2 3 3 8 1 1 2 4 1 2 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm mm,m64 xmm,xmm 1 1 2 4 2 2 4 6 1 1 2 2 1 1 2 4 1 1 2 4 1 1 2 4 1 1 2 4 1 1 2 4 2 2 4 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 2 1 2 j) 1 2 1 1 2 2 2 2 4 4 1 1 2 2 1 1 2 2 1 1 2 2 Page 117 1 1 2 3 1 1 1 1 1 2 1 2 1 0.5 1 1 2 1 1 2 2 0.5 1 1 2 1 1 2 2 1 1 2 2 1 1 2 2 0.5 1 1 2 0.5 1 1 2 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 4 1 2 1 1 1 2 3 3 3 3 4 4 4 4 3 3 3 3 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 4 4 4 Pentium M PSADBW xmm,m128 6 4 Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PAND(N) POR PXOR PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ mm,mm mm,m64 xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i 1 1 2 4 1 1 2 3 3 4 1 1 2 2 Other EMMS Notes: g) j) k) 1 1 2 2 2 4 2 1 3 0.5 1 1 2 1 1 2 2 2 3 6 k) 6 1 1 2 1 1 2 2 1 1 1 3 11 2 11 SSE3 instruction only available on Core Solo and Core Duo. Also uses some execution units under port 1. You may hide the delay by inserting other instructions between EMMS and any subsequent floating point instruction. Floating point XMM instructions Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS/D SHUFPS/D MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD UNPCKH/LPD Operands xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 μops fused domain 2 2 2 4 8 1 2 1 1 1 1 1 2 3 4 2 2 4 4 4 2 3 μops unfused domain p0 p1 p01 p2 p3 Latency Reciprocal p4 throughput 2 2 2 2 2 2 1 1 1 1 2 2 4 4 1 1 1 1 1 1 1 j) 2 1 1 1 1 2 3 2 3 1 1 1 1 1 1 2 2 2 1 2 2 Page 118 2 2 1 1 3-4 2 1 1 1 1 1 2 2 2 4 1 1 1 1 1 1 1 3 2 2 1 2 5 5 1 1 Pentium M Conversion CVTPS2PD CVTPS2PD CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) HADDPS HSUBPS g) HADDPD HSUBPD g) MULSS MULSD MULSS MULSD MULPS MULPD MULPS MULPD DIVSS DIVSD DIVSS DIVSD xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64 4 4 4 6 2 3 2 3 2 4 2 4 4 5 4 6 1 2 1 2 4 5 3 5 2 2 3 2 3 2 3 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,m32 xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128 xmm,xmm xmm,xmm xmm,m32 xmm,m64 1 2 2 4 2 6? 3 1 1 2 2 2 2 4 4 1 1 2 2 2 1 3 3 2 2 1 1 3 1 4 2 2 2 2 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 Page 119 2 3 2 4 1 4 2 3 3 1 5 1 1 1 4 2 4 4 1 4 1 1 1 1 1 2 2 2 ? 3 1 1 1 1 2 2 2 2 1 1 1 1 3 1 3 3 1 1 1 2 2 2 2 4 4 4 4 2 2 4 4 1 1 2 1 1 2 2 1 1 3 3 3 3 3 7 4 4 5 4 5 4 5 4 5 9-18 c) 9-32 c) 9-18 c) 9-32 c) 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 4 2 1 2 1 2 2 4 2 4 8-17 c) 8-31 c) 8-17 c) 8-31 c) Pentium M DIVPS DIVPD DIVPS DIVPD CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D RCPSS RCPSS RCPPS RCPPS xmm,xmm xmm,xmm xmm,m128 xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 2 2 4 4 1 2 2 4 1 2 1 2 2 4 1 2 2 4 2 2 2 2 Math SQRTSS SQRTSS SQRTSD SQRTSD SQRTPS SQRTPD SQRTPS SQRTPD RSQRTSS RSQRTSS RSQRTPS RSQRTPS xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,xmm xmm,m128 xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m128 2 3 1 2 2 2 4 4 1 2 2 4 2 2 1 1 2 2 2 2 Logic AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D xmm,xmm xmm,m128 2 4 m32 m32 m4096 m4096 9 6 118 87 Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: c) g) j) 2 2 1 1 2 2 1 1 1 3 2 1 1 1 2 2 1 1 2 2 3 3 3 3 3 1 2 1 3 2 6-30 1 5-58 1 8-56 16-114 2 2 1 1 3 2 3 1 3 2 2 2 9 6 32 43 16-34 c) 16-62 c) 16-34 c) 16-62 c) 3 1 4-28 4-28 4-57 4-57 16-55 16-114 16-55 16-114 1 1 2 2 2 1 1 44 20 12 63 72 43 43 High values are typical, low values are for round divisors. SSE3 instruction only available on Core Solo and Core Duo. Also uses some execution units under port 1. Page 120 16-34 c) 16-62 c) 16-34 c) 16-62 c) 1 1 2 2 1 1 1 1 2 2 1 1 2 2 Merom Intel Core 2 (Merom, 65nm) List of instruction timings and μop breakdown Explanation of column headings: Operands: μops fused domain: μops unfused domain: p015: p0: p1: p5: p2: p3: p4: Unit: Latency: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of μops at the decode, rename, allocate and retirement stages in the pipeline. Fused μops count as one. The number of μops for each execution port. Fused μops count as two. Fused macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under μops fused domain. An x under p0, p1 or p5 means that at least one of the μops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one μop which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these μops go to. The total number of μops going to port 0, 1 and 5. The number of μops going to port 0 (execution units). The number of μops going to port 1 (execution units). The number of μops going to port 5 (execution units). The number of μops going to port 2 (memory read). The number of μops going to port 3 (memory write address). The number of μops going to port 4 (memory write data). Tells which execution unit cluster is used. An additional delay of 1 clock cycle is generated if a register written by a μop in the integer unit (int) is read by a μop in the floating point unit (float) or vice versa. flt→int means that an instruction with multiple μops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Instruction Move instructions MOV Operands r,r/i μops μops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 x Page 121 x x int Laten- Recicy procal throughput 1 0.33 Merom MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE CLFLUSH IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT r,m m,r m,i r,sr m,sr sr,r sr,m m,r 1 1 1 1 2 8 8 2 r,r r,m r,r r,m r,r r,m m8 1 1 2 2 3 7 2 1 1 2 2 17 18 1 4 2 10 24 10 1 2 1 2 11 1 1 2 2 2 4 r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m 1 1 2 2 2 4 1 1 1 3 r i m sr r (E/R)SP m sr r,m r m m m 1 4 3 1 x x x x x x x 1 1 4 5 1 1 1 1 1 1 1 1 x 1 2 2 3 x 1 x x x x x x x x x 1 1 1 1 1 15 9 x x x 3 9 23 2 1 2 1 2 11 x x x x x 1 1 x x x x 1 1 1 1 1 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 8 1 1 1 1 1 2 x x x 1 1 1 2 2 3 1 1 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Page 122 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int 2 3 3 1 1 1 1 1 16 16 2 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 0.33 1 1 int int int int int int int int int int 2 2 high b) 4 3 2 2 1 1 1 1 1 7 8 1 1.5 17 20 1 4 1 4 240 1 6 2 2 7 1 1 1 6 7 0.33 1 1 1 17 1 1 8 9 9 117 0.33 1 1 2 2 0.33 1 0.33 1 Merom AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r64 m8 m16 m32 m64 m64 r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl 1 3 4 1 3 3 3 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 3 5 4 32 56 4 6 5 32 56 1 1 1 3 4 1 3 3 3 1 1 1 1 1 1 1 3 3 2 1 1 1 1 1 1 3 5 4 32 56 3 5 4 31 55 1 1 1 1 2 1 1 1 3 1 3 2 9 8 6 4 12 11 1 1 1 1 1 1 2 1 2 2 9 8 6 3 9 8 x x x x 1 x 1 x x x 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int x x x x 1 1 1 1 x x 2 1 x x x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Page 123 x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int 1 1 17 3 5 5 7 3 3 5 3 3 5 3 5 5 7 3 3 5 18 18-26 18-42 29-61 39-72 18 18-26 18-42 29-61 39-72 1 1 1 6 1 1 6 1 6 2 12 11 11 7 14 13 1 1.5 1.5 4 1 1 2 1 1 2 1 1.5 1.5 4 1 1 2 2 1 2 12 12-20 c) 12-36 c) 18-37 c) 28-40 c) 12 12-20 c) 12-36 c) 18-37 c) 28-40 c) 0.33 1 1 0.33 1 0.5 1 1 1 2 Merom RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near Fused compare/test and branch e,i) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS REP(N)E CMPS 10 2 3 1 10 2 1 11 3 2 2 1 2 1 7 6 7 2 2 1 9 1 1 8 1 2 2 1 1 1 7 6 1 30 1 1 31 1 1 2 11 11 3 43 3 4 44 1 3 32 32 15 5 1 30 1 1 29 1 1 2 11 11 2 43 2 3 42 1 x 30 30 13 5 x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x 3 2 4+7n - 14+6n 4 2 8+5n - 20+1.2n 8 5 1 1 1 7+7n - 13+n 4 3 7+8n - 17+7n 7 5 7+10n - 7+9n Page 124 x x x x 1 2 1 1 1 x x x 1 1 1 2 1 1 2 2 2 1 1 1 1 1 1 1 1 5 1 2 1 int int int int int int int int int int int int int int int int 13 2 7 1 int int int int int int int int int int int int int int int int int int int int int 0 int int int int int int int int int int int 1 5 6 2 1 1 0 0 0 0 1 1 5 1 1 2 1 1 0.33 4 14 1-2 76 1-2 1-2 68 1 1 1-2 5 5 2 75 2 2 75 2 2 78 78 8 3 1 1+5n - 21+3n 1 7+2n - 0.55n 1+3n - 0.63n 1 3+8n - 23+6n 3 2+7n - 22+5n Merom Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) i,0 a,b 1 1 3 12 1 1 3 10 3 46-100 29 23 2 x x x x x x x x x 1 1 1 int int int int int int int int int 0.33 1 8 8 180-215 64 54 Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode. Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST FISTP FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Operands r m32/64 m80 m80 r m32/m64 m80 m80 r m m m m r AX m16 m16 m16 r m m μops μops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 40 1 1 7 170 1 1 2 3 3 1 2 2 2 1 2 2 3 1 2 142 78 1 1 2 38 1 1 3 166 0 f) 1 1 1 1 1 2 2 2 1 1 1 1 1 2 x x 1 2 2 2 1 x x 1 x x 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 1 1 1 2 Arithmetic instructions Page 125 float float float float float float float float float float float float float float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 3 4 164 0 6 6 6 6 2 1 184 169 1 1 3 20 1 1 5 166 1 1 1 1 1 1 2 2 2 1 2 10 8 1 2 192 177 Merom FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT r m r m r m r m r m m m m Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT Notes: d) f) g) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 1 21-27 21-27 7-15 7-15 27 82 1 ~96 ~100 ~19 ~53 ~98 ~70 27 82 1 ~96 ~100 ~19 ~53 ~98 ~70 1 2 4 15 1 2 4 15 1 1 1 1 1 1 1 1 1 1 1 2 2 1 float float float float float float float float float float float float float float float float float float float float 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 float float float float float float float float float 1 3 1 1 5 2 2 6-38 d) 5-37 d) 5-37 d) 1 1 1 1 1 1 1 2 2 5-37 d) 2 1 1 16-56 22-29 41 170 6-69 ~96 ~115 ~45 ~96 ~136 ~119 float float float float 1 1 15 63 Round divisors or low precision give low values. Resolved by register renaming. Generates no μops in the unfused domain. SSE3 instruction set. Integer MMX and XMM instructions Instruction Operands Move instructions MOVD k) MOVD k) MOVD k) MOVD k) r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 μops μops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 x x x int 1 1 x x 1 Page 126 1 int int Laten- Recicy procal throughput 2 3 2 2 0.33 1 0.5 1 Merom MOVQ (x)mm, (x)mm MOVQ (x)mm,m64 MOVQ m64, (x)mm MOVDQA xmm, xmm MOVDQA xmm, m128 MOVDQA m128, xmm MOVDQU m128, xmm MOVDQU xmm, m128 LDDQU g) xmm, m128 MOVDQ2Q mm, xmm MOVQ2DQ xmm,mm MOVNTQ m64,mm MOVNTDQ m128,xmm mm,mm PACKSSWB/DW PACKUSWB mm,m64 xmm,xmm PACKSSWB/DW PACKUSWB xmm,m128 mm,mm PUNPCKH/LBW/WD/DQ mm,m64 PUNPCKH/LBW/WD/DQ xmm,xmm PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ xmm,m128 PUNPCKH/LQDQ xmm,xmm PUNPCKH/LQDQ xmm, m128 PSHUFB h) mm,mm PSHUFB h) mm,m64 PSHUFB h) xmm,xmm PSHUFB h) xmm,m128 PSHUFW mm,mm,i PSHUFW mm,m64,i PSHUFD xmm,xmm,i PSHUFD xmm,m128,i PSHUFL/HW xmm,xmm,i PSHUFL/HW xmm, m128,i PALIGNR h) mm,mm,i PALIGNR h) mm,m64,i PALIGNR h) xmm,xmm,i PALIGNR h) xmm,m128,i MASKMOVQ mm,mm MASKMOVDQU xmm,xmm PMOVMSKB r32,(x)mm PEXTRW r32,mm,i PEXTRW r32,xmm,i PINSRW mm,r32,i PINSRW mm,m16,i PINSRW xmm,r32,i PINSRW xmm,m16,i 1 1 1 1 1 1 9 4 4 1 1 1 1 1 1 3 4 1 1 3 4 1 2 1 2 4 5 1 2 2 3 1 2 2 2 2 2 4 10 1 2 3 1 2 3 4 1 Arithmetic instructions PADD/SUB(U)(S)B/W/D (x)mm, (x)mm PADD/SUB(U)(S)B/W/D (x)mm,m PADDQ PSUBQ (x)mm, (x)mm PADDQ PSUBQ (x)mm,m 1 1 2 2 x x x int int 1 1 1 x x 1 x int int 1 4 2 2 1 1 x x x x x x x x x x x x x 1 2 2 1 2 int int int int 1 1 1 1 3 3 1 1 3 3 1 1 1 1 4 4 1 1 2 2 1 1 2 2 2 2 1 1 1 2 3 1 1 3 3 1 1 1 2 2 x x x x 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x Page 127 x x x x x x x x 1 1 1 1 1 1 x x x x 1 1 x x x x x x 1 1 1 1 1 1 1 1 1 1 2 1 2 3 1 2 3 3-8 2-8 2-8 1 1 1 1 int int flt→int int int int flt→int int int int int int int int int int flt→int int int int int int int int int int int int int int int int int 1 int int int int 1 3 1 3 1 1 3 1 3 1 2 2 2 3 5 2 6 2 0.33 1 1 0.33 1 1 4 2 2 0.33 0.33 2 2 1 1 2 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2-5 6-10 1 1 1 1 1 1.5 1.5 0.5 1 1 1 Merom PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADDD PHSUBD h) PHADDD PHSUBD h) PHADDD PHSUBD h) PHADDD PHSUBD h) PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW h) PMULHRSW h) PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW h) PMADDUBSW h) PAVGB/W PAVGB/W PMIN/MAXUB/SW PMIN/MAXUB/SW PABSB PABSW PABSD h) PSIGNB PSIGNW PSIGND h) PSADBW PSADBW Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: g) h) k) mm,mm 5 5 int mm,m64 6 5 xmm,xmm 7 7 xmm,m128 mm,mm mm,m64 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m 8 3 4 5 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 3 3 5 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (x)mm,(x)mm (x)mm,m mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i 1 1 1 1 1 2 3 2 1 1 1 1 1 2 2 2 x x 1 1 1 x x x x x 11 11 x x 1 int int 1 1 1 x x x x 1 1 1 1 1 1 1 1 1 1 x x x x x x x x 1 1 1 1 1 1 x x x x x x x x 1 1 1 1 1 1 1 x x 1 1 x x x 1 x 5 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int float 4 4 6 3 5 1 3 3 3 3 3 1 1 1 1 3 1 1 1 2 2 4 4 2 2 3 3 0.5 1 1 1 1 1 1 1 1 1 1 1 0.5 1 0.5 1 0.5 1 0.5 1 1 1 0.33 1 1 1 1 1 1 1 6 SSE3 instruction set. Supplementary SSE3 instruction set. MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits. Page 128 Merom Floating point XMM instructions Instruction Operands Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPS SHUFPD SHUFPD MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD UNPCKH/LPD xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 1 1 1 4 9 1 1 1 2 2 1 1 1 1 3 4 1 2 1 2 1 2 3 4 1 2 1 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 2 2 2 2 2 2 2 2 1 1 1 1 2 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ μops μops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main x x x int int 1 1 2 4 1 1 x x x x 1 x x 2 1 1 int 2 2 int int 1 1 1 1 1 1 1 1 1 1 int 1 1 3 3 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 2 2 1 2 1 1 Page 129 1 1 float float 1 3 3 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 Laten- Recicy procal throughput 1 2 3 2-4 3-4 1 2 3 3 5 3 1 1 1 flt→int flt→int float float int int int int flt→int int float float 3 float float float float float float float float float float float float float float float float 4 1 1 1 3 1 4 2 2 3 3 4 4 0.33 1 1 2 4 0.33 1 1 1 1 1 1 1 2-3 2 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 Merom CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULSS MULSD MULSD MULPS MULPS MULPD MULPD DIVSS DIVSS DIVSD DIVSD DIVPS DIVPS DIVPD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m32/64 xmm,xmm 1 1 1 1 1 1 6 7 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Page 130 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 float float float float float float float float float float float float float float float float 3 float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float float 3 3 4 4 4 3 4 3 3 3 9 5 4 5 4 5 6-18 d) 6-32 d) 6-18 d) 6-32 d) 3 3 3 3 3 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 3 3 2 2 1 1 1 1 1 1 1 1 5-17 d) 5-17 d) 5-31 d) 5-31 d) 5-17 d) 5-17 d) 5-31 d) 5-31 d) 2 2 1 1 1 1 1 1 1 Merom MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D xmm,m32/64 xmm,xmm xmm,m128 1 1 1 1 1 1 xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 Logic AND/ANDN/OR/XORPS/D xmm,xmm AND/ANDN/OR/XORPS/D xmm,m128 1 1 1 1 x x 14 6 141 119 13 4 Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: d) g) m32 m32 m4096 m4096 1 1 1 1 float float float 1 6-29 1 float float float float float float int int 1 1 1 1 1 1 x x x x 3 6-58 3 1 1 1 145 164 Round divisors give low values. SSE3 instruction set. Page 131 1 1 1 6-29 6-29 6-58 6-58 2 2 0.33 1 42 19 145 164 Wolfdale Intel Core 2 (Wolfdale, 45nm) List of instruction timings and μop breakdown Explanation of column headings: Operands: μops fused domain: μops unfused domain: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of μops at the decode, rename, allocate and retirement stages in the pipeline. Fused μops count as one. The number of μops for each execution port. Fused μops count as two. Fused macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under μops fused domain. An x under p0, p1 or p5 means that at least one of the μops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one μop which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these μops go to. p015: p0: p1: p5: p2: p3: p4: Unit: The total number of μops going to port 0, 1 and 5. The number of μops going to port 0 (execution units). The number of μops going to port 1 (execution units). The number of μops going to port 5 (execution units). The number of μops going to port 2 (memory read). The number of μops going to port 3 (memory write address). The number of μops going to port 4 (memory write data). Tells which execution unit cluster is used. An additional delay of 1 clock cycle is generated if a register written by a μop in the integer unit (int) is read by a μop in the floating point unit (float) or vice versa. flt→int means that an instruction with multiple μops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure. Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Instruction Operands μops μops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main Move instructions Page 132 Laten- Recicy procal throughput Wolfdale MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE CLFLUSH IN OUT Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r 1 1 1 1 1 2 8 8 2 1 r,r r16/32,m r64,m r,r r,m r,r r,m 1 x x x 1 2 2 3 x 1 x x x x x x x x x x x x m8 1 1 2 2 2 3 7 2 1 1 2 2 17 18 1 4 2 10 24 10 1 2 1 2 11 1 1 2 2 2 4 r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i 1 1 2 2 2 4 1 1 r i m sr r (E/R)SP m sr r,m r m m m x x x 1 4 3 x x x x x 1 1 4 5 x 9 23 2 1 2 1 2 11 x x x 1 1 x x x x 1 1 1 1 0.33 1 1 1 2 x 3 1 1 1 1 x 1 1 0.33 1 1 1 1 1 16 16 2 1 1 1 1 15 9 1 1 1 2 3 3 1 1 1 1 1 8 1 1 1 1 1 1 1 1 1 1 1 1 1 8 2 high b) 4 3 2 1 1 1 20 x x 1 4 1 4 1 1 1 1 2 2 3 1 1 x x x x x x x x 1 1 1 1 1 x x x x x x x x Page 133 x x x x x x x x 1 1 1 1 120 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 8 1 1.5 17 1 1 1 2 2 6 2 2 7 1 1 7 0.33 1 1 1 17 1 1 8 6 9 90 0.33 1 1 2 2 0.33 1 Wolfdale INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV DIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL r m r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r64 m8 m16 m32 m64 m64 r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r8,i/cl r8,i/cl r,i/cl m,1 1 3 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 4 7 7 32-38 56-62 4 7 7 32 56 1 1 1 1 2 1 1 1 3 1 3 2 9 8 6 4 1 1 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 2 1 1 1 1 1 1 4 7 7 x x x x x x x x x 1 x x 1 x x x 1 1 x x 1 17 3 5 5 7 3 3 5 3 3 5 3 5 5 7 3 3 5 x x x 1 1 1 1 x x 2 1 x x x x 1 1 1 1 1 3 7 6 31 55 1 1 1 1 1 1 1 1 2 1 2 2 9 8 6 3 x x x x x x x x x x x x x x 56-62 1 x x 1 1 2 1 x x x 2 3 2 9 10 13 x x x 1 2 2 3 2 x x x x x x x x x x x x x x 32-38 1 x x x x x x x x x x Page 134 x x x x x x x x x x x x x x 1 6 1 1 1 1 1 1 1 1 1 1 0.33 1 1 1 1 1.5 1.5 4 1 1 2 1 1 2 1 1.5 1.5 4 1 1 2 2 1 2 9-18 c) 14-22 c) 14-23 c) 18-57 c) 34-88 c) 9-18 14-22 c) 14-23 c) 34-88 c) 39-72 c) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 1 1 1 6 1 6 2 12 11 11 7 0.33 1 1 0.33 1 0.5 1 1 1 2 Wolfdale RCR RCL RCR RCL SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m8,i/cl m8,i/cl m,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near Fused compare/test and branch e,i) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND i) r,m INTO i) String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP(N)E SCAS CMPS 12 11 10 2 3 1 9 3 1 10 3 2 2 1 2 1 6 6 9 8 7 2 2 1 8 2 1 7 1 2 2 1 1 1 6 6 1 30 1 1 31 1 1 2 11 11 3 43 3 4 44 1 3 32 32 15 5 1 30 1 1 29 1 1 2 11 11 2 43 2 3 42 1 1 30 30 13 5 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x 1 1 1 1 1 1 3 2 4+7n-14+6n 4 2 8+5n-20+1.2n 8 5 1 1 1 7+7n-13+n 4 3 7+8n-17+7n 7 5 Page 135 1 5 6 2 1 1 1 1 1 0 0 0 1 2 1 1 1 x x x 1 1 1 1 4 1 1 1 1 1 1 14 13 13 2 7 1 0 0 1 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 0.33 3 14 1-2 76 1-2 1-2 68 1 1 1-2 5 5 2 75 2 2 75 2 2 78 78 8 3 1 1 1+5n-21+3n 1 1 1 7+2n-0.55n 5 1+3n-0.63n 1 1 3+8n-23+6n 2 3 Wolfdale REP(N)E CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) 7+10n-7+9n i,0 a,b 1 1 3 12 1 1 3 10 3 53-117 13 23 2 2+7n-22+5n x x x x x x x x x 1 0.33 1 8 8 1 1 53-211 32 54 Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. The reciprocal throughput is only slightly less than the latency. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode. Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST FISTP FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE Operands r m32/64 m80 m80 r m32/m64 m80 m80 r m m m m r AX m16 m16 m16 r m μops μops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 40 1 1 7 171 1 1 2 3 3 1 2 2 2 1 2 2 3 1 2 141 1 1 2 38 1 1 2 x 1 x x 3 167 0 f) 1 1 1 1 1 2 2 2 1 1 1 1 1 2 95 x x x x x x 1 1 1 1 1 1 1 2 2 1 2 2 1 2 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 x x x x Page 136 x x 7 23 23 float float float float float float float float float float float float float float float float float float float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 3 4 164 0 6 6 6 6 2 1 1 1 3 20 1 1 5 166 1 1 1 1 1 1 2 2 2 1 2 10 8 1 2 142 Wolfdale FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT m 78 51 r m r m r m 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 26-29 28-35 17-19 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 r m r m m m m Math FSCALE FXTRACT FSQRT FSIN FCOS Other FNOP WAIT FNCLEX FNINIT Notes: d) f) g) 1 2 4 15 1 2 4 15 x x 27 1 1 1 1 1 1 1 1 1 1 1 float 1 1 5 2 2 6-21 d) 5-20 d) 6-21 d) 5-20 d) 1 1 1 1 1 1 43 ~170 6-20 32-85 70-100 38-107 45 50-100 40-130 55-130 x x x x x 1 x x x x x x x x x x float float float float float x x x x x x x x x x x x x x x float float float float float x x x x x x float float float float 1 1 1 1 x x 1 1 1 1 1 177 float float float float float float float float float float float float float float float float float float float float float 1 1 1 1 2 1 1 2 1 1 x x x x x x 28 28 53-84 1 1 18-85 76-100 18105 19 19 57-65 19-100 23-87 FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN x 3 3 5 6-21 1 2 2 5-20 d) 2 1 1 13-40 18-41 10-22 1 1 15 63 Round divisors or low precision give low values. Resolved by register renaming. Generates no μops in the unfused domain. SSE3 instruction set. Integer MMX and XMM instructions Instruction Operands μops μops unfused domain fused domain Page 137 Unit Laten- Recicy procal throughput μops Wolfdale fused dop015 p0 p1 p5 p2 p3 p4 main Move instructions MOVD k) MOVD k) MOVD k) MOVD k) MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA j) PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW j) PACKUSDW j) PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW j) PMOVSX/ZXBW j) PMOVSX/ZXBD j) PMOVSX/ZXBD j) PMOVSX/ZXBQ j) PMOVSX/ZXBQ j) PMOVSX/ZXWD j) PMOVSX/ZXWD j) PMOVSX/ZXWQ j) PMOVSX/ZXWQ j) PMOVSX/ZXDQ j) PMOVSX/ZXDQ j) PSHUFB h) PSHUFB h) PSHUFB h) PSHUFB h) r,(x)mm m,(x)mm (x)mm,r (x)mm,m v,v (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm xmm, m128 1 1 1 1 1 1 1 1 1 1 9 4 4 1 1 1 1 1 1 x x x 1 x mm,mm 1 1 1 mm,m64 1 1 1 xmm,xmm 1 1 1 xmm,m128 xmm,xmm xmm,m mm,mm mm,m64 xmm,xmm xmm,m128 xmm,xmm xmm, m128 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m16 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m64 mm,mm mm,m64 xmm,xmm xmm,m128 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int 1 1 x int int int int 1 1 x x x 1 1 1 x x 1 x int int 1 4 2 2 1 1 x x x x x x x x x x x x x Reciprocal throughput 1 2 2 1 2 1 2 int int int int 1 1 int 1 1 Page 138 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int 1 2 0.33 1 0.5 1 0.33 1 1 0.33 1 1 4 2 2 0.33 0.33 2 2 1 1 1 1 1 2 3 2 2 1 2 3 1 2 3 3-8 2-8 2-8 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Wolfdale PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PALIGNR h) PALIGNR h) PBLENDVB j) PBLENDVB j) PBLENDW j) PBLENDW j) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB j) PEXTRB j) PEXTRW PEXTRW j) PEXTRD j) PEXTRD j) PEXTRQ j,m) PEXTRQ j,m) PINSRB j) PINSRB j) PINSRW PINSRW PINSRD j) PINSRD j) PINSRQ j,m) PINSRQ j,m) Arithmetic instructions PADD/SUB(U)(S)B/W/D PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PADDQ PSUBQ PHADD(S)W PHSUB(S)W h) PHADD(S)W PHSUB(S)W h) PHADDD PHSUBD h) PHADDD PHSUBD h) PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQQ j) PCMPEQQ j) PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW h) PMULHRSW h) mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i x, m128,i mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i x,x,xmm0 x,m,xmm0 xmm,xmm,i xmm,m,i mm,mm xmm,xmm r32,(x)mm r32,xmm,i m8,xmm,i r32,(x)mm,i m16,(x)mm,i r32,xmm,i m32,xmm,i r64,xmm,i m64,xmm,i xmm,r32,i xmm,m8,i (x)mm,r32,i (x)mm,m16,i xmm,r32,i xmm,m32,i xmm,r64,i xmm,m64,i 1 2 1 2 1 2 2 3 1 1 2 2 1 1 4 10 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 2 3 1 1 2 2 1 1 1 4 1 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 2 2 1 1 v,v (x)mm,m v,v (x)mm,m 1 1 2 2 1 1 2 2 x x x x x x x x v,v 3 3 1 2 (x)mm,m64 v,v (x)mm,m64 v,v (x)mm,m xmm,xmm xmm,m128 v,v (x)mm,m v,v (x)mm,m 4 3 4 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 x x 2 2 2 x x 1 1 1 1 1 x x x ? x x 3 x x x ? x x 1 1 1 1 Page 139 x x x 1 x 1 x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 int int int int 1 int 3 int int int int int int int int int int int 1 1 2 1 2 1 2 3 3 3 3 3 1 2 1 1 2 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2-5 6-10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 1 1 1 2 2 2 2 0.5 1 1 1 1 1 1 1 Wolfdale PMULLD j) xmm,xmm PMULLD j) xmm,m128 PMULDQ j) xmm,xmm PMULDQ j) xmm,m128 PMULUDQ v,v PMULUDQ (x)mm,m PMADDWD v,v PMADDWD (x)mm,m PMADDUBSW h) v,v PMADDUBSW h) (x)mm,m PAVGB/W v,v PAVGB/W (x)mm,m PMIN/MAXSB j) xmm,xmm PMIN/MAXSB j) xmm,m128 PMIN/MAXUB v,v PMIN/MAXUB (x)mm,m PMIN/MAXSW v,v PMIN/MAXSW (x)mm,m PMIN/MAXUW j) xmm,xmm PMIN/MAXUW j) xmm,m PMIN/MAXSD j) xmm,xmm PMIN/MAXSD j) xmm,m128 PMIN/MAXUD j) xmm,xmm PMIN/MAXUD j) xmm,m128 PHMINPOSUW j) xmm,xmm PHMINPOSUW j) xmm,m128 PABSB PABSW PABSD h) v,v PABSB PABSW PABSD h) (x)mm,m PSIGNB PSIGNW PSIGND h) v,v PSIGNB PSIGNW PSIGND h) (x)mm,m PSADBW v,v PSADBW (x)mm,m MPSADBW j) xmm,xmm,i MPSADBW j) xmm,m,i Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PTEST j) PTEST j) PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS Notes: v,v (x)mm,m xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i 4 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 4 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 1 1 2 2 1 1 1 1 1 1 1 1 x x 1 1 x x x x 1 2 2 1 1 1 1 1 x x 1 1 x x x x 1 1 1 1 1 1 1 1 1 1 x 4 4 x 1 x x 1 1 1 x x 1 1 1 3 4 1 1 1 3 3 x 1 1 2 2 1 1 1 2 3 1 1 1 2 2 1 1 1 2 2 1 11 11 x 1 1 1 1 2 2 x x 1 1 1 1 1 x x x x x x x x x x x x x 1 1 1 1 1 Page 140 x 1 5 5 3 3 3 3 1 1 1 1 1 1 1 4 1 int int 1 x x x int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int float 2 4 1 1 1 1 1 1 1 1 0.5 1 1 1 0.5 1 0.5 1 1 1 1 1 1 1 4 4 0.5 1 1 3 5 1 1 1 1 2 1 0.5 1 1 1 2 2 0.33 1 1 1 1 1 1 1 1 1 6 Wolfdale g) h) j) k) m) SSE3 instruction set. Supplementary SSE3 instruction set. SSE4.1 instruction set MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits Only available in 64 bit mode Floating point XMM instructions Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPS SHUFPD SHUFPD BLENDPS/PD j) BLENDPS/PD j) BLENDVPS/PD j) BLENDVPS/PD j) MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS UNPCKH/LPS UNPCKH/LPD UNPCKH/LPD EXTRACTPS j) EXTRACTPS j) INSERTPS j) INSERTPS j) Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS Operands xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm x,m32/64 m32/64,x xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i μops μops unfused domain Unit fused dop015 p0 p1 p5 p2 p3 p4 main 1 x,m,xmm0 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 r32,xmm,i m32,xmm,i xmm,xmm,i xmm,m32,i 1 1 1 4 9 1 1 1 2 2 1 1 1 1 1 2 1 2 1 1 2 2 1 2 1 2 1 1 1 2 2 2 1 2 xmm,xmm xmm,m128 xmm,xmm xmm,m64 2 2 2 2 2 2 2 2 x,x,xmm0 x x x int int 1 2 4 1 1 x x x x 1 x x 2 1 1 1 2 2 int int int 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int float float 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x 1 1 1 1 Page 141 x 1 1 1 1 1 1 1 1 2 3 2-4 3-4 1 2 3 3 5 3 1 1 1 1 1 1 2 2 Laten- Recicy procal throughput 1 int int float float int int int int int int int int int int float float int int int int 1 float float float float 4 1 1 2 1 1 1 1 4 1 4 0.33 1 1 2 4 0.33 1 1 1 1 1 1 1 2-3 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Wolfdale CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1 Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULSS MULSD MULSD MULPS MULPS MULPD MULPD DIVSS DIVSS DIVSD DIVSD DIVPS xmm,xmm x,m32/64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m32 xmm,xmm xmm,m64 xmm,xmm 1 1 1 1 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x Page 142 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 x x 1 1 1 1 1 1 1 1 float float float float float float float float float float float float float float float float float float float float float float float float float float float float 2 float float float float float float float float float float float float float float float float float float float float float float float 3 2 3 3 4 4 3 3 4 4 4 3 4 3 2 2 2 2 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1 1 1 3 1 1 3 1 1 7 3 3 6 1.5 1.5 4 1 1 5 1 1 4 1 1 5 1 1 6-13 d) 5-12 d) 5-12 d) 6-21 d) 5-20 d) 5-20 d) 6-13 d) 5-12 d) Wolfdale DIVPS DIVPD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccSS/D CMPccPS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D ROUNDSS/D j) ROUNDSS/D j) ROUNDPS/D j) ROUNDPS/D j) DPPS j) DPPS j) DPPD j) DPPD j) Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm x,m32/64 xmm,xmm xmm,m128 xmm,xmm x,m32/64 xmm,xmm x,m32/64 xmm,xmm xmm,m128 xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 1 1 1 xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 xmm,xmm xmm,m128 1 1 1 1 x x x x x x m32 m32 m4096 m4096 13 10 151 121 12 8 67 74 x x x x x x x x x 1 x 1 1 x 8 38 38 x 47 2 2 x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 x x 1 1 1 1 1 1 1 1 1 x x 1 5-12 d) 6-21 d) 5-20 d) 5-20 d) 3 2 2 3 1 1 3 1 1 3 1 1 3 1 1 3 1 1 3 1 1 3 1 1 11 3 3 9 3 3 6-13 1 float float float float float float 1 1 int int 1 1 1 1 float float float float float float float float float float float float float float float float float float float float float float float 6-20 3 5-12 5-12 5-19 5-19 2 2 Logic AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: d) g) Round divisors give low values. SSE3 instruction set. Page 143 0.33 1 38 20 145 150 Nehalem Intel Nehalem List of instruction timings and μop breakdown Explanation of column headings: Operands: μops fused domain: μops unfused domain: p015: p0: p1: p5: p2: p3: p4: Domain: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of μops at the decode, rename, allocate and retirement stages in the pipeline. Fused μops count as one. The number of μops for each execution port. Fused μops count as two. Fused macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under μops fused domain. An x under p0, p1 or p5 means that at least one of the μops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one μop which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these μops go to. The total number of μops going to port 0, 1 and 5. The number of μops going to port 0 (execution units). The number of μops going to port 1 (execution units). The number of μops going to port 5 (execution units). The number of μops going to port 2 (memory read). The number of μops going to port 3 (memory write address). The number of μops going to port 4 (memory write data). Tells which execution unit domain is used: "int" = integer unit (general purpose registers), "ivec" = integer vector unit (SIMD), "fp" = floating point unit (XMM and x87 floating point). An additional "bypass delay" is generated if a register written by a μop in one domain is read by a μop in another domain. The bypass delay is 1 clock cycle between the "int" and "ivec" units, and 2 clock cycles between the "int" and "fp", and between the "ivec" and "fp" units. The bypass delay is indicated under latency only where it is unavoidable because either the source operand or the destination operand is in an unnatural domain such as a general purpose register (e.g. eax) in the "ivec" domain. For example, the PEXTRW instruction executes in the "int" domain. The source operand is an xmm register and the destination operand is a general purpose register. The latency for this instruction is indicated as 2+1, where 2 is the latency of the instruction itself and 1 is the bypass delay, assuming that the xmm operand is most likely to come from the "ivec" domain. If the xmm operand comes from the "fp" domain then the bypass delay will be 2 rather than one. The flags register can also have a bypass delay. For example, the COMISS instruction (floating point compare) executes in the "fp" domain and returns the result in the integer flags. Almost all instructions that read these flags execute in the "int" domain. Here the latency is indicated as 1+2, where 1 is the latency of the instruction itself and 2 is the bypass delay from the "fp" domain to the "int" domain. The bypass delay from the memory read unit to any other unit and from any unit to the memory write unit are included in the latency figures in the table. Where the domain is not listed, the bypass delays are either unlikely to occur or unavoidable and therefore included in the latency figure. Page 144 Nehalem Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Instruction Move instructions MOV MOV a) MOV a) MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) i) POP POP POP POP POPF(D/Q) POPA(D) i) LAHF SAHF SALC i) LEA a) BSWAP BSWAP LDS LES LFS LGS LSS PREFETCHNTA Operands μops μops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main r,r/i r,m m,r m,i r,sr m,sr sr,r sr,m m,r 1 1 1 1 1 2 6 6 2 1 r,r 1 1 r,m r,r r,m r,r r,m 1 2 2 3 7 2 1 1 2 2 3 18 1 3 2 7 8 10 1 2 1 1 1 9 1 r i m sr r (E/R)SP m sr r,m r32 r64 m m x x x 1 3 2 x x x x x x x 1 1 3 4 1 1 1 1 1 1 1 1 x 1 2 2 3 x 1 x x x x x x x x x 1 1 1 1 1 2 2 x x x 1 x x 2 x 1 x 2 7 2 1 2 1 1 1 3 x x x x x x x 1 1 1 x x x x Page 145 x 1 1 1 5 1 8 6 1 1 1 1 1 1 1 8 1 1 1 1 1 1 1 8 1 Laten- Recicy procal throughput int int int int int int int int int ~270 0.33 1 1 1 1 1 13 14 1 int 1 0.33 2 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int 1 2 3 3 2 20 b) 5 3 2 1 4 1 1 3 2 1 1 1 1 1 1 8 1 5 1 15 14 8 0.33 1 1 1 1 15 1 Nehalem PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS i) AAD i) AAM i) MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV c) DIV c) DIV c) DIV c) IDIV c) IDIV c) IDIV c) IDIV c) CBW CWDE CDQE CWD CDQ CQO POPCNT ℓ) POPCNT ℓ) CRC32 ℓ) CRC32 ℓ) m r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r16,m16 r32,m32 r64,m64 r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r8 r16 r32 r64 r,r r,m r,r r,m 1 2 3 2 1 1 1 2 2 2 4 1 1 1 3 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 3 1 1 1 1 1 1 4 6 6 ~40 4 8 7 ~60 1 1 1 1 1 1 1 x x x 1 1 1 2 2 3 1 1 1 1 1 3 5 1 3 3 3 1 1 1 1 1 1 1 3 3 2 1 1 1 1 1 1 4 6 6 x 4 8 7 x 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x x x 1 x x 1 x x x 1 1 x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x 1 1 1 1 x x 2 1 x x x x 1 1 1 1 1 1 1 x x x 1 x x x x x Page 146 2 4 3 x 2 5 3 x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x 1 x x x x x 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 9 23 5 1 6 2 2 7 1 1 1 6 3 15 20 3 5 5 3 3 3 3 3 3 3 3 5 5 3 3 3 3 11-21 17-22 17-28 28-90 10-22 18-23 17-28 37-100 1 1 3 3 0.33 1 1 2 2 0.33 1 0.33 1 1 2 7 1 2 2 2 1 1 1 1 1 2 1 2 2 2 1 1 1 1 1 1 7-11 7-12 7-17 19-69 7-11 7-12 7-17 26-86 1 1 1 1 1 1 Nehalem Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHLD SHLD SHRD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r8,i/cl r8,i/cl r16/32/64,i/cl m,1 m8,i/cl m8,i/cl m16/32/64,i/cl r,r,i/cl m,r,i/cl r,r,i/cl m,r,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m Control transfer instructions JMP short/near JMP i) far JMP r JMP m(near) JMP m(far) Conditional jump short/near Fused compare/test and branch e) J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL i) far CALL r CALL m(near) CALL m(far) 1 1 2 1 1 1 3 1 3 2 9 8 6 4 12 11 10 2 3 2 3 1 9 2 1 10 3 1 2 1 2 1 2 2 1 1 1 1 1 1 2 1 2 2 9 8 6 3 9 8 7 2 2 2 2 1 8 2 1 7 3 1 1 1 1 1 2 2 1 31 1 1 31 1 1 2 6 11 2 46 3 4 47 1 31 1 1 31 1 1 2 6 11 2 46 2 3 47 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x 1 1 1 1 1 x x x ? x x x ? 1 11 1 1 1 x x 1 1 1 1 1 1 1 9 ? ? Page 147 ? ? 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int int 1 6 1 1 6 1 6 2 13 11 12-13 7 16 14 15 3 8 4 9 1 1 6 6 3 3 1 1 0 0 0 0 0 0.33 1 1 0.33 1 0.5 1 1 1 2 12-13 1 1 1 5 1 1 1 1 1 1 0.33 4 5 2 67 2 2 73 2 2 2 4 7 2 74 2 2 79 Nehalem RETN RETN RETF RETF BOUND i) INTO i) String instructions LODS REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC Notes: a) b) c) e) i) ℓ) i i r,m small n large n small n large n a,0 a,b 1 3 39 40 15 4 1 2 39 40 13 4 1 1 int int int int int int 2 2 1 11+4n 3 1 60+n 2.5/16 bytes 5 2 13+6n 2/16 bytes 3 2 37+6n 5 3 65+8n x 1 1 5 11 34+7b 3 25-100 22 28 1 1 5 9 1 1 x x 1 x x x x x x 1 x x x 1 x x x 2 x x x x x x x x x x x x 3 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int int int int int int 2 2 120 124 7 5 1 40+12n 1 12+n 1 clk / 16 bytes 4 12+n 1 clk / 16 bytes 1 40+2n 4 42+2n 0.33 1 9 8 79+5b ~200 5 ~200 24 40-60 Applies to all addressing modes Has an implicit LOCK prefix. Low values are for small results, high values for high results. See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion. Not available in 64 bit mode. SSE4.2 instruction set. Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) Operands r m32/64 m80 m80 r m32/m64 μops μops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 4 41 1 1 1 1 2 38 1 1 1 x 1 1 x x 1 2 3 1 Page 148 1 float float float float float float Laten- Recicy procal throughput 1 3 4 45 1 4 1 1 2 20 1 1 Nehalem FSTP FBSTP FXCH FILD FIST(P) FISTTP g) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN m80 m80 r m m m r AX m16 m16 m16 r m m r m r m r m r m r m m m m 7 208 1 1 3 3 1 2 2 2 2 3 2 2 1 2 143 79 3 204 0 f) 1 1 1 1 2 2 2 2 2 1 1 1 2 89 52 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 25 35 17 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 25 35 17 24 17 1 ~100 ~100 ~100 19 ~55 ~100 ~82 24 17 1 ~100 ~100 ~100 19 ~55 ~100 ~82 x x x x x x 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 x x x x x x x x 8 23 23 x 27 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x 1 x x x x x x x Page 149 1 1 1 1 1 1 2 1 1 2 1 1 x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 float float float float float float float float float float float float float float float float float float 5 242 0 6 7 7 2+2 7 5 1 178 156 5 245 1 1 1 1 1 2 2 2 1 2 31 1 1 4 178 156 float 3 1 float 1 float 5 1 float 1 float 7-27 d) 7-27 d) float 7-27 d) 7-27 d) float 1 1 float 1 1 float 1 float 1 float 1 float 1 float 3 2 float 5 2 float 7-27 d) 7-27 d) float 1 float 1 float 1 float 14 float 19 float 22 float 12 float 13 float ~27 float 40-100 float 40-100 float ~110 float 58 float ~80 float ~115 float ~120 Nehalem Other FNOP WAIT FNCLEX FNINIT Notes: d) f) g) 1 1 1 2 2 x 3 3 ~190 ~190 x x x x float float float float x x x 1 1 17 77 Round divisors or low precision give low values. Resolved by register renaming. Generates no μops in the unfused domain. SSE3 instruction set. Integer MMX and XMM instructions Instruction Move instructions MOVD k) MOVD k) MOVD k) MOVD k) MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU g) MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA j) PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW j) PACKUSDW j) PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW j) PMOVSX/ZXBW j) PMOVSX/ZXBD j) PMOVSX/ZXBD j) Operands μops μops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm xmm, m128 m128, xmm xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm xmm, m128 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x mm,mm 1 1 1 mm,m64 1 1 1 xmm,xmm 1 1 x x xmm,m128 xmm,xmm xmm,m (x)mm, (x)mm (x)mm,m xmm,xmm xmm, m128 xmm,xmm xmm,m64 xmm,xmm xmm,m32 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x x x x x int 1 1 x x 1 x ivec 1 1 x x x ivec 1 1 1 x x 1 x ivec 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x ivec ivec 1 1 1 1 1 Page 150 ivec Laten- Recicy procal throughput 1+1 3 1+1 2 1 2 3 1 2 3 2 3 2 1 1 ~270 ~270 2 0.33 1 0.33 1 0.33 1 1 0.33 1 1 1 1 1 0.33 0.33 2 2 1 1 1 1 2 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 1 1 1 1 1 1 0.5 2 2 2 0.5 2 0.5 1 1 2 1 2 Nehalem PMOVSX/ZXBQ j) PMOVSX/ZXBQ j) PMOVSX/ZXWD j) PMOVSX/ZXWD j) PMOVSX/ZXWQ j) PMOVSX/ZXWQ j) PMOVSX/ZXDQ j) PMOVSX/ZXDQ j) PSHUFB h) PSHUFB h) PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR h) PALIGNR h) PBLENDVB j) PBLENDVB j) PBLENDW j) PBLENDW j) MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB j) PEXTRB j) PEXTRW PEXTRW j) PEXTRD j) PEXTRD j) PEXTRQ j,m) PEXTRQ j,m) PINSRB j) PINSRB j) PINSRW PINSRW PINSRD j) PINSRD j) PINSRQ j,m) PINSRQ j,m) Arithmetic instructions PADD/SUB(U) (S)B/W/D/Q PADD/SUB(U) (S)B/W/D/Q PHADD/SUB(S)W/D h) PHADD/SUB(S)W/D h) PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQQ j) PCMPEQQ j) xmm,xmm xmm,m16 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m64 (x)mm, (x)mm (x)mm,m mm,mm,i mm,m64,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm, m128,i (x)mm,(x)mm,i (x)mm,m,i x,x,xmm0 xmm,m,xmm0 xmm,xmm,i xmm,m,i mm,mm xmm,xmm r32,(x)mm r32,xmm,i m8,xmm,i r32,(x)mm,i m16,(x)mm,i r32,xmm,i m32,xmm,i r64,xmm,i m64,xmm,i xmm,r32,i xmm,m8,i (x)mm,r32,i (x)mm,m16,i xmm,r32,i xmm,m32,i xmm,r64,i xmm,m64,i 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 2 3 1 2 4 10 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 4 1 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x 1 1 x x 1 x 1 x x x x x x x x x x x x x x x x (x)mm, (x)mm 1 1 x x (x)mm,m (x)mm, (x)mm (x)mm,m64 (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 1 3 4 1 1 1 1 1 3 3 1 1 1 1 x x x x x x x x x x x x x x Page 151 x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x x x x ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 2 ivec 1 ivec ivec float ivec 2+2 2+1 ivec 2+1 ivec 2+1 ivec 2+1 ivec 1+1 ivec 1+1 ivec 1+1 ivec 1+1 ivec 1 ivec 3 ivec 1 ivec 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 0.5 1 0.5 1 0.5 1 0.5 1 1 1 1 1 0.5 1 2 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 2 1.5 3 0.5 2 0.5 2 Nehalem PCMPGTQ ℓ) PCMPGTQ ℓ) PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW h) PMULHRSW h) PMULLD j) PMULLD j) PMULDQ j) PMULDQ j) PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW h) PMADDUBSW h) PAVGB/W PAVGB/W PMIN/MAXSB j) PMIN/MAXSB j) PMIN/MAXUB PMIN/MAXUB PMIN/MAXSW PMIN/MAXSW PMIN/MAXUW j) PMIN/MAXUW j) PMIN/MAXU/SD j) PMIN/MAXU/SD j) PHMINPOSUW j) PHMINPOSUW j) PABSB PABSW PABSD h) PABSB PABSW PABSD h) PSIGNB PSIGNW PSIGND h) PSIGNB PSIGNW PSIGND h) PSADBW PSADBW MPSADBW j) MPSADBW j) PCLMULQDQ n) AESDEC, AESDECLAST, AESENC, AESENCLAST n) AESIMC n) AESKEYGENASSIST n) Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm xmm,m xmm,xmm xmm,m128 xmm,xmm xmm,m128 1 1 1 1 1 1 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 (x)mm,(x)mm 1 1 x x (x)mm,m 1 1 x x (x)mm,(x)mm 1 1 x x (x)mm,m (x)mm,(x)mm (x)mm,m xmm,xmm,i xmm,m,i xmm,xmm,i 1 1 1 3 4 1 1 1 3 3 x x x x x x x x x x x x x ivec 3 ivec 3 ivec 6 ivec 3 ivec 3 ivec 3 ivec 3 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 1 ivec 3 ivec 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x 1 ivec 1 ivec 3 ivec 5 1 1 1 1 1 x x Page 152 x x x x ivec 0.5 12 ~5 ~5 ~5 ~2 ~2 ~2 1 0.33 1 1 1 0.5 2 1 3 1 2 8 1 x x 1 1 1 1 1 1 1 1 0.5 1 1 2 0.5 2 0.5 2 1 2 1 2 1 3 1 xmm,xmm xmm,xmm xmm,xmm,i (x)mm,(x)mm (x)mm,m 1 1 1 1 1 1 2 1 1 1 x x 3 1 x x x x x x x x x x x x 1 1 x x ivec Nehalem PTEST j) PTEST j) PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ xmm,xmm xmm,m128 mm,mm/i mm,m64 xmm,i xmm,xmm xmm,m128 xmm,i 2 2 1 1 1 2 3 1 2 2 1 1 1 2 2 1 x x String instructions PCMPESTRI ℓ) PCMPESTRI ℓ) PCMPESTRM ℓ) PCMPESTRM ℓ) PCMPISTRI ℓ) PCMPISTRI ℓ) PCMPISTRM ℓ) PCMPISTRM ℓ) xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i 8 9 9 10 3 4 4 6 8 8 9 10 3 4 4 5 x x x x x x x x x x x x x x x x x x x x x x x x 11 11 x x x Other EMMS Notes: g) h) j) k) ℓ) m) n) x x x x x 1 1 1 1 1 x x ivec 3 ivec 1 ivec ivec 1 2 ivec 1 1 1 1 2 1 2 1 1 ivec ivec ivec ivec ivec ivec ivec ivec 14 14 7 7 8 8 7 7 5 6 6 6 2 2 2 5 1 1 x x x 1 1 1 1 1 float 6 SSE3 instruction set. Supplementary SSE3 instruction set. SSE4.1 instruction set MASM uses the name MOVD rather than MOVQ for this instruction even when moving 64 bits SSE4.2 instruction set Only available in 64 bit mode Only available on newer models Floating point XMM instructions Instruction Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVH/LPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS/D SHUFPS/D BLENDPS/PD j) Operands xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,m128,i xmm,xmm,i μops μops unfused domain Dofused main dop015 p0 p1 p5 p2 p3 p4 main 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 float 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 float float 1 1 1 1 1 Page 153 1 1 1 1 1 float float float Laten- Recicy procal throughput 1 2 3 2 3 1 2 3 3 5 1 1+2 ~270 1 1 1 1 1 1-4 1-3 1 1 1 2 1 1 1 2 1 1 1 Nehalem BLENDPS/PD j) BLENDVPS/PD j) BLENDVPS/PD j) MOVDDUP g) MOVDDUP g) MOVSH/LDUP g) MOVSH/LDUP g) UNPCKH/LPS/D UNPCKH/LPS/D EXTRACTPS j) EXTRACTPS j) INSERTPS j) INSERTPS j) Conversion CVTPD2PS CVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ CVTDQ2PD CVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D xmm,m128,i x,x,xmm0 xmm,m,xmm0 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,xmm xmm,m128 r32,xmm,i m32,xmm,i xmm,xmm,i xmm,m32,i 2 2 3 1 1 1 1 1 1 1 2 1 3 1 2 2 1 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m64 xmm,xmm xmm,m32 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m64 xmm,xmm xmm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,mm xmm,m64 mm,xmm mm,m128 xmm,r32 xmm,m32 r32,xmm r32,m32 xmm,r32 xmm,m32 r32,xmm r32,m64 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 1 1 1 1 1 1 1 1 1 2 2 1 1 float float float float 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 float 2 1 2 1 1 ? 1 1 1 1 x x 1 Page 154 1 1 1 ? 1 1 1 1 1 ? 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x float float float 1 1 1+2 1 float float 1 float float float float float float float float float float float float float float float float float float float float 4 4 2 1 3+2 3+2 4+2 4+2 3+2 3+2 ivec/float 6 float/ivec 6 float float float float float float float float 3+2 float float float float 3 1 1 1 1 1 1 1 1 3+2 4+2 3+2 3 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 Nehalem ADDSUBPS/D g) ADDSUBPS/D g) HADDPS HSUBPS g) HADDPS HSUBPS g) HADDPD HSUBPD g) HADDPD HSUBPD g) MULSS MULPS MULSS MULPS MULSD MULPD MULSD MULPD DIVSS DIVPS DIVSS DIVPS DIVSD DIVPD DIVSD DIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccPS/D xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m128 xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m 1 1 3 4 3 4 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 xmm,xmm 1 1 1 xmm,m xmm,xmm xmm,m32/64 xmm,xmm xmm,m32/64 xmm,xmm xmm,m128 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 xmm,xmm,i 1 1 1 xmm,m128,i xmm,xmm,i xmm,m128,i xmm,xmm,i xmm,m128,i 2 4 6 3 4 1 4 5 3 3 xmm,xmm xmm,m xmm,xmm xmm,m xmm,xmm xmm,m 1 2 1 2 1 1 1 1 1 1 1 1 xmm,xmm xmm,m128 1 1 1 1 m32 m32 m4096 m4096 6 2 141 112 6 1 141 90 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 float float float float float float float float float float float float float float float float float 3 5 3 4 5 7-14 7-22 3 3 1 1 2 2 2 2 1 1 1 1 7-14 7-14 7-22 7-22 2 2 1 CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D ROUNDSS/D ROUNDPS/D j) ROUNDSS/D ROUNDPS/D j) DPPS j) DPPS j) DPPD j) DPPD j) Math SQRTSS/PS SQRTSS/PS SQRTSD/PD SQRTSD/PD RSQRTSS/PS RSQRTSS/PS 1 x x x 1 2 x x x 1 1 1 1 float float float float float float float float 1 1 x x x 1 1 1 1 1 3 3 1 1 1 1 1 1 1 3 1 11 1 2 9 1 3 7-18 1 float float float float float float 7-18 7-18 7-32 7-32 2 2 1 1 float float 1 1 1 1 1 float float float float float 1+2 7-32 3 Logic AND/ANDN/OR/XORPS/D AND/ANDN/OR/XORPS/D Other LDMXCSR STMXCSR FXSAVE FXRSTOR Notes: 1 1 x x x x x x Page 155 x 1 1 1 1 x 5 38 38 x 42 90 1 1 5 1 90 100 Nehalem g) SSE3 instruction set. Page 156 Sandy Bridge Intel Sandy Bridge List of instruction timings and μop breakdown Explanation of column headings: Operands: i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same = same register for both operands. m = memory operand, m32 = 32-bit memory operand, etc. μops fused domain: The number of μops at the decode, rename, allocate and retirement stages in the pipeline. Fused μops count as one. The number of μops for each execution port. Fused μops count as two. Fused macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p23 + p4 exceeds the number listed under μops fused domain. A number indicated as 1+ under a read or write port means a 256-bit read or write operation using two clock cycles for handling 128 bits each cycle. The port cannot receive another read or write µop in the second clock cycle, but a read port can receive an address-calculation µop in the second clock cycle. An x under p0, p1 or p5 means that at least one of the μops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one μop which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these μops go to. μops unfused domain: p015: p0: p1: p5: p23: p4: Latency: The total number of μops going to port 0, 1 and 5. The number of μops going to port 0 (execution units). The number of μops going to port 1 (execution units). The number of μops going to port 5 (execution units). The number of μops going to port 2 or 3 (memory read or address calculation). The number of μops going to port 4 (memory write data). This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. The latencies and throughputs listed below for addition and multiplication using full size YMM registers are obtained only after a warm-up period of a thousand instructions or more. The latencies may be one or two clock cycles longer and the reciprocal throughputs double the values for shorter sequences of code. There is no warm-up effect when vectors are 128 bits wide or less. Integer instructions Instruction Move instructions MOV Operands r,r/i μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 x Page 157 x x 1 Sandy Bridge MOV r,m 1 1 MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG m,r m,i m,r r,r 1 1 2 1 1 1 1 r,m 1 r,r r,m r,r r,m 2 2 3 8 2 2 3 x 3 1 1 2 3 16 1 1 2 9 18 1 3 1 1 2 XLAT PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA LEA BSWAP BSWAP PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT r i m r (E/R)SP m r,m r,m r32 r64 m m r,r/i r,m m,r/i r,same r,r/i r,m m,r/i r,r/i m,r/i r m 1 x x 1 1 1 x 2 0.5 3 1 1 1 ~350 1 1 2 0 x x x x x x x x x x x x 0 8 10 1 3 1 1 1 2 1 1 2 3 2 1 2 1 1 2 1 2 2 4 1 1 1 3 1 1 1 0 2 2 3 1 1 1 1 x x x x 2 1 1 1 2 1 8 1 1 2 1 8 1 1 1 1 1 8 2 25 7 3 2 1 x 1 1 1 1 3 1 2 1 2 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Page 158 0.5 1 2 1 1 1 1 1 1 implicit lock 1 1 1 1 1 8 0.5 0.5 1 18 9 1 1 0.5 1 1 1 0.5 0.5 4 33 6 1 1 2 1 1 2 1 1 2 all addressing modes 1 6 0 2 2 7 1 1 1 6 0.5 1 0.25 1 1 1.5 0.5 2 not 64 bit not 64 bit not 64 bit simple complex or rip relative Sandy Bridge AAA AAS 2 2 4 not 64 bit DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV 3 3 8 1 4 3 2 1 2 1 1 1 3 2 1 1 2 1 1 10 11 10 x 10 10 9 x 4 2 20 3 4 4 3 3 4 3 3 3 not 64 bit not 64 bit not 64 bit r,r r,m r,r r,m 3 3 8 1 4 3 2 1 2 1 1 1 4 3 2 1 2 1 1 10 11 10 34-56 10 10 9 59138 1 1 1 2 1 1 1 1 1 1 r,r/i r,m m,r/i r,same r,r/i m,r/i r,i m,i r,cl m,cl r,i 1 1 2 1 1 1 1 3 3 5 1 1 1 1 0 1 1 1 1 3 3 1 CBW CWDE CDQE CWD CDQ CQO POPCNT POPCNT CRC32 CRC32 Logic instructions AND OR XOR AND OR XOR AND OR XOR XOR TEST TEST SHR SHL SAR SHR SHL SAR SHR SHL SAR SHR SHL SAR ROR ROL r8 r16 r32 r64 r,r r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r,m r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r8 r16 r32 r64 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 20-24 21-25 20-28 30-94 21-24 21-25 20-27 40-103 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 1 x x x x x x x x x x x x x x x x x Page 159 11 1 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 11-14 11-14 11-18 22-76 11-14 11-14 11-18 25-84 0.5 1 0.5 1 1 0.5 1 1 1 1 1 1 2 1 6 0 1 1 1 2 1 2 1 2 1 0.5 1 0.25 0.5 0.5 2 2 4 1 SSE4.2 SSE4.2 SSE4.2 SSE4.2 Sandy Bridge ROR ROL ROR ROL ROR ROL RCR RCR RCR RCR RCR RCR RCL RCL RCL RCL RCL SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD m,i r,cl m,cl r8,1 r16/32/64,1 r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,r,i m,r,i r,r,cl m,r,cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m Control transfer instructions JMP short/near JMP r JMP m Conditional jump short/near Fused arithmetic and branch J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL RET RET BOUND INTO String instructions LODS short short short near r m i r,m 4 3 5 high 3 8 11 8 11 3 8 11 8 11 1 3 4 5 1 10 2 1 11 3 1 1 1 2 1 1 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 7 11 3 2 3 2 3 15 4 2 7 11 2 1 2 2 2 13 4 3 2 2 2 1 high 2 5 3 8 7 8 7 3 8 7 8 7 1 4 3 1 8 1 1 7 1 1 1 1 1 0 1 3 1 x x x x 5 2 6 x x x x 2 1 2 1 6 2 1 x 1 1 x 2 x 1 3 1 x x x 1 x x x 1 1 1 x 2 2 4 high 2 5 6 5 6 2 6 6 6 6 0.5 2 2 4 0.5 5 0.5 0.5 5 2 1 1 0.5 1 0.25 1 4 x Page 160 x 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 0 0 0 0 2 2 2 1-2 0 1-2 2-4 5 5 2 2 2 2 2 7 6 1 fast if not jumping not 64 bit not 64 bit Sandy Bridge REP LODS STOS REP STOS 5n+12 3 2n REP STOS 1.5/16B 1/16B MOVS REP MOVS 5 2n 1.5 n REP MOVS 3/16B 1/16B SCAS REP SCAS CMPS REP CMPS 3 6n+47 5 8n+80 Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDTSCP RDPMC 1 1 1 n worst case best case 4 1 1 a,0 a,b ~2n 1 worst case best case 1 2n+45 4 2n+80 0 0 0.25 0.25 7 7 12 10 49+6b 3 3 31-75 21 23 35 2 decode only 1 per clk 11 8 1 84+3b 1 7 100-250 28 36 42 Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc Operands r m32/64 m80 m80 r m32/m64 m80 m80 r m m m r μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put 1 1 4 43 1 1 7 246 1 1 3 3 1 2 2 3 1 1 2 40 1 1 1 1 1 1 0 6 7 7 1 1 2 3 0 1 1 1 1 2 2 3 1 2 1 3 4 45 1 4 5 1 2 3 1 1 1 1 1 Page 161 1 1 1 1 2 3 1 1 2 21 1 1 5 252 0.5 1 2 2 2 2 2 2 SSE3 Sandy Bridge FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN AX m16 m16 m16 2 2 3 2 1 1 143 90 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 41-87 17 1 2 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 r m m r m r m r m r m r m m m m 2 1 1 1 1 1 1 1 1 1 1 8 5 1 3 1 1 1 1 1 1 1 1 1 1 5 1 10-24 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 4 1 1 1 1 17 21 26-50 22 27 27 17 17 1 1 64-100 x 20-110 x 20-110 x 53-118 x 12 10 10-24 47-100 47-115 43-123 61-69 1 102 102 28-91 x Other FNOP WAIT FNCLEX FNINIT 1 2 5 26 1 2 5 26 1 1 1 1 1 166 165 1 1 1 1 10-24 10-24 1 1 1 1 1 1 1 1 2 1 2 21 26-50 130 93-146 1 Integer MMX and XMM instructions Page 162 1 1 22 81 Sandy Bridge Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW PACKUSDW PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW PMOVSX/ZXBW PMOVSX/ZXBD PMOVSX/ZXBD PMOVSX/ZXBQ PMOVSX/ZXBQ PMOVSX/ZXWD PMOVSX/ZXWD PMOVSX/ZXWQ PMOVSX/ZXWQ PMOVSX/ZXDQ PMOVSX/ZXDQ PSHUFB PSHUFB PSHUFW PSHUFW Operands μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm,(x)mm (x)mm,m64 m64, (x)mm x,x x, m128 m128, x x, m128 m128, x x, m128 mm, x x,mm m64,mm m128,x x, m128 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 x x x mm,mm 1 1 1 mm,m64 1 1 1 x,x 1 1 x x x,m128 x,x x,m (x)mm,(x)mm (x)mm,m x,x x, m128 x,x x,m64 x,x x,m32 x,x x,m16 x,x x,m64 x,x x,m32 x,x x,m64 (x)mm,(x)mm (x)mm,m mm,mm,i mm,m64,i 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x x 1 x x x 1 1 1 1 1 x x x 1 1 1 1 1 1 1 1 2 1 1 1 1 Page 163 1 1 1 1 1 1 3 1 3 1 3 3 1 3 3 3 3 3 1 1 ~300 ~300 1 0.5 0.5 1 0.5 1 0.5 1 0.5 1 SSE3 1 0.5 1 1 1 0.5 SSE4.1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 Sandy Bridge PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR PALIGNR PBLENDVB PBLENDVB PBLENDW PBLENDW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB PEXTRB PEXTRW PEXTRW PEXTRD PEXTRD PEXTRQ PEXTRQ PINSRB PINSRB PINSRW PINSRW PINSRD PINSRD PINSRQ PINSRQ x,x,i x,m128,i x,x,i x, m128,i (x)mm,(x)mm,i (x)mm,m,i x,x,xmm0 x,m,xmm0 x,x,i x,m,i mm,mm x,x r32,(x)mm r32,x,i m8,x,i r32,(x)mm,i m16,(x)mm,i r32,x,i m32,x,i r64,x,i m64,x,i x,r32,i x,m8,i (x)mm,r32,i (x)mm,m16,i x,r32,i x,m32,i x,r64,i x,m64,i 1 2 1 2 1 2 2 3 1 2 4 10 1 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 1 1 4 1 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 (x)mm, (x)mm (x)mm,m (x)mm, (x)mm (x)mm,m64 (x)mm,(x)mm (x)mm,m x,x x,m128 x,x x,m128 x,same x,same (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m 1 1 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 3 3 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 x x x x x x 1 1 x x x x x x x x 1 1 x x 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 2 1 1 1 2 4 1 x 2 2 1 1 2 1 1 2 1 1 1 1 2 2 1 2 1 2 1 2 1 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 1 6 1 1 1 1 2 1 1 1 1 1 0.5 1 0.5 1 0.5 1 0.5 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1, 64b SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1, 64 b Arithmetic instructions PADD/SUB(U,S)B/W/D/Q PADD/SUB(U,S)B/W/D/Q PHADD/SUB(S)W/D PHADD/SUB(S)W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQQ PCMPEQQ PCMPGTQ PCMPGTQ PSUBxx, PCMPGTx PCMPEQx PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULLD PMULLD PMULDQ PMULDQ PMULUDQ PMULUDQ 1 1 1 1 1 1 1 1 1 1 1 1 Page 164 1 1 2 1 1 1 1 1 5 1 0 0 5 1 5 1 5 1 5 1 5 1 0.5 0.5 1.5 1.5 0.5 0.5 0.5 0.5 1 1 0.25 0.5 1 1 1 1 1 1 1 1 1 1 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.2 SSE4.2 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 Sandy Bridge PMADDWD PMADDWD PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W PMIN/MAXSB PMIN/MAXSB PMIN/MAXUB PMIN/MAXUB PMIN/MAXSW PMIN/MAXSW PMIN/MAXUW PMIN/MAXUW PMIN/MAXU/SD PMIN/MAXU/SD PHMINPOSUW PHMINPOSUW PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW MPSADBW MPSADBW (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x,i x,m,i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PXOR PTEST PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ (x)mm,(x)mm (x)mm,m x,same x,x x,m128 mm,mm/i mm,m64 x,i x,x x,m128 x,i 1 1 1 1 1 1 1 1 2 3 1 1 1 0 2 2 1 1 1 2 2 1 x x x x x x 1 x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i 8 8 8 8 3 4 3 4 8 7 8 7 3 3 3 3 x,x,i 18 18 String instructions PCMPESTRI PCMPESTRI PCMPESTRM PCMPESTRM PCMPISTRI PCMPISTRI PCMPISTRM PCMPISTRM Encryption instructions PCLMULQDQ 5 1 5 1 x x x x x x x x x x x x x x x x x x x x x x x x 1 1 Page 165 1 1 1 1 1 1 1 1 1 1 5 1 x x x x 1 1 1 1 1 1 1 1 1 x x x x 1 1 1 1 5 1 x x x x 6 0 1 1 1 x x x SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3 SSE4.1 SSE4.1 1 1 x x x 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 0.5 0.5 1 1 1 1 1 2 1 1 4 1 11-12 1 3 1 11 1 14 0.5 0.25 1 1 1 2 1 1 1 1 SSE4.1 SSE4.1 4 4 4 4 3 3 3 3 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 8 CLMUL Sandy Bridge AESDEC, AESDECLAST, AESENC, AESENCLAST AESIMC AESKEYGENASSIST x,x x,x x,x,i Other EMMS 2 2 11 2 2 11 31 31 8 8 4 2 8 AES AES AES 18 Floating point XMM and YMM instructions Instruction Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVH/LPS/D MOVLHPS MOVHLPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D SHUFPS/D VSHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD BLENDPS/PD VBLENDPS/PD VBLENDPS/PD BLENDVPS/PD BLENDVPS/PD VBLENDVPS/PD Operands μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4 procal ments dothroughmain put x,x y,y x,m128 1 1 1 y,m256 m128,x 1 1 m256,y x,x x,m32/64 m32/64,x x,m64 m64,x x,x r32,x r32,y m128,x m256,y x,x,i x,m128,i y,y,y,i y, y,m256,i x,x,x/i y,y,y/i x,x,m y,y,m x,m,i y,m,i y,y,y,i y,y,m,i x,x,i x,m128,i y,y,i y,m256,i x,x,xmm0 x,m,xmm0 y,y,y,y 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 3 2 1 1 1 1 1 1 1+ 1 1 1 1+ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 x x x x x x x Page 166 1 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x 1 4 1 1 3 1 1 0.5 4 3 1 1 AVX 3 1 3 3 3 3 1 2 2 ~300 ~300 1 1 1 0.5 1 1 1 1 1 1 1 25 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 1 1 1 1 1 AVX 1 1 1+ 1 1 1 1+ 1 1+ 2 1+ 1 1 1 1+ 2 1 2 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX Sandy Bridge VBLENDVPS/PD MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D UNPCKH/LPS/D VUNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D Conversion CVTPD2PS CVTPD2PS VCVTPD2PS VCVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD VCVTPS2PD VCVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS VCVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ VCVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD CVTDQ2PD y,y,m,y x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 y,y y,m256 x,x x,m128 y,y,y y,y,m256 r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 3 1 2 1 2 1 2 3 3 4 4 2 1 x,x x,m128 x,y x,m256 x,x x,m64 x,x x,m64 y,x y,m128 x,x x,m32 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m64 2 2 2 2 2 2 2 2 2 3 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 1 1 1 2 2 x x 1 1+ 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1+ 1 1 1 1 1 3 1 4 1 1 1+ 1 1 1 1+ 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Page 167 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1+ 1 1 1 1+ 4 1 4 1+ 3 1 3 1 1 1 1 4 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1+ 3 1 3 1+ 1 1 4 1 1 1 0.5 1 1 1 1 1 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 AVX SSE3 SSE3 AVX AVX AVX AVX AVX AVX SSE3 SSE3 AVX AVX SSE3 SSE3 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX Sandy Bridge VCVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ VCVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D ADDSUBPS/D VADDSUBPS/D VADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULPS MULSS MULPS VMULPS VMULPS MULSD MULPD MULSD MULPD VMULPD VMULPD DIVSS DIVPS DIVSS DIVPS VDIVPS VDIVPS DIVSD DIVPD y,x y,m128 x,x x,m128 x,y x,m256 x,mm x,m64 mm,x mm,m128 x,mm x,m64 mm,x mm,m128 x,r32 x,m32 r32,x r32,m32 x,r32 x,m32 r32,x r32,m64 2 3 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x x,m128 y,y,y y,y,m256 x,x x,m128 1 1 1 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 2 2 y,y,y 3 3 1 2 y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x 4 1 1 1 1 1 1 1 1 1 1 3 4 1 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 2 5 1 4 1 5 1+ 4 1 1 4 1 1 1 4 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 Page 168 1 1 1 1 1 1 1 1 1 4 1 4 1 4 1 4 1 3 1 3 1 3 1+ 3 1 3 1+ 5 1 5 1+ 5 1 5 1+ 5 1 5 1+ 10-14 1 1 1 21-29 1+ 10-22 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1.5 1.5 1 1 1.5 1.5 1 1 AVX AVX AVX AVX 1 1 1 1 1 1 1 1 1 1 2 2 AVX AVX SSE3 SSE3 AVX AVX SSE3 SSE3 2 AVX 2 1 1 1 1 1 1 1 1 10-14 10-14 20-28 20-28 10-22 AVX AVX AVX AVX AVX AVX AVX Sandy Bridge DIVSD DIVPD VDIVPD VDIVPD RCPSS/PS RCPSS/PS VRCPPS VRCPPS CMPccSS/D CMPccPS/D x,m y,y,y y,y,m256 x,x x,m128 y,y y,m256 1 3 4 1 1 3 4 1 3 3 1 1 3 3 1 2 2 1 1 2 1 x,x 1 1 1 x,m128 y,y,y y,y,m256 x,x x,m32/64 x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x,i x,m128,i y,y,i y,m256,i x,x,i x,m128,i y,y,y,i y,m256,i x,x,i x,m128,i 2 1 2 2 2 1 1 1 1 1 1 1 2 1 2 4 6 4 6 3 4 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 4 5 4 5 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 1 1 3 4 1 2 3 4 1 1 3 4 1 1 3 3 1 1 3 3 1 1 3 3 x,x x,m128 1 1 1 1 1 1 y,y,y 1 1 1 y,y,m256 1 1 1 1 1 21-45 1+ 5 1 1 7 1+ 3 10-22 20-44 20-44 1 1 2 2 AVX AVX AVX AVX 1 CMPccSS/D CMPccPS/D VCMPccPS/D VCMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D VMAXPS/D VMINPS/D VMAXPS/D VMINPS/D ROUNDSS/SD/PS/PD ROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD DPPS DPPS VDPPS VDPPS DPPD DPPD Math SQRTSS/PS SQRTSS/PS VSQRTPS VSQRTPS SQRTSD/PD SQRTSD/PD VSQRTPD VSQRTPD RSQRTSS/PS RSQRTSS/PS VRSQRTPS VRSQRTPS 1 1 1 3 1+ 2 1 3 1 3 1 3 1+ 3 1 3 1+ 12 1 12 1+ 1 1 1 9 1 1 1 10-14 1 1+ 1 1 10-21 1 21-43 1+ 1 1 5 1 7 1+ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 2 4 2 2 10-14 10-14 21-28 21-28 10-21 10-21 21-43 21-43 1 1 2 2 AVX AVX AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX AVX AVX Logic AND/ANDN/OR/XORPS/PD AND/ANDN/OR/XORPS/PD VAND/ANDN/OR/XORPS/ PD VAND/ANDN/OR/XORPS/ PD Page 169 1 1 1 1 1 AVX 1 AVX 1 1+ Sandy Bridge (V)XORPS/PD x/y,x/y,same 1 Other VZEROUPPER 4 VZEROALL 12 VZEROALL LDMXCSR STMXCSR VSTMXCSR FXSAVE FXRSTOR XSAVEOPT m32 m32 m32 m4096 m4096 m 0 0 0.25 2 1 11 20 3 3 3 3 3 3 130 116 100-161 1 1 Page 170 1 1 1 1 1 9 3 1 1 68 72 1 1 60-500 AVX AVX, 32 bit AVX, 64 bit AVX Ivy Bridge Intel Ivy Bridge List of instruction timings and μop breakdown Explanation of column headings: Operands: i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same = same register for both operands. m = memory operand, m32 = 32-bit memory operand, etc. μops fused domain: The number of μops at the decode, rename, allocate and retirement stages in the pipeline. Fused μops count as one. The number of μops for each execution port. Fused μops count as two. Fused macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p23 + p4 exceeds the number listed under μops fused domain. A number indicated as 1+ under a read or write port means a 256-bit read or write operation using two clock cycles for handling 128 bits each cycle. The port cannot receive another read or write µop in the second clock cycle, but a read port can receive an address-calculation µop in the second clock cycle. An x under p0, p1 or p5 means that at least one of the μops listed under p015 can optionally go to this port. For example, a 1 under p015 and an x under p0 and p5 means one μop which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these μops go to. μops unfused domain: p015: p0: p1: p5: p23: The total number of μops going to port 0, 1 and 5. The number of μops going to port 0 (execution units). The number of μops going to port 1 (execution units). The number of μops going to port 5 (execution units). The number of μops going to port 2 or 3 (memory read or address calculation). p4: Latency: The number of μops going to port 4 (memory write data). This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. The latencies and throughputs listed below for addition and multiplication using full size YMM registers are obtained only after a warm-up period of a thousand instructions or more. The latencies may be one or two clock cycles longer and the reciprocal throughputs double the values for shorter sequences of code. There is no warm-up effect when vectors are 128 bits wide or less. Integer instructions Instruction Operands μops μops unfused domain Latency ReciComfused procal ments dothroughp015 p0 p1 p5 p23 p4 main put Page 171 Ivy Bridge Move instructions MOV MOV MOV r,i r8/16,r8/16 r32/64,r32/64 1 1 1 1 1 1 x x x x x x x x x MOV MOV MOV r8/16,m8/16 r32/64,m32/64 r,m 1 1 1 1 x x x MOV MOV MOVNTI MOVSX MOVSXD MOVZX MOVZX m,r m,i m,r r,r r16,r8 r32/64,r8 1 1 2 1 1 1 MOVZX MOVSX MOVZX MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG r32/64,r16 r16,m8 r32/64,m r,r r,m r,r r,m 1 1 1 1 1 1 x x x x x x x x x 1 2 1 1 1 x x x x x x 2 2 3 7 2 2 3 x x x x x x x x x x 2 XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA LEA r16,m r32/64,m 3 1 1 2 2 3 19 1 3 2 9 18 1 3 2 1 LEA r32/64,m 1 1 r32 r64 m m 1 2 1 1 2 3 2 1 2 BSWAP BSWAP PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE r i m (E/R)SP r (E/R)SP m 1 1 1 1 1 x x x x x x x x x 2 x x x 8 10 1 3 2 1 x x x x x x x x x x x x x x 1 x 1 1 1 2 1 1 8 1 1 2 1 8 3 1 1 1 1 1 8 1 1 0.5 0.5 1 3 ~340 1 1 0-1 1 1 1 0.33 0.33 0.25 1 3 2 0.33 0.5 0.5 2 0.67 ~0.8 1 x 1 1 2 25 7 3 1 1 2-4 1 3 1 1 2 1 1 43 43 4 36 6 3 1 1 may be elimin. 64 b abs address may be elimin. implicit lock 1 1 1 1 1 1 8 0.5 0.5 1 18 9 1 1 1 0.5 1 1 1 Page 172 2 2 2 2 1 x 0.33 0.33 0.25 1 2 1 2 3 1 1 1 1 1 0-1 not 64 bit not 64 bit not 64 bit 1-2 components 3 components or RIP Ivy Bridge Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO POPCNT POPCNT CRC32 CRC32 r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m r8 r16 r32 r64 r,r r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r,m r16,m16,i r32,m32,i r64,m64,i r8 r16 r32 r64 r8 r16 r32 r64 r,r r,m r,r r,m 1 1 2 2 2 4 1 1 1 3 2 3 3 8 1 4 3 2 1 2 1 1 1 4 3 2 1 2 1 1 11 11 10 35-57 11 11 9 59134 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 3 1 1 1 1 2 3 3 8 1 4 3 2 1 2 1 1 1 3 2 1 1 2 1 1 11 11 10 x 11 11 9 x x x x x x x x x x x x 1 1 1 2 1 1 1 1 1 1 x x x x x x x x x x x x x x x x 1 x x x x x x x x x x x 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 2 2 7-8 1 1 1 6 4 4 2 20 3 4 4 3 3 4 3 3 3 19-22 20-24 19-27 29-94 20-23 20-24 19-26 28-103 Logic instructions Page 173 x x x x 1 1 1 1 x x x x x x 1 1 1 1 1 1 3 1 3 1 0.33 0.5 1 1 1 2 0.33 0.5 0.33 1 8 1 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 9 10 11 22-76 8 8 8-11 26-88 not 64 bit not 64 bit not 64 bit not 64 bit 0.33 0.5 1 1 1 SSE4.2 SSE4.2 SSE4.2 SSE4.2 Ivy Bridge AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL ROR ROL ROR ROL ROR ROL RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD r,r/i r,m m,r/i r,r/i m,r/i r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,r,i m,r,i r,r,cl m,r,cl r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m Control transfer instructions JMP short/near JMP r JMP m Conditional jump short/near Fused arithmetic and branch J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL RET RET short short short near r m i 1 1 2 1 1 1 3 2 5 2 1 4 2 5 3 8 11 8 11 1 3 4 5 1 10 2 1 11 3 1 1 1 2 1 1 3 1 1 1 1 1 1 1 2 3 2 1 3 2 3 3 8 8 8 8 1 3 4 4 1 9 1 1 8 2 1 1 1 1 0 1 3 1 1 1 1 x x x x x x x x x x x 1 x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x x x 1 1 x x x x 1 1 2 1 2 1 1 1 1 2 1 2 1 1 2 5 2 1 2 1 2 1 2 1 5 1 2 1 1 1 1 2 1 1 1 3 1 x x 1 1 1 0.33 0.5 1 0.33 0.5 0.5 2 1 4 1 0.5 2 1 4 2 5 6 5 6 0.5 2 2 4 0.5 5 0.5 0.5 5 2 1 1 0.5 1 0.25 0.33 4 x x 1 1 1 1 1 1 1 1 1 0 0 0 0 2 2 2 1-2 1 1 1 0 1-2 2 7 11 2 2 3 2 3 2 7 11 1 1 1 1 2 x x Page 174 x x 6 1 1 1 x x x x x x x x x x x x x x x x x 1 1 2 x 1 x 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1-2 4-5 6 2 2 2 2 2 short form fast if no jump fast if no jump Ivy Bridge BOUND INTO r,m 15 4 13 4 x x x String instructions LODS REP LODS STOS REP STOS 3 ~5n 3 many 2 x x 1 x x REP STOS many MOVS REP MOVS 5 2n REP MOVS 4/16B SCAS REP SCAS CMPS REP CMPS 3 ~6n 5 ~8n 2 4 8 7 5 9 14 18 22 24 3 5 5 3 6 11 15 19 21 Synchronization instructions XADD LOCK XADD LOCK ADD CMPXCHG LOCK CMPXCHG CMPXCHG8B LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP / Long NOP PAUSE ENTER ENTER LEAVE XGETBV CPUID RDTSC RDPMC RDRAND m,r m,r m,r m,r m,r m,r m,r m,r m,r a,0 a,b r 2 7 6 x 1 1 x 1 not 64 bit not 64 bit ~2n 1 1 n worst case best case 1/16B 2 x x x 2 1 4 n worst case best case 1/16B x x x 1 1 ~2n 3 x x x 2 4 ~2n 1 0 7 7 12 9 45+7b 3 2 8 37-82 21 35 13 12 x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 2 1 2 2 2 2 2 2 x x x 2 x x x 1 1 1 1 1 1 1 1 1 1 7 22 22 7 22 7 22 16 27 0.25 10 8 1 84+3b 6 9 XGETBV 100-340 x x x 1 27 39 104-117 RDRAND Floating point x87 instructions Instruction Operands μops μops unfused domain Latency ReciComfused procal ments dothroughp015 p0 p1 p5 p23 p4 main put Move instructions Page 175 Ivy Bridge FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN r m32/64 m80 m80 r m32/m64 m80 m80 r m m m r AX m16 m16 m16 r m m r m r m r m r m r m m m m 1 1 4 43 1 1 7 243 1 1 3 3 1 2 2 3 2 2 3 2 1 1 143 90 1 1 2 40 1 1 1 1 2 1 2 1 2 1 1 1 1 2 3 2 2 2 2 1 2 28 41 17 1 1 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 25 17 1 21-78 25 17 1 1 1 0 6 7 7 1 2 3 1 1 2 3 0 1 1 1 1 2 2 3 2 1 2 1 1 1 1 2 1 3 5 45 1 4 5 1 1 1 1 1 1 1 1 x 1 1 1 1 1 2 x 2 1 1 1 1 1 2 4 1 1 1 1 1 3 5 1 10-24 1 Page 176 1 1 3 1 1 1 4 5 1 1 1 1 17 x x 1 x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 2 21 1 1 5 252 0.5 1 1 2 x x x x x x 2 2 1 1 3 1 1 1 167 162 1 1 1 1 8-18 8-18 1 1 1 1 1 1 2 2 21-26 27-50 22 2 1 2 12 19 11 49 10 10-23 47-106 49 10 8-17 47-106 SSE3 Ivy Bridge FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN 23-100 20-110 16-23 42 42 56 56 102 102 28-72 Other FNOP WAIT FNCLEX FNINIT 1 2 5 26 1 2 5 26 x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x 48-115 50-123 ~68 90-106 82 130 94-150 48-115 50-123 ~68 1 1 22 80 Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQU MOVDQA MOVDQU MOVDQA MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW PACKUSDW PUNPCKH/LBW/WD/DQ PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PUNPCKH/LQDQ PMOVSX/ZXBW Operands μops μops unfused domain Latency ReciComfused procal ments dothroughp015 p0 p1 p5 p23 p4 main put r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm,(x)mm (x)mm,m64 m64, (x)mm x,x x, m128 m128, x x, m128 mm, x x,mm m64,mm m128,x x, m128 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 mm,mm 1 1 1 mm,m64 1 1 1 x,x 1 1 x x x,m128 x,x x,m (x)mm,(x)mm (x)mm,m x,x x, m128 x,x 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 x x x x x x x x x x x x x x x x 1 1 1 1 1 1 x x x 1 1 1 1 2 1 x x 1 1 1 x x 1 1 1 1 1 Page 177 1 x 1 1 1 3 1 3 1 3 3 0-1 3 3 3 1 1 ~360 ~360 3 1 1 1 0.5 0.33 0.5 1 0.25 0.5 1 0.5 1 0.33 1 1 0.5 1 1 1 1 eliminat. SSE3 SSE4.1 1 1 0.5 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1 SSE4.1 SSE4.1 SSE4.1 Ivy Bridge PMOVSX/ZXBW PMOVSX/ZXBD PMOVSX/ZXBD PMOVSX/ZXBQ PMOVSX/ZXBQ PMOVSX/ZXWD PMOVSX/ZXWD PMOVSX/ZXWQ PMOVSX/ZXWQ PMOVSX/ZXDQ PMOVSX/ZXDQ PSHUFB PSHUFB PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR PALIGNR PBLENDVB PBLENDVB PBLENDW PBLENDW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRB PEXTRB PEXTRW PEXTRW PEXTRD PEXTRD PEXTRQ PEXTRQ PINSRB PINSRB PINSRW PINSRW PINSRD PINSRD PINSRQ PINSRQ x,m64 x,x x,m32 x,x x,m16 x,x x,m64 x,x x,m32 x,x x,m64 (x)mm,(x)mm (x)mm,m mm,mm,i mm,m64,i xmm,x,i x,m128,i x,x,i x, m128,i (x)mm,(x)mm,i (x)mm,m,i x,x,xmm0 x,m,xmm0 x,x,i x,m,i mm,mm x,x r32,(x)mm r32,x,i m8,x,i r32,(x)mm,i m16,(x)mm,i r32,x,i m32,x,i r64,x,i m64,x,i x,r32,i x,m8,i (x)mm,r32,i (x)mm,m16,i x,r32,i x,m32,i x,r64,i x,m64,i 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 2 2 3 1 2 4 10 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 4 1 2 1 1 1 2 1 2 1 2 1 2 1 1 1 1 1 (x)mm, (x)mm (x)mm,m (x)mm, (x)mm (x)mm,m64 (x)mm,(x)mm (x)mm,m x,x 1 1 3 4 1 1 1 1 1 3 3 1 1 1 1 x 1 1 1 1 1 x x x x x x x x x x x x x x x x x x x x x 1 1 x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 4 1 2 2 2 1 1 1 1 2 2 1 1 1 1 2 2 1 2 1 2 1 2 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 1 6 1 1 1 1 1 1 1 1 1 1 0.5 1 0.5 1 0.5 1 0.5 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 Arithmetic instructions PADD/SUB(U,S)B/W/D/Q PADD/SUB(U,S)B/W/D/Q PHADD/SUB(S)W/D PHADD/SUB(S)W/D PCMPEQ/GTB/W/D PCMPEQ/GTB/W/D PCMPEQQ Page 178 1 1 3 1 1 1 1 0.5 0.5 1.5 1.5 0.5 0.5 0.5 SSSE3 SSSE3 SSE4.1 Ivy Bridge PCMPEQQ PCMPGTQ PCMPGTQ PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULLD PMULLD PMULDQ PMULDQ PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W PMIN/MAXSB PMIN/MAXSB PMIN/MAXUB PMIN/MAXUB PMIN/MAXSW PMIN/MAXSW PMIN/MAXUW PMIN/MAXUW PMIN/MAXU/SD PMIN/MAXU/SD PHMINPOSUW PHMINPOSUW PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW MPSADBW MPSADBW x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x x,m x,x x,m128 x,x x,m128 (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m (x)mm,(x)mm (x)mm,m x,x,i x,m,i 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 Logic instructions PAND(N) POR PXOR PAND(N) POR PXOR PTEST PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ (x)mm,(x)mm (x)mm,m x,x x,m128 mm,mm/i mm,m64 xmm,i x,x x,m128 x,i 1 1 2 3 1 1 1 2 3 1 1 1 2 2 1 1 1 2 2 1 x x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 x x x x x x x x x x x x x x x x x x x x x x x x 1 1 x x 1 1 1 1 1 1 1 String instructions Page 179 1 1 1 1 1 1 1 1 1 1 1 1 5 1 x x x x 1 1 1 1 1 x x x x 1 1 1 1 5 1 1 1 1 1 x x x x x x x x 6 1 1 1 1 1 1 1 x x x x x x 1 2 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 0.5 0.5 0.5 0.5 1 1 1 1 0.33 0.5 1 1 1 1 1 1 1 0.5 SSE4.1 SSE4.2 SSE4.2 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 Ivy Bridge PCMPESTRI PCMPESTRI PCMPESTRM PCMPESTRM PCMPISTRI PCMPISTRI PCMPISTRM PCMPISTRM Encryption instructions PCLMULQDQ PCLMULQDQ AESDEC, AESDECLAST, AESENC, AESENCLAST x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i 8 8 8 8 3 4 3 4 8 7 8 7 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 4 3 4 3 x,x,i x,m,i 18 18 18 17 x x x x x x x,x 2 2 x x 1 x,m x,x x,m x,x,i x,m,i 3 2 3 11 11 2 2 2 11 10 x x 1 2 2 x x 31 31 4 3 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 14 8 8 CLMUL CLMUL 4 1 AES 1 2 2 8 7 AES AES AES AES AES 1 12 1 4 4 4 4 3 1 3 11 1 1 AESDEC, AESDECLAST, AESENC, AESENCLAST AESIMC AESIMC AESKEYGENASSIST AESKEYGENASSIST Other EMMS x x x x 1 14 1 10 1 18 Floating point XMM and YMM instructions Instruction Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVH/LPS/D MOVLHPS MOVHLPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D Operands μops μops unfused domain Latency ReciComfused procal ments dothroughp015 p0 p1 p5 p23 p4 main put x,x y,y x,m128 1 1 1 y,m256 m128,x 1 1 m256,y x,x x,m32/64 m32/64,x x,m64 m64,x x,x r32,x r32,y m128,x m256,y x,x,i 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1+ 1 1 1 1+ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Page 180 1 1 1+ 0-1 0-1 3 ≤1 ≤1 0.5 elimin. elimin. 4 3 1 1 AVX 4 1 3 3 4 3 1 2 2 ~380 ~380 1 2 1 0.5 1 1 1 1 1 1 1 2 1 AVX AVX Ivy Bridge SHUFPS/D VSHUFPS/D VSHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 BLENDPS/PD BLENDPS/PD VBLENDPS/PD VBLENDPS/PD BLENDVPS/PD BLENDVPS/PD VBLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VMOVDDUP VMOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP VMOVSH/LDUP VMOVSH/LDUP UNPCKH/LPS/D UNPCKH/LPS/D VUNPCKH/LPS/D VUNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D Conversion CVTPD2PS CVTPD2PS VCVTPD2PS VCVTPD2PS x,m128,i y,y,y,i y, y,m256,i x,x,x/i y,y,y/i x,x,m y,y,m x,m,i y,m,i y,y,y,i y,y,m,i x,x,i x,m128,i y,y,i y,m256,i x,x,xmm0 x,m,xmm0 y,y,y,y y,y,m,y x,x x,m64 y,y y,m256 x,m32 y,m32 y,m64 y,m128 x,x x,m128 y,y y,m256 x,x x,m128 y,y,y y,y,m256 r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i x,x,m128 y,y,m256 m128,x,x m256,y,y 2 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 3 2 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 3 1 2 1 2 1 2 3 3 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 x,x x,m128 x,y x,m256 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 x x x x x x x x 1 x x x x x x x x 1 1 1+ 1 1 1 1+ 1 1+ 2 1+ 1 1 1 1+ 2 1 2 1+ 1 3 1 3 4 5 5 5 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1+ 1 1 1 1 1 1 1 1 1 1 1 2 2 1 0 1 1 1 1 2 2 2 2 1 1 1 1 x x 1 1+ x x 1 1 1 x x x x x x Page 181 x x 1 1 1 1 1 1 1 1 1 1 1 1+ 2 1 1 1 1 2 4 1 1 1 1 1+ 1 1 1 1+ 2 4 4 5 4 1 4 1+ 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 1 1 1 1 1 1 0.5 1 1 0.5 1 1 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE3 SSE3 AVX AVX AVX AVX AVX AVX SSE3 SSE3 AVX AVX SSE3 SSE3 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX AVX AVX 1 1 1 1 AVX AVX Ivy Bridge CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD VCVTPS2PD VCVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS VCVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ VCVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD CVTDQ2PD VCVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ VCVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI VCVTPS2PH VCVTPS2PH VCVTPH2PS VCVTPH2PS x,x x,m64 x,x x,m64 y,x y,m128 x,x x,m32 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m64 y,x y,m128 x,x x,m128 x,y x,m256 x,mm x,m64 mm,x mm,m128 x,mm x,m64 mm,x mm,m128 x,r32 x,m32 r32,x r32,m32 x,r32 x,m32 r32,x r32,m64 x,v,i m,v,i v,x v,m 2 2 2 2 2 3 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 3 3 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 1 2 2 3 2 2 1 Arithmetic ADDSS/D SUBSS/D ADDSS/D SUBSS/D ADDPS/D SUBPS/D ADDPS/D SUBPS/D VADDPS/D VSUBPS/D VADDPS/D VSUBPS/D ADDSUBPS/D x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Page 182 1 1 1 4 1 1 1 1 1 1 4 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1+ 3 1 3 1+ 1 1 1 1 1 1 1 1 4 1 5 1 4 1 5 1+ 1 1 1 1 1 1 1 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX 4 1 1 4 1 1 1 1 1 1 4 1 4 1 4 1 4 1 1 4 1 4 1 1 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 1 3 1 3 1 3 1+ 3 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 F16C F16C F16C F16C 1 1 1 1 1 1 1 AVX AVX SSE3 Ivy Bridge ADDSUBPS/D VADDSUBPS/D VADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D VHADDPS/D VHSUBPS/D VHADDPS/D VHSUBPS/D MULSS MULPS MULSS MULPS VMULPS VMULPS MULSD MULPD MULSD MULPD VMULPD VMULPD DIVSS DIVPS DIVSS DIVPS VDIVPS VDIVPS DIVSD DIVPD DIVSD DIVPD VDIVPD VDIVPD RCPSS/PS RCPSS/PS VRCPPS VRCPPS CMPccSS/D CMPccPS/D x,m128 y,y,y y,y,m256 x,x x,m128 1 1 1 3 4 1 1 1 3 3 1 1 1 1 1 2 2 y,y,y 3 3 1 2 y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m y,y,y y,y,m256 x,x x,m128 y,y y,m256 4 1 1 1 1 1 1 1 1 1 1 3 4 1 1 3 4 1 1 3 4 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 3 3 1 1 3 3 1 2 x,x 1 1 1 x,m128 y,y,y y,y,m256 x,x x,m32/64 x,x x,m32/64 x,x x,m128 y,y,y y,y,m256 x,x,i x,m128,i y,y,i y,m256,i x,x,i x,m128,i y,y,y,i y,m256,i x,x,i x,m128,i 2 1 2 2 2 1 1 1 1 1 1 1 2 1 2 4 6 4 6 3 4 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 4 5 4 5 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 2 2 1 3 1+ 5 1 5 1+ 5 1 5 1+ 5 1 5 1+ 10-13 1 1 1 19-21 1+ 10-20 1 1 1 20-35 1+ 5 1 1 1 7 1+ 3 1 1 1 2 2 SSE3 AVX AVX SSE3 SSE3 2 AVX 2 1 1 1 1 1 1 1 1 7 7 14 14 8-14 8-14 16-28 16-28 1 1 2 2 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX 1 CMPccSS/D CMPccPS/D VCMPccPS/D VCMPccPS/D COMISS/D UCOMISS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXSS/D MINSS/D MAXPS/D MINPS/D MAXPS/D MINPS/D VMAXPS/D VMINPS/D VMAXPS/D VMINPS/D ROUNDSS/SD/PS/PD ROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD VROUNDSS/SD/PS/PD DPPS DPPS VDPPS VDPPS DPPD DPPD 1 1 1 1 1 1 1 1 Page 183 1 3 1+ 1 3 1 3 1 3 1+ 3 1 3 1+ 1 2 1 2 1 1 12 1 12 1+ 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 2 4 1 1 AVX AVX AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 Ivy Bridge Math SQRTSS/PS SQRTSS/PS VSQRTPS VSQRTPS SQRTSD/PD SQRTSD/PD VSQRTPD VSQRTPD RSQRTSS/PS RSQRTSS/PS VRSQRTPS VRSQRTPS x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 1 1 3 4 1 1 3 4 1 1 3 4 1 1 3 3 1 1 3 3 1 1 3 3 x,x x,m128 1 1 y,y,y 1 1 2 2 1 1 2 2 1 1 2 2 11 1 1 1 19 1+ 16 1 1 1 28 1+ 5 1 1 1 1+ 7 1 1 1 1 1 1 1 1 y,y,m256 1 1 1 m32 m32 m4096 m4096 m 4 12 20 3 3 130 116 100-161 0 2 2 2 2 7 7 14 14 8-14 8-14 16-28 16-28 1 1 2 2 AVX AVX AVX AVX AVX AVX Logic AND/ANDN/OR/XORPS/PD AND/ANDN/OR/XORPS/PD VAND/ANDN/OR/XORPS/ PD VAND/ANDN/OR/XORPS/ PD Other VZEROUPPER VZEROALL VZEROALL LDMXCSR STMXCSR FXSAVE FXRSTOR XSAVEOPT 1 1 Page 184 1 1 1 1 1 1 1 AVX 1 AVX 1 11 9 3 1 66 68 AVX 32 bit 64 bit 1+ 1 1 1 6 7 60-500 Haswell Intel Haswell List of instruction timings and μop breakdown Explanation of column headings: Instruction: Name of instruction. Multiple names mean that these instructions have the same data. Instructions with or without V name prefix behave the same unless otherwise noted. Operands: i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx, xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc. μops fused domain: μops unfused domain: The number of μops at the decode, rename and allocate stages in the pipeline. Fused μops count as one. The total number of μops for all execution port. Fused μops count as two. Fused macroops count as one. The instruction has μop fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any execution port or if the counters are inaccurate. µops each port: The number of μops for each execution port. p0 means a µop to execution port 0. p01means a µop that can go to either port 0 or port 1. p0 p1 means two µops going to port 0 and 1, respectively. Port 0: Integer, f.p. and vector ALU, mul, div, branch Port 1: Integer, f.p. and vector ALU Port 2: Load Port 3: Load Port 4: Store Port 5: Integer and vector ALU Port 6: Integer ALU, branch Port 7: Store address Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV Operands Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments r,i r8/16,r8/16 r32/64,r32/64 r8l,m r8h,m r16,m r32/64,m 1 1 1 1 1 1 1 1 1 1 2 1 2 1 p0156 p0156 p0156 p23 p0156 p23 p23 p0156 p23 m,r 1 2 p237 p4 Page 185 2 0.25 0.25 0.25 0.5 0.5 0.5 0.5 3 1 1 0-1 may be elim. all addressing modes Haswell MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA m,i m,r r,r 1 2 1 2 2 1 p237 p4 p23 p4 p0156 r16,m8 r,m 1 1 2 1 p23 p0156 p23 r,r r,m r,r r,m 2 3 3 8 3 2 2 3 3 4 19 1 3 3 9 18 1 3 2 2p0156 2p0156 p23 3p0156 r16,m 2 3 3 8 3 1 1 2 2 3 11 1 3 2 9 18 1 3 2 LEA r32/64,m 1 LEA r32/64,m LEA BSWAP BSWAP MOVBE MOVBE MOVBE MOVBE MOVBE MOVBE PREFETCHNTA/ 0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ~400 1 1 1 0.25 0.5 0.5 2 2 21 7 3 0.5 1 1 implicit lock p23 p23 2p0156 2p237 p4 2 p06 3p0156 p1 p0156 1 1 4 2 1 1 1 1 1 8 0.5 4 1 18 9 1 1 1 1 p15 1 0.5 1 1 p1 3 1 r32/64,m 1 1 p1 r32 r64 r16,m16 r32,m32 r64,m64 m16,r16 m32,r32 m64,r64 1 2 3 2 3 2 2 3 1 2 3 2 3 3 3 4 p15 p06 p15 2p0156 p23 p15 p23 2p0156 p23 p06 p237 p4 p15 p237 p4 p06 p15 p237 p4 m 1 1 p23 0.5 2 3 2 2 2 none counted p23 p4 p23 p4 4 33 5 1 1 1 2 p0156 p0156 p23 r i m stack pointer r stack pointer m r,r/i r,m p237 p4 p237 p4 p4 2p237 p0156 p237 p4 p1 p4 p237 p06 Page 186 1 1 2 1 all other combinations 0.5 1 0.5 0.5 0.5 1 1 1 0.25 0.5 not 64 bit not 64 bit not 64 bit 16 or 32 bit address size 1 or 2 components in address 3 components in address rip relative address MOVBE MOVBE MOVBE MOVBE MOVBE MOVBE Haswell ADD SUB m,r/i 2 4 2p0156 2p237 p4 6 1 ADC SBB ADC SBB ADC SBB r,r/i r,m m,r/i 2 2 4 2 3 6 2p0156 2p0156 p23 2 3p0156 2p237 p4 7 1 1 2 CMP CMP INC DEC NEG NOT INC DEC NOT NEG AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MULX MULX MULX MULX DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO POPCNT POPCNT r,r/i m,r/i r 1 1 1 1 2 1 p0156 p0156 p23 p0156 1 1 1 0.25 0.5 0.25 m m 3 2 2 2 3 3 8 1 4 3 2 1 4 3 2 1 1 2 1 1 2 1 1 3 3 2 2 9 11 10 36 9 10 9 59 1 1 1 2 1 1 1 1 4 4 2 2 3 3 8 1 4 3 2 2 5 4 3 1 2 2 1 1 3 2 2 3 4 2 3 9 11 10 36 9 10 9 59 1 1 1 2 1 1 1 2 p0156 2p237 p4 p0156 2p237 p4 p1 p0156 p1 p56 p1 2p0156 p1 2p0156 p0 p1 p5 p6 p1 p1 p0156 p1 p0156 p1 p6 p1 p23 p1 3p0156 p23 p1 2p0156 p23 p1 p6 p23 p1 p1 p23 p1 p0156 p1 p1 p1 p0156 p23 p1 p23 p1 p23 p1 2p056 p1 2p056 p23 p1 p6 p1 p6 p23 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0156 p0156 p0156 p0156 p06 p06 p1 p1 p23 6 6 4 6 4 4 21 3 4 4 3 1 1 r8 r16 r32 r64 m8 m16 m32 m64 r,r r,m r16,r16,i r32,r32,i r64,r64,i r16,m16,i r32,m32,i r64,m64,i r32,r32,r32 r32,r32,m32 r64,r64,r64 r64,r64,m64 r8 r16 r32 r64 r8 r16 r32 r64 r,r r,m Page 187 3 4 3 3 4 4 22-25 23-26 22-29 32-96 23-26 23-26 22-29 39-103 1 1 1 1 1 1 3 8 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9-11 21-74 8 8 8-11 24-81 1 1 not 64 bit not 64 bit not 64 bit not 64 bit not 64 bit AVX2 AVX2 AVX2 AVX2 SSE4.2 SSE4.2 Haswell CRC32 CRC32 r,r r,m 1 1 1 2 p1 p1 p23 3 1 1 Logic instructions AND OR XOR AND OR XOR AND OR XOR r,r/i r,m m,r/i 1 1 2 1 2 4 p0156 p0156 p23 1 2p0156 2p237 p4 6 0.25 0.5 1 r,r/i m,r/i r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,1 m,1 r,i m,i r,cl m,cl r,r,i m,r,i r,r,cl r,r,cl m,r,cl r,r,r r,m,r r,r,i r,m,i r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 1 1 1 3 3 5 2 1 4 3 5 3 4 8 11 8 11 1 3 4 4 5 1 2 1 2 1 10 2 1 10 3 1 1 1 2 1 1 1 3 1 1 1 1 1 2 1 4 3 6 2 1 5 3 6 3 6 8 11 8 11 1 5 4 4 7 1 2 1 2 1 10 2 1 11 4 1 2 1 3 0 1 1 3 1 2 1 2 p0156 p0156 p23 p06 2p06 p237 p4 3p06 3p06 2p23 p4 2p06 p06 2p06 2p237 p4 3p06 1 2p06 p0156 2 p0156 6 p0156 6 p1 3 p0156 p0156 3 4 p06 p06 p23 p06 p06 p23 p06 1 TEST TEST SHR SHL SAR SHR SHL SAR SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL ROR ROL ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHRD SHLD SHRD SHLD SHLD SHRD SHRD SHLD SHLX SHRX SARX SHLX SHRX SARX RORX RORX BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD LZCNT LZCNT TZCNT TZCNT r,r r,m r,r r,m p06 p23 p06 2p06 p23 p4 p1 p1 p23 p06 p06 p237 p4 none p0156 p0156 p15 p6 p1 p1 p23 p1 p1 p23 Page 188 1 2 1 1 2 1 1 1 3 1 0.25 0.5 0.5 2 2 4 1 0.5 2 2 4 2 3 6 6 6 6 1 2 2 2 4 0.5 0.5 0.5 0.5 0.5 5 0.5 0.5 5 2 1 1 0.5 1 0.25 0.25 SSE4.2 SSE4.2 short form BMI2 BMI2 BMI2 BMI2 1 3 3 4 1 1 1 1 LZCNT LZCNT BMI1 BMI1 Haswell ANDN ANDN BLSI BLSMSK BLSR BLSI BLSMSK BLSR BEXTR BEXTR BZHI BZHI PDEP PDEP PEXT PEXT r,r,r r,r,m r,r 1 1 1 1 2 1 p15 p15 p23 p15 r,m 1 2 p15 p23 r,r,r r,m,r r,r,r r,m,r r,r,r r,r,m r,r,r r,r,m 2 3 1 1 1 1 1 1 2 3 1 2 1 2 1 2 2p0156 2p0156 p23 p15 p15 p23 p1 p1 p23 p1 p1 p23 Control transfer instructions JMP short/near JMP r JMP m Conditional jump short/near 1 1 1 1 1 1 2 1 p6 p6 p23 p6 p6 1-2 2 2 1-2 Conditional jump 1 1 p06 0.5-1 1 1 p6 1-2 1 1 p06 0.5-1 2 7 11 2 2 3 1 3 15 4 2 7 11 3 3 4 2 4 15 4 p0156 p6 0.5-2 5 6 2 2 3 1 2 8 5 3 2 5n+12 3 <2n 2.6/32B 3 2 2p0156 p23 p0156 p23 3 p23 p0156 p4 MOVS REP MOVS REP MOVS 5 ~2n 4/32B 5 SCAS REP SCAS 3 ≥6n 3 Fused arithmetic and branch Fused arithmetic and branch J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL RET RET BOUND INTO String instructions LODSB/W LODSD/Q REP LODS STOS REP STOS REP STOS short/near short short short near r m i r,m p237 p4 p6 p237 p4 p6 2p237 p4 p6 p237 p6 p23 2p6 p015 2p23 p4 2p0156 p23 2p0156 Page 189 1 1 1 2 1 3 3 0.5 0.5 0.5 BMI1 BMI1 BMI1 0.5 BMI1 0.5 1 0.5 0.5 1 1 1 1 BMI1 BMI1 BMI2 BMI2 BMI2 BMI2 BMI2 BMI2 1 1 ~2n 1 ~0.5n 1/32B 4 ~1.5 n 1/32B 1 ≥2n predicted taken predicted not taken predicted taken predicted not taken not 64 bit not 64 bit worst case best case aligned by 32 worst case best case aligned by 32 Haswell CMPS REP CMPS 5 ≥8n 5 Synchronization instructions XADD m,r LOCK XADD m,r LOCK ADD m,r CMPXCHG m,r LOCK CMPXCHG m,r CMPXCHG8B m,r LOCK CMPXCHG8B m,r CMPXCHG16B m,r LOCK CMPXCHG16B m,r 4 9 8 5 10 15 19 22 24 5 9 8 6 10 15 19 22 24 1 1 0 0 Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE XGETBV RDTSC RDPMC RDRAND a,0 a,b r 2p23 3p0156 5 5 12 12 ~14+7b ~45+7b 3 3 8 8 15 15 34 34 17 17 4 ≥2n 7 19 19 8 19 9 19 15 25 none none 0.25 0.25 p05 3p6 9 8 ~87+2b 2p0156 p23 6 9 24 37 ~320 p23 16p0156 XGETBV RDRAND Floating point x87 instructions Instruction Operands Move instructions FLD r FLD m32/64 FLD m80 FBLD m80 FST(P) r FST(P) m32/m64 FSTP m80 FBSTP m80 FXCH r FILD m FIST(P) m FISTTP m FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc r FNSTSW AX FNSTSW m16 FLDCW m16 Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments 1 1 4 43 1 1 7 238 2 1 3 3 1 2 2 3 2 2 3 1 1 4 43 1 2 7 226 0 2 3 3 1 2 2 3 2 3 3 p01 p23 2p01 2p23 p01 p4 p237 3p0156 2p23 2p4 none p01 p23 p1 p23 p4 p1 p23 p4 p01 2p01 2p01 2p0 p5 p0 p0156 p0 p4 p237 p01 p23 p6 Page 190 1 3 4 47 1 4 1 0 6 7 7 2 6 7 0.5 0.5 2 22 0.5 1 5 265 0.5 1 1 2 1 2 2 2 1 1 2 SSE3 Haswell FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP WAIT FNCLEX FNINIT m16 3 1 1 147 90 p237 p4 p6 p01 p01 0 r m m 2 1 1 147 90 r 1 1 p1 3 m r m r m 1 1 1 1 1 1 1 1 1 2 3 2 2 2 2 1 2 28 41 17 2 1 2 1 2 1 1 1 2 2 3 3 3 3 3 1 2 28 41 17 p1 p23 p0 p0 p23 p0 p0 p23 p0 p0 p1 p1 p23 2p01 3p01 2p1 p23 p0 p1 p23 p0 p1 p23 2p1 p23 p1 2p1 r m r m m m m 25-75 17 1 71-100 110 70-120 58-89 55-417 55-228 110-121 78-160 1 2 5 26 10-24 1 1 19 27 11 17 1 1 2 5 26 5 p0 p01 p01 p0156 Integer MMX and XMM instructions Page 191 49-125 15 10-23 47-106 112 52-123 63-68 58-680 58-360 130 96-156 1 0.5 0.5 150 164 1 1 1 1 8-18 8-18 1 1 1 1 1 1.5 2 2 2 1 2 13 17 23 11 8-17 0.5 1 22 83 Haswell Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA/U MOVDQA/U MOVDQA/U VMOVDQA/U VMOVDQA/U VMOVDQA/U LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ VMOVNTDQ MOVNTDQA VMOVNTDQA PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW PACKUSDW PUNPCKH/L BW/WD/DQ PUNPCKH/L BW/WD/DQ PUNPCKH/L QDQ PUNPCKH/L QDQ Operands Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 r64,(x)mm (x)mm,r64 (x)mm,(x)mm (x)mm,m64 m64, (x)mm x,x x, m128 m128, x 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 p0 p237 p4 p5 p23 p0 p5 p015 p23 p237 p4 p015 p23 p237 p4 1 3 1 3 1 1 1 3 3 0-1 3 3 1 1 1 0.5 1 1 0.33 0.5 1 0.33 0.5 1 y,y y,m256 m256,y x, m128 mm, x x,mm m64,mm m128,x m256,y x, m128 y,m256 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 2 2 2 1 1 p015 p23 p237 p4 p23 p01 p5 p015 p237 p4 p237 p4 p237 p4 p23 p23 0-1 3 4 3 1 1 ~400 ~400 ~400 3 3 0.33 0.5 1 0.5 1 0.33 1 1 1 0.5 0.5 mm,mm 3 3 p5 2 2 mm,m64 3 3 p23 2p5 x,x / y,y,y 1 1 p5 1 1 x,m / y,y,m x,x / y,y,y x,m / y,y,m 1 1 1 2 1 2 p23 p5 p5 p23 p5 1 1 1 1 v,v / v,v,v 1 1 p5 1 1 v,m / v,v,m 1 2 p23 p5 x,x / y,y,y 1 1 p5 x,m / y,y,m 2 2 p23 p5 PMOVSX/ZX BW BD BQ DW DQ x,x 1 1 p5 PMOVSX/ZX BW BD BQ DW DQ x,m 1 2 p23 p5 VPMOVSX/ZX BW BD BQ DW DQ y,x 1 1 p5 Page 192 may be elim. AVX may be elim. AVX AVX SSE3 AVX2 SSE4.1 AVX2 2 SSE4.1 SSE4.1 1 1 1 1 1 3 1 SSE4.1 1 SSE4.1 1 AVX2 Haswell VPMOVSX/ZX BW BD BQ DW DQ PINSRB PINSRB PINSRW PINSRW PINSRD/Q PINSRD/Q VINSERTI128 VINSERTI128 y,m v,v / v,v,v v,m / v,v,m mm,mm,i mm,m64,i v,v,i v,m,i v,v,i v,m,i v,v,i / v,v,v,i v,m,i / v,v,m,i x,x,xmm0 x,m,xmm0 v,v,v,v v,v,m,v x,x,i / v,v,v,i x,m,i / v,v,m,i v,v,v,i v,v,m,i y,y,y y,y,m y,y,i y,m,i y,y,y,i y,y,m,i mm,mm x,x v,v,m m,v,v r,v r32,x,i m8,x,i x,y,i m,y,i x,r32,i x,m8,i (x)mm,r32,i (x)mm,m16,i x,r32,i x,m32,i y,y,x,i y,y,m,i 2 1 2 1 2 1 2 1 2 1 2 2 3 2 3 1 2 1 2 1 1 1 2 1 2 4 10 3 4 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 1 2 2 3 2 3 1 2 1 2 1 2 1 2 1 2 4 10 3 4 1 2 3 1 2 2 2 2 2 2 2 1 2 p5 p23 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 2p5 2p5 p23 2p5 2p5 p23 p5 p23 p5 p015 p015 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p0 p4 2p23 4p04 2p56 4p23 p23 2p5 p0 p1 p4 p23 p0 p0 p5 p23 p4 p5 p5 p23 p4 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p015 p23 VPBROADCAST B/W/D/Q x,x 1 1 VPBROADCAST B/W x,m8/16 3 VPBROADCAST D/Q x,m32/64 VPBROADCAST B/W/D/Q VPBROADCAST B/W PSHUFB PSHUFB PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR PALIGNR PBLENDVB PBLENDVB VPBLENDVB VPBLENDVB PBLENDW PBLENDW VPBLENDD VPBLENDD VPERMD VPERMD VPERMQ VPERMQ VPERM2I128 VPERM2I128 MASKMOVQ MASKMOVDQU VPMASKMOVD/Q VPMASKMOVD/Q PMOVMSKB PEXTRB/W/D/Q PEXTRB/W/D/Q VEXTRACTI128 VEXTRACTI128 3 4 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 0.33 0.5 1 1 1 1 1 1 1 6 2 1 1 1 1 1 1 2 1 2 1 2 1 1 0.5 p5 1 1 AVX2 3 p01 p23 p5 5 1 AVX2 1 1 p23 4 0.5 AVX2 y,x 1 1 p5 3 1 AVX2 y,m8/16 3 3 p01 p23 p5 7 1 AVX2 Page 193 1 1 1 1 1 2 2 1 1 3 3 3 13-413 14-438 4 13-14 3 2 3 4 2 2 2 AVX2 SSSE3 SSSE3 SSSE3 SSSE3 SSE4.1 SSE4.1 AVX2 AVX2 SSE4.1 SSE4.1 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 SSE4.1 SSE4.1 AVX2 AVX2 SSE4.1 SSE4.1 SSE4.1 SSE4.1 AVX2 AVX2 Haswell VPBROADCAST D/Q y,m32/64 y,m128 x,[r+s*x],x y,[r+s*y],y x,[r+s*x],x x,[r+s*y],x x,[r+s*x],x y,[r+s*x],y x,[r+s*x],x y,[r+s*y],y 1 1 20 34 15 22 12 20 14 22 1 1 20 34 15 22 12 20 14 22 p23 p23 5 3 0.5 0.5 9 12 8 7 7 9 7 9 PADD/SUB(S,US) B/W/D/Q v,v / v,v,v 1 1 p15 1 0.5 PADD/SUB(S,US) B/W/D/Q v,m / v,v,m 1 2 p15 p23 v,v / v,v,v 3 3 p1 2p5 v,m / v,v,m 4 4 p1 2p5 p23 v,v / v,v,v 1 1 p15 v,m / v,v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m 1 1 1 1 1 2 1 2 1 2 p15 p23 p15 p15 p23 p0 p0 p23 v,v / v,v,v 1 1 p0 v,m / v,v,m v,v / v,v,v v,m / v,v,m x,x / y,y,y x,m / y,y,m x,x / y,y,y x,m / y,y,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m 1 1 1 2 3 1 1 1 1 1 1 1 1 1 1 2 1 2 2 3 1 2 1 2 1 2 1 2 1 2 p0 p23 p0 p0 p23 2p0 2p0 p23 p0 p0 p23 p0 p0 p23 p0 p0 p23 p0 p0 p23 p15 p15 p23 x,x / y,y,y 1 1 p15 x,m / y,y,m 1 2 p15 p23 VBROADCASTI128 VPGATHERDD VPGATHERDD VPGATHERQD VPGATHERQD VPGATHERDQ VPGATHERDQ VPGATHERQQ VPGATHERQQ AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 Arithmetic instructions PHADD(S)W/D PHSUB(S)W/D PHADD(S)W/D PHSUB(S)W/D PCMPEQB/W/D PCMPGTB/W/D PCMPEQB/W/D PCMPGTB/W/D PCMPEQQ PCMPEQQ PCMPGTQ PCMPGTQ PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULLD PMULLD PMULDQ PMULDQ PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W PMIN/PMAX SB/SW/SD UB/UW/UD PMIN/PMAX SB/SW/SD UB/UW/UD Page 194 0.5 3 1 1 5 5 5 10 5 5 5 5 1 1 2 SSSE3 2 SSSE3 0.5 0.5 0.5 0.5 1 1 SSE4.1 SSE4.1 SSE4.2 SSE4.2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 0.5 0.5 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 0.5 SSE4.1 0.5 SSE4.1 Haswell PHMINPOSUW PHMINPOSUW PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW MPSADBW MPSADBW Logic instructions PAND PANDN POR PXOR PAND PANDN POR PXOR PTEST PTEST PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q VPSLLVD/Q VPSRAVD VPSRLVD/Q VPSLLVD/Q VPSRAVD VPSRLVD/Q PSLLDQ PSRLDQ String instructions PCMPESTRI PCMPESTRI PCMPESTRM PCMPESTRM PCMPISTRI PCMPISTRI PCMPISTRM PCMPISTRM x,x x,m128 v,v v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m x,x,i / v,v,v,i x,m,i / v,v,m,i 1 1 1 1 1 1 1 1 3 4 1 2 1 2 1 2 1 2 3 4 p0 p0 p23 p15 p15 p23 p15 p15 p23 p0 p0 p23 p0 2p5 p0 2p5 p23 5 v,v / v,v,v 1 1 p015 1 0.33 v,m / v,v,m v,v v,m 1 2 2 2 2 3 p015 p23 p0 p5 p0 p5 p23 2 0.5 1 1 mm,mm 1 1 p0 1 1 mm,m64 1 2 p0 p23 x,x / v,v,x 2 2 p0 p5 x,m / v,v,m 2 2 p0 p23 v,i / v,v,i 1 1 p0 1 1 v,v,v 3 3 2p0 p5 2 2 AVX2 v,v,m 4 4 2p0 p5 p23 2 AVX2 x,i / v,v,i 1 1 p5 1 1 x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i 8 8 9 9 3 4 3 4 8 8 9 9 3 4 3 4 6p05 2p16 11 4 4 5 5 3 3 3 3 1 1 5 6 Page 195 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 1 2 1 1 3p0 2p16 2p5 p23 3p0 2p16 4p5 6p05 2p16 p23 3p0 3p0 p23 3p0 3p0 p23 1 1 0.5 0.5 0.5 0.5 1 1 2 2 10 11 10 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 Haswell Encryption instructions PCLMULQDQ x,x,i PCLMULQDQ x,m,i AESDEC, AESDECLAST, AESENC, AESENCLAST x,x AESDEC, AESDECLAST, AESENC, AESENCLAST x,m AESIMC x,x AESIMC x,m AESKEYGENAS SIST x,x,i AESKEYGENAS SIST x,m,i Other EMMS 3 4 3 4 2p0 p5 2p0 p5 p23 7 2 2 CLMUL CLMUL 1 1 p5 7 1 AES 2 2 3 2 2 3 p5 p23 2p5 2p5 p23 14 1.5 2 2 AES AES AES 10 10 2p0 8p5 10 9 AES 10 10 2p0 p23 7p5 8 AES 31 31 13 Floating point XMM and YMM instructions Instruction Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVHPS/D MOVLPS/D MOVLPS/D MOVHLPS MOVLHPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D SHUFPS/D Operands Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments x,x y,y 1 1 1 1 p5 p5 0-1 0-1 1 1 x,m128 1 1 p23 3 0.5 y,m256 1 1 p23 3 0.5 m128,x 1 2 p237 p4 3 1 m256,y x,x x,m32/64 m32/64,x x,m64 m64,x x,m64 m64,x x,x x,x r32,x r32,y m128,x m256,y x,x,i / v,v,v,i x,m,i / v,v,m,i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 1 1 1 1 2 2 1 2 p237 p4 p5 p23 p237 p4 p23 p5 p4 p237 p23 p5 p4 p237 p5 p5 p0 p0 p4 p237 p4 p237 p5 p5 p23 4 1 3 3 4 3 4 3 1 1 3 2 ~400 ~400 1 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 Page 196 may be elim. may be elim. AVX AVX AVX Haswell VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 VPERMPS VPERMPS VPERMPD VPERMPD BLENDPS/PD BLENDPS/PD BLENDVPS/PD BLENDVPS/PD VBLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP UNPCKH/LPS/D UNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VPGATHERDPS VPGATHERDPS VPGATHERQPS VPGATHERQPS VPGATHERDPD VPGATHERDPD VPGATHERQPD VPGATHERQPD Conversion CVTPD2PS CVTPD2PS v,v,i v,m,i v,v,v v,v,m y,y,y,i y,y,m,i y,y,y y,y,m y,y,i y,m,i x,x,i / v,v,v,i x,m,i / v,v,m,i x,x,xmm0 x,m,xmm0 v,v,v,v v,v,m,v v,v v,m x,m32 y,m32 x,x y,x y,m64 y,x y,m128 v,v v,m x,x / v,v,v x,m / v,v,m r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i v,v,m m128,x,x m256,y,y x,[r+s*x],x y,[r+s*y],y x,[r+s*x],x x,[r+s*y],x x,[r+s*x],x y,[r+s*x],y x,[r+s*x],x y,[r+s*y],y 1 2 1 2 1 2 1 1 1 2 1 2 2 3 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 1 2 1 2 1 2 3 4 4 20 34 15 22 12 20 14 22 1 2 1 2 1 2 1 2 1 2 1 2 2 3 2 3 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 1 2 1 2 1 2 3 4 4 20 34 15 22 12 20 14 22 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p015 p015 p23 2p5 2p5 p23 2p5 2p5 p23 p5 p23 p23 p23 p5 p5 p23 p5 p23 p5 p23 p5 p5 p23 p0 p5 p0 p5 p23 p5 p23 p4 p5 p23 p5 p5 p015 p23 2p5 p23 p0 p1 p4 p23 p0 p1 p4 p23 x,x x,m128 2 2 2 3 p1 p5 p1 p5 p23 Page 197 1 1 3 3 3 1 2 2 1 3 4 5 1 3 5 3 3 1 3 1 4 3 4 1 4 3 4 4 13 14 4 1 1 1 1 1 1 1 1 1 1 0.33 0.5 2 2 2 2 1 0.5 0.5 0.5 1 1 0.5 1 0.5 1 0.5 1 1 1 1 1 1 1 1 1 2 2 1 2 9 12 8 7 7 9 7 9 1 1 AVX AVX AVX AVX AVX AVX AVX2 AVX2 AVX2 AVX2 SSE4.1 SSE4.1 SSE4.1 SSE4.1 AVX AVX SSE3 SSE3 AVX AVX AVX2 AVX2 AVX AVX2 AVX SSE3 SSE3 SSE3 SSE3 SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX AVX AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 Haswell VCVTPD2PS VCVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD VCVTPS2PD VCVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS VCVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ VCVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD CVTDQ2PD VCVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ VCVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI VCVTPS2PH VCVTPS2PH VCVTPH2PS VCVTPH2PS Arithmetic ADDSS/D PS/D SUBSS/D PS/D ADDSS/D PS/D SUBSS/D PS/D ADDSUBPS/D x,y x,m256 x,x x,m64 x,x x,m64 y,x y,m128 x,x x,m32 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m64 y,x y,m128 x,x x,m128 x,y x,m256 x,mm x,m64 mm,x mm,m128 x,mm x,m64 mm,x mm,m128 x,r32 x,m32 r32,x r32,m32 x,r32/64 x,m32 r32/64,x r32,m64 x,v,i m,v,i v,x v,m 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 4 2 2 2 3 2 3 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 2 3 2 3 1 2 2 2 2 2 2 3 2 2 2 3 2 2 2 3 2 4 2 2 p1 p5 p1 p5 p23 p1 p5 p1 p5 p23 p0 p5 p0 p23 p0 p5 p0 p23 p0 p5 p0 p23 p1 p1 p23 p1 p1 p23 p1 p1 p23 p1 p1 p23 p1 p5 p1 p23 p1 p5 p1 p23 p1 p5 p1 p5 p23 p1 p5 p1 p5 p23 p1 p1 p23 p1 p5 p1 p23 p1 p5 p1 p23 p1 p5 p1 p5 p23 p1 p5 p1 p23 p0 p1 p0 p1 p23 p1 p5 p1 p23 p0 p1 p0 p1 p23 p1 p5 p1 p4 p5 p23 p1 p5 p1 p23 5 x,x / v,v,v 1 1 p1 3 1 x,m / v,v,m x,x / v,v,v 1 1 2 1 p1 p23 p1 3 1 1 Page 198 4 2 5 2 3 3 3 3 4 6 4 6 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 3 1 1 1 1 1 1 3 3 1 1 3 3 1 1 1 1 1 1 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX F16C F16C F16C F16C SSE3 Haswell ADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D MULSS/D PS/D MULSS/D PS/D DIVSS DIVPS DIVSS DIVPS DIVSD DIVPD DIVSD DIVPD VDIVPS VDIVPS VDIVPD VDIVPD RCPSS/PS RCPSS/PS VRCPPS VRCPPS CMPccSS/D CMPccPS/D CMPccSS/D CMPccPS/D (U)COMISS/D (U)COMISS/D MAXSS/D PS/D MINSS/D PS/D MAXSS/D PS/D MINSS/D PS/D x,m / v,v,m 1 2 p1 p23 x,x / v,v,v 3 3 p1 2p5 x,m / v,v,m x,x / v,v,v x,m / v,v,m x,x x,m x,x x,m y,y,y y,y,m256 y,y,y y,y,m256 x,x x,m128 y,y y,m256 4 1 1 1 1 1 1 3 4 3 4 1 1 3 4 4 1 2 1 2 1 2 3 4 3 4 1 2 3 4 p1 2p5 p23 p01 p01 p23 p0 p0 p23 p0 p0 p23 2p0 p15 2p0 p15 p23 2p0 p15 2p0 p15 p23 p0 p0 p23 2p0 p15 2p0 p15 p23 x,x / v,v,v 1 1 p1 x,m / v,v,m x,x x,m32/64 2 1 2 2 1 2 p1 p23 p1 p1 p23 x,x / v,v,v 1 1 p1 x,m / v,v,m 1 2 p1 p23 ROUNDSS/D PS/D v,v,i 2 2 2p1 6 ROUNDSS/D PS/D v,m,i x,x,i / v,v,v,i x,m,i / v,v,m,i x,x,i x,m128,i 3 4 6 3 4 3 4 6 3 4 2p1 p23 2p0 p1 p5 14 2p0 p1 p5 p23 p6 p0 p1 p5 p0 p1 p5 p23 9 v,v,v 1 1 p01 5 v,v,m 1 2 p01 p23 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m128 1 1 3 4 1 1 3 4 1 1 1 2 3 4 1 2 3 4 1 2 p0 p0 p23 2p0 p15 2p0 p15 p23 p0 p0 p23 2p0 p15 2p0 p15 p23 p0 p0 p23 DPPS DPPS DPPD DPPD VFMADD... (all FMA instr.) VFMADD... (all FMA instr.) Math SQRTSS/PS SQRTSS/PS VSQRTPS VSQRTPS SQRTSD/PD SQRTSD/PD VSQRTPD VSQRTPD RSQRTSS/PS RSQRTSS/PS Page 199 5 5 10-13 10-20 18-21 19-35 5 7 3 1 SSE3 2 SSE3 2 0.5 0.5 7 7 8-14 8-14 14 14 16-28 16-28 1 1 2 2 SSE3 AVX AVX AVX AVX AVX AVX 1 1 1 1 3 1 1 11 19 16 28-29 5 2 SSE4.1 2 2 4 1 1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 0.5 FMA 0.5 FMA 7 7 14 14 8-14 8-14 16-28 16-28 1 1 AVX AVX AVX AVX Haswell VRSQRTPS VRSQRTPS y,y y,m256 3 4 3 4 2p0 p15 2p0 p15 p23 7 2 2 AND/ANDN/OR/XO RPS/PD x,x / v,v,v 1 1 p5 1 1 AND/ANDN/OR/XO RPS/PD x,m / v,v,m 1 2 p5 p23 1 Other VZEROUPPER 4 4 none 1 VZEROALL 12 12 none 10 20 3 3 3 130 116 224 173 20 3 4 none p0 p6 p23 p0 p4 p6 p237 AVX AVX Logic VZEROALL LDMXCSR STMXCSR VSTMXCSR FXSAVE FXRSTOR XSAVE XRSTOR XSAVEOPT m32 m32 m32 m4096 m4096 m Page 200 6 7 8 3 1 1 68 72 84 111 AVX AVX, 32 bit AVX, 64 bit AVX Broadwell Intel Broadwell List of instruction timings and μop breakdown Explanation of column headings: Instruction: Name of instruction. Multiple names mean that these instructions have the same data. Instructions with or without V name prefix behave the same unless otherwise noted. Operands: i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx, xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc. μops fused domain: μops unfused domain: The number of μops at the decode, rename and allocate stages in the pipeline. Fused μops count as one. The total number of μops for all execution port. Fused μops count as two. Fused macroops count as one. The instruction has μop fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any execution port or if the counters are inaccurate. µops each port: The number of μops for each execution port. p0 means a µop to execution port 0. p01means a µop that can go to either port 0 or port 1. p0 p1 means two µops going to port 0 and 1, respectively. Port 0: Integer, f.p. and vector ALU, mul, div, branch Port 1: Integer, f.p. and vector ALU Port 2: Load Port 3: Load Port 4: Store Port 5: Integer and vector ALU Port 6: Integer ALU, branch Port 7: Store address Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV Operands Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments r,i r8/16,r8/16 r32/64,r32/64 r8l,m r8h,m r16,m r32/64,m 1 1 1 1 1 1 1 1 1 1 2 1 2 1 p0156 p0156 p0156 p23 p0156 p23 p23 p0156 p23 m,r 1 2 p237 p4 Page 201 2 0.25 0.25 0.25 0.5 0.5 0.5 0.5 3 1 1 0-1 may be elim. all addressing modes Broadwell MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA m,i m,r r,r 1 2 1 2 2 1 p237 p4 p23 p4 p0156 r16,m8 r,m 1 1 2 1 p23 p0156 p23 r,r r,m r,r r,m 1 2 3 8 3 2 2 3 3 4 19 1 3 3 9 18 1 3 2 p06 p06 p23 3p0156 r16,m 1 2 3 8 3 1 1 2 2 3 11 1 3 2 9 18 1 3 2 LEA r32/64,m 1 LEA r32/64,m LEA BSWAP BSWAP MOVBE MOVBE MOVBE MOVBE MOVBE MOVBE PREFETCHNTA/ 0/1/2 PREFETCHW LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB 1 1 0.25 0.5 0.5 1 2 21 7 3 implicit lock p06 3p0156 p1 p05 1 1 2-4 1 p15 1 0.5 1 1 p1 3 1 r32/64,m 1 1 p1 r32 r64 r16,m16 r32,m32 r64,m64 m16,r16 m32,r32 m64,r64 1 2 3 2 3 2 2 3 1 2 3 2 3 3 3 4 p15 p06 p15 2p0156 p23 p15 p23 2p0156 p23 p06 p237 p4 p15 p237 p4 p06 p15 p237 p4 m 1 1 p23 0.5 m 1 2 3 2 1 3 2 p23 none counted p23 p4 p23 p4 1 4 33 6 1 1 p0156 r stack pointer m r,r/i p23 p23 2p0156 2p237 p4 Page 202 2 1 1 2 1 all other combinations 0.5 0.5 1 2 1 1 1 1 1 8 0.5 4 1 18 8 1 1 1 r i m stack pointer p23 2p0156 p237 p4 p237 p4 p4 2p237 p0156 p237 p4 p1 p4 p237 p06 ~400 1 0.5 1 0.5-1 0.5 0.5 1 1 1 0.25 not 64 bit not 64 bit not 64 bit 16 or 32 bit address size 1 or 2 components in address 3 components in address rip relative address MOVBE MOVBE MOVBE MOVBE MOVBE MOVBE PREFETCHW Broadwell ADD SUB ADD SUB r,m m,r/i 1 2 2 4 ADC SBB ADC SBB ADC SBB r,r/i r,m m,r/i 1 2 4 1 2 6 CMP CMP INC DEC NEG NOT INC DEC NOT NEG AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MULX MULX MULX MULX DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO r,r/i m,r/i r 1 1 1 m m 3 2 2 2 3 3 8 1 4 3 2 1 4 3 2 1 1 2 1 1 2 1 1 3 3 2 2 9 11 10 36 9 10 9 59 1 1 1 2 1 1 r8 r16 r32 r64 m8 m16 m32 m64 r,r r,m r16,r16,i r32,r32,i r64,r64,i r16,m16,i r32,m32,i r64,m64,i r32,r32,r32 r32,r32,m32 r64,r64,r64 r64,r64,m64 r8 r16 r32 r64 r8 r16 r32 r64 p0156 p23 0.5 1 2p0156 2p237 p4 6 p06 p06 p23 1 3p0156 2p237 p4 7 1 1 2 1 2 1 p0156 p0156 p23 p0156 1 1 1 0.25 0.5 0.25 4 4 2 2 3 3 8 1 4 3 2 2 5 4 3 1 2 2 1 1 3 2 2 3 4 2 3 9 11 10 36 9 10 9 59 1 1 1 2 1 1 p0156 2p237 p4 p0156 2p237 p4 p1 p56 p1 p056 p1 2p056 p1 2p056 p0 p1 p5 p6 p1 p1 p0156 p1 p0156 p1 p6 p1 p23 p1 3p0156 p23 p1 2p0156 p23 p1 p6 p23 p1 p1 p23 p1 p0156 p1 p1 p1 p0156 p23 p1 p23 p1 p23 p1 2p056 p1 2p056 p23 p1 p5 p1 p6 p23 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0156 p0156 p0156 p0156 p06 p06 6 6 4 6 4 6 21 3 4 4 3 1 1 Page 203 3 4 3 3 4 4 22-25 23-26 22-29 32-95 23-26 23-26 22-29 39-103 1 1 1 1 1 1 7 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 21-73 6 6 6 24-81 not 64 bit not 64 bit not 64 bit not 64 bit not 64 bit AVX2 AVX2 AVX2 AVX2 Broadwell POPCNT POPCNT CRC32 CRC32 r,r r,m r,r r,m 1 1 1 1 1 2 1 2 p1 p1 p23 p1 p1 p23 3 Logic instructions AND OR XOR AND OR XOR AND OR XOR r,r/i r,m m,r/i 1 1 2 1 2 4 p0156 p0156 p23 1 2p0156 2p237 p4 6 r,r/i m,r/i r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,1 m,1 r,i m,i r,cl m,cl r,r,i m,r,i r,r,cl r,r,cl m,r,cl r,r,r r,m,r r,r,i r,m,i r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 1 1 1 3 3 5 2 1 4 3 5 3 4 8 11 8 11 1 3 4 4 5 1 2 1 2 1 10 2 1 10 2 1 1 1 2 1 1 1 3 1 1 1 2 1 4 3 6 2 1 5 3 6 3 6 8 11 8 11 1 5 4 4 7 1 2 1 2 1 10 2 1 10 2 1 2 1 3 0 1 1 3 1 2 p0156 p0156 p23 p06 2p06 p237 p4 3p06 3p06 2p23 p4 2p06 p06 2p06 2p237 p4 3p06 3p06 p23 p4 2p06 p0156 1 1 1 p0156 6 p0156 6 p1 3 p0156 p0156 3 4 p06 p06 p23 p06 p06 p23 p06 1 TEST TEST SHR SHL SAR SHR SHL SAR SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL ROR ROL ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHRD SHLD SHRD SHLD SHLD SHRD SHRD SHLD SHLX SHRX SARX SHLX SHRX SARX RORX RORX BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD LZCNT LZCNT r,r r,m p06 p23 p06 p06 p23 p1 p1 p23 p06 p06 p237 p4 none p0156 p0156 p15 p6 p1 p1 p23 Page 204 3 2 1 1 2 2 1 1 1 3 1 1 3 1 1 1 1 SSE4.2 SSE4.2 SSE4.2 SSE4.2 0.25 0.5 1 0.25 0.5 0.5 2 2 4 1 0.5 2 2 4 2 3 6 6 6 6 1 2 2 2 4 0.5 0.5 0.5 0.5 0.5 5 0.5 0.5 5 0.5 1 1 0.5 1 0.25 0.25 1 4 1 1 short form BMI2 BMI2 BMI2 BMI2 LZCNT LZCNT Broadwell TZCNT TZCNT ANDN ANDN BLSI BLSMSK BLSR BLSI BLSMSK BLSR BEXTR BEXTR BZHI BZHI PDEP PDEP PEXT PEXT r,r r,m r,r,r r,r,m r,r 1 1 1 1 1 1 2 1 2 1 p1 p1 p23 p15 p15 p23 p15 r,m 1 2 p15 p23 r,r,r r,m,r r,r,r r,m,r r,r,r r,r,m r,r,r r,r,m 2 3 1 1 1 1 1 1 2 3 1 2 1 2 1 2 2p0156 2p0156 p23 p15 p15 p23 p1 p1 p23 p1 p1 p23 Control transfer instructions JMP short/near JMP r JMP m Conditional jump short/near 1 1 1 1 1 1 2 1 p6 p6 p23 p6 p6 1-2 2 2 1-2 Conditional jump 1 1 p06 0.5-1 1 1 p6 1-2 1 1 p06 0.5-1 2 7 11 2 2 3 1 3 15 4 2 7 11 3 3 4 2 4 15 4 p0156 p6 0.5-2 5 6 2 2 3 1 2 8 5 3 2 5n+12 3 <2n 2.6/32B 3 2 2p0156 p23 p0156 p23 3 p23 p0156 p4 5 ~2n 4/32B 5 Fused arithmetic and branch Fused arithmetic and branch J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL RET RET BOUND INTO String instructions LODSB/W LODSD/Q REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS short/near short short short near r m i r,m p237 p4 p6 p237 p4 p6 2p237 p4 p6 p237 p6 p23 2p6 p015 2p23 p4 2p0156 Page 205 3 1 1 1 2 1 3 3 1 1 0.5 0.5 0.5 BMI1 BMI1 BMI1 BMI1 BMI1 0.5 BMI1 0.5 1 0.5 0.5 1 1 1 1 BMI1 BMI1 BMI2 BMI2 BMI2 BMI2 BMI2 BMI2 1 1 ~2n 1 ~0.5n 1/32B 4 < 1n 1/32B predicted taken predicted not taken predicted taken predicted not taken not 64 bit not 64 bit worst case best case aligned by 32 worst case best case aligned by 32 Broadwell SCAS REP SCAS CMPS REP CMPS 3 ≥6n 5 ≥8n 3 p23 2p0156 5 2p23 3p0156 Synchronization instructions XADD m,r LOCK XADD m,r LOCK ADD m,r CMPXCHG m,r LOCK CMPXCHG m,r CMPXCHG8B m,r LOCK CMPXCHG8B m,r CMPXCHG16B m,r LOCK CMPXCHG16B m,r 4 9 8 5 10 15 19 22 24 5 9 8 6 10 15 19 22 24 1 1 0 0 Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE XGETBV RDTSC RDTSCP RDPMC RDRAND RDSEED a,0 a,b r r 5 5 12 12 ~14+7b ~45+7b 3 3 8 8 15 15 21 21 34 34 16 16 16 16 1 ≥2n 4 ≥2n 6 21 21 7 21 8 21 15 27 none none 0.25 0.25 p05 3p6 9 8 ~87+2b 2p0156 p23 5 5 24 30 37 ~230 ~230 p23 15p0156 p23 15p0156 XGETBV RDTSCP RDRAND RDSEED Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 Operands r m32/64 m80 m80 r m32/m64 m80 m80 r m m m Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments 1 1 4 43 1 1 7 238 2 1 3 3 1 2 1 1 4 43 1 2 7 226 0 2 3 3 1 2 p01 p23 2p01 2p23 p01 p4 p237 3p0156 2p23 2p4 none p01 p23 p1 p23 p4 p1 p23 p4 p01 2p01 Page 206 1 3 4 47 1 4 5 269 0 6 7 7 0.5 0.5 2 22 0.5 1 5 267 0.5 1 1 2 1 2 SSE3 Broadwell FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP r m m 2 3 2 2 3 2 1 1 152 95 2 3 2 3 3 3 1 1 152 95 2p01 2p0 p5 p0 p0156 p0 p4 p237 p01 p23 p6 p237 p4 p6 p01 p01 r 1 1 p1 m r m r m 1 1 1 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1 2 2 p1 p23 p0 p0 p23 p0 p0 p23 p0 p0 p1 p1 p23 2p01 3 2 2 2 2 1 2 28 28 17 3 3 3 3 3 1 2 28 28 17 3p01 2p1 p23 p0 p1 p23 p0 p1 p23 2p1 p23 p1 2p1 27 17 1 75-100 70-100 70-110 16-86 55-96 56 71-102 27-71 27 17 1 p0 1 1 p01 r AX m16 m16 m16 r m r m m m m Page 207 173 175 2 2 1 1 2 1 0.5 0.5 173 175 3 1 2 6 6 7 6 0 5 10-15 1 1 3 7 3 6 20-24 23-48 11 125 12 10-23 48-106 49-112 52-124 63-68 92 74 132 97-147 1 1 1 4-5 4-5 1 1 1 1 1 1.5 2 2 2 1 2 13 13 23 130 11 4-9 0.5 Broadwell WAIT FNCLEX FNINIT 2 5 26 2 5 26 p01 p0156 1 22 84 Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA/U MOVDQA/U MOVDQA/U VMOVDQA/U VMOVDQA/U VMOVDQA/U LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ VMOVNTDQ MOVNTDQA VMOVNTDQA PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW PACKUSDW PUNPCKH/L BW/WD/DQ PUNPCKH/L BW/WD/DQ PUNPCKH/L QDQ PUNPCKH/L QDQ Operands r32/64,(x)mm Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments r64,(x)mm (x)mm,r64 (x)mm,(x)mm (x)mm,m64 m64, (x)mm x,x x, m128 m128, x 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 p0 p237 p4 p5 p23 p0 p5 p015 p23 p237 p4 p015 p23 p237 p4 1 3 1 3 1 1 1 3 3 0-1 3 3 1 1 1 0.5 1 1 0.33 0.5 1 0.25 0.5 1 y,y y,m256 m256,y x, m128 mm, x x,mm m64,mm m128,x m256,y x, m128 y,m256 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 2 2 2 1 1 p015 p23 p237 p4 p23 p01 p5 p015 p237 p4 p237 p4 p237 p4 p23 p23 0-1 3 4 3 1 1 ~400 ~400 ~400 3 3 0.25 0.5 1 0.5 1 0.33 1 1 1 0.5 0.5 mm,mm 3 3 p5 2 2 mm,m64 3 3 p23 2p5 x,x / y,y,y 1 1 p5 x,m / y,y,m x,x / y,y,y x,m / y,y,m 1 1 1 2 1 2 v,v / v,v,v 1 v,m / v,v,m m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 AVX may be elim. AVX AVX SSE3 AVX2 SSE4.1 AVX2 2 1 1 p23 p5 p5 p23 p5 1 1 1 1 1 p5 1 1 1 2 p23 p5 x,x / y,y,y 1 1 p5 x,m / y,y,m 2 2 p23 p5 Page 208 may be elim. 1 1 1 1 SSE4.1 SSE4.1 Broadwell PMOVSX/ZX BW BD BQ DW DQ x,x 1 1 p5 PMOVSX/ZX BW BD BQ DW DQ x,m 1 2 p23 p5 VPMOVSX/ZX BW BD BQ DW DQ y,x 1 1 p5 PINSRB PINSRB PINSRW PINSRW PINSRD/Q PINSRD/Q VINSERTI128 VINSERTI128 y,m v,v / v,v,v v,m / v,v,m mm,mm,i mm,m64,i v,v,i v,m,i v,v,i v,m,i v,v,i / v,v,v,i v,m,i / v,v,m,i x,x,xmm0 x,m,xmm0 v,v,v,v v,v,m,v x,x,i / v,v,v,i x,m,i / v,v,m,i v,v,v,i v,v,m,i y,y,y y,y,m y,y,i y,m,i y,y,y,i y,y,m,i mm,mm x,x v,v,m m,v,v r,v r32,x,i m8,x,i x,y,i m,y,i x,r32,i x,m8,i (x)mm,r32,i (x)mm,m16,i x,r32,i x,m32,i y,y,x,i y,y,m,i 2 1 2 1 2 1 2 1 2 1 2 2 3 2 3 1 2 1 2 1 1 1 2 1 2 4 10 3 4 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 1 2 2 3 2 3 1 2 1 2 1 2 1 2 1 2 4 10 3 4 1 2 3 1 2 2 2 2 2 2 2 1 2 p5 p23 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 2p5 2p5 p23 2p5 2p5 p23 p5 p23 p5 p015 p015 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p0 p4 2p23 4p04 2p56 4p23 p23 2p5 p0 p1 p4 p23 p0 p0 p5 p23 p4 p5 p5 p23 p4 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p015 p23 VPBROADCAST B/W/D/Q x,x 1 1 VPBROADCAST B/W x,m8/16 3 3 VPMOVSX/ZX BW BD BQ DW DQ PSHUFB PSHUFB PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR PALIGNR PBLENDVB PBLENDVB VPBLENDVB VPBLENDVB PBLENDW PBLENDW VPBLENDD VPBLENDD VPERMD VPERMD VPERMQ VPERMQ VPERM2I128 VPERM2I128 MASKMOVQ MASKMOVDQU VPMASKMOVD/Q VPMASKMOVD/Q PMOVMSKB PEXTRB/W/D/Q PEXTRB/W/D/Q VEXTRACTI128 VEXTRACTI128 1 SSE4.1 1 SSE4.1 1 AVX2 AVX2 SSSE3 SSSE3 3 4 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 0.33 0.5 1 1 1 1 1 1 1 6 2 1 1 1 1 1 1 2 1 2 1 2 1 1 0.5 p5 1 1 AVX2 p01 p23 p5 5 1 AVX2 Page 209 1 3 1 1 1 1 1 2 2 1 1 3 3 3 18-500 18-500 4 15 3 2 3 4 2 2 2 SSSE3 SSSE3 SSE4.1 SSE4.1 AVX2 AVX2 SSE4.1 SSE4.1 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 SSE4.1 SSE4.1 AVX2 AVX2 SSE4.1 SSE4.1 SSE4.1 SSE4.1 AVX2 AVX2 Broadwell VPBROADCAST D/Q x,m32/64 1 1 p23 4 0.5 AVX2 VPBROADCAST B/W/D/Q y,x 1 1 p5 3 1 AVX2 VPBROADCAST B/W y,m8/16 3 3 p01 p23 p5 7 1 AVX2 y,m32/64 y,m128 x,[r+s*x],x y,[r+s*y],y x,[r+s*x],x x,[r+s*y],x x,[r+s*x],x y,[r+s*x],y x,[r+s*x],x y,[r+s*y],y 1 1 10 14 9 10 7 9 7 9 1 1 10 14 9 10 7 9 7 9 p23 p23 5 3 0.5 0.5 6 7 6 6 5 6 5 6 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 PADD/SUB(S,US) B/W/D/Q v,v / v,v,v 1 1 p15 1 0.5 PADD/SUB(S,US) B/W/D/Q v,m / v,v,m 1 2 p15 p23 v,v / v,v,v 3 3 p1 2p5 v,m / v,v,m 4 4 p1 2p5 p23 v,v / v,v,v 1 1 p15 v,m / v,v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m 1 1 1 1 1 2 1 2 1 2 p15 p23 p15 p15 p23 p0 p0 p23 v,v / v,v,v 1 1 p0 v,m / v,v,m v,v / v,v,v v,m / v,v,m x,x / y,y,y x,m / y,y,m x,x / y,y,y x,m / y,y,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m 1 1 1 2 3 1 1 1 1 1 1 1 1 1 1 2 1 2 2 3 1 2 1 2 1 2 1 2 1 2 p0 p23 p0 p0 p23 2p0 2p0 p23 p0 p0 p23 p0 p0 p23 p0 p0 p23 p0 p0 p23 p15 p15 p23 VPBROADCAST D/Q VBROADCASTI128 VPGATHERDD VPGATHERDD VPGATHERQD VPGATHERQD VPGATHERDQ VPGATHERDQ VPGATHERQQ VPGATHERQQ Arithmetic instructions PHADD(S)W/D PHSUB(S)W/D PHADD(S)W/D PHSUB(S)W/D PCMPEQB/W/D PCMPGTB/W/D PCMPEQB/W/D PCMPGTB/W/D PCMPEQQ PCMPEQQ PCMPGTQ PCMPGTQ PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULLD PMULLD PMULDQ PMULDQ PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W Page 210 0.5 3 1 1 5 5 5 10 5 5 5 5 1 2 SSSE3 2 SSSE3 0.5 0.5 0.5 0.5 1 1 SSE4.1 SSE4.1 SSE4.2 SSE4.2 1 1 1 1 2 2 1 1 1 1 1 1 1 1 0.5 0.5 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 Broadwell PMIN/PMAX SB/SW/SD UB/UW/UD PMIN/PMAX SB/SW/SD UB/UW/UD PHMINPOSUW PHMINPOSUW PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW MPSADBW MPSADBW Logic instructions PAND PANDN POR PXOR PAND PANDN POR PXOR PTEST PTEST PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q VPSLLVD/Q VPSRAVD VPSRLVD/Q VPSLLVD/Q VPSRAVD VPSRLVD/Q PSLLDQ PSRLDQ String instructions PCMPESTRI PCMPESTRI PCMPESTRM x,x / y,y,y 1 1 p15 x,m / y,y,m x,x x,m128 v,v v,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m x,x,i / v,v,v,i x,m,i / v,v,m,i 1 1 1 1 1 1 1 1 1 3 4 2 1 2 1 2 1 2 1 2 3 4 p15 p23 p0 p0 p23 p15 p15 p23 p15 p15 p23 p0 p0 p23 p0 2p5 p0 2p5 p23 v,v / v,v,v 1 1 p015 1 0.33 v,m / v,v,m v,v v,m 1 2 2 2 2 3 p015 p23 p0 p5 p0 p5 p23 2 0.5 1 1 mm,mm 1 1 p0 1 1 mm,m64 1 2 p0 p23 x,x / v,v,x 2 2 p0 p5 x,m / v,v,m 2 2 p0 p23 v,i / v,v,i 1 1 p0 1 1 v,v,v 3 3 2p0 p5 2 2 AVX2 v,v,m 4 4 2p0 p5 p23 2 AVX2 x,i / v,v,i 1 1 p5 1 1 x,x,i x,m128,i x,x,i 8 8 9 8 8 9 6p05 2p16 4 3p0 2p16 2p5 p23 4 4 11 3p0 2p16 4p5 Page 211 1 5 1 1 5 6 0.5 SSE4.1 0.5 1 1 0.5 0.5 0.5 0.5 1 1 2 2 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 1 2 1 1 11 SSE4.2 SSE4.2 SSE4.2 Broadwell PCMPESTRM PCMPISTRI PCMPISTRI PCMPISTRM PCMPISTRM x,m128,i x,x,i x,m128,i x,x,i x,m128,i Encryption instructions PCLMULQDQ x,x,i PCLMULQDQ x,m,i AESDEC, AESDECLAST, AESENC, AESENCLAST x,x AESDEC, AESDECLAST, AESENC, AESENCLAST x,m AESIMC x,x AESIMC x,m AESKEYGENAS SIST x,x,i AESKEYGENAS SIST x,m,i Other EMMS 9 3 4 3 4 9 3 4 3 4 6p05 2p16 p23 3p0 3p0 p23 3p0 3p0 p23 5 3 3 11 3 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 1 2 1 2 p0 p0 p23 5 1 1 CLMUL CLMUL 1 1 p5 7 1 AES 2 2 3 2 2 3 p5 p23 2p5 2p5 p23 14 1.5 2 2 AES AES AES 10 10 2p0 8p5 10 9 AES 10 10 2p0 p23 7p5 8 AES 31 31 3 11 12 Floating point XMM and YMM instructions Instruction Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVHPS/D MOVLPS/D MOVLPS/D MOVHLPS MOVLHPS Operands Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments x,x y,y 1 1 1 1 p5 p5 0-1 0-1 1 1 x,m128 1 1 p23 3 0.5 y,m256 1 1 p23 3 0.5 m128,x 1 2 p237 p4 3 1 m256,y x,x x,m32/64 m32/64,x x,m64 m64,x x,m64 m64,x x,x x,x 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 2 1 1 p237 p4 p5 p23 p237 p4 p23 p5 p4 p237 p23 p5 p4 p237 p5 p5 4 1 3 3 4 3 4 3 1 1 1 1 0.5 1 1 1 1 1 1 1 Page 212 may be elim. may be elim. AVX AVX Broadwell MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D SHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 VPERMPS VPERMPS VPERMPD VPERMPD BLENDPS/PD BLENDPS/PD BLENDVPS/PD BLENDVPS/PD VBLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTSD VBROADCASTF128 MOVSH/LDUP MOVSH/LDUP UNPCKH/LPS/D UNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VPGATHERDPS VPGATHERDPS VPGATHERQPS VPGATHERQPS VPGATHERDPD VPGATHERDPD VPGATHERQPD r32,x r32,y m128,x m256,y x,x,i / v,v,v,i x,m,i / v,v,m,i v,v,i v,m,i v,v,v v,v,m y,y,y,i y,y,m,i y,y,y y,y,m y,y,i y,m,i x,x,i / v,v,v,i x,m,i / v,v,m,i x,x,xmm0 x,m,xmm0 v,v,v,v v,v,m,v v,v v,m x,m32 y,m32 x,x y,x y,m64 y,x y,m128 v,v v,m x,x / v,v,v x,m / v,v,m r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i v,v,m m128,x,x m256,y,y x,[r+s*x],x y,[r+s*y],y x,[r+s*x],x x,[r+s*y],x x,[r+s*x],x y,[r+s*x],y x,[r+s*x],x 1 1 1 1 1 2 1 2 1 2 1 2 1 1 1 2 1 2 2 3 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 2 1 2 3 4 4 10 14 9 10 7 9 7 1 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 3 2 3 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 1 2 1 2 1 2 3 4 4 10 14 9 10 7 9 7 p0 p0 p4 p237 p4 p237 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p015 p015 p23 2p5 2p5 p23 2p5 2p5 p23 p5 p23 p23 p23 p5 p5 p23 p5 p23 p5 p23 p5 p5 p23 p0 p5 p0 p5 p23 p5 p23 p4 p5 p23 p5 p5 p015 p23 2p5 p23 p0 p1 p4 p23 p0 p1 p4 p23 Page 213 3 3 ~400 ~400 1 1 1 3 3 3 1 2 2 1 3 4 5 1 3 5 3 4 1 3 1 4 3 4 1 4 3 4 4 15 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.33 0.5 2 2 2 2 1 0.5 0.5 0.5 1 1 0.5 1 0.5 1 0.5 1 1 1 1 1 1 1 1 1 2 2 1 1 6 7 6 6 5 6 5 AVX AVX AVX AVX AVX AVX AVX AVX2 AVX2 AVX2 AVX2 SSE4.1 SSE4.1 SSE4.1 SSE4.1 AVX AVX SSE3 SSE3 AVX AVX AVX2 AVX2 AVX AVX2 AVX SSE3 SSE3 SSE3 SSE3 SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX AVX AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 Broadwell VPGATHERQPD Conversion CVTPD2PS CVTPD2PS VCVTPD2PS VCVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD VCVTPS2PD VCVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS VCVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ VCVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD CVTDQ2PD VCVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ VCVT(T)PD2DQ VCVT(T)PD2DQ CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI VCVTPS2PH VCVTPS2PH VCVTPH2PS VCVTPH2PS y,[r+s*y],y 9 9 x,x x,m128 x,y x,m256 x,x x,m64 x,x x,m64 y,x y,m128 x,x x,m32 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m64 y,x y,m128 x,x x,m128 x,y x,m256 x,mm x,m64 mm,x mm,m128 x,mm x,m64 mm,x mm,m128 x,r32 x,r64 x,m32 r32,x r32,m32 x,r32/64 x,m32 r32/64,x r32,m64 x,v,i m,v,i v,x v,m 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 3 1 2 2 2 2 2 2 2 3 2 2 2 3 2 3 2 3 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 2 3 2 3 1 2 2 2 2 2 2 3 2 3 2 2 3 2 2 2 3 2 3 2 2 6 p1 p5 p1 p5 p23 p1 p5 p1 p5 p23 p1 p5 p1 p5 p23 p0 p5 p0 p23 p0 p5 p0 p23 p0 p5 p0 p23 p1 p1 p23 p1 p1 p23 p1 p1 p23 p1 p1 p23 p1 p5 p1 p23 p1 p5 p1 p23 p1 p5 p1 p5 p23 p1 p5 p1 p5 p23 p1 p1 p23 p1 p5 p1 p23 p1 p5 p1 p23 p1 p5 p1 p5 p23 p1 p5 p1 2p5 p1 p23 p0 p1 p0 p1 p23 p1 p5 p1 p23 p0 p1 p0 p1 p23 p1 p5 p1 p4 p23 p1 p5 p1 p23 Page 214 4 5 4 2 5 2 3 3 3 3 4 6 4 6 4 4 4 4 4 5 4 4 4 4-6 4-6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 3 1 1 1 1 1 1 3 4 3 1 1 3 3 1 1 1 1 1 1 AVX2 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX F16C F16C F16C F16C Broadwell Arithmetic ADDSS/D PS/D SUBSS/D PS/D ADDSS/D PS/D SUBSS/D PS/D ADDSUBPS/D ADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D MULSS/D PS/D MULSS/D PS/D DIVSS DIVPS DIVSS DIVPS DIVSD DIVPD DIVSD DIVPD VDIVPS VDIVPS VDIVPD VDIVPD RCPSS/PS RCPSS/PS VRCPPS VRCPPS CMPccSS/D CMPccPS/D CMPccSS/D CMPccPS/D (U)COMISS/D (U)COMISS/D MAXSS/D PS/D MINSS/D PS/D MAXSS/D PS/D MINSS/D PS/D 3 1 p1 p23 p1 p1 p23 3 1 1 1 SSE3 SSE3 3 p1 2p5 5 2 SSE3 4 1 1 1 1 1 1 1 1 3 4 3 4 1 1 3 4 4 1 2 1 1 2 1 1 2 3 4 3 4 1 2 3 4 p1 2p5 p23 p01 p01 p23 p0 p0 p0 p23 p0 p0 p0 p23 2p0 p15 2p0 p15 p23 2p0 p15 2p0 p15 p23 p0 p0 p23 2p0 p15 2p0 p15 p23 2 0.5 0.5 2.5 5 3-5 4-5 8 4-5 10 10 16 16 1 1 2 2 SSE3 x,x / v,v,v 1 1 p1 x,m / v,v,m x,x x,m32/64 2 1 2 2 1 2 p1 p23 p1 p1 p23 x,x / v,v,v 1 1 p1 x,m / v,v,m 1 2 p1 p23 ROUNDSS/D PS/D v,v,i 2 2 2p1 ROUNDSS/D PS/D v,m,i x,x,i / v,v,v,i x,m,i / v,v,m,i x,x,i x,m128,i 3 4 6 3 4 3 4 6 3 4 2p1 p23 2p0 p1 p5 2p0 p1 p5 p23 p6 p0 p1 p5 p0 p1 p5 p23 7 v,v,v 1 1 p01 5 v,v,m 1 2 p01 p23 x,x 1 1 p0 DPPS DPPS DPPD DPPD VFMADD... (all FMA instr.) VFMADD... (all FMA instr.) Math SQRTSS x,x / v,v,v 1 1 p1 x,m / v,v,m x,x / v,v,v x,m / v,v,m 1 1 1 2 1 2 x,x / v,v,v 3 x,m / v,v,m x,x / v,v,v x,m / v,v,m x,x x,x x,m x,x x,x x,m y,y,y y,y,m256 y,y,y y,y,m256 x,x x,m128 y,y y,m256 Page 215 3 11 11 10-14 10-14 17 19-23 5 7 3 AVX AVX AVX AVX AVX AVX 1 1 1 1 3 1 1 6 12 11 2 SSE4.1 2 2 4 1 1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 0.5 FMA 0.5 FMA 4 Broadwell SQRTPS SQRTSS/PS VSQRTPS VSQRTPS SQRTSD SQRTPD SQRTSD/PD VSQRTPD VSQRTPD RSQRTSS/PS RSQRTSS/PS VRSQRTPS VRSQRTPS x,x x,m128 y,y y,m256 x,x x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 1 1 3 4 1 1 1 3 4 1 1 3 4 1 2 3 4 1 1 2 3 4 1 2 3 4 p0 p0 p23 2p0 p15 2p0 p15 p23 p0 p0 p0 p23 2p0 p15 2p0 p15 p23 p0 p0 p23 2p0 p15 2p0 p15 p23 11 AND/ANDN/OR/XO RPS/PD x,x / v,v,v 1 1 p5 AND/ANDN/OR/XO RPS/PD x,m / v,v,m 1 2 p5 p23 1 Other VZEROUPPER 4 4 none 1 VZEROALL 12 12 none 10 20 3 3 111 141 107 115 174 224 172 173 114 20 3 4 none p0 p6 p23 p0 p4 p6 p237 19 15-16 15-16 27-29 5 7 7 4-7 14 14 4-8 8-14 4-14 16-28 16-28 1 1 2 2 AVX AVX AVX AVX AVX AVX Logic VZEROALL LDMXCSR STMXCSR FXSAVE FXSAVE FXRSTOR FXRSTOR XSAVE XSAVE XRSTOR XRSTOR XSAVEOPT m32 m32 m4096 m4096 m4096 m4096 m Page 216 1 6 7 66 66 80 80 70 84 111 112 51 1 8 3 1 66 66 80 80 70 84 111 112 51 AVX AVX, 32 bit AVX, 64 bit 32 bit mode 64 bit mode 32 bit mode 64 bit mode 32 bit mode 64 bit mode 32 bit mode 64 bit mode Skylake Intel Skylake List of instruction timings and μop breakdown Explanation of column headings: Instruction: Name of instruction. Multiple names mean that these instructions have the same data. Instructions with or without V name prefix behave the same unless otherwise noted. Operands: i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx, xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc. μops fused domain: μops unfused domain: The number of μops at the decode, rename and allocate stages in the pipeline. Fused μops count as one. The total number of μops for all execution port. Fused μops count as two. Fused macroops count as one. The instruction has μop fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any execution port or if the counters are inaccurate. µops each port: The number of μops for each execution port. p0 means a µop to execution port 0. p01means a µop that can go to either port 0 or port 1. p0 p1 means two µops going to port 0 and 1, respectively. Port 0: Integer, f.p. and vector ALU, mul, div, branch Port 1: Integer, f.p. and vector ALU Port 2: Load Port 3: Load Port 4: Store Port 5: Integer and vector ALU Port 6: Integer ALU, branch Port 7: Store address Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter. Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Instruction Move instructions MOV MOV MOV MOV MOV MOV MOV MOV Operands Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments r,i r8/16,r8/16 r32/64,r32/64 r8l,m r8h,m r16,m r32/64,m 1 1 1 1 1 1 1 1 1 1 2 1 2 1 p0156 p0156 p0156 p23 p0156 p23 p23 p0156 p23 m,r 1 2 p237 p4 Page 217 2 0.25 0.25 0.25 0.5 0.5 0.5 0.5 2 1 1 0-1 may be elim. all addressing modes Skylake MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA m,i m,r r,r 1 2 1 2 2 1 p237 p4 p23 p4 p0156 r16,m8 r,m 1 1 2 1 p23 p0156 p23 r,r r,m r,r r,m 1 2 3 8 3 2 2 3 3 4 19 1 3 3 9 18 1 3 2 p06 p06 p23 3p0156 r16,m 1 2 3 8 3 1 1 2 2 3 11 1 3 2 9 18 1 3 2 LEA r32/64,m 1 LEA r32/64,m LEA BSWAP BSWAP MOVBE MOVBE MOVBE MOVBE MOVBE MOVBE PREFETCHNTA/ 0/1/2 PREFETCHW LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB 1 1 0.25 0.5 0.5 1 2 23 7 3 implicit lock p23 p23 2p0156 2p237 p4 2 p06 3p0156 p1 p05 1 1 2-4 1 p15 1 0.5 1 1 p1 3 1 r32/64,m 1 1 p1 r32 r64 r16,m16 r32,m32 r64,m64 m16,r16 m32,r32 m64,r64 1 2 3 2 3 2 2 3 1 2 3 2 3 3 3 4 p15 p06 p15 2p0156 p23 p15 p23 2p0156 p23 p06 p237 p4 p15 p237 p4 p06 p15 p237 p4 m 1 1 p23 0.5 m 1 2 4 2 1 4 2 p23 none counted p23 p4 p23 p4 1 4 33 6 1 1 p0156 r stack pointer m r,r/i Page 218 1 1 2 1 all other combinations 0.5 0.5 1 2 1 1 1 1 1 8 0.5 3 1 20 8 1 1 1 r i m stack pointer p23 2p0156 p237 p4 p237 p4 p4 2p237 p0156 p237 p4 p1 p4 p237 p06 ~400 1 0.5 1 0.5-1 0.5 0.75 1 1 1 0.25 not 64 bit not 64 bit not 64 bit 16 or 32 bit address size 1 or 2 components in address 3 components in address rip relative address MOVBE MOVBE MOVBE MOVBE MOVBE MOVBE PREFETCHW Skylake ADD SUB ADD SUB r,m m,r/i 1 2 2 4 ADC SBB ADC SBB ADC SBB r,r/i r,m m,r/i 1 2 4 1 2 6 CMP CMP INC DEC NEG NOT INC DEC NOT NEG AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL MULX MULX MULX MULX DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO r,r/i m,r/i r 1 1 1 m m 3 2 2 2 3 3 11 1 4 3 2 1 4 3 2 1 1 2 1 1 2 1 1 3 3 2 2 10 10 10 36 11 10 10 57 1 1 1 2 1 1 r8 r16 r32 r64 m8 m16 m32 m64 r,r r,m r16,r16,i r32,r32,i r64,r64,i r16,m16,i r32,m32,i r64,m64,i r32,r32,r32 r32,r32,m32 r64,r64,r64 r64,r64,m64 r8 r16 r32 r64 r8 r16 r32 r64 p0156 p23 0.5 1 2p0156 2p237 p4 5 p06 p06 p23 1 3p0156 2p237 p4 5 1 1 2 1 2 1 p0156 p0156 p23 p0156 1 1 1 0.25 0.5 0.25 4 4 2 2 3 3 11 1 4 3 2 2 5 4 3 1 2 2 1 1 3 2 2 3 4 2 3 10 10 10 36 11 10 10 57 1 1 1 2 1 1 p0156 2p237 p4 p0156 2p237 p4 p1 p56 p1 p056 p1 2p056 p1 2p056 p0 p1 p5 p6 p1 p1 p0156 p1 p0156 p1 p6 p1 p23 p1 3p0156 p23 p1 2p0156 p23 p1 p6 p23 p1 p1 p23 p1 p0156 p1 p1 p1 p0156 p23 p1 p23 p1 p23 p1 2p056 p1 2p056 p23 p1 p5 p1 p6 p23 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0 p1 p5 p6 p0156 p0156 p0156 p0156 p06 p06 5-6 5-6 4 4 4 4 23 3 4 4 3 1 1 Page 219 3 4 3 3 4 4 23 23 26 35-88 24 23 26 42-95 1 1 1 1 1 1 7 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 6 21-83 6 6 6 24-90 not 64 bit not 64 bit not 64 bit not 64 bit not 64 bit AVX2 AVX2 AVX2 AVX2 Skylake POPCNT POPCNT CRC32 CRC32 r,r r,m r,r r,m 1 1 1 1 1 2 1 2 p1 p1 p23 p1 p1 p23 3 Logic instructions AND OR XOR AND OR XOR AND OR XOR r,r/i r,m m,r/i 1 1 2 1 2 4 p0156 p0156 p23 1 2p0156 2p237 p4 5 r,r/i m,r/i r,i m,i r,cl m,cl r,1 r,i m,i r,cl m,cl r,1 m,1 r,i m,i r,cl m,cl r,r,i m,r,i r,r,cl r,r,cl m,r,cl r,r,r r,m,r r,r,i r,m,i r,r/i m,r m,i r,r/i m,r m,i r,r r,m r m 1 1 1 3 3 5 2 1 4 3 5 3 4 8 11 8 11 1 3 4 4 5 1 2 1 2 1 10 2 1 10 3 1 1 1 2 1 1 1 3 1 1 1 2 1 4 3 6 2 1 5 3 6 3 6 8 11 8 11 1 5 4 4 7 1 2 1 2 1 10 2 1 11 4 1 2 1 3 0 1 1 3 1 2 p0156 p0156 p23 p06 2p06 p237 p4 3p06 3p06 2p23 p4 2p06 p06 2p06 2p237 p4 3p06 3p06 p23 p4 2p06 p0156 1 1 1 p0156 6 p0156 6 p1 3 p0156 p0156 3 4 p06 p06 p23 p06 p06 p23 p06 1 TEST TEST SHR SHL SAR SHR SHL SAR SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL ROR ROL ROR ROL ROR ROL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL RCR RCL SHRD SHLD SHRD SHLD SHLD SHRD SHRD SHLD SHLX SHRX SARX SHLX SHRX SARX RORX RORX BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR BSF BSR SETcc SETcc CLC STC CMC CLD STD LZCNT LZCNT r,r r,m p06 p23 p06 p06 p4 p23 p1 p1 p23 p06 p06 p237 p4 none p0156 p0156 p15 p6 p1 p1 p23 Page 220 3 2 1 1 2 2 1 1 1 3 1 1 3 1 1 1 1 SSE4.2 SSE4.2 SSE4.2 SSE4.2 0.25 0.5 1 0.25 0.5 0.5 2 2 4 1 0.5 2 2 4 2 3 6 6 6 6 1 2 2 2 4 0.5 0.5 0.5 0.5 0.5 5 0.5 0.5 5 1 1 1 0.5 1 0.25 0.25 1 4 1 1 short form BMI2 BMI2 BMI2 BMI2 LZCNT LZCNT Skylake TZCNT TZCNT ANDN ANDN BLSI BLSMSK BLSR BLSI BLSMSK BLSR BEXTR BEXTR BZHI BZHI PDEP PDEP PEXT PEXT r,r r,m r,r,r r,r,m r,r 1 1 1 1 1 1 2 1 2 1 p1 p1 p23 p15 p15 p23 p15 r,m 1 2 p15 p23 r,r,r r,m,r r,r,r r,m,r r,r,r r,r,m r,r,r r,r,m 2 3 1 1 1 1 1 1 2 3 1 2 1 2 1 2 2p0156 2p0156 p23 p15 p15 p23 p1 p1 p23 p1 p1 p23 Control transfer instructions JMP short/near JMP r JMP m Conditional jump short/near 1 1 1 1 1 1 2 1 p6 p6 p23 p6 p6 1-2 2 2 1-2 Conditional jump 1 1 p06 0.5-1 1 1 p6 1-2 1 1 p06 0.5-1 2 7 11 2 2 3 1 2 7 11 3 3 4 2 2 15 5 p0156 p6 p237 p4 p6 p237 p4 p6 2p237 p4 p6 p237 p6 0.5-2 5 6 3 2 3 1 2 8 6 3 2 5n+12 3 <2n 2.6/32B 3 2 2p0156 p23 p0156 p23 3 p23 p0156 p4 5 ~2n 4/32B 5 Fused arithmetic and branch Fused arithmetic and branch J(E/R)CXZ LOOP LOOP(N)E CALL CALL CALL RET RET BOUND INTO String instructions LODSB/W LODSD/Q REP LODS STOS REP STOS REP STOS MOVS REP MOVS REP MOVS short/near short short short near r m i r,m 15 5 2p23 p4 2p0156 Page 221 3 1 1 1 2 1 3 3 1 1 0.5 0.5 0.5 BMI1 BMI1 BMI1 BMI1 BMI1 0.5 BMI1 0.5 1 0.5 0.5 1 1 1 1 BMI1 BMI1 BMI2 BMI2 BMI2 BMI2 BMI2 BMI2 1 1 ~2n 1 ~0.5n 1/32B 4 < 1n 1/32B predicted taken predicted not taken predicted taken predicted not taken not 64 bit not 64 bit worst case best case aligned by 32 worst case best case aligned by 32 Skylake SCAS REP SCAS CMPS REP CMPS 3 ≥6n 5 ≥8n 3 p23 2p0156 5 2p23 3p0156 Synchronization instructions XADD m,r LOCK XADD m,r LOCK ADD m,r CMPXCHG m,r LOCK CMPXCHG m,r CMPXCHG8B m,r LOCK CMPXCHG8B m,r CMPXCHG16B m,r LOCK CMPXCHG16B m,r 4 9 8 5 10 16 20 23 25 5 9 8 6 10 16 20 23 25 1 1 0 0 Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE XGETBV RDTSC RDTSCP RDPMC RDRAND RDSEED a,0 a,b r r 1 ≥2n 4 ≥2n 5 18 18 6 18 11 19 16 26 none none 4 4 12 12 ~14+7b ~45+7b 3 3 15 15 20 20 22 22 35 35 16 16 16 16 0.25 0.25 p6 8 ~87+2b 2p0156 p23 5 9 25 32 40 ~460 ~460 p23 15p0156 p23 15p0156 XGETBV RDTSCP RDRAND RDSEED Floating point x87 instructions Instruction Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 Operands r m32/64 m80 m80 r m32/m64 m80 m80 r m m m Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments 1 1 4 43 1 1 7 244 2 1 3 3 1 2 1 1 4 43 1 2 7 226 0 2 3 3 1 2 p05 p23 2p01 2p23 p05 p4 p237 3p0156 2p23 2p4 none p05 p23 p5 p23 p4 p1 p23 p4 p05 2p05 Page 222 1 3 4 46 1 3 4 264 0 5 7 7 0.5 0.5 2 22 0.5 1 5 266 0.5 1 1 2 1 2 SSE3 Skylake FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FADD(P) FSUB(R)(P) FMUL(P) FMUL(P) FDIV(R)(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN Other FNOP r m m 2 4 2 2 3 2 1 1 133 89 2 4 2 3 3 3 1 1 133 89 2p05 p0 p1 p56 p0 p0156 p0 p4 p237 p01 p23 p6 p237 p4 p6 p05 p05 r 1 1 p5 m r m r m 2 1 2 1 1 1 1 1 1 2 3 1 3 1 2 1 1 1 2 2 p5 p23 p0 p0 p23 p0 p0 p23 p0 p0 p5 p5 p23 p0 p5 3 3 2 2 2 1 2 31 31 17 3 4 3 3 3 1 2 31 31 17 p5 2p5 p23 p0 p5 p23 p0 p5 p23 2p5 p23 p5 2p5 27 17 1 53-105 53-105 55-120 16-90 40-100 56 40-112 30-160 27 17 1 p0 1 1 p05 r AX m16 m16 m16 r m r m m m m Page 223 176 175 2 2 2 1 2 1 0.5 0.5 176 175 3 1 3 6 6 7 6 0 5 14-16 1 1 3 1 1 1 4-5 4-5 1 1 1 1 1 1 2 1 3 6 26-30 30-57 21 130 11 14-21 50-120 50-130 55-150 65-80 103 77 140-160 100-160 2 1 2 17 17 11 130 11 4-7 0.5 Skylake WAIT FNCLEX FNINIT 2 5 18 2 5 18 p05 p156 2 22 78 Integer MMX and XMM instructions Instruction Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVQ MOVQ MOVDQA/U MOVDQA/U MOVDQA/U VMOVDQA/U VMOVDQA/U VMOVDQA/U LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ VMOVNTDQ MOVNTDQA VMOVNTDQA PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PACKUSDW PACKUSDW PUNPCKH/L BW/WD/DQ PUNPCKH/L BW/WD/DQ PUNPCKH/L QDQ PUNPCKH/L QDQ Operands r32/64,(x)mm Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments r64,(x)mm (x)mm,r64 mm,mm x,x (x)mm,m64 m64, (x)mm x,x x, m128 m128, x y,y y,m256 m256,y x, m128 mm, x x,mm m64,mm m128,x m256,y x, m128 y,m256 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 2 2 2 2 2 p0 p237 p4 p5 p23 p0 p5 p05 p015 p23 p237 p4 p015 p23 p237 p4 p015 p23 p237 p4 p23 p0 p5 p0 p15 p237 p4 p237 p4 p237 p4 p23 p015 p23 p015 2 3 2 2 2 1 1 1 2 3 0-1 2 3 0-1 3 3 3 2 2 ~418 ~450 ~400 3 3 1 1 1 0.5 1 1 0.5 0.33 0.5 1 0.25 0.5 1 0.25 0.5 1 0.5 1 1 1 1 1 0.5 0.5 mm,mm 3 3 p5 2 2 mm,m64 3 3 p23 2p5 x,x / y,y,y 1 1 p5 x,m / y,y,m x,x / y,y,y x,m / y,y,m 1 1 1 2 1 2 v,v / v,v,v 1 v,m / v,v,m m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 may eliminate AVX AVX SSE3 AVX2 SSE4.1 AVX2 2 1 1 p23 p5 p5 p23 p5 1 1 1 1 1 p5 1 1 1 2 p23 p5 x,x / y,y,y 1 1 p5 x,m / y,y,m 1 2 p23 p5 Page 224 may eliminate 1 1 1 1 SSE4.1 SSE4.1 Skylake PMOVSX/ZX BW BD BQ DW DQ x,x 1 1 p5 PMOVSX/ZX BW BD BQ DW DQ x,m 1 2 p23 p5 VPMOVSX/ZX BW BD BQ DW DQ y,x 1 1 p5 PINSRB PINSRB PINSRW PINSRW PINSRD/Q PINSRD/Q VINSERTI128 VINSERTI128 y,m v,v / v,v,v v,m / v,v,m mm,mm,i mm,m64,i v,v,i v,m,i v,v,i v,m,i v,v,i / v,v,v,i v,m,i / v,v,m,i x,x,xmm0 x,m,xmm0 v,v,v,v v,v,m,v x,x,i / v,v,v,i x,m,i / v,v,m,i v,v,v,i v,v,m,i y,y,y y,y,m y,y,i y,m,i y,y,y,i y,y,m,i mm,mm x,x v,v,m m,v,v r,v r32,x,i m8,x,i x,y,i m,y,i x,r32,i x,m8,i (x)mm,r32,i (x)mm,m16,i x,r32,i x,m32,i y,y,x,i y,y,m,i 2 1 2 1 2 1 1-2 1 2 1 2 1 2 2 3 1 2 1 2 1 1 1 2 1 2 4 10 2 3 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 2 3 1 2 1 2 1 2 1 2 1 2 4 10 2 3 1 2 3 1 2 2 2 2 2 2 2 1 2 p5 p23 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p015 p015 p23 2p015 2p015 p23 p5 p23 p5 p015 p015 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p0 p4 2p23 4p04 2p56 4p23 p23 p015 p0 p4 p23 p0 p0 p5 p23 p4 p5 p5 p23 p4 2p5 p23 p5 p5 p23 p5 2p5 p23 p5 p5 p015 p23 VPBROADCAST B/W/D/Q x,x 1 1 VPBROADCAST B/W x,m8/16 2 2 VPMOVSX/ZX BW BD BQ DW DQ PSHUFB PSHUFB PSHUFW PSHUFW PSHUFD PSHUFD PSHUFL/HW PSHUFL/HW PALIGNR PALIGNR PBLENDVB PBLENDVB VPBLENDVB VPBLENDVB PBLENDW PBLENDW VPBLENDD VPBLENDD VPERMD VPERMD VPERMQ VPERMQ VPERM2I128 VPERM2I128 MASKMOVQ MASKMOVDQU VPMASKMOVD/Q VPMASKMOVD/Q PMOVMSKB PEXTRB/W/D/Q PEXTRB/W/D/Q VEXTRACTI128 VEXTRACTI128 1 SSE4.1 1 SSE4.1 1 AVX2 AVX2 SSSE3 SSSE3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 0.33 0.5 1 1 1 1 1 1 2 6 0.5 1 1 1 1 1 1 2 1 2 1 2 1 1 0.5 p5 1 1 AVX2 p23 p5 7 1 AVX2 Page 225 1 3 1 1 1 1 1 1 2 1 1 3 3 3 ~450 18-500 4 14 2-3 3 3 4 3 3 3 SSSE3 SSSE3 SSE4.1 SSE4.1 AVX2 AVX2 SSE4.1 SSE4.1 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 SSE4.1 SSE4.1 AVX2 AVX2 SSE4.1 SSE4.1 SSE4.1 SSE4.1 AVX2 AVX2 Skylake VPBROADCAST D/Q x,m32/64 1 1 p23 4 0.5 AVX2 VPBROADCAST B/W/D/Q y,x 1 1 p5 3 1 AVX2 VPBROADCAST B/W y,m8/16 2 2 p23 p5 7 1 AVX2 y,m32/64 y,m128 x,[r+s*x],x y,[r+s*y],y x,[r+s*x],x x,[r+s*y],x x,[r+s*x],x y,[r+s*x],y x,[r+s*x],x y,[r+s*y],y 1 1 4 4 5 4 5 4 5 4 1 1 4 4 5 4 5 4 5 4 p23 p23 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 3 3 0.5 0.5 4 5 2 4 2 4 2 4 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 PADD/SUB(S,US) B/W/D/Q v,v / v,v,v 1 1 p015 1 0.33 PADD/SUB(S,US) B/W/D/Q v,m / v,v,m 1 2 p015 p23 v,v / v,v,v 3 3 p01 2p5 v,m / v,v,m 4 4 p01 2p5 p23 mm,mm 1 1 p0 1 1 x,x / y,y,y 1 1 p01 1 0.5 x,m / y,y,m v,v / v,v,v v,m / v,v,m v,v / v,v,v v,m / v,v,m 1 1 1 1 1 2 1 2 1 2 p01 p23 p01 p01 p23 p5 p5 p23 mm,mm 1 1 p0 5 1 x,x / y,y,y 1 1 p01 5 0.5 x,m / y,y,m mm,mm x,x / y,y,y x,m / y,y,m x,x / y,y,y x,m / y,y,m x,x / y,y,y x,m / y,y,m mm,mm x,x / y,y,y x,m / y,y,m 1 1 1 1 2 3 1 1 1 1 1 2 1 1 2 2 3 1 2 1 1 2 p01 p23 p0 p01 p01 p23 2p01 2p01 p23 p01 p01 p23 p0 p01 p01 p23 VPBROADCAST D/Q VBROADCASTI128 VPGATHERDD VPGATHERDD VPGATHERQD VPGATHERQD VPGATHERDQ VPGATHERDQ VPGATHERQQ VPGATHERQQ Arithmetic instructions PHADD(S)W/D PHSUB(S)W/D PHADD(S)W/D PHSUB(S)W/D PCMPEQB/W/D PCMPGTB/W/D PCMPEQB/W/D PCMPGTB/W/D PCMPEQB/W/D PCMPGTB/W/D PCMPEQQ PCMPEQQ PCMPGTQ PCMPGTQ PMULL/HW PMULHUW PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULHRSW PMULLD PMULLD PMULDQ PMULDQ PMULUDQ PMULUDQ PMULUDQ Page 226 0.5 3 1 3 5 5 10 5 5 5 2 SSSE3 2 SSSE3 0.5 0.5 0.5 1 1 0.5 1 0.5 0.5 1 1 0.5 0.5 1 0.5 0.5 SSE4.1 SSE4.1 SSE4.2 SSE4.2 SSSE3 SSSE3 SSSE3 SSE4.1 SSE4.1 SSE4.1 SSE4.1 Skylake PMADDWD PMADDWD PMADDWD PMADDUBSW PMADDUBSW PMADDUBSW PAVGB/W PAVGB/W PAVGB/W PMIN/PMAX SB/SW/SD UB/UW/UD PMIN/PMAX SB/SW/SD UB/UW/UD PMIN/PMAX SB/SW/SD UB/UW/UD PHMINPOSUW PHMINPOSUW PABSB/W/D PABSB/W/D PABSB/W/D PSIGNB/W/D PSIGNB/W/D PSIGNB/W/D PSADBW PSADBW MPSADBW MPSADBW Logic instructions PAND PANDN POR PXOR PAND PANDN POR PXOR PAND PANDN POR PXOR PTEST PTEST PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q mm,mm x,x / y,y,y x,m / y,y,m mm,mm x,x / y,y,y x,m / y,y,m mm,mm x,x / y,y,y x,m / y,y,m 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 p0 p01 p01 p23 p0 p01 p01 p23 p0 p01 p01 p23 5 5 mm,mm 1 1 p0 x,x / y,y,y 1 1 p01 x,m / y,y,m x,x x,m128 mm,mm x,x / y,y x,m / y,m mm,mm x,x / y,y,y x,m / y,y,m v,v / v,v,v v,m / v,v,m x,x,i / v,v,v,i x,m,i / v,v,m,i 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 2 1 1 2 1 1 2 1 2 2 3 p01 p23 p0 p0 p23 p0 p01 p01 p23 p0 p01 p01 p23 p5 p5 p23 2p5 2p5 p23 mm,mm 1 1 p05 1 0.5 x,x / y,y,y 1 1 p015 1 0.33 v,m / v,v,m v,v v,m 1 2 2 2 2 3 p015 p23 p0 p5 p0 p5 p23 3 0.5 1 1 mm,mm 1 1 p0 1 1 mm,m64 2 2 p0 p23 x,x / v,v,x 2 2 p01 p5 x,m / v,v,m 2 2 p01 p23 Page 227 1 0.5 0.5 1 0.5 0.5 1 0.5 0.5 SSSE3 SSSE3 SSSE3 1 1 SSE4.1 1 0.5 SSE4.1 0.5 1 1 1 0.5 0.5 1 0.5 0.5 1 1 2 2 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3 SSSE3 SSSE3 5 5 1 1 4 1 1 1 1 3 4 1 1 1 0.5 SSE4.1 SSE4.1 SSE4.1 SSE4.1 Skylake PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q PSLLW/D/Q PSRLW/D/Q PSRAW/D/Q VPSLLVD/Q VPSRAVD VPSRLVD/Q VPSLLVD/Q VPSRAVD VPSRLVD/Q PSLLDQ PSRLDQ String instructions PCMPESTRI PCMPESTRI PCMPESTRM PCMPESTRM PCMPISTRI PCMPISTRI PCMPISTRM PCMPISTRM mm,i 1 1 p0 1 1 x,i / y,y,i 1 1 p01 1 0.5 v,v,v 1 1 p01 1 0.5 AVX2 v,v,m 1 2 p01 p23 0.5 AVX2 x,i / v,v,i 1 1 p5 1 1 x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i x,x,i x,m128,i 8 8 9 9 3 4 3 4 8 8 9 9 3 4 3 4 6p05 2p16 12 3p0 2p16 4p5 6p05 2p16 p23 3p0 3p0 p23 3p0 3p0 p23 4 4 5 5 3 3 3 3 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 1 2 1 2 p5 p5 p23 7 1 1 CLMUL CLMUL 1 1 p0 4 1 AES 2 2 3 2 2 3 p0 p23 2p0 2p0 p23 8 1.5 2 2 AES AES AES 13 13 p0 p5 12 12 AES 13 13 12 AES 10 10 Encryption instructions PCLMULQDQ x,x,i PCLMULQDQ x,m,i AESDEC, AESDECLAST, AESENC, AESENCLAST x,x AESDEC, AESDECLAST, AESENC, AESENCLAST x,m AESIMC x,x AESIMC x,m AESKEYGENAS SIST x,x,i AESKEYGENAS SIST x,m,i Other EMMS 3p0 2p16 2p5 p23 p05 9 12 9 6 Floating point XMM and YMM instructions Instruction Operands Reciproµops µops cal fused unfused through domain domain µops each port Latency put Comments Page 228 Skylake Move instructions MOVAPS/D VMOVAPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVAPS/D MOVUPS/D VMOVAPS/D VMOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVHPS/D MOVLPS/D MOVLPS/D MOVHLPS MOVLHPS MOVMSKPS/D VMOVMSKPS/D MOVNTPS/D VMOVNTPS/D SHUFPS/D SHUFPS/D VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERMILPS/PD VPERM2F128 VPERM2F128 VPERMPS VPERMPS VPERMPD VPERMPD BLENDPS/PD BLENDPS/PD BLENDVPS/PD BLENDVPS/PD VBLENDVPS/PD VBLENDVPS/PD MOVDDUP MOVDDUP VBROADCASTSS VBROADCASTSS VBROADCASTSS VBROADCASTSS VBROADCASTSD VBROADCASTSD VBROADCASTF128 MOVSH/LDUP x,x y,y 1 1 1 1 p015 p015 0-1 0-1 0.25 0.25 x,m128 1 1 p23 2 0.5 y,m256 1 1 p23 3 0.5 m128,x 1 2 p237 p4 3 1 m256,y x,x x,m32/64 m32/64,x x,m64 m64,x x,m64 m64,x x,x x,x r32,x r32,y m128,x m256,y x,x,i / v,v,v,i x,m,i / v,v,m,i v,v,i v,m,i v,v,v v,v,m y,y,y,i y,y,m,i y,y,y y,y,m y,y,i y,m,i x,x,i / v,v,v,i x,m,i / v,v,m,i x,x,xmm0 x,m,xmm0 v,v,v,v v,v,m,v v,v v,m x,m32 y,m32 x,x y,x y,m64 y,x y,m128 v,v 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1 1 2 1 2 1 2 2 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 2 1 1 1 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 3 1 1 1 1 1 1 1 1 1 1 p237 p4 p5 p23 p237 p4 p23 p5 p4 p237 p23 p5 p4 p237 p5 p5 p0 p0 p4 p237 p4 p237 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p5 p5 p23 p015 p015 p23 p015 p015 p23 2p015 2p015 p23 p5 p23 p23 p23 p5 p5 p23 p5 p23 p5 3 1 3 3 4 3 4 3 1 1 2 3 ~400 ~400 1 1 1 0.5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.33 0.5 1 1 1 1 1 0.5 0.5 0.5 1 1 0.5 1 0.5 1 Page 229 1 1 3 3 3 1 1 2 1 3 2 3 1 3 3 3 3 1 may eliminate may eliminate AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX2 AVX2 AVX2 AVX2 SSE4.1 SSE4.1 SSE4.1 SSE4.1 AVX AVX SSE3 SSE3 AVX AVX AVX2 AVX2 AVX AVX2 AVX SSE3 Skylake MOVSH/LDUP UNPCKH/LPS/D UNPCKH/LPS/D EXTRACTPS EXTRACTPS VEXTRACTF128 VEXTRACTF128 INSERTPS INSERTPS VINSERTF128 VINSERTF128 VMASKMOVPS/D VMASKMOVPS/D VMASKMOVPS/D VPGATHERDPS VPGATHERDPS VPGATHERQPS VPGATHERQPS VPGATHERDPD VPGATHERDPD VPGATHERQPD VPGATHERQPD Conversion CVTPD2PS CVTPD2PS VCVTPD2PS VCVTPD2PS CVTSD2SS CVTSD2SS CVTPS2PD CVTPS2PD VCVTPS2PD VCVTPS2PD CVTSS2SD CVTSS2SD CVTDQ2PS CVTDQ2PS VCVTDQ2PS VCVTDQ2PS CVT(T) PS2DQ CVT(T) PS2DQ VCVT(T) PS2DQ VCVT(T) PS2DQ CVTDQ2PD CVTDQ2PD VCVTDQ2PD VCVTDQ2PD CVT(T)PD2DQ CVT(T)PD2DQ VCVT(T)PD2DQ VCVT(T)PD2DQ v,m x,x / v,v,v x,m / v,v,m r32,x,i m32,x,i x,y,i m128,y,i x,x,i x,m32,i y,y,x,i y,y,m128,i v,v,m m128,x,x m256,y,y x,[r+s*x],x y,[r+s*y],y x,[r+s*x],x x,[r+s*y],x x,[r+s*x],x y,[r+s*x],y x,[r+s*x],x y,[r+s*y],y 1 1 1 2 2 1 2 1 2 1 2 2 4 4 4 4 5 4 5 4 5 4 1 1 2 2 3 1 2 1 2 1 2 2 4 4 4 4 5 4 5 4 5 4 p23 p5 p5 p23 p0 p5 p4 p5 p23 p5 p23 p4 p5 p23 p5 p5 p015 p23 p015 p23 p0 p4 p23 p0 p4 p23 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 p0 p1 p23 p5 x,x x,m128 x,y x,m256 x,x x,m64 x,x x,m64 y,x y,m128 x,x x,m32 x,x x,m128 y,y y,m256 x,x x,m128 y,y y,m256 x,x x,m64 y,x y,m128 x,x x,m128 x,y x,m256 2 2 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 2 2 1 2 3 2 2 2 3 2 3 2 3 2 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 2 3 2 3 p01 p5 p01 p5 p23 p01 p5 p01 p5 p23 p01 p5 p01 p5 p23 p01 p5 p01 p5 p23 p01 p5 p01 p5 p23 p01 p5 p01 p5 p23 p01 p01 p23 p01 p01 p23 p01 p01 p23 p01 p01 p23 p01 p5 p01 p23 p01 p5 p01 p23 p01 p5 p01 p23 p5 p01 p5 p01 p23 p5 Page 230 3 1 5 3 6 1 4 3 5 3 13 13 12 13 5 7 5 5 7 5 4 4 4 4 5 7 5 7 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 1 1 4 5 2 4 2 4 2 4 1 1 1 1 1 1 1 0.5 1 0.5 2 2 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 1 0.5 1 1 1 1 SSE3 SSE3 SSE3 SSE4.1 SSE4.1 AVX AVX SSE4.1 SSE4.1 AVX AVX AVX AVX AVX AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX2 AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX AVX Skylake CVTPI2PS CVTPI2PS CVT(T)PS2PI CVT(T)PS2PI CVTPI2PD CVTPI2PD CVT(T) PD2PI CVT(T) PD2PI CVTSI2SS CVTSI2SS CVTSI2SS CVT(T)SS2SI CVT(T)SS2SI CVT(T)SS2SI CVTSI2SD CVTSI2SD CVT(T)SD2SI CVT(T)SD2SI VCVTPS2PH VCVTPS2PH VCVTPH2PS VCVTPH2PS Arithmetic ADDSS/D PS/D SUBSS/D PS/D ADDSS/D PS/D SUBSS/D PS/D ADDSUBPS/D ADDSUBPS/D HADDPS/D HSUBPS/D HADDPS/D HSUBPS/D MULSS/D PS/D MULSS/D PS/D DIVSS DIVPS DIVSS DIVPS DIVSD DIVPD DIVSD DIVPD VDIVPS VDIVPS VDIVPD VDIVPD RCPSS/PS RCPSS/PS CMPccSS/D CMPccPS/D CMPccSS/D CMPccPS/D (U)COMISS/D x,mm x,m64 mm,x mm,m128 x,mm x,m64 mm,x mm,m128 x,r32 x,r64 x,m32 r32,x r64,x r32,m32 x,r32/64 x,m32 r32/64,x r32,m64 x,v,i m,v,i v,x v,m 2 1 2 2 2 1 2 2 2 3 1 2 3 3 2 1 2 3 2 3 2 1 2 2 2 2 2 2 2 3 2 3 2 2 3 3 2 2 2 3 2 3 2 2 p0 p1 p01 p23 p0 p5 p0 p23 p01 p5 p01 p23 p01 p5 p01 p23 p5 p01 p5 p01 2p5 p1 p23 2p01 2p01 p5 2p01 p23 p01 p5 p01 p23 p0 p1 2p01 p23 p01 p5 p01 p4 p23 p01 p5 p01 p23 x,x / v,v,v 1 1 p01 4 0.5 x,m / v,v,m x,x / v,v,v x,m / v,v,m 1 1 1 2 1 2 p01 p23 p01 p01 p23 4 0.5 0.5 0.5 SSE3 SSE3 x,x / v,v,v 3 3 p01 2p5 6 2 SSE3 x,m / v,v,m x,x / v,v,v x,m / v,v,m x,x x,x x,m x,x x,x x,m y,y,y y,y,m256 y,y,y y,y,m256 v,v v,m 4 1 1 1 1 1 1 1 1 1 1 1 4 1 1 4 1 2 1 1 2 1 1 2 1 2 1 4 1 2 p1 2p5 p23 p01 p01 p23 p0 p0 p0 p23 p0 p0 p0 p23 p0 p0 p23 p0 p0 p23 p0 p0 p23 2 0.5 0.5 3 3 3-5 4 4 4 5 5 8 8 1 1 SSE3 x,x / v,v,v 1 1 p01 x,m / v,v,m x,x 2 1 2 1 p01 p23 p0 Page 231 6 7 5 5 6 7 6 7 6 6 5-7 5-7 4 11 11 13-14 13-14 11 13-14 4 4 2 3 1 1 1 0.5 1 1 2 2 3 1 1 1 2 2 1 1 1 1 1 1 0.5 0.5 1 F16C F16C F16C F16C AVX AVX AVX AVX Skylake (U)COMISS/D MAXSS/D PS/D MINSS/D PS/D MAXSS/D PS/D MINSS/D PS/D x,m32/64 2 2 p0 p23 x,x / v,v,v 1 1 p01 x,m / v,v,m 1 2 p01 p23 ROUNDSS/D PS/D v,v,i 2 2 2p01 8 ROUNDSS/D PS/D v,m,i x,x,i / v,v,v,i x,m,i / v,v,m,i x,x,i x,m128,i 3 4 6 3 4 3 4 6 3 4 2p01 p23 3p01 p5 13 3p01 p23 p5 p6 2p01 p5 2p01 p23 p5 9 v,v,v 1 1 p01 4 v,v,m 1 2 p01 p23 x,x x,m128 y,y y,m256 x,x x,x x,m128 y,y y,m256 v,v v,m 1 1 1 4 1 1 1 1 4 1 1 1 2 1 4 1 1 2 1 4 1 2 p0 p0 p23 p0 p0 p23 p0 p0 p0 p23 p0 p0 p23 p0 p0 p23 AND/ANDN/OR/XO RPS/PD x,x / v,v,v 1 1 p015 AND/ANDN/OR/XO RPS/PD x,m / v,v,m 1 2 p015 p23 0.5 Other VZEROUPPER 4 4 none 1 VZEROALL 25 25 p0 p1 p5 p6 12 34 4 3 106 136 105 121 247 304 257 257 168 34 4 4 p0 p1 p5 p6 p0 p5 p6 p23 p0 p4 p6 p237 DPPS DPPS DPPD DPPD VFMADD... (all FMA instr.) VFMADD... (all FMA instr.) Math SQRTSS/PS SQRTSS/PS VSQRTPS VSQRTPS SQRTSD SQRTPD SQRTSD/PD VSQRTPD VSQRTPD RSQRTSS/PS RSQRTSS/PS 1 4 0.5 0.5 12 12 15-16 15-16 15-16 4 1 SSE4.1 1 1.5 1.5 1 1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 0.5 FMA 0.5 FMA 3 3 6 6 4-6 4-6 4-6 9-12 9-12 1 1 AVX AVX AVX AVX Logic VZEROALL LDMXCSR STMXCSR FXSAVE FXSAVE FXRSTOR FXRSTOR XSAVE XSAVE XRSTOR XRSTOR XSAVEOPT m32 m32 m4096 m4096 m4096 m4096 m Page 232 1 5 5 78 64 76 77 107 107 122 122 74 0.33 12 3 2 78 64 76 77 107 107 122 122 74 AVX AVX, 32 bit AVX, 64 bit 32 bit mode 64 bit mode 32 bit mode 64 bit mode 32 bit mode 64 bit mode 32 bit mode 64 bit mode Pentium 4 Intel Pentium 4 List of instruction timings and μop breakdown This list is measured for a Pentium 4, model 2. Timings for model 3 may be more like the values for P4E, listed on the next sheet Explanation of column headings: Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc. μops: Microcode: Latency: Number of μops issued from instruction decoder and stored in trace cache. Number of additional μops issued from microcode ROM. This is the delay that the instruction generates in a dependency chain if the next dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately because of the problem with memory intermediates explained above under “How the values were measured”. Additional latency: This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1. This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25 indicates 4 instructions per clock cycle in one thread. Reciprocal throughput: Port: Execution unit: Execution subunit: Instruction set The port through which each μop goes to an execution unit. Two independent μops can start to execute simultaneously only if they are going through different ports. Use this information to determine additional latency. When an instruction with more than one μop uses more than one execution unit, only the first and the last execution unit is listed. Throughput measures apply only to instructions executing in the same subunit. Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated. Integer instructions Page 233 Pentium 4 alu0/1 alu0/1 load load store store 86 86 86 86 86 86 86 86 sse2 386 386 386 386 ppro 86 86 86 86 186 86 86 86 186 86 86 86 86 186 86 86 386 386 386 86 86 86 86 486 alu0/1 load alu0 alu0/1 alu0/1 alu0/1 int,alu int,alu int,alu int alu0/1 int int,alu 86 sse sse Notes Page 234 Instruction set r,m r r,r/i m m Subunit r,[r+r/i] r,[r+r+i] r,[r*i] r,[r+r*i] r,[r+r*i+i] 0 0.5 0.5-1 0.25 0/1 0 0.5 0.5-1 0.25 0/1 0 2 0 1 2 0 3 0 1 2 0 1 2 0 0 2 0,3 2 6 4 12 0 14 0 ≈33 0 0.5 0.5-1 0.25 0/1 0 2 0 1 2 0 0.5 0.5-1 0.5 0 0 3 0.5-1 1 2,0 0 6 0 3 0 1.5 0.5-1 1 0/1 8 >100 0 3 0 1 2 0 1 2 0 2 4 7 4 10 10 19 0 1 0 1 8 14 5 13 8 52 16 14 0 0.5 0.5-1 0.25 0/1 0 1 0.5-1 0.5 0/1 0 4 0.5-1 1 1 0 4 0.5-1 1 1 0 4 0.5-1 1 1 0 4 0 4 1 0 0.5 0.5-1 0.5 0/1 0 5 0 1 1 7 15 0 7 0 2 64 >1000 2 6 2 6 Execution unit r m sr Port Reciprocal throughput r i m sr Additional latency 1 1 1 2 1 3 4 4 2 1 1 1 2 3 3 4 4 2 2 3 4 4 4 2 4 4 4 4 1 2 3 2 3 1 1 3 4 3 8 4 4 Latency r,r r,i r32,m r8/16,m m,r m,i r,sr sr,r/m m,r32 r,r r,m r,r r,m r,r/m r,r r,m Microcode Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX MOVZX MOVSX MOVSX CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D) PUSHA(D) POP POP POP POPF(D) POPA(D) LEA LEA LEA LEA LEA LAHF SAHF SALC LDS, LES, ... BSWAP IN, OUT PREFETCHNTA PREFETCHT0/1/2 Operands μops Instruction c b, c a, q c c a, e d Pentium 4 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC INC, DEC NEG NEG AAA, AAS DAA, DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV IDIV IDIV IDIV CBW CWD, CDQ CWDE Logic instructions AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT 4 4 4 r,r r,m m,r r,r r,i r,m m,r r,r r,m r m r m r8/32 r16 m8/32 m16 r32,r r32,(r),i r16,r r16,r,i r16,m16 r32,m32 r,m,i r8/m8 r16/m16 r32/m32 r8/m8 r16/m16 r32/m32 r,r r,m m,r r,r r,m r 1 2 3 4 3 4 4 1 2 2 4 1 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 1 1 2 3 1 2 1 2 2 2 40 38 100 0 0.5 0.5-1 0.25 0 1 0.5-1 1 0 ≥8 ≥4 4 6 0 6 0 6 0 6 6 8 0 8 7 ≥9 8 0 0.5 0.5-1 0.25 0 1 0.5-1 1 0 0.5 0.5-1 0.5 0 4 ≥4 0 0.5 0.5-1 0.5 0 ≥3 27 90 57 100 10 22 22 56 6 16 0 8 7 17 0 8 7-8 16 0 8 10 16 0 8 0 14 0 4.5 0 14 0 4.5 5 16 0 9 5 15 0 8 7 15 0 10 0 14 0 8 7 14 0 10 20 61 0 24 18 53 0 23 21 50 0 23 24 61 0 24 22 53 0 23 20 50 0 23 0 1 0.5-1 1 0 1 0.5-1 0.5 0 0.5 0.5-1 0.5 0 0 0 0 0 0 0.5 ≥1 ≥8 0.5 ≥1 0.5 0.5-1 0.5 0.5-1 ≥ 1 ≥4 0.5-1 0.5 0.5-1 ≥ 1 0.5-1 0.5 Page 235 sse sse2 sse2 0/1 alu0/1 1 1 1 int,alu int,alu int,alu 0/1 alu0/1 0/1 alu0/1 0 alu0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0/1 0 int int int int int int int int int int int int int int int int int int int alu0 alu0/1 alu0 0 alu0 0 alu0 0 alu0 fpmul fpdiv fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpmul fpdiv fpdiv fpdiv fpdiv fpdiv fpdiv 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 386 386 386 186 386 386 186 86 86 386 86 86 386 86 86 386 86 86 86 86 86 86 c c c c c a a a a a c c c c c Pentium 4 NOT SHL, SHR, SAR SHL, SHR, SAR ROL, ROR ROL, ROR RCL, RCR RCL, RCR RCL, RCR SHL,SHR,SAR,ROL, ROR RCL, RCR RCL, RCR SHLD, SHRD SHLD, SHRD BT BT BT BT BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BSF, BSR BSF, BSR SETcc SETcc CLC, STC CMC CLD STD CLI STI m r,i r,CL r,i r,CL r,1 r,i r,CL 4 1 2 1 2 1 4 4 0 0 0 0 0 0 15 15 1 0 1 0 1 0 0 ≥4 1 1 1 1 1 15 14 m,i/CL m,1 m,i/CL r,r,i/CL m,r,i/CL r,i r,r m,i m,r r,i r,r m,i m,r r,r r,m r m 4 4 4 4 4 3 2 4 4 3 2 4 4 2 3 3 4 3 3 4 4 4 4 7-8 10 0 7 10 0 18 18-28 14 14 0 18 14 0 0 4 0 0 4 0 0 4 0 12 12 0 0 6 0 0 6 0 7 18 0 15 14 0 0 4 0 0 4 0 0 5 0 0 5 0 0 10 0 0 10 0 7 52 0 5 48 0 5 35 12 43 1 4 3 3 4 1 4 4 3 4 4 4 4 4 4 4 0 28 0 0 31 0 4 4 0 34 4 4 38 0 0 33 Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF 4 6 4 6 4 16 16 0 118 4 4 11 0 0 0 2 0 8 9 2 2 11 Page 236 1 1 1 1 1 1 1 int int int int int int int mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh 10 10 14 14 14 2 1 2 12 2 4 8 14 2 3 1 3 2 2 52 48 35 43 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 int int int int int int int int int int int int int int int int int mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh mmxsh 1 118 4 4 11 2-4 2-4 2-4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 alu0 branch alu0 alu0 branch branch alu0 alu0 alu0 alu0 branch branch branch branch alu0 alu0 branch branch alu0 alu0 branch branch 86 186 86 186 86 86 186 86 86 86 86 386 386 386 386 386 386 386 386 386 386 386 386 386 386 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 d d d d d d d d d d d d d d Pentium 4 RETF IRET ENTER ENTER LEAVE BOUND INTO INT i 4 33 11 4 48 24 4 12 26 4 45+24n 4 0 3 4 14 14 4 5 18 4 84 644 i,0 i,n m i String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE CPUID RDTSC Notes: a) b) c) d) e) q) 4 4 4 4 4 4 4 4 4 4 0 0 26 128+16n 3 14 18 3 6 5n ≈ 4n+36 2 6 2n+3 ≈ 3n+10 4 6 ≈163+1.1n 3 ≈ 40+6n ≈4n 5 ≈ 50+8n ≈4n 1 0 0 1 0 0 4 2 4 39-81 4 7 86 86 186 186 186 186 86 86 6 86 86 6 86 86 4 86 86 6 86 86 8 86 86 0.25 0/1 0.25 0/1 alu0/1 alu0/1 200-500 80 86 ppro sse2 p5 p5 Add 1 μop if source is a memory operand. Uses an extra μop (port 3) if SIB byte used. A SIB byte is needed if the memory operand has more than one pointer register, or a scaled index, or ESP is used as base pointer. Add 1 μop if source or destination, but not both, is a high 8-bit register (AH, BH, CH, DH). Has (false) dependence on the flags in most cases. Not available on PMMX Latency is 12 in 16-bit real or virtual mode, 24 in 32-bit protected mode. Floating point x87 instructions 0 2 mov load Notes Page 237 1 1 Instruction set 0 0 Subunit 6 ≈7 Execution unit 0 0 Port Reciprocal throughput 1 1 Additional latency r m32/64 Latency Microcode Move instructions FLD FLD Operands μops Instruction 87 87 Pentium 4 FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FIST FIST FISTP FLDZ FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD,FSUB(R) FIADD,FISUB(R) FIADD,FISUB(R) FMUL(P) FMUL FIMUL FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FIDIV(R) FABS FCHS FCOM(P), FUCOM(P) FCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 m80 m80 r m32/64 m80 m80 r m16 m32/64 m16 m32/64 m st0,r r AX AX m16 m16 m16 r m m16 m32 r m m16 m32 r m m16 m32 r m r m16 m32 3 3 1 2 3 3 1 3 2 3 2 3 1 2 4 3 1 4 6 4 4 4 4 75 0 0 8 311 0 3 0 0 0 0 0 0 0 0 0 0 0 4 4 7 1 2 3 3 1 2 3 3 1 2 3 3 1 1 1 2 2 3 4 3 1 1 3 6 6 0 0 4 0 0 0 4 0 0 0 4 0 0 0 0 0 0 0 4 0 0 0 15 84 84 6 ≈7 0 0 ≈ 10 ≈ 10 ≈ 10 ≈ 10 ≈ 10 0 2-4 1 0 11 11 0 0 0 (3) 5 5 6 5 7 7 7 7 43 43 43 43 2 2 2 2 2 10 1 1 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 2 2 2 23 212 212 0 0 0 0 Page 238 6 2 90 2 1 0 2-3 0 8 0 400 0 1 0 6 2 1 2 2-4 0 2-3 0 2-4 0 2 0 2 0 4 1 4 0 1 0 3 1 3 1 6 0 6 0 (8) 0,2 1 1 6 2 2 2 6 2 43 43 43 43 1 1 1 1 1 3 6 2 1 1 15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1 1 1,2 1 1 0,1 1 1 load load mov store store store mov load load store store store mov mov fp mov mov fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 287 287 87 87 87 add add add add mul mul mul mul div div div div misc misc misc misc misc misc misc misc misc misc 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 87 87 87 87 387 e f g, h g, h g, h g, h Pentium 4 Math FSQRT FLDPI, etc. FSIN FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR Notes: e) f) g) h) i) 1 2 6 6 7 6 3 3 3 3 3 11 0 43 0 ≈150 ≈180 ≈175 ≈207 ≈178 ≈216 ≈160 ≈230 92 ≈187 24 57 15 20 45 ≈165 60 ≈200 134 ≈242 0 1 2 4 6 4 4 4 4 0 0 4 29 174 96 69 94 0 0 1 0 43 3 ≈170 ≈207 ≈211 ≈200 ≈153 66 20 63 90 ≈220 456 528 132 208 1 1 1 1 1 1 1 1 1 1 1 1 fp fp fp fp fp fp fp fp fp fp fp fp 1 0 1 0 96 1 172 420 0,1 532 96 208 div 87 87 387 387 387 87 87 87 87 87 87 87 mov mov 87 87 87 87 87 87 sse sse g, h i i Not available on PMMX The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW, i.e. when alternating between the same two values. In all other cases, the latency and reciprocal throughput is 143. Latency and reciprocal throughput depend on the precision setting in the F.P. control word. Single precision: 23, double precision: 38, long double precision (default): 43. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. Takes 6 μops more and 40-80 clocks more when XMM registers are disabled. Integer MMX and XMM instructions 0 1 2 0 fp mmx load fp alu mmx mmx mmx sse2 Notes Page 239 1 2 1 2 Instruction set 1 0 0 1 Subunit 5 2 ≈8 10 Execution unit 0 0 0 0 Port Reciprocal throughput 2 2 1 2 Additional latency r32, mm mm, r32 mm,m32 r32, xmm Latency Microcode Move instructions MOVD MOVD MOVD MOVD Operands μops Instruction Pentium 4 MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKHBW/WD/DQ/ QDQ PUNPCKLBW/WD/DQ/Q DQ PSHUFD PSHUFL/HW PSHUFW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW PINSRW Arithmetic instructions PADDB/W/D PADD(U)SB/W PSUBB/W/D PSUB(U)SB/W PADDQ, PSUBQ PADDQ, PSUBQ PCMPEQB/W/D PCMPGTB/W/D PMULLW PMULHW PMULHUW PMADDWD PMULUDQ PAVGB/W xmm, r32 xmm,m32 m32, r mm,mm xmm,xmm r,m64 m64,r xmm,xmm xmm,m m,xmm xmm,m m,xmm mm,xmm xmm,mm m,mm m,xmm 2 1 2 1 1 1 2 1 1 2 4 4 3 2 3 2 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 6 ≈8 ≈8 6 2 ≈8 ≈8 6 ≈8 ≈8 2 1 2 1 2 1 2 1 1 2 2 2 2 2 75 18 1 2 0,1 0 1 2 0 0 2 0 2 0 0,1 0,1 0 0 mm,r/m 1 0 2 1 1 1 mmx shift mmx a xmm,r/m 1 0 4 1 2 1 mmx shift mmx a mm,r/m 1 0 2 1 1 1 mmx shift mmx a xmm,r/m 1 0 4 1 2 1 mmx shift sse2 a xmm,r/m xmm,xmm,i xmm,xmm,i mm,mm,i mm,mm xmm,xmm r32,r r32,mm,i r32,xmm,i mm,r32,i xmm,r32,i 1 1 1 1 4 4 2 3 3 2 2 0 0 0 0 4 6 0 0 0 0 0 2 4 2 2 1 1 1 1 shift shift shift shift sse2 sse2 sse2 mmx sse sse2 a 1 1 1 1 1 1 1 1 1 0 0 0,1 1 1 1 1 mmx mmx mmx mmx mov mov 7 8 9 3 4 2 2 2 1 7 10 3 2 2 2 2 mmx-alu0 mmx-int mmx-int int-mmx int-mmx sse sse sse2 sse sse2 r,r/m 1 0 2 1 1,2 1 mmx alu mmx a,j r,r/m mm,r/m xmm,r/m 1 1 1 0 0 0 2 2 4 1 1 1 1,2 1 2 1 1 1 mmx mmx fp alu alu add mmx sse2 sse2 a,j a a r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 1 1 1 1 1 0 0 0 0 0 0 2 6 6 6 6 2 1 1 1 1 1 1 1,2 1,2 1,2 1,2 1,2 1,2 1 1 1 1 1 1 mmx fp fp fp fp mmx alu mul mul mul mul alu mmx mmx sse mmx sse2 sse a,j a,j a,j a,j a,j a,j 8 8 1 0 0 1 0 1 1 Page 240 mmx load mov mmx load mov mov load mov load mov mov-mmx mov-mmx shift shift sse2 sse2 mmx mmx sse2 mmx mmx sse2 sse2 sse2 sse2 sse2 k k sse2 sse2 mov mov sse sse2 Pentium 4 PMIN/MAXUB PMIN/MAXSW PSADBW Logic PAND, PANDN POR, PXOR PSLL/RLW/D/Q, PSRAW/D PSLLDQ, PSRLDQ Other EMMS Notes: a) j) k) r,r/m r,r/m r,r/m 1 1 1 0 0 0 2 2 4 1 1 1 1,2 1,2 1,2 1 1 1 mmx mmx mmx alu alu alu sse sse sse a,j a,j a,j r,r/m r,r/m 1 1 0 0 2 2 1 1 1,2 1,2 1 1 mmx mmx alu alu mmx mmx a,j a,j r,i/r/m xmm,i 1 1 0 0 2 4 1 1 1,2 2 1 1 mmx mmx shift shift mmx sse2 a,j a 4 11 12 12 0 mmx Add 1 μop if source is a memory operand. Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands. It may be advantageous to replace this instruction by two 64-bit moves Floating point XMM instructions 2 2 2 1 1 1 0 2 0 0 2 0 1 1 2 0 1 1 0 4 2 sse 0 0 0 0 0 0 2 4 3 2 2 2 0 0 1 1 1 1 sse sse/2 sse sse sse sse 2 2 ≈7 0 1 0 4 2 0 0 mov mmx mmx shift shift mmx mmx shift shift sse sse sse sse sse sse sse sse sse sse sse sse MOVHPS/D, MOVLPS/D MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCKHPS/D UNPCKLPS/D 6 4 4 2 1 1 1 1 Page 241 fp mmx mmx mmx shift shift shift Notes m,r m,r r32,r r,r/m,i r,r/m r,r/m 1 1 2 1 2 8 2 2 1 2 2 2 0 mov Instruction set 3 0 0 Subunit r,m 6 ≈7 ≈7 6 Execution unit 0 0 0 0 0 6 0 0 0 0 0 0 Port Reciprocal throughput 1 1 2 1 4 4 1 1 1 2 1 1 Additional latency r,r r,m m,r r,r r,m m,r r,r r,r r,m m,r r,r r,r Operands Latency Microcode Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS MOVSD MOVSS, MOVSD MOVSS, MOVSD MOVHLPS MOVLHPS MOVHPS/D, MOVLPS/D μops Instruction k k Pentium 4 Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 4 2 4 4 1 3 1 2 4 4 3 3 3 4 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 10 14 10 4 9 4 9 10 11 7 11 10 15 8 8 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 4 2 6 6 2 4 2 2 4 5 2 3 3 6 2.5 2.5 1 1 1 1 1 1 1 1 1 1 0,1 0,1 1 1 1 1 mmx fp-mmx mmx mmx fp mmx-fp fp fp-mmx mmx fp-mmx fp-mmx fp-mmx fp-mmx fp-mmx fp fp shift sse2 shift shift r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 1 1 1 1 1 1 2 0 0 0 0 0 0 0 0 4 4 6 23 39 38 69 4 1 1 1 0 0 0 0 1 2 2 2 23 39 38 69 4 1 1 1 1 1 1 1 1 fp fp fp fp fp fp fp mmx r,r/m 1 0 4 1 2 1 r,r/m r,r/m 1 2 0 0 4 6 1 1 2 3 Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,r/m 1 0 2 1 Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 1 1 1 2 2 0 0 0 0 0 0 23 39 38 69 4 4 0 0 0 0 1 1 m m 4 4 8 4 98 Arithmetic ADDPS/D ADDSS/D SUBPS/D SUBSS/D MULPS/D MULSS/D DIVSS DIVPS DIVSD DIVPD RCPPS RCPSS MAXPS/D MAXSS/DMINPS/D MINSS/D CMPccPS/D CMPccSS/D COMISS/D UCOMISS/D Other LDMXCSR STMXCSR Notes: Page 242 sse2 a sse2 sse2 sse2 a sse2 a sse a a a a a sse2 sse a add add mul div div div div sse sse sse sse sse sse2 sse2 sse a a a a,h a,h a,h a,h a fp add sse a 1 1 fp fp add add sse sse a a 2 1 mmx alu sse a 23 39 38 69 3 4 1 1 1 1 1 1 fp fp fp fp mmx mmx div div div div sse sse sse2 sse2 sse sse a,h a,h a,h a,h a a 100 6 1 1 sse2 sse2 sse2 sse sse2 sse sse2 sse sse a a a a a a a Pentium 4 a) h) k) Add 1 μop if source is a memory operand. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. It may be advantageous to replace this instruction by two 64-bit moves. Page 243 Prescott Intel Pentium 4 w. EM64T (Prescott) List of instruction timings and μop breakdown Explanation of column headings: Instruction: Operands: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc., mabs = memory operand with 64-bit absolute address. μops: Microcode: Latency: Number of μops issued from instruction decoder and stored in trace cache. Number of additional μops issued from microcode ROM. This is the delay that the instruction generates in a dependency chain if the next dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately because of the problem with memory intermediates explained above under “How the values were measured”. Additional latency: This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1. This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25 indicates 4 instructions per clock cycle in one thread. Reciprocal throughput: Port: Execution unit: Execution subunit: Instruction set The port through which each μop goes to an execution unit. Two independent μops can start to execute simultaneously only if they are going through different ports. Use this information to determine additional latency. When an instruction with more than one μop uses more than one execution unit, only the first and the last execution unit is listed. Throughput measures apply only to instructions executing in the same subunit. Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated. Integer instructions 86 86 x64 Notes alu0/1 alu0/1 alu0/1 Instruction set Page 244 0.25 0/1 0.25 0/1 0.5 0/1 Subunit 0 0 0 Execution unit 1 1 Port Reciprocal throughput 0 0 0 Additional latency 1 1 1 Latency r,r r8/16/32,i r64,i32 Microcode Move instructions MOV MOV MOV Operands μops Instruction c Prescott MOV MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVZX MOVZX MOVZX MOVSX MOVSX MOVSX MOVSXD CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LEA LEA LEA LEA LEA LEA LAHF SAHF SALC LDS, LES, ... LODS REP LODS STOS REP STOS MOVS REP MOVSB REP MOVSW r64,i64 r8/16,m r32/64,m m,r m,i m64,i32 r,sr sr,r/m r,mabs mabs,r m,r32 r,r r16,r8 r,m r16,r8 r32/64,r8/16 r,m r64,r32 r,r/m r,r r,m r i m sr r m sr r,[m] r,[r+r/i] r,[r+r+i] r,[r*i] r,[r+r*i] r,[r+r*i+i] r,m 2 0 0 2 0 3 0 1 0 2 0 1 0 2 0 2 0 1 2 1 8 3 0 3 0 2 0 1 0 1 0 2 0 2 0 1 0 2 0 2 0 2 0 1 0 1 0 2 0 3 0 1 0 1 0 3 0 9.5 0 3 0 2 0 2 6 ≈100 4 0 6 2 0 2 2 0 2 3 0 2 1 3 1 3 1 9 2 0 1 0 2 6 1 8 1 8 2 16 1 0 1 0 2.5 0 2 0 3.5 0 3 0 3.5 0 2 0 3.5 0 3 0 3.5 0 1 0 4 0 1 0 5 0 2 0 0 2 10 1 3 8 1 5n ≈ 4n+50 1 2 8 1 2.5n ≈ 3n 1 4 8 9 ≈.3n ≈.3n 1 ≈.5-1.1n≈ .6-1.4n Page 245 1 1 1 2 2 2 8 27 1 2 2 0.25 1 1 1 0.5 1 0.5 3 1 2 2 2 9 9 16 1 10 30 70 15 0.25 0.25 0.5 1 1 1 1 28 8 1 2 2 0 0,3 0,3 alu1 load load store store store 0/1 0/1 2 0 0 2 0 alu0/1 alu0/1 load alu0 alu0 load alu0 0/1 alu0/1 0/1 0/1 0/1 1 0,1 1 1 0/1 1 alu0/1 alu0/1 alu0/1 alu alu0,1 alu int alu0/1 int x64 86 86 86 86 x64 86 86 x64 x64 sse2 386 386 386 386 386 386 x64 PPro 86 86 86 86 186 86 86 86 186 86 86 86 86 186 86 86 86 386 386 386 86 86 86 86 86 86 8 86 86 8 86 86 86 b,c a,q l l c c a,c,o a,c,o a a,e m m p n d,n m m Prescott REP MOVSD REP MOVSQ BSWAP IN, OUT PREFETCHNTA PREFETCHT0/1/2 SFENCE LFENCE MFENCE Arithmetic instructions ADD, SUB ADD, SUB ADD, SUB ADC, SBB ADC, SBB ADC, SBB ADC, SBB CMP CMP INC, DEC INC, DEC NEG NEG AAA, AAS DAA, DAS AAD AAM MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL MUL, IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV r r,r/i m m r,r r,m m,r r,r/i r,m m,r m,i r,r r,m r m r m r8 r16 r32 r64 m8 m16 m32 m64 r16,r16 r16,r16,i r32,r32 r32,(r32),i r64,r64 r64,(r64),i r16,m16 r32,m32 r64,m64 r,m,i r8/m8 r16/m16 r32/m32 r64/m64 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 3 1 2 2 4 1 3 1 1 2 2 1 4 3 1 2 2 3 2 1 2 1 1 1 1 2 2 2 3 1 1 1 1 ≈1.1n ≈ 1.4 n 86 x64 alu ≈1.1n ≈ 1.4 n 0 52 0 0 2 2 4 1 0 0 0 0 5 6 5 0 0 0 0 0 0 10 16 5 17 0 0 0 5 0 5 0 6 0 0 0 0 0 0 0 0 0 0 20 19 21 31 1 1 5 10 10 20 22 1 1 1 5 1 5 26 29 13 71 10 11 11 11 10 11 11 11 10 11 10 10 10 10 10 10 10 10 74 73 76 63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Page 246 1 >1000 1 1 50 50 124 86 sse sse sse sse2 sse2 0.25 0/1 1 2 10 1 10 1 10 10 0.25 0/1 1 0.5 0/1 3 0.5 0 3 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 1-2.5 34 34 34 52 486 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 alu0/1 int,alu int,alu alu0/1 alu0/1 alu0 int int int int int int int int int int int int int int int int int int int int int int int int mul fpdiv mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul mul fpdiv fpdiv fpdiv fpdiv 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 x64 86 86 86 x64 386 186 386 386 x64 x64 386 386 x64 186 86 86 386 x64 c c c c c m m m m a a a a Prescott IDIV IDIV IDIV IDIV CBW CWD CDQ CQO CWDE CDQE SCAS REP SCAS CMPS REP CMPS Logic AND, OR, XOR AND, OR, XOR AND, OR, XOR TEST TEST NOT NOT SHL SHR, SAR SHR, SAR SHL SHR, SAR SHR, SAR ROL, ROR ROL, ROR ROL, ROR ROL, ROR RCL, RCR RCL RCR RCL RCR SHL, SHR, SAR ROL. ROR SHL, SHR, SAR ROL. ROR RCL, RCR RCL, RCR RCL, RCR SHLD, SHRD SHLD SHRD SHLD, SHRD SHLD r8/m8 r16/m16 r32/m32 r64/m64 r,r r,m m,r r,r r,m r m r,i r8/16/32,i r64,i r,CL r8/16/32,CL r64,CL r8/16/32,i r64,i r8/16/32,CL r64,CL r,1 r,i r,i r,CL r,CL m8/16/32,i m8/16/32,i m8/16/32,cl m8/16/32,cl m8/16/32,1 m8/16/32,i m8/16/32,cl r8/16/32,r,i r64,r64,i r64,r64,i r8/16/32,r,cl r64,r64,cl 1 21 76 0 1 19 79 0 1 19 79 0 1 58 96 0 2 0 2 0 2 0 2 0 1 0 1 0 1 0 7 0 2 0 2 0 1 0 1 0 1 3 0 1 ≈ 54+6n ≈ 4n 1 5 1 ≈ 81+8n ≈ 5n 34 34 34 91 1 1 1 1 1 1 8 1 2 3 1 2 1 3 1 1 1 2 2 2 1 1 2 2 1 2 2 1 1 3 3 2 2 2 3 2 3 4 3 4 4 0.5 1 2 0.5 1 0.5 2 0.5 0.5 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 11 11 11 6 6 6 6 5 13 13 0 5 7 0 5 1 1 5 1 1 1 5 1 1 7 2 2 8 1 7 2 8 7 31 25 31 25 10 10 10 10 27 38 37 8 10 10 9 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Page 247 1 1 1 1 0 0/1 0/1 0/1 0/1 0/1 int int int int alu0 alu0/1 alu0/1 alu0/1 alu0/1 alu0/1 fpdiv fpdiv fpdiv fpdiv 86 86 386 x64 86 86 386 x64 386 x64 86 a a a a 86 10 86 86 1 7 2 8 7 31 25 31 25 27 38 37 7 8 0 alu0 0 alu0 0 alu0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 86 86 86 86 86 86 86 186 186 x64 86 86 x64 186 x64 86 x64 86 186 186 86 86 86 86 86 86 86 86 86 386 x64 x64 386 x64 c c c c c d d d d d d d d d d d d d d Prescott SHRD SHLD, SHRD SHLD, SHRD BT BT BT BT BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BTR, BTS, BTC BSF, BSR SETcc SETcc CLC, STC CMC CLD, STD r64,r64,cl m,r,i m,r,CL r,i r,r m,i m,r r,i r,r m,i m,r r,r/m r m 3 3 2 1 2 3 2 1 2 3 2 2 2 3 2 3 1 8 8 8 0 0 0 7 0 0 6 10 0 0 0 0 0 8 12 20 20 8 9 8 10 8 9 28 14 16 9 9 Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Jcc short/near J(E)CXZ short LOOP short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i IRET BOUND m INT i INTO 1 2 3 3 2 1 4 4 3 3 4 4 2 4 4 1 2 1 2 2 1 0 25 0 0 28 0 0 0 0 29 0 0 32 0 0 30 30 49 11 67 4 0 Other NOP (90) Long NOP (0F 1F) PAUSE LEAVE CLI STI CPUID RDTSC 1 0 1 0 1 2 4 0 1 5 1 11 1 49-90 1 12 15 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 10 8 9 8 10 8 9 10 14 4 1 2 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x64 386 386 386 386 386 386 386 386 386 386 386 386 386 86 86 86 53 1 154 15 10 157 2-4 4 4 7 160 7 9 160 7 7 160 160 325 12 470 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.25 0/1 0.25 0/1 50 5 52 64 300-500 100 Page 248 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 alu1 int int alu0 branch alu0 alu0 branch branch alu0 alu0 alu0 alu0 branch branch branch branch alu0 alu0 branch branch alu0 alu0 branch branch alu0/1 alu0/1 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 86 186 86 86 86 ppro sse2 186 86 86 p5 p5 d d d d d m m m m Prescott RDPMC (bit 31 = 1) RDPMC (bit 31 = 0) MONITOR MWAIT Notes: a) b) c) d) e) l) m) n) o) p) q) 1 4 37 154 100 240 p5 p5 (sse3) (sse3) Add 1 μop if source is a memory operand. Uses an extra μop (port 3) if SIB byte used. Add 1 μop if source or destination, but not both, is a high 8-bit register (AH, BH, CH, DH). Has (false) dependence on the flags in most cases. Not available on PMMX Move accumulator to/from memory with 64 bit absolute address (opcode A0 A3). Not available in 64 bit mode. Not available in 64 bit mode on some processors. MOVSX uses an extra μop if the destination register is smaller than the biggest register size available. Use a 32 bit destination register in 16 bit and 32 bit mode, and a 64 bit destination register in 64 bit mode for optimal performance. LEA with a direct memory operand has 1 μop and a reciprocal throughput of 0.25. This also applies if there is a RIP-relative address in 64-bit mode. A signextended 32-bit direct memory operand in 64-bit mode without RIP-relative address takes 2 μops because of the SIB byte. The throughput is 1 in this case. You may use a MOV instead. These values are measured in 32-bit mode. In 16-bit real mode there is 1 microcode μop and a reciprocal throughput of 17. Floating point x87 instructions 0 0 Page 249 mov load load load mov store store store mov load load store store mov 87 87 87 87 87 87 87 87 87 87 87 87 sse3 87 Notes 0 0 2 2 2 0 0 0 0 0 2 2 0 0 0 Instruction set 7 7 1 1 8 90 1 2 10 400 1 8 2 2.5 2.5 2 Subunit 0 0 Execution unit 7 Port Reciprocal throughput 0 0 3 74 0 0 6 311 0 2 0 0 0 0 Additional latency 1 1 3 3 1 2 3 3 1 3 2 3 3 1 Latency r m32/64 m80 m80 r m32/64 m80 m80 r m16 m32/64 m m Microcode Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FIST(P) FISTTP FLDZ Operands μops Instruction Prescott FLD1 FCMOVcc FFREE FINCSTP, FDECSTP FNSTSW FSTSW FNSTSW FNSTCW FLDCW Arithmetic instructions FADD(P),FSUB(R)(P) FADD,FSUB(R) FIADD,FISUB(R) FIADD,FISUB(R) FMUL(P) FMUL FIMUL FIMUL FDIV(R)(P) FDIV(R) FIDIV(R) FIDIV(R) FABS FCHS FCOM(P), FUCOM(P) FCOM(P) FCOMPP, FUCOMPP FCOMI(P) FICOM(P) FICOM(P) FTST FXAM FRNDINT FPREM FPREM1 Math FSQRT FLDPI, etc. FSIN, FCOS FSINCOS FPTAN FPATAN FSCALE FXTRACT F2XM1 FYL2X FYL2XP1 st0,r r AX AX m16 m16 m16 r m m16 m32 r m m16 m32 r m m16 m32 r m r m16 m32 2 4 3 1 4 6 2 4 3 0 0 0 0 0 0 3 0 6 1 2 3 3 1 2 3 3 1 2 3 3 1 1 1 2 2 3 3 3 1 1 3 8 9 0 0 3 0 0 0 3 0 0 0 3 3 0 0 0 0 0 0 3 0 0 0 14 86 92 1 2 3 5 8 4 3 4 3 3 3 0 0 ≈100 ≈150 ≈170 97 25 16 190 63 58 5 1 0 0 0 0 6 6 7 6 8 8 8 8 45 45 45 45 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 28 220 220 1 1 1 45 1 ≈200 ≈200 ≈270 ≈250 96 27 ≈270 ≈170 ≈170 Page 250 2 4 3 1 3 3 8 3 10 0 1 0 0 1 1 0 0 0,2 mov fp mov mov 1 1 6 2 2 2 8 3 45 45 45 45 1 1 1 1 1 3 8 2 1 1 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0,1 1 1,2 1 1 0,1 1 1 fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp fp 1 1 1 1 1 1 1 1 1 1 1 fp fp fp fp fp fp fp fp fp fp fp 45 2 ≈200 ≈200 ≈270 ≈250 87 PPro 87 87 287 287 87 87 87 add add add add mul mul mul mul div div div div misc misc misc misc misc misc misc misc misc misc 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 87 PPro 87 87 87 87 87 87 387 div 87 87 387 387 87 87 87 87 87 87 87 fp fp e f g,h g,h g,h g,h g,h Prescott Other FNOP (F)WAIT FNCLEX FNINIT FNSAVE FRSTOR FXSAVE FXRSTOR Notes: e) f) g) h) i) 1 2 1 1 2 2 2 2 0 0 4 30 181 96 121 118 1 0 0 0 1 1 120 200 500 570 0 0 1 mov mov 0,1 160 244 87 87 87 87 87 87 sse sse i i Not available on PMMX The latency for FLDCW is 3 when the new value loaded is the same as the value of the control word before the preceding FLDCW, i.e. when alternating between the same two values. In all other cases, the latency and reciprocal throughput is > 100. Latency and reciprocal throughput depend on the precision setting in the F.P. control word. Single precision: 32, double precision: 40, long double precision (default): 45. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. Takes fewer microcode μops when XMM registers are disabled, but the throughput is the same. Integer MMX and XMM instructions 7 2 0 1 7 0 10 1 Page 251 alu shift shift sse2 mmx mmx mmx sse2 sse2 sse2 mmx mmx sse2 mmx mmx sse2 sse2 sse2 sse2 sse2 sse3 Notes 1 1 0 fp 1 mmx 2 load 0 fp 1 mmx 2 load 0,1 0 mov 1 mmx 2 load 0 mov 0 mov 2 load 0 mov 2 load 0 mov 2 load 0,1 mov-mmx Instruction set 7 4 1 1 1 1 2 1 2 1 2 1 2 1 1 2 23 8 2.5 2 Subunit 1 1 Execution unit 6 3 Port Reciprocal throughput 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 Additional latency 2 1 1 1 2 1 2 1 1 1 2 1 1 2 4 4 4 3 Latency r32, mm mm, r32 mm,m32 r32, xmm xmm, r32 xmm,m32 m32, r mm,mm xmm,xmm r,m64 m64,r xmm,xmm xmm,m m,xmm xmm,m m,xmm xmm,m mm,xmm Microcode Move instructions MOVD MOVD MOVD MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q Operands μops Instruction k k Prescott MOVQ2DQ MOVNTQ MOVNTDQ MOVDDUP MOVSHDUP MOVSLDUP PACKSSWB/DW PACKUSWB PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKHBW/WD/DQ/ QDQ PUNPCKLBW/WD/DQ/Q DQ PSHUFD PSHUFL/HW PSHUFW MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRW PINSRW Arithmetic instructions PADDB/W/D PADD(U)SB/W PSUBB/W/D PSUB(U)SB/W PADDQ, PSUBQ PADDQ, PSUBQ PCMPEQB/W/D PCMPGTB/W/D PMULLW PMULHW PMULHUW PMADDWD PMULUDQ PAVGB/W PMIN/MAXUB PMIN/MAXSW PSADBW Logic PAND, PANDN POR, PXOR PSLL/RLW/D/Q, PSRAW/D PSLLDQ, PSRLDQ xmm,mm m,mm m,xmm xmm,xmm 2 3 2 1 0 0 0 0 10 1 2 1 2 4 4 2 0,1 mov-mmx 0 mov 0 mov 1 mmx xmm,xmm 1 0 4 1 2 1 mm,r/m 1 0 2 1 2 xmm,r/m 1 0 4 1 mm,r/m 1 0 2 xmm,r/m 1 0 xmm,r/m xmm,xmm,i xmm,xmm,i mm,mm,i mm,mm xmm,xmm r32,r r32,mm,i r32,xmm,i r,r32,i 1 1 1 1 1 1 2 2 2 2 r,r/m sse2 shift sse sse2 sse3 mmx shift sse3 1 mmx shift mmx a 4 1 mmx shift mmx a 1 2 1 mmx shift mmx a 4 1 4 1 mmx shift sse2 a 0 0 0 0 4 6 0 0 0 0 2 4 2 2 1 1 1 1 2 2 2 1 10 12 3 2 3 2 shift shift shift shift sse2 sse2 sse sse sse sse2 a 1 0 2 1 1,2 1 mmx alu mmx a,j r,r/m mm,r/m xmm,r/m 1 1 1 0 0 0 2 2 5 1 1 1 1,2 1 2 1 1 1 mmx mmx fp alu alu add mmx sse2 sse2 a,j a a r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 2 7 7 7 7 2 2 2 4 1 1 1 1 1 1 1 1 1 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1,2 1 1 1 1 1 1 1 1 1 mmx fp fp fp fp mmx mmx mmx mmx alu mul mul mul mul alu alu alu alu mmx mmx sse mmx sse2 sse sse sse sse a,j a,j a,j a,j a,j a,j a,j a,j a,j r,r/m r,r/m 1 1 0 0 2 2 1 1 1,2 1,2 1 1 mmx mmx alu alu mmx mmx a,j a,j r,i/r/m xmm,i 1 1 0 0 2 4 1 1 1,2 2 1 1 mmx mmx shift shift mmx sse2 a,j 7 7 7 4 Page 252 1 mmx 1 mmx 1 mmx 1 mmx 0 mov 0 mov 0,1 mmx-alu0 1 mmx-int 1 mmx-int 1 int-mmx sse sse sse2 sse Prescott Other EMMS Notes: a) j) k) 10 10 12 0 mmx Add 1 μop if source is a memory operand. Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands. It may be advantageous to replace this instruction by two 64-bit moves or LDDQU. Floating point XMM instructions 0 Conversion CVTPS2PD CVTPD2PS CVTSD2SS CVTSS2SD CVTDQ2PS CVTDQ2PD CVT(T)PS2DQ CVT(T)PD2DQ CVTPI2PS r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m xmm,mm 1 2 3 2 1 3 1 2 4 0 0 0 0 0 0 0 0 0 MOVHPS/D, MOVLPS/D 1 1 0 4 2 1 1 4 2 1 1 5 4 4 2 1 1 1 1 4 10 14 8 5 10 5 11 12 1 1 1 1 1 1 1 1 1 4 2 6 6 2 4 2 2 6 Page 253 0 2 0 0 2 0 1 1 2 0 1 1 2 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 mov mov mmx mmx shift shift mmx mmx shift shift fp mmx mmx mmx shift shift shift mmx fp-mmx mmx mmx fp mmx-fp fp fp-mmx mmx shift sse2 shift shift sse2 sse2 sse sse sse sse sse sse sse sse sse sse sse sse sse sse sse3 sse3 sse sse sse sse sse sse2 a sse2 sse2 sse2 a sse2 a sse Notes 7 MOVHPS/D, MOVLPS/D 2 4 1 1 2 1 2 8 2 2 1 2 2 2 2 2 2 2 4 3 2 2 2 Instruction set 0 0 Subunit 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Execution unit 1 1 2 1 4 4 1 1 1 2 1 1 2 2 1 1 2 2 1 2 1 7 MOVSH/LDUP MOVDDUP MOVNTPS/D MOVMSKPS/D SHUFPS/D UNPCKHPS/D UNPCKLPS/D r,r r,m m,r r,r r,m m,r r,r r,r r,m m,r r,r r,r r,m m,r r,r r,r m,r r32,r r,r/m,i r,r/m r,r/m Port Reciprocal throughput Additional latency Latency Microcode Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVUPS/D MOVSS MOVSD MOVSS, MOVSD MOVSS, MOVSD MOVHLPS MOVLHPS Operands μops Instruction k k a a a a a a Prescott CVTPI2PD CVT(T)PS2PI CVT(T)PD2PI CVTSI2SS CVTSI2SD CVT(T)SD2SI CVT(T)SS2SI xmm,mm mm,xmm mm,xmm xmm,r32 xmm,r32 r32,xmm r32,xmm 4 3 4 3 4 2 2 0 0 0 0 0 0 0 12 8 12 20 20 12 17 1 0 1 1 1 1 1 5 2 3 4 5 4 4 1 0,1 0,1 1 1 1 1 fp-mmx fp-mmx fp-mmx fp-mmx fp-mmx fp fp sse2 sse sse2 sse sse2 r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 1 1 3 1 1 1 1 1 2 0 0 0 0 0 0 0 0 0 0 5 5 5 13 7 32 41 40 71 6 1 1 1 1 1 1 1 1 1 1 2 2 2 5-6 2 23 41 40 71 4 1 1 1 1 1 1 1 1 1 1 fp fp fp fp fp fp fp fp fp mmx r,r/m 1 0 5 1 2 1 r,r/m r,r/m 1 2 0 0 5 6 1 1 2 3 Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D r,r/m 1 0 2 1 Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS r,r/m r,r/m r,r/m r,r/m r,r/m r,r/m 1 1 1 1 2 2 0 0 0 0 0 0 32 41 40 71 5 6 1 1 1 1 1 1 m m 2 3 11 0 Arithmetic ADDPS/D ADDSS/D SUBPS/D SUBSS/D ADDSUBPS/D HADDPS/D HSUBPS/D MULPS/D MULSS/D DIVSS DIVPS DIVSD DIVPD RCPPS RCPSS MAXPS/D MAXSS/DMINPS/D MINSS/D CMPccPS/D CMPccSS/D COMISS/D UCOMISS/D Other LDMXCSR STMXCSR Notes: a) h) k) a a a a a sse2 sse a a add add add add mul div div div div sse sse sse3 sse3 sse sse sse sse2 sse2 sse a a a a a a,h a,h a,h a,h a fp add sse a 1 1 fp fp add add sse sse a a 2 1 mmx alu sse a 32 41 40 71 3 4 1 1 1 1 1 1 fp fp fp fp mmx mmx div div div div sse sse sse2 sse2 sse sse a,h a,h a,h a,h a a 13 3 1 1 sse sse Add 1 μop if source is a memory operand. Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. It may be advantageous to replace this instruction by two 64-bit moves or LDDQU. Page 254 Atom Intel Atom List of instruction timings and μop breakdown Explanation of column headings: Instruction: Operands: μops: Unit: Latency: Reciprocal throughput: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of μops from the decoder or ROM. Tells which execution unit is used. Instructions that use the same unit cannot execute simultaneously. ALU0 and ALU1 means integer unit 0 or 1, respectively. ALU0/1 means that either unit can be used. ALU0+1 means that both units are used. Mem means memory in/out unit. FP0 means floating point unit 0 (includes multiply, divide and other SIMD instructions). FP1 means floating point unit 1 (adder). MUL means multiplier, shared between FP and integer units. DIV means divider, shared between FP and integer units. np means not pairable: Cannot execute simultaneously with any other instruction. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Move instructions MOV MOV MOV MOV MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG Operands μops Unit r,r r,i r,m m,r m,i r,sr m,sr sr,r sr,m m,r r,r/m r,r r,m r,r r,m 1 1 1 1 1 1 2 7 8 1 1 1 1 3 4 ALU0/1 ALU0/1 ALU0, Mem ALU0, Mem ALU0, Mem Latency Reciprocal throughput 1 1 1-3 1 1 ALU0, Mem ALU0 ALU0+1 1 2 6 6 Page 255 1/2 1/2 1 1 1 1 5 21 26 2.5 1 2 3 6 6 Remarks All addr. modes All addr. modes Implicit lock Atom XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL r i m sr r (E/R)SP m sr 3 1 1 2 3 14 9 1 1 3 7 19 16 1 1 2 r,m r m m m 1 1 10 1 1 1 1 1 r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m 1 1 1 1 1 1 1 1 1 1 13 13 20 21 4 10 3 4 3 8 2 1 6 2 r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i np np np np ALU0+1 ALU0/1 AGU1 ALU0 6 1 1 1 2 1 7 1-4 1 30 1 1 30 1 1 1/2 1 1 1 1/2 1 1 2 2 2 1/2 1 1/2 Mem Mem ALU0/1 ALU0/1, Mem ALU0/1 ALU0/1 ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul Page 256 6 1 1 5 6 12 11 1 1 6 31 28 12 2 1/2 5 2 2 2 2 1 1 1 16 12 20 25 7 24 7 6 6 14 6 5 13 5 Not in x64 mode Not in x64 mode Not in x64 mode 4 clock latency on input register Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode 7 6 6 14 5 2 11 5 Atom IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCL SHLD SHLD SHLD SHLD SHLD SHLD SHRD SHRD SHRD SHRD SHRD SHRD BT BT BT r32,r32,i r64,r64,i m8 m16 m32 m64 r/m8 r/m16 r/m32 r/m 64 r/m8 r/m16 r/m32 r/m64 r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r,1 r/m,i/cl r/m,i/cl r16,r16,i r32,r32,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r16,r16,i r32,r32,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r,r/i m,r m,i 1 7 3 5 4 8 9 12 12 38 26 29 29 60 2 1 1 2 1 1 ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Mul ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0, Div ALU0 ALU0 ALU0 ALU0 ALU0 ALU0 5 14 6 7 7 14 22 33 49 183 38 45 61 207 5 1 1 5 1 1 1 ALU0/1 1 1 ALU0/1, Mem 1 ALU0/1, Me 1 1 ALU0/1 1 1 ALU0/1, Mem 1 ALU0 1 1 ALU0 1 1 ALU0 1 1 ALU0 1 5 ALU0 7 2 ALU0 1 12-17 ALU0 12-15 14-20 ALU0 14-18 10 ALU0 10 2 ALU0 5 10 ALU0 11 9 ALU0 9 2 ALU0 5 9 ALU0 10 8 ALU0 8 2 ALU0 5 10 ALU0 9 7 ALU0 8 2 ALU0 5 9 ALU0 9 1 ALU1 1 9 10 2 5 Page 257 2 14 22 33 49 183 38 45 61 207 1/2 1 1 1/2 1 1 1 1 1 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1-2 more if mem 1 Atom BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc SETcc CLC STC CMC CLD STD r,r/i m,r m,i r,r/m r m Control transfer instructions JMP short/near JMP far JMP r JMP m(near) JMP m(far) Conditional jump short/near J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL far CALL r CALL m(near) CALL m(far) RETN RETN i RETF RETF i BOUND r,m INTO String instructions LODS REP LODS STOS REP STOS MOVS REP MOVS SCAS REP SCAS CMPS REP CMPS Other NOP (90) Long NOP (0F 1F) PAUSE ENTER 1 10 3 10 1 2 1 1 5 6 ALU1 ALU1 ALU1 ALU0+1 ALU0/1 2 1 29 1 2 30 1 3 8 8 1 37 1 2 38 1 1 36 36 11 4 ALU1 a,0 1 2 5 1/2 2 7 25 2 66 4 7 78 2 7 8 8 3 65 18 20 64 6 6 80 80 10 6 ALU1 np np 3 5n+11 2 3n+10 4 4n+11 3 5n+16 5 6n+16 1 1 5 14 1 11 6 16 2 6 3n+50 5 2n+4 6 2n - 4n 6 3n+60 7 4n+40 ALU0/1 ALU0/1 Page 258 Not in x64 mode Not in x64 mode Not in x64 mode fastest for high n 1/2 1/2 24 23 Not in x64 mode Atom ENTER LEAVE CPUID RDTSC RDPMC a,b 20+6b 4 40-80 16 24 6 100-170 29 48 Floating point x87 instructions Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST Operands μops r m32/m64 m80 m80 r m32/m64 m80 m80 r m m m 1 1 4 52 1 3 8 189 1 1 3 3 1 2 2 3 4 4 2 3 1 1 166 83 1 3 9 92 1 7 12 221 1 7 11 11 1 1 1 1 1 1 1 5 3 3 3 3 1 5 5 71 1 1 1 1 r AX m16 m16 m16 m m r/m r/m r/m r/m r m m m m Unit Latency Reciprocal throughput 9 1 321 177 Mul Div Mul Div 1 Page 259 1 1 10 92 1 9 13 221 1 6 9 9 1 8 10 9 10 10 8 9 1 1 321 177 1 2 71 1 1 1 1 10 9 9 73 9 1 Remarks SSE3 Atom FXAM FPREM FPREM1 FRNDINT 1 26 37 19 1 ~110 ~130 48 Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN 30 15 1 9 112 25 63 100 91 56 24 71 ~260 ~260 ~100 ~220 ~300 ~300 Other FNOP WAIT FNCLEX FNINIT 1 2 4 23 Div 5 1 1 5 26 74 Integer MMX and XMM instructions Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ Operands μops r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm 1 1 1 1 1 1 1 1 1 1 3 4 4 1 1 1 1 (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm 1 1 1 Unit Latency Reciprocal throughput Mem Mem 4 5 3 4 1 4 5 1 4 5 6 6 6 1 1 ~400 ~450 2 1 1 1 1/2 1 1 1/2 1 1 6 6 6 1 1 1 3 FP0 FP0 FP0 1 1 1 1 1 1 Mem Mem FP0/1 Mem Mem FP0/1 Mem Mem Mem Mem Mem Page 260 Remarks Atom PSHUFB PSHUFB PSHUFW PSHUFL/HW PSHUFD PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PINSRW PEXTRW Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW PSADBW PSADBW PAVGB/W PMIN/MAXUB PMIN/MAXSW PABSB PABSW PABSD PSIGNB PSIGNW PSIGND Logic instructions PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ mm,mm xmm,xmm mm,mm,i xmm,xmm,i xmm,xmm,i xmm, xmm,i mm,mm xmm,xmm r32,(x)mm (x)mm,r32,i r32,(x)mm,i 1 4 1 1 1 1 1 2 1 1 2 (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm, (x)mm (x)mm,(x)mm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm mm,mm xmm,xmm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm 1 2 7 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm (x)xmm,i xmm,i Other EMMS FP0 FP0 FP0 FP0 FP0 Mem Mem 1 6 1 1 1 1 4 3 5 FP0/1 1 6 1 1 1 1 2 7 2 1 5 1/2 5 8 FP0/1 FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0, Mul FP0/1 FP0/1 FP0/1 FP0/1 1 5 8 6 1 4 5 4 5 4 5 4 5 4 5 4 5 1 1 1 1 1 FP0/1 1 1/2 1 2 1 1 FP0/1 FP0 FP0 FP0 1 5 1 1 1/2 5 1 1 9 1/2 1 2 1 2 1 2 1 2 1 2 1 2 1/2 1/2 1/2 1/2 9 Floating point XMM instructions Operands μops Unit Page 261 Latency Reciprocal throughput Remarks Atom Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm xmm,m32/64 m32/64,xmm xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm 4 3 4 3 3 3 3 3 1 1 3 4 3 3 3 3 Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 3 1 3 5 5 1 FP0/1 Mem Mem Mem Mem FP0/1 Mem Mem Mem Mem Mem FP0 Mem FP0 FP0 FP0 FP0 FP0 FP0 FP1 FP1 FP1 FP1 FP1 FP1 FP0+1 FP0+1 FP0, Mul Page 262 1 4 5 6 6 1 4 5 5 4 4 1 4 ~500 1 1 1 1 1 1 1/2 1 1 6 6 1/2 1 1 1 1 1 1 2 3 1 1 1 1 1 11 10 7 6 6 6 7 6 6 4 7 7 7 10 8 10 11 10 6 6 6 6 6 6 5 1 6 7 6 8 6 8 5 5 5 6 5 6 8 8 4 1 1 1 6 1 6 7 7 1 Atom MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D MAXPS/D MINPS/D xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 6 3 3 6 6 1 5 1 3 4 1 3 FP0, Mul FP0, Mul FP0, Mul FP0, Div FP0, Div FP0, Div FP0, Div FP0 FP0 FP0 FP0 FP0 5 5 9 31 60 64 122 4 9 5 6 9 5 6 2 2 9 31 60 64 122 1 8 1 6 9 1 6 Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm 3 5 3 5 1 5 FP0, Div FP0, Div FP0, Div FP0, Div FP0 FP0 31 63 60 121 4 9 31 63 60 121 1 8 Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D xmm,xmm xmm,xmm xmm,xmm xmm,xmm 1 1 1 1 FP0/1 FP0/1 FP0/1 FP0/1 1 1 1 1 1/2 1/2 1/2 1/2 Other LDMXCSR STMXCSR FXSAVE FXRSTOR m32 m32 m4096 m4096 4 4 121 116 5 14 142 149 6 15 144 150 Page 263 Silvermont Intel Silvermont List of instruction timings and μop breakdown Explanation of column headings: Instruction: Operands: μops: Unit: Latency: Reciprocal throughput: Instruction name. cc means any condition code. For example, Jcc can be JB, JNE, etc. i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, m = memory, m32 = 32-bit memory operand, etc. The number of μops from the decoder or ROM. A µop that goes to multiple units is counted as one. Tells which execution unit is used. Instructions that use the same unit cannot execute simultaneously. IP0 and IP1 means integer port 0 or 1 and their associated pipelines IP0/1 means that either integer unit can be used. IP0+1 means that both units are used at the same time. Mem means memory execution cluster FP0 means floating point port 0 (includes multiply, divide, convert and shuffle). FP1 means floating point port 1 (adder). This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Delays in the decoders are included in the latency and throughput timings. Values of 4 or more are often caused by bottlenecks in the decoders and microcode ROM rather than the execution units. Integer instructions Move instructions MOV MOV MOV MOV MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD MOVSX MOVZX MOVSXD CMOVcc CMOVcc XCHG XCHG XLAT PUSH Operands μops Unit r,r r,i r,m m,r m,i m,r r16,r8 r16,m8 r32/64,r/m r,r r,m r,r r,m 1 1 1 1 1 1 2 3 1 1 1 3 3 4 1 IP0/1 IP0/1 Mem Mem Mem Mem IP0 IP0 IP0 IP0/1 1 1 3 4 IP0/1 5 10 5 r IP0+1 Page 264 Latency Reciprocal throughput 1 2 1/2 1/2 1 1 1 2 4 10 1 1 1 5 10 5 1 Remarks All addr. modes All addr. modes Implicit lock Silvermont PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA LEA LEA LEA LEA BSWAP MOVBE MOVBE MOVBE PREFETCHNTA PREFETCHT0/1/2 PREFETCHNTW LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL i m r (E/R)SP m r,[r+d] r,[r+r*s] r,[r+r*s+d] r,[rip+d] r16,[m] r r16,m16 r32/64,m32/64 m,r m m m r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r r m r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 1 3 18 10 2 2 6 21 17 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 IP0+1 IP0+1 IP0+1 1 1 1 1 1 1 1 1 1 1 1 13 13 20 21 4 11 3 4 3 3 2 1 1 IP0/1 IP0/1, Mem IP0/1, Mem IP0/1 IP0 IP0/1 IP1 IP0+1 IP0/1 IP0 IP0/1 IP0/1 IP0/1 IP0 IP0 IP0 IP0 IP0 IP0 IP0 Page 265 1 2 6 1 1 2 4 1 1 6 2 6 1 1 1 6 12 12 16 16 5 24 5 5 5 7 4 3 5 1 5 29 10 1 3 6 47 14 1 1 4 1 1 2 0.5 4 1 2 1 1 1 1 1 8 14 7 1/2 1 1 2 2 2 1/2 1 1/2 1/2 1 16 5 5 5 7 4 1 2 Not in x64 mode Not in x64 mode Not in x64 mode latency to flag=2 Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Not in x64 mode Silvermont IMUL IMUL IMUL MUL IMUL MUL IMUL MUL IMUL MUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO POPCNT POPCNT POPCNT CRC32 CRC32 CRC32 Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR SHR SHL SAR ROR ROL ROR ROL RCR RCL RCR RCR RCL RCL SHLD SHLD SHLD SHLD SHLD SHLD SHRD SHRD SHRD r16,r16,i r32,r32,i r64,r64,i m8 m16 m32 m64 r/m8 r/m16 r/m32 r/m 64 r/m8 r/m16 r/m32 r/m64 r16,r16 r32,r32 r64,r64 r32,r8 r32,r16 r32,r32 r,r/i r,m m,r/i r,r/i m,r/i r,i/cl m,i/cl r,i/cl m,i/cl r,1 r,1 r,i/cl m,i/cl r,i/cl m,i/cl r16,r16,i r32,r32,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r16,r16,i r32,r32,i r64,r64,i 2 1 1 3 5 4 4 9 12 12 23 26 29 29 44 2 1 1 2 1 1 2 1 1 2 1 1 IP0 IP0 IP0 IP0 IP0 IP0 IP0 IP0, FP0 IP0, FP0 IP0, FP0 IP0, FP0 IP0, FP0 IP0, FP0 IP0, FP0 IP0, FP0 IP0 IP0 IP0 IP0 IP0 IP0 1 IP0/1 1 IP0/1, Mem 1 IP0/1, Mem 1 IP0/1 1 IP0/1, Mem 1 IP0 1 IP0 1 IP0 1 IP0 7 IP0 1 IP0 11 IP0 14 IP0 13 IP0 16 IP0 10 IP0 1 IP0 10 IP0 9 IP0 2 IP0 9 IP0 8 IP0 2 IP0 8-10 IP0 Page 266 4 3 5 14 24 25-29 25-39 34-94 24-35 37-41 29-46 47-107 4 1 1 4 1 1 4 3 3 4 6 3 1 6 1 1 1 1 9 2 12 13 12 14 10 2 10 10 4 10 10 4 10 4 1 2 19 19-23 19-31 25-94 25 30-32 29-38 47-107 4 1 1 4 6 1 1/2 1 1 1/2 1 1 1 1 1 2 2 more if mem 4 more if mem 2 more if mem 2 more if mem 2 more if mem 2 more if mem 3 more if mem 4 more if mem 3 more if mem Silvermont SHRD SHRD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc CLC STC CMC CLD STD 7 2 2 1 7 1 1 8 1 10 1 1 1 4 5 IP0 IP0 IP0 IP0+1 Control transfer instructions JMP short/near JMP r JMP m(near) Conditional jump short/near J(E/R)CXZ short LOOP short LOOP(N)E short CALL near CALL r CALL m RET RET i BOUND r,m INTO 1 1 1 1 2 7 8 1 1 3 1 1 10 4 IP1 String instructions LODS REP LODS STOS 3 ~4n 2 5 ~2n 4 REP STOS MOVS ~0.12B 5 ~0.1B 6 REP MOVS SCAS REP SCAS CMPS REP CMPS ~ 0.2B 3 ~5n 6 ~6n ~0.15B 5 ~3n 6 ~3n 6 4 1 8 6 13 6 10 10 10 11 14 Synchronization instructions XADD LOCK XADD LOCK ADD CMPXCHG LOCK CMPXCHG CMPXCHG8B r16,r16,cl r32,r32,cl r64,r64,cl r,r/i m,r m,i r,r/i m,r m,i r,r/m r/m m,r m,r m,r m,r m,r m,r IP0+1 IP0+1 IP0+1 IP0+1 IP0/1 10 4 4 1 9 1 1 10 2 1 IP0+1 IP0+1 2 more if mem 2 more if mem 2 more if mem 1 1 10 1 10 1 1 1 7 35 2 2 2 1-2 2-15 10-20 IP1 2 9 14 3 3 10 7 Page 267 Not in x64 mode Not in x64 mode per byte, best case per byte, best case Silvermont LOCK CMPXCHG8B CMPXCHG16B LOCK CMPXCHG16B Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDTSCP RDPMC RDRAND m,r m,r m,r 11 19 17 14 24 27 1 1 6 15 19+6b 4 31-80 13 15 19 15 IP0/1 IP0/1 Operands μops Unit r m32/m64 m80 m80 r m32/m64 m80 m80 r m m m 1 1 5 59 1 1 8 204 2 1 6 7 1 1 2 3 2 4 2 4 1 1 166 82 a,0 a,b r 1/2 1/2 24 14 59+5b 5 54-108 29 25 19 ~1472 Floating point x87 instructions Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FIST(P) FISTTP FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FABS r AX m16 m16 m16 m m r/m r/m r/m 1 1 1 1 FP0+1 Latency Reciprocal throughput 1 3 9 68 1 4 9 239 1 6 9 6 240 174 0.5 1 8 68 0.5 2 9 239 1 2 9 13 1 7 7 6 9 11 4 5 0.5 0.5 240 174 3 5 39 1 1 2 37 1 6 ~9 ~6 ~5 1 FP1 FP0 FP0 Page 268 Remarks Silvermont FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT r/m r m m m m 1 1 1 1 3 3 3 3 1 1 27 27 18 1 5 5 5 6 7 32-57 32-57 26 Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN 27 15 1 18 110 9 34 61 101 63 Other FNOP WAIT FNCLEX FNINIT 1 2 4 19 20 13-40 40-170 40-170 39-90 80-140 154 45-200 85-190 1 1 1 1 5 6 39 5 1 1 32-57 32-57 26 66 20 13-40 40-170 0.5 4 24 65 Integer MMX and XMM instructions Move instructions MOVD MOVQ MOVD MOVQ MOVD MOVQ MOVD MOVQ MOVQ MOVDQA MOVDQA MOVDQU MOVDQA MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA Operands μops r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm x, x x, m128 m128, x x, m128 mm, x x,mm m64,mm m128,x x, m128 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Unit Mem Mem FP0/1 FP0/1 Mem Mem Mem Mem Mem Page 269 Latency Reciprocal throughput 4 4 3 3 1 1 3 4 3 1 1 ~370 ~370 3 1 1 1 1 0.5 0.5 1 1 1 1 1 1 1 1 Remarks Silvermont PACKSSWB/DW PACKUSWB PACKUSDW PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PMOVSX/ZX BW BD BQ DW DQ PMOVSX/ZX BW BD BQ DW DQ PSHUFB PSHUFB PSHUFW PSHUFL/HW PSHUFD PALIGNR PBLENDVB PBLENDVB PBLENDW MASKMOVQ MASKMOVDQU PMOVMSKB PINSRW PINSRB/D/Q PINSRB/D/Q PEXTRW PEXTRB/W PEXTRQ PEXTRB/W PEXTRD PEXTRQ Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PCMPEQQ PCMPGTQ PMULL/HW PMULHUW PMULL/HW PMULHUW PMULHRSW PMULHRSW PMULLD PMULDQ PMULUDQ PMULUDQ PMADDWD PMADDWD PMADDUBSW PMADDUBSW (x)mm, (x)mm x,x (x)mm, (x)mm (x)mm, (x)mm 1 1 1 1 FP0 FP0 FP0 FP0 x,x 1 x,m mm,mm x,x mm,mm,i x,x,i x,x,i x,x,i x,x,xmm0 x,m,xmm0 x,x/m,i mm,mm x,x r32,(x)mm (x)mm,r32,i x,r32,i x,m8,i r32,(x)mm,i r32,x,i r64,x,i m8/16,x,i m32,x,i m64,x,i 1 1 4 1 1 1 1 2 3 1 1 3 1 1 1 1 2 2 2 5 4 4 (x)mm, (x)mm (x)mm, (x)mm (x)mm, m mm, mm x, x/m (x)mm, (x)mm (x)mm,(x)mm x, x x, x mm,mm x, x mm,mm x, x x, x x, x mm,mm x, x mm,mm x, x mm,mm x, x 1 2 3 5 7-8 3-4 1 2 1 1 1 1 1 7 1 1 1 1 1 1 1 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 Mem Mem 1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 1 4 1 1 5 1 1 1 1 4 5 1 1 5 1 1 1 1 4 4 7 6 5 8 1 ~370 ~370 4 3 3 5 5 7 FP0/1 FP0/1 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 Page 270 1 4 6 9 5-6 1 4 5 4 5 4 5 11 5 4 5 4 5 4 5 0.5 4 5 6 9 5-6 0.5 4 2 1 2 1 2 11 2 1 2 1 2 1 2 +1 if mem +1 if mem Silvermont PSADBW PSADBW MPSADBW MPSADBW PAVGB/W PMIN/MAXUB PMIN/MAXSW PMIN/PMAX SB/SW/SD UB/UW/UD PHMINPOSUW PABSB PABSW PABSD PSIGNB PSIGNW PSIGND mm,mm x, x x,x,i x,m,i (x)mm,(x)mm (x)mm,(x)mm (x)mm,(x)mm 1 1 3 4 1 1 1 FP0 FP0 4 5 7 FP0/1 FP0/1 FP0/1 1 1 1 1 2 6 6 0.5 0.5 0.5 x,x x,x (x)mm,(x)mm 1 1 1 FP0 FP0/1 1 5 1 1 2 0.5-1 (x)mm,(x)mm 1 FP0/1 1 0.5-1 (x)mm,(x)mm x,x (x)mm,(x)mm (x)mm,i x,i 1 1 2 1 1 FP0/1 FP0 FP0 FP0 1 1 2 1 1 0.5 1 2 1 1 String instructions PCMPESTRI PCMPESTRM PCMPISTRI PCMPISTRM x,x,i x,x,i x,x,i x,x,i 9 8 6 5 FP0 FP0 FP0 FP0 21 17 17 13 21 17 17 13 +1 if mem +1 if mem +1 if mem +1 if mem Encryption instructions PCLMULQDQ x,x,i 8 FP0 10 10 +1 if mem Logic instructions PAND(N) POR PXOR PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ Other EMMS 9 10 Floating point XMM instructions Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS BLENDPS/PD BLENDVPS/PD BLENDVPS/PD Operands μops Unit x, x x,m128 m128,x x,m128 m128,x x, x x,m32/64 m32/64,x x,m64 m64,x x,x x,x/m,i x,x,xmm0 x,m,xmm0 1 1 1 1 1 1 1 1 1 1 1 1 2 3 FP0/1 Mem Mem Mem Mem FP0/1 Mem Mem Mem Mem FP0 FP0+1 FP0+1 Page 271 Latency Reciprocal throughput 1 3 4 3 4 1 3 4 3 4 1 1 4 5 0.5 1 1 1 1 0.5 1 1 1 1 1 1 4 5 Remarks Silvermont INSERTPS INSERTPS EXTRACTPS EXTRACTPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD x,x,i x,m32,i r32,x,i m32,x,i r32,x m128,x x,x,i x,x,i x, x x, x x, x x, x 1 3 2 4 1 1 1 1 1 1 1 1 FP0 Mem FP0 FP0 FP0 FP0 FP0 FP0 5 4 ~370 1 1 1 1 1 1 1 5 4 5 1 1 1 1 1 1 1 1 Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI x, x x, x x, x x, x x, x x, x x, x x, x x,mm mm,x x,mm mm,x x,r32 r32,x xm,r32 r32,x 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 5 4 5 4 5 5 5 5 4 4 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x x, x 1 1 1 1 1 1 4 4 1 1 1 1 1 1 6 6 1 5 1 1 1 FP1 FP1 FP1 FP1 FP1 FP1 3 3 3 4 3 4 6 6 4 5 5 7 19 34 39 69 4 9 3 1 1 1 2 1 2 6 5 1 2 2 4 17 32 39 69 1 8 1 1 1 Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D PS/D COMISS/D UCOMISS/D MAXSS/D MINSS/D 1 4 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP0 FP1 FP1 FP1 Page 272 3 +1 if mem +1 if mem Silvermont MAXPS MINPS MAXPD MINPD DPPS DPPD x, x x, x x,x,i x,x,i x,x,i x,x,i 1 1 1 1 9 5 FP1 FP1 FP0 FP0 FP0 FP0 3 4 4 5 15 12 1 2 2 2 12 8 Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS x, x x, x x, x x, x x, x x, x 1 5 1 5 1 5 FP0 FP0 FP0 FP0 FP0 FP0 20 40 35 70 4 9 18 40 33 70 1 8 Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D x, x x, x x, x x, x 1 1 1 1 FP0/1 FP0/1 FP0/1 FP0/1 1 1 1 1 0.5 0.5 0.5 0.5 Other LDMXCSR STMXCSR FXSAVE FXSAVE FXRSTOR FXRSTOR m32 m32 m4096 m4096 m4096 m4096 5 4 115 123 114 123 10 12 132 143 118 122 8 11 132 143 118 122 ROUNDSS/D ROUNDPS/D Page 273 +1 if mem +1 if mem 32 bit mode 64 bit mode 32 bit mode 64 bit mode VIA Nano 2000 VIA Nano 2000 series List of instruction timings and μop breakdown Explanation of column headings: Operands: μops: Port: Latency: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of micro-operations from the decoder or ROM. Note that the VIA Nano 2000 processor has no reliable performance monitor counter for μops. Therefore the number of μops cannot be determined except in simple cases. Tells which execution port or unit is used. Instructions that use the same port cannot execute simultaneously. I1: Integer add, Boolean, shift, etc. I2: Integer add, Boolean, move, jump. I12: Can use either I1 or I2, whichever is vacant first. MA: Multiply, divide and square root on all operand types. MB: Various Integer and floating point SIMD operations. MBfadd: Floating point addition subunit under MB. SA: Memory store address. ST: Memory store. LD: Memory load. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. Note: There is an additional latency for moving data from one unit or subunit to another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included in the listings below where the source and destination operands are of the same type. Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Operands μops Port Latency Reciprocal thruoghput Remarks Move instructions MOV MOV r,r r,i 1 1 I2 I2 1 1 1 1 MOV MOV MOV MOV MOV MOV MOV r,m m,r m,i r,sr m,sr sr,r sr,m 1 1 1 LD SA, ST SA, ST 2 2 1 1.5 1.5 1 2 20 20 20 20 Page 274 Latency 4 on pointer register VIA Nano 2000 MOVNTI MOVSX MOVSXD MOVZX MOVSX MOVSXD MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA m,r r,r r,m r,m r,r r,m r,r r,m m r i m sr 1 2 1 2 3 SA, ST 2 1.5 I2 LD, I2 LD I1, I2 LD, I1 I2 1 3 2 2 5 3 20 6 1 1 1 1 2 3 20 SA, ST SA, ST Ld, SA, ST 8 r (E/R)SP m sr LD 9 1 1 r,m 1 SA 1 1 9 1 r 1 I2 1 1 30 30 1-2 1-2 14 14 14 I12 LD I12 1 LD I12 SA ST 5 1 1/2 1 2 1 1 2 1/2 1 1/2 m m m r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m I1 I1 1-2 1-2 2 17 8 15 1.25 4 5 20 9 12 1 1 6 1 LD LD 1 2 3 1 2 3 1 2 1 3 I1 LD I1 LD I1 SA ST I12 LD I12 I12 LD I12 SA ST 5 1 1 5 37 37 22 Page 275 Implicit lock Not in x64 mode Not in x64 mode Not in x64 mode 3 clock latency on input register Not in x64 mode Not in x64 mode Not in x64 mode VIA Nano 2000 DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL MUL IMUL IMUL IMUL IMUL IMUL IMUL IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR 24 23 30 r8 r16 r32 r64 r16,r16 r32,r32 r64,r64 r16,r16,i r32,r32,i r64,r64,i r8 r16 r32 r64 r8 r16 r32 r64 1 1 r,r/i r,m m,r/i r,r/i m,r/i r,i/cl r,i/cl r,1 r,i/cl r16,r16,i r32,r32,i r64,r64,i r64,r64,i r16,r16,cl r32,r32,cl r64,r64,cl r64,r64,cl r,r/i m,r m,i r,r/i m,r m,i r,r 1 2 3 1 2 1 1 1 1 2 2 MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA I1 I1 7-9 7-9 7-9 8-10 4-6 4-6 5-7 4-6 4-6 5-7 26 27-35 25-41 148-183 26 27-35 23-39 187-222 1 1 I12 LD I12 1 LD I12 SA ST 5 1 I12 LD I12 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 I1 Page 276 1 1 1 28+3n 11 7 33 43 11 7 33 43 1 2 10 8 3 1 1 2 1 1 2 26 27-35 25-41 148-183 26 27-35 23-39 187-222 1 1 1/2 1 2 1/2 1 1 1 1 28+3n 11 7 33 43 11 7 33 43 1 8 1 2 10 8 2 Not in x64 mode Not in x64 mode Not in x64 mode Extra latency to other ports do. do. do. do. do. do. do. do. do. do. do. do. do. do. do. do. do. VIA Nano 2000 SETcc SETcc CLC STC CMC CLD STD r m I1 2 1 1 3 3 I1 3 3 I2 3 58 3 I2 3 3 55 1-3-8 3 3 1-3-8 1 if not jumping. 3 if jumping. 8 if >2 jumps in 16 bytes block do. do. Control transfer instructions JMP JMP short/near far JMP JMP JMP Conditional jump r m(near) m(far) short/near 1 J(E/R)CXZ LOOP LOOP(N)E short short short 1-3-8 1-3-8 25 1-3-8 1-3-8 25 CALL CALL near far 3 72 3 72 CALL CALL CALL r m(near) m(far) 3 4 72 3 3 72 3 3 39 39 3 3 39 39 13 7 RETN RETN RETF RETF BOUND INTO i i r,m String instructions LODSB/W/D/Q REP LODSB/W/D/Q STOSB/W/D/Q REP STOSB/W/D/Q 1 3n+22 1-2 Small: 2n+2, Big: 6 bytes per clock MOVSB/W/D/Q REP MOVSB/W/D/Q 2 Small: 2n+45, Big: 6 bytes per clock SCASB/W/D/Q REP SCASB 1 2.2n Page 277 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 8 if >2 jumps in 16 bytes block do. Not in x64 mode Not in x64 mode VIA Nano 2000 REP SCASW/D/Q Small: 2n+50 Big: 5 bytes per clock CMPSB/W/D/Q REP CMPSB/W/D/Q Other NOP (90) Long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 6 2.4n+24 1 1 All I12 a,0 a,b 4 53-173 40 1 1/2 25 23 52+5b 4 Blocks all ports 39 40 Floating point x87 instructions Operands μops Port and Unit Latency Reciprocal thruoghput 1 2 2 MB LD MB LD MB 1 3 3 MB MB SA ST MB SA ST 1 I2 1 4 4 54 1 5 5 125 0 7 5 5 6 5 5 1 1 1 54 1 1-2 1-2 125 1 1 MB Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FILD FIST(T)(P) FIST(T)(P) FIST(T)(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR r m32/m64 m80 m80 r m32/m64 m80 m80 r m16 m32 m64 m16 m32 m64 r AX m16 m16 m16 2 13 1 1 I2 MB m m 0 321 195 Arithmetic instructions Page 278 1 10 2 5 3 13 2 1 1 321 195 Remarks VIA Nano 2000 FADD(P) FSUB(R)(P) FMUL(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT r/m r/m r/m r/m r m m m m 1 1 1 1 1 1 1 1 1 MB MA MA MB MB MB MB MB MB 2 4 15-42 1 1 MB 1 2 15-42 1 1 1 1 1 2 4 42 2 1 41 Lower precision: Lat: 4, Thr: 2 151-171 106-155 29 Math FSCALE FXTRACT FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN 39 36-57 73 51-159 270-360 50-200 ~60 ~170 300-370 ~170 Other FNOP WAIT FNCLEX FNINIT 1 1 MB I12 0 1 1/2 57 85 Integer MMX and XMM instructions Operands μops Port and Unit Latency Reciprocal thruoghput 3 2-3 4 2-3 1 2-3 2-3 1 2-3 1 1-2 1 1 1 1 1-2 1 1 Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA MOVDQA r32/64,(x)mm m32/64,(x)mm (x)mm,r32/64 (x)mm,m32/64 (x)mm, (x)mm (x)mm,m64 m64, (x)mm xmm, xmm xmm, m128 1 1 SA ST 1 1 1 1 1 1 LD MB LD SA ST MB LD Page 279 Remarks VIA Nano 2000 MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ PACKSSWB/DW PACKUSWB PUNPCKH/LBW/WD/ DQ PUNPCKH/LQDQ PSHUFB PSHUFW PSHUFL/HW PSHUFD PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PINSRW m128, xmm m128, xmm xmm, m128 xmm, m128 mm, xmm xmm,mm m64,mm m128,xmm 1 1 1 1 1 1 3 3 SA ST SA ST LD LD MB MB 2-3 2-3 2-3 2-3 1 1 ~300 ~300 1-2 1-2 1 1 1 1 2 2 v,v 1 MB 1 1 v,v v,v v,v mm,mm,i x,x,i x,x,i x,x,i mm,mm xmm,xmm r32,(x)mm r32 ,(x)mm,i (x)mm,r32,i 1 1 1 1 1 1 1 MB MB MB MB MB MB MB 1 1 1 1 1 1 1 3 3 9 1 1 1 1 1 1 1 1-3 1-3 1 1 9 v,v v,v 1 1 MB MB 1 1 1 1 v,v v,v v,v v,v v,v v,v v,v v,v v,v v,v v,v v,v 3 3 1 1 1 1 MB MB MB MA MA MA 1 1 1 MB MB MB MB 3 3 1 3 3 3 4 10 2 1 1 1 3 3 1 1 1 1 2 8 1 1 1 1 Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PMULL/HW PMULHUW PMULHRSW PMULUDQ PMADDWD PMADDUBSW PSADBW PAVGB/W PMIN/MAXUB PMIN/MAXSW PABSB PABSW PABSD v,v 1 MB 1 1 PSIGNB PSIGNW PSIGND v,v 1 MB 1 1 Logic instructions PAND(N) POR PXOR PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ v,v v,v v,i x,i 1 1 1 1 MB MB MB MB 1 1 1 1 1 1 1 1 Page 280 VIA Nano 2000 Other EMMS 1 MB 1 Floating point XMM instructions Operands μops Port and Unit Latency Reciprocal thruoghput 1 1 1 1 1 1 1 1 MB LD SA ST LD SA ST MB LD SA ST 1 MB 1 1 1 1 1 1 MB MB MB MB MB MB 1 2-3 2-3 2-3 2-3 1 2-3 2-3 6 6 6 2 1 3 ~300 1 1 1 1 1 1 1 1 1-2 1 1-2 1 1 1-2 1 1 1-2 1-2 1 1 2.5 1 1 1 1 1 1 Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD xmm,xmm xmm,m128 m128,xmm xmm,m128 m128,xmm xmm,xmm x,m32/64 m32/64,x xmm,m64 xmm,m64 m64,xmm m64,xmm xmm,xmm r32,xmm m128,xmm xmm,xmm,i xmm,xmm,i xmm,xmm xmm,xmm xmm,xmm xmm,xmm Conversion CVTPD2PS CVTSD2SS CVTPS2PD CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,mm mm,xmm xmm,mm mm,xmm xmm,r32 r32,xmm xmm,r32 r32,xmm Arithmetic ADDSS SUBSS xmm,xmm 3-4 15 3-4 15 3 2 4 3 4 3 4 3 5 4 5 4 1 MBfadd Page 281 2-3 1 Remarks VIA Nano 2000 ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm MAXSS/D MINSS/D MAXPS/D MINPS/D xmm,xmm xmm,xmm xmm,xmm Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm xmm,xmm Logic ANDPS/D ANDNPS/D ORPS/D XORPS/D xmm,xmm xmm,xmm xmm,xmm xmm,xmm Other LDMXCSR STMXCSR FXSAVE FXRSTOR m32 m32 m4096 m4096 1 1 1 1 1 1 1 MBfadd MBfadd 2-3 2-3 2-3 2-3 2-3 5 5 3 4 3 4 15-22 15-36 42-82 24-70 5 14 2 2 1 1 MBfadd MBfadd 3 2 2 1 1 1 MA MA MA MA 33 126 62 122 5 14 33 126 62 122 5 11 MB MB MB MB 1 1 1 1 1 1 1 1 45 13 208 232 29 13 208 232 1 1 1 1 1 1 MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MA MA MA MA MA MA MA MA 1 1 1 1 1 3 3 1 2 1 2 15-22 15-36 42-82 24-70 5 11 1 1 VIA-specific instructions Instruction XSTORE XSTORE REP XSTORE REP XSTORE Conditions Data available No data available Quality factor = 0 Quality factor > 0 Clock cycles, approximately 160-400 clock giving 8 bytes 50-80 clock giving 0 bytes 4800 clock per 8 bytes 19200 clock per 8 bytes Page 282 VIA Nano 2000 REP XCRYPTECB REP XCRYPTECB REP XCRYPTECB REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTOFB REP XCRYPTOFB REP XCRYPTOFB REP XSHA1 REP XSHA256 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 44 clock per 16 bytes 46 clock per 16 bytes 48 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 43 clock per 16 bytes 46 clock per 16 bytes 48 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 54 clock per 16 bytes 59 clock per 16 bytes 63 clock per 16 bytes 3 clock per byte 4 clock per byte Page 283 Nano 3000 VIA Nano 3000 series List of instruction timings and μop breakdown Explanation of column headings: Operands: μops: Port: Latency: i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm register, (x)mm = mmx or xmm register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc. The number of micro-operations from the decoder or ROM. Note that the VIA Nano 3000 processor has no reliable performance monitor counter for μops. Therefore the number of μops cannot be determined except in simple cases. Tells which execution port or unit is used. Instructions that use the same port cannot execute simultaneously. I1: Integer add, Boolean, shift, etc. I2: Integer add, Boolean, move, jump. I12: Can use either I1 or I2, whichever is vacant first. MA: Multiply, divide and square root on all operand types. MB: Various Integer and floating point SIMD operations. MBfadd: Floating point addition subunit under MB. SA: Memory store address. ST: Memory store. LD: Memory load. This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. Note: There is an additional latency for moving data from one unit or subunit to another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included in the listings below where the source and destination operands are of the same type. Reciprocal throughput: The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Integer instructions Operands μops Port Latency Reciprocal thruoghput MOV MOV r,r r,i 1 1 I2 I12 1 1 1 1/2 MOV MOV MOV MOV MOV MOV r,m m,r m,i r,sr m,sr sr,r 1 1 1 LD SA, ST SA, ST I12 2 2 1 1.5 1.5 1/2 1.5 20 Remarks Move instructions 20 Page 284 Latency 4 on pointer register Nano 3000 MOV MOVNTI MOVSX MOVZX MOVSXD MOVSX MOVSXD MOVZX CMOVcc CMOVcc XCHG XCHG XLAT PUSH PUSH PUSH PUSH PUSHF(D/Q) PUSHA(D) POP POP POP POP POPF(D/Q) POPA(D) LAHF SAHF SALC LEA BSWAP LDS LES LFS LGS LSS PREFETCHNTA PREFETCHT0/1/2 LFENCE MFENCE SFENCE Arithmetic instructions ADD SUB ADD SUB ADD SUB ADC SBB ADC SBB ADC SBB CMP CMP INC DEC NEG NOT INC DEC NEG NOT AAA AAS DAA sr,m m,r r,r r64,r32 r,m r,m r,r r,m r,r r,m m r i m sr r (E/R)SP m sr 1 1 2 1 1 3 3 1 1 3 9 2 SA, ST I12 LD, I12 LD I12 LD, I12 I12 LD, I1 SA, ST SA, ST LD, SA, ST 20 2 1 1 3 2 1 5 3 18 6 1 1 10 20 1.5 1/2 1 1 1 1/2 1 1.5 18 2 1-2 1-2 2 6 2 15 1.25 4 2 11 1 12 1 1 6 1 1 1 1 28 28 1 1 2 LD 3 3 16 1 1 2 r,m r 1 1 m m m 12 1 1 I1 I1 SA I2 LD LD Implicit lock Not in x64 mode Not in x64 mode Not in x64 mode Extra latency to other ports 15 r,r/i r,m m,r/i r,r/i r,m m,r/i r,r/i m,r/i r m 1 2 3 1 2 3 1 2 1 3 12 12 14 I12 LD I12 1 LD I12 SA ST 5 1 I1 LD I1 LD I1 SA ST I12 LD I12 I12 LD I12 SA ST 5 1 1 5 1/2 1 2 1 1 2 1/2 1 1/2 37 22 22 Page 285 Not in x64 mode Not in x64 mode Not in x64 mode Nano 3000 DAS AAD AAM MUL IMUL MUL IMUL MUL IMUL r8 r16 r32 14 7 13 1 3 3 I2 I2 I2 2 3 3 MUL IMUL IMUL IMUL r64 r16,r16 r32,r32 3 1 1 MA I2 I2 8 2 2 8 1 1 IMUL IMUL IMUL r64,r64 r16,r16,i r32,r32,i 1 1 1 MA I2 I2 5 2 2 2 1 1 IMUL DIV DIV DIV DIV IDIV IDIV IDIV IDIV CBW CWDE CDQE CWD CDQ CQO r64,r64,i r8 r16 r32 r64 r8 r16 r32 r64 1 MA MA MA MA MA MA MA MA MA I2 I2 5 22-24 24-28 22-30 145-162 21-24 24-28 18-26 182-200 1 1 Logic instructions AND OR XOR AND OR XOR AND OR XOR TEST TEST SHR SHL SAR ROR ROL RCR RCL RCR RCL SHLD SHRD SHLD SHRD SHLD SHRD BT BT BT BTR BTS BTC BTR BTS BTC BTR BTS BTC BSF BSR SETcc SETcc 1 1 r,r/i r,m m,r/i r,r/i m,r/i r,i/cl r,i/cl r,1 r,i/cl r16,r16,i/cl r32,r32,i/cl r64,r64,i/cl r64,r64,i/cl r,r/i m,r m,i r,r/i m,r m,i r,r r8 m 24 24 31 1 I12 2 LD I12 3 LD I12 SA ST 1 I12 2 LD I12 1 I12 1 I1 1 I1 5+2n I1 2 I1 2 I1 16 I1 23 I1 1 I1 6 I1 2 I1 2 I1 8 I1 5 I1 2 I1 1 I1 2 Page 286 1 5 1 1 1 1 28+3n 2 2 32 42 1 2 10 8 2 1 Not in x64 mode Not in x64 mode Not in x64 mode Extra latency to other ports Extra latency to other ports Extra latency to other ports 2 22-24 24-28 22-30 145-162 21-24 24-28 18-26 182-200 1 1 1/2 1 2 1/2 1 1/2 1 1 28+3n 2 2 32 42 1 8 1 2 10 8 2 1 2 Nano 3000 CLC STC CMC CLD STD 3 3 I1 I1 3 3 3 3 Control transfer instructions JMP JMP short/near far 1 14 I2 3 3 50 JMP JMP JMP r m(near) m(far) 2 2 17 I2 3 3 3 3 42 Conditional jump J(E/R)CXZ LOOP LOOP(N)E short/near short short short 1 2 2 5 CALL CALL near far CALL CALL CALL r m(near) m(far) RETN RETN RETF RETF BOUND INTO i i r,m I2 1-3-8 1-3-8 1-3-8 24 1-3-8 1-3-8 1-3-8 24 2 17 3 3 58 2 3 19 3 4 3 3 54 3 4 20 20 9 3 3 3 3 3 49 49 13 7 String instructions LODSB/W/D/Q REP LODSB/W/D/Q STOSB/W/D/Q 2 3n 1 REP STOSB/W/D/Q MOVSB/W/D/Q 1 3n+27 1-2 Small: n+40, Big: 6-7 bytes/clk 3 2 Small: 2n+20, Big: 6-7 bytes/clk 3 1 2.4n Small: 2n+31, Big: 5 bytes/clk REP MOVSB/W/D/Q SCASB/W/D/Q REP SCASB REP SCASW/D/Q Page 287 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 1 if not jumping. 3 if jumping. 8 if >2 jumps in 16 bytes block 8 if >2 jumps in 16 bytes block Not in x64 mode 8 if >2 jumps in 16 bytes block do. 8 if >2 jumps in 16 bytes block do. Not in x64 mode Not in x64 mode Nano 3000 CMPSB/W/D/Q REP CMPSB/W/D/Q Other NOP (90) long NOP (0F 1F) PAUSE ENTER ENTER LEAVE CPUID RDTSC RDPMC 5 a,0 a,b 0-1 0-1 2 10 6 2.2n+30 I12 I12 3 0 0 2 55-146 1/2 1/2 6 21 52+5b 2 Sometimes fused 37 40 Floating point x87 instructions Operands μops Port r m32/m64 m80 m80 r m32/m64 m80 m80 r m16 m32 m64 m16 m32 m64 MB LD MB LD MB m m 1 2 2 36 1 3 3 80 1 3 2 2 3 3 3 1 3 1 1 3 5 3 1 1 122 115 r/m 1 Latency Reciprocal thruoghput Move instructions FLD FLD FLD FBLD FST(P) FST(P) FSTP FBSTP FXCH FILD FILD FILD FIST(T)(P) FIST(T)(P) FIST(T)(P) FLDZ FLD1 FLDPI FLDL2E etc. FCMOVcc FNSTSW FNSTSW FLDCW FNSTCW FINCSTP FDECSTP FFREE(P) FNSAVE FRSTOR Arithmetic instructions FADD(P) FSUB(R)(P) r AX m16 m16 m16 MB MB SA ST MB SA ST I2 1 4 4 54 1 5 5 125 0 7 5 5 6 5 5 MB 319 196 1 10 2 1 2 8 2 1 1 319 196 2 1 MB 2 I2 MB 0 MB Page 288 1 1 1 54 1 1-2 1-2 125 1 Remarks Nano 3000 FMUL(P) FDIV(R)(P) FABS FCHS FCOM(P) FUCOM FCOMPP FUCOMPP FCOMI(P) FUCOMI(P) FIADD FISUB(R) FIMUL FIDIV(R) FICOM(P) FTST FXAM FPREM FPREM1 FRNDINT r/m r/m r/m r m m m m Math FSCALE FXTRACT 1 MA MA MB MB MB MB MB MB 4 14-23 1 1 MB 11 2 38 ~130 ~130 27 22 13 37 57 1 1 1 1 1 3 3 3 3 1 15 FSQRT FSIN FCOS FSINCOS F2XM1 FYL2X FYL2XP1 FPTAN FPATAN 2 2 14-23 1 1 1 1 1 2 4 16 2 1 38 Less at lower precision 73 ~150 270-360 50-200 ~50 ~50 300-370 ~180 Other FNOP WAIT FNCLEX FNINIT 1 1 MB I12 0 1 1/2 59 84 Integer MMX and XMM instructions Operands μops Port r,(x)mm m,(x)mm (x)mm,r (x)mm,m v,v (x)mm,m64 m64, (x)mm x,x 1 1 1 1 1 1 1 1 MB SA ST I2 LD MB LD SA ST MB Latency Reciprocal thruoghput Move instructions MOVD MOVD MOVD MOVD MOVQ MOVQ MOVQ MOVDQA Page 289 3 2 4 2 1 2 2 1 1 1-2 1 1 1 1 1-2 1 Remarks Nano 3000 MOVDQA MOVDQA MOVDQU MOVDQU LDDQU MOVDQ2Q MOVQ2DQ MOVNTQ MOVNTDQ MOVNTDQA PACKSSWB/DW PACKUSWB PACKUSDW PUNPCKH/LBW/WD/DQ PUNPCKH/LQDQ PSHUFB PSHUFW PSHUFL/HW PSHUFD PBLENDVB PBLENDW PALIGNR MASKMOVQ MASKMOVDQU PMOVMSKB PEXTRW PEXTRB/D/Q PINSRW PINSRB/D/Q PMOVSX/ZXBW/BD/ BQ/WD/WQ/DQ x, m128 m128, x m128, x x, m128 x, m128 mm, x x,mm m64,mm m128,x x,m128 1 1 1 1 1 1 1 2 2 1 LD SA ST SA ST LD LD MB MB 2 2 2 2 2 1 1 ~360 ~360 2 1 1-2 1-2 1 1 1 1 2 2 1 v,v x,x v,v v,v v,v mm,mm,i x,x,i x,x,i x,x,xmm0 x,x,i x,x,i mm,mm x,x r32,(x)mm r32 ,(x)mm,i r32/64,x,i (x)mm,r32,i x,r32/64,i 1 1 1 1 1 1 1 1 1 1 1 MB MB MB MB MB MB MB MB MB MB MB 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 MB MB MB MB 3 3 3 5 5 1 1 1 1 1 1 1 1 2 1 1 1-2 1-2 1 1 1 1 1 x,x 1 MB 1 1 v,v v,v 1 1 MB MB 1 1 1 1 v,v v,v v,v x,x v,v v,v x,x v,v x,x v,v v,v v,v x,x,i v,v 3 3 1 1 1 1 1 1 1 1 7 1 1 1 MB MB MB MB MA MA MA MA MA MA 3 3 1 1 3 3 3 3 3 4 10 2 2 1 3 3 1 1 1 1 1 1 1 2 8 1 1 1 Arithmetic instructions PADD/SUB(U)(S)B/W/D PADDQ PSUBQ PHADD(S)W PHSUB(S)W PHADDD PHSUBD PCMPEQ/GTB/W/D PCMPEQQ PMULL/HW PMULHUW PMULHRSW PMULLD PMULUDQ PMULDQ PMADDWD PMADDUBSW PSADBW MPSADBW PAVGB/W MB MB MB Page 290 Nano 3000 PMIN/MAXSW PMIN/MAXUB PMIN/MAXSB/D PMIN/MAXUW/D PHMINPOSUW PABSB PABSW PABSD PSIGNB PSIGNW PSIGND Logic instructions PAND(N) POR PXOR PTEST PSLL/RL/RAW/D/Q PSLL/RL/RAW/D/Q PSLL/RLDQ v,v v,v x,x x,x x,x 1 1 1 1 1 MB MB MB MB MB 1 1 1 1 2 1 1 1 1 1 v,v 1 MB 1 1 v,v 1 MB 1 1 v,v v,v v,v (x)xmm,i x,i 1 1 1 1 1 MB MB MB MB MB 1 3 1 1 1 1 1 1 1 1 1 MB Operands μops Port x,x x,m128 m128,x x,m128 m128,x x,x x,m32/64 m32/64,x x,m64 x,m64 m64,x m64,x x,x r32,x m128,x x,x,i x,x,i x,x x,x x,x x,x 1 1 1 1 2 1 1 2 2 2 3 1 1 MB LD SA ST LD SA ST MB LD SA ST x,x x,x x,x 2 1 2 Other EMMS 1 Floating point XMM instructions Move instructions MOVAPS/D MOVAPS/D MOVAPS/D MOVUPS/D MOVUPS/D MOVSS/D MOVSS/D MOVSS/D MOVHPS/D MOVLPS/D MOVHPS/D MOVLPS/D MOVLHPS MOVHLPS MOVMSKPS/D MOVNTPS/D SHUFPS SHUFPD MOVDDUP MOVSH/LDUP UNPCKH/LPS UNPCKH/LPD Conversion CVTPD2PS CVTSD2SS CVTPS2PD 2 1 1 1 1 1 1 MB MB MB MB MB MB Page 291 Latency Reciprocal thruoghput 1 2 2 2 2 1 2-3 2-3 6 6 6 2 1 3 ~360 1 1 1 1 1 1 1 1 1 1 1 1 1 1-2 1 1 1-2 1-2 1 1 1-2 1 1 1 1 1 1 5 2 5 2 1 Remarks Nano 3000 CVTSS2SD CVTDQ2PS CVT(T) PS2DQ CVTDQ2PD CVT(T)PD2DQ CVTPI2PS CVT(T)PS2PI CVTPI2PD CVT(T) PD2PI CVTSI2SS CVT(T)SS2SI CVTSI2SD CVT(T)SD2SI x,x x,x x,x x,x x,x x,mm mm,x x,mm mm,x x,r32 r32,x x,r32 r32,x 1 1 1 2 x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x x,x 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 3 1 1 MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MBfadd MA MA MA MA MA MA MA MA MA MA MBfadd MBfadd 2 2 2 2 2 2 5 5 3 4 3 4 13 13-20 24 21-38 5 14 2 2 1 1 1 1 1 1 3 3 1 2 1 2 13 13-20 24 21-38 5 11 1 1 MAXSS/D MINSS/D MAXPS/D MINPS/D x,x x,x x,x 1 1 1 MBfadd MBfadd MBfadd 3 2 2 1 1 1 Math SQRTSS SQRTPS SQRTSD SQRTPD RSQRTSS RSQRTPS x,x x,x x,x x,x x,x x,x 1 1 1 1 1 3 MA MA MA MA 33 64 62 122 5 14 33 64 62 122 5 11 Logic ANDPS/D x,x 1 MB 1 1 Arithmetic ADDSS SUBSS ADDSD SUBSD ADDPS SUBPS ADDPD SUBPD ADDSUBPS ADDSUBPD HADDPS HSUBPS HADDPD HSUBPD MULSS MULSD MULPS MULPD DIVSS DIVSD DIVPS DIVPD RCPSS RCPPS CMPccSS/D CMPccPS/D COMISS/D UCOMISS/D MB 2 1 2 2 2 1 2 1 Page 292 2 3 2 5 4 5 4 4 4 5 4 5 4 1 1 1 2 2 1 1 2 1 1 Nano 3000 ANDNPS/D ORPS/D XORPS/D x,x x,x x,x Other LDMXCSR STMXCSR FXSAVE FXRSTOR m32 m32 m4096 m4096 1 1 1 MB MB MB 1 1 1 1 1 1 31 13 97 201 VIA-specific instructions Instruction XSTORE XSTORE REP XSTORE REP XSTORE REP XCRYPTECB REP XCRYPTECB REP XCRYPTECB REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCBC REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCTR REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTCFB REP XCRYPTOFB REP XCRYPTOFB REP XCRYPTOFB REP XSHA1 REP XSHA256 Conditions Data available No data available Quality factor = 0 Quality factor > 0 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key 128 bits key 192 bits key 256 bits key Clock cycles, approximately 160-400 clock giving 8 bytes 50-80 clock giving 0 bytes 1300 clock per 8 bytes 5455 clock per 8 bytes 15 clock per 16 bytes 17 clock per 16 bytes 18 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 23 clock per 16 bytes 26 clock per 16 bytes 27 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 29 clock per 16 bytes 33 clock per 16 bytes 37 clock per 16 bytes 5 clock per byte 5 clock per byte Page 293