4. Instruction tables

advertisement
Introduction
4. Instruction tables
Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD
and VIA CPUs
By Agner Fog. Technical University of Denmark.
Copyright © 1996 – 2016. Last updated 2016-01-09.
Introduction
This is the fourth in a series of five manuals:
1. Optimizing software in C++: An optimization guide for Windows, Linux and Mac
platforms.
2. Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly
programmers and compiler makers.
4. Instruction tables: Lists of instruction latencies, throughputs and micro-operation
breakdowns for Intel, AMD and VIA CPUs.
5. Calling conventions for different C++ compilers and operating systems.
The latest versions of these manuals are always available from www.agner.org/optimize.
Copyright conditions are listed below.
The present manual contains tables of instruction latencies, throughputs and micro-operation
breakdown and other tables for x86 family microprocessors from Intel, AMD and VIA.
The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower
than the values published elsewhere. The discrepancies can be explained by the following factors:
● My figures are experimental values while figures published by microprocessor vendors may be
based on theory or simulations.
● My figures are obtained with a particular test method under particular conditions. It is possible that
different values can be obtained under other conditions.
● Some latencies are difficult or impossible to measure accurately, especially for memory access
and type conversions that cannot be chained.
● Latencies for moving data from one execution unit to another are listed explicitly in some of my
tables while they are included in the general latencies in some tables published by Intel.
Most values are the same in all microprocessor modes (real, virtual, protected, 16-bit, 32-bit, 64-bit).
Values for far calls and interrupts may be different in different modes. Call gates have not been
tested.
Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices
then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor
systems. This also applies to the XCHG instruction with a memory operand.
If any text in the pdf version of this manual is unreadable, then please refer to the spreadsheet version.
Copyright notice
Page 1
Introduction
This series of five manuals is copyrighted by Agner Fog. Public distribution and mirroring is not
allowed. Non-public distribution to a limited audience for educational purposes is allowed. The code
examples in these manuals can be used without restrictions. A GNU Free Documentation License
shall automatically come into force when I die. See www.gnu.org/copyleft/fdl.html
Page 2
Definition of terms
Definition of terms
Instruction
The instruction name is the assembly code for the instruction. Multiple instructions or
multiple variants of the same instruction may be joined into the same line. Instructions
with and without a 'v' prefix to the name have the same values unless otherwise
noted.
Operands
Operands can be different types of registers, memory, or immediate constants. Abbreviations used in the tables are: i = immediate constant, r = any general purpose
register, r32 = 32-bit register, etc., mm = 64 bit mmx register, x or xmm = 128 bit xmm
register, y = 256 bit ymm register, z = 512 bit zmm register, v = any vector register, sr
= segment register, m = any memory operand including indirect operands, m64
means 64-bit memory operand, etc.
Latency
The latency of an instruction is the delay that the instruction generates in a dependency chain. The measurement unit is clock cycles. Where the clock frequency is varied dynamically, the figures refer to the core clock frequency. The numbers listed are
minimum values. Cache misses, misalignment, and exceptions may increase the
clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity may increase the latencies by possibly
more than 100 clock cycles on many processors, except in move, shuffle and Boolean
instructions. Floating point overflow, underflow, denormal or NAN results may give a
similar delay. A missing value in the table means that the value has not been measured or that it cannot be measured in a meaningful way.
Some processors have a pipelined execution unit that is smaller than the largest register size so that different parts of the operand are calculated at different times. Assume, for example, that we have a long depencency chain of 128-bit vector instructions running in a fully pipelined 64-bit execution unit with a latency of 4. The lower 64
bits of each operation will be calculated at times 0, 4, 8, 12, 16, etc. And the upper 64
bits of each operation will be calculated at times 1, 5, 9, 13, 17, etc. as shown in the
figure below. If we look at one 128-bit instruction in isolation, the latency will be 5. But
if we look at a long chain of 128-bit instructions, the total latency will be 4 clock cycles
per instruction plus one extra clock cycle in the end. The latency in this case is listed
as 4 in the tables because this is the value it adds to a dependency chain.
Reciprocal
throughput
The throughput is the maximum number of instructions of the same kind that can be
executed per clock cycle when the operands of each instruction are independent of
the preceding instructions. The values listed are the reciprocals of the throughputs,
i.e. the average number of clock cycles per instruction when the instructions are not
part of a limiting dependency chain. For example, a reciprocal throughput of 2 for
FMUL means that a new FMUL instruction can start executing 2 clock cycles after a
previous FMUL. A reciprocal throughput of 0.33 for ADD means that the execution
units can handle 3 integer additions per clock cycle.
The reason for listing the reciprocal values is that this makes comparisons between latency and throughput easier. The reciprocal throughput is also called issue latency.
Page 3
Definition of terms
The values listed are for a single thread or a single core. A missing value in the table
means that the value has not been measured.
μops
Uop or μop is an abbreviation for micro-operation. Processors with out-of-order cores
are capable of splitting complex instructions into μops. For example, a read-modify instruction may be split into a read-μop and a modify-μop. The number of μops that an
instruction generates is important when certain bottlenecks in the pipeline limit the
number of μops per clock cycle.
Execution
unit
The execution core of a microprocessor has several execution units. Each execution
unit can handle a particular category of μops, for example floating point additions. The
information about which execution unit a particular μop goes to can be useful for two
purposes. Firstly, two μops cannot execute simultaneously if they need the same execution unit. And secondly, some processors have a latency of an extra clock cycle
when the result of a μop executing in one execution unit is needed as input for a μop
in another execution unit.
Execution
port
The execution units are clustered around a few execution ports on most Intel processors. Each μop passes through an execution port to get to the right execution unit. An
execution port can be a bottleneck because it can handle only one μop at a time. Two
μops cannot execute simultaneously if they need the same execution port, even if
they are going to different execution units.
Instruction
set
This indicates which instruction set an instruction belongs to. The instruction is only
available in processors that support this instruction set. The different instruction sets
are listed at the end of this manual. Availability in processors prior to 80386 does not
apply for 32-bit and 64-bit operands. Availability in the MMX instruction set does not
apply to 128-bit packed integer instructions, which require SSE2. Availability in the
SSE instruction set does not apply to double precision floating point instructions,
which require SSE2.
32-bit instructions are available in 80386 and later. 64-bit instructions in general purpose registers are available only under 64-bit operating systems. Instructions that use
XMM registers (SSE and later) are only available under operating systems that support this register set. Instructions that use YMM registers (AVX and later) are only
available under operating systems that support this register set.
How the values were measured
The values in the tables are measured with the use of my own test programs, which are available
from www.agner.org/optimize/testp.zip
The time unit for all measurements is CPU clock cycles. It is attempted to obtain the highest clock
frequency if the clock frequency is varying with the workload. Many Intel processors have a performance counter named "core clock cycles". This counter gives measurements that are independent
of the varying clock frequency. Where no "core clock cycles" counter is available, the "time stamp
counter" is used (RDTSC instruction). In cases where this gives inconsistent results (e.g. in AMD
Bobcat) it is necessary to make the processor boost the clock frequency by executing a large number of instructions (> 1 million) or turn off the power-saving feature in the BIOS setup.
Instruction throughputs are measured with a long sequence of instructions of the same kind, where
subsequent instructions use different registers in order to avoid dependence of each instruction on
the previous one. The input registers are cleared in the cases where it is impossible to use different
registers. The test code is carefully constructed in each case to make sure that no other bottleneck
is limiting the throughput than the one that is being measured.
Instruction latencies are measured in a long dependency chain of identical instructions where the
output of each instruction is needed as input for the next instruction.
Page 4
Definition of terms
The sequence of instructions should be long, but not so long that it doesn't fit into the level-1 code
cache. A typical length is 100 instructions of the same type. This sequence is repeated in a loop if a
larger number of instructions is desired.
It is not possible to measure the latency of a memory read or write instruction with software methods.
It is only possible to measure the combined latency of a memory write followed by a memory read
from the same address. What is measured here is not actually the cache access time, because in
most cases the microprocessor is smart enough to make a "store forwarding" directly from the write
unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of
this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables.
But in fact, the only value that makes sense to performance optimization is the sum of the write time
and the read time.
A similar problem occurs where the input and the output of an instruction use different types of registers. For example, the MOVD instruction can transfer data between general purpose registers and
XMM vector registers. The value that can be measured is the combined latency of data transfer from
one type of registers to another type and back again (A → B → A). The division of this latency between the A → B latency and the B → A latency is sometimes obvious, sometimes based on guesswork, µop counts, indirect evidence, or triangular sequences such as A → B → Memory → A. In
many cases, however, the division of the total latency between A → B latency and B → A latency is
arbitrary. However, what cannot be measured cannot matter for performance optimization. What
counts is the sum of the A → B latency and the B → A latency, not the individual terms.
The µop counts are usually measured with the use of the performance monitor counters (PMCs) that
are built into modern microprocessors. The PMCs for VIA processors are undocumented, and the interpretation of these PMCs is based on experimentation.
The execution ports and execution units that are used by each instruction or µop are detected in different ways depending on the particular microprocessor. Some microprocessors have PMCs that
can give this information directly. In other cases it is necessary to obtain this information indirectly by
testing whether a particular instruction or µop can execute simultaneously with another
instruction/µop that is known to go to a particular execution port or execution unit. On some processors, there is a delay for transmitting data from one execution unit (or cluster of execution units) to
another. This delay can be used for detecting whether two different instructions/µops are using the
same or different execution units.
Page 5
Instruction sets
Instruction sets
Explanation of instruction sets for x86 processors
x86
80186
80286
80386
80486
x87
80287
80387
Pentium
PPro
MMX
This is the name of the common instruction set, supported by all processors in
this lineage.
This is the first extension to the x86 instruction set. New integer instructions:
PUSH i, PUSHA, POPA, IMUL r,r,i, BOUND, ENTER, LEAVE, shifts and rotates
by immediate ≠ 1.
System instructions for 16-bit protected mode.
The eight general purpose registers are extended from 16 to 32 bits. 32-bit
addressing. 32-bit protected mode. Scaled index addressing. MOVZX, MOVSX,
IMUL r,r, SHLD, SHRD, BT, BTR, BTS, BTC, BSF, BSR, SETcc.
BSWAP. Later versions have CPUID.
This is the floating point instruction set. Supported when a 8087 or later
coprocessor is present. Some 486 processors and all processors since
Pentium/K5 have built-in support for floating point instructions without the need
for a coprocessor.
FSTSW AX
FPREM1, FSIN, FCOS, FSINCOS.
RDTSC, RDPMC.
Conditional move (CMOV, FCMOV) and fast floating point compare (FCOMI)
instructions introduced in Pentium Pro. These instructions are not supported in
Pentium MMX, but are supported in all processors with SSE and later.
Integer vector instructions with packed 8, 16 and 32-bit integers in the 64-bit
MMX registers MM0 - MM7, which are aliased upon the floating point stack
registers ST(0) - ST(7).
SSE
Single precision floating point scalar and vector instructions in the new 128-bit
XMM registers XMM0 - XMM7. PREFETCH, SFENCE, FXSAVE, FXRSTOR,
MOVNTQ, MOVNTPS. The use of XMM registers requires operating system
support.
SSE2
Double precision floating point scalar and vector instructions in the 128-bit XMM
registers XMM0 - XMM7. 64-bit integer arithmetics in the MMX registers. Integer
vector instructions with packed 8, 16, 32 and 64-bit integers in the XMM
registers. MOVNTI, MOVNTPD, PAUSE, LFENCE, MFENCE.
FISTTP, LDDQU, MOVDDUP, MOVSHDUP, MOVSLDUP, ADDSUBPS,
ADDSUPPD, HADDPS, HADDPD, HSUBPS, HSUBPD.
(Supplementary SSE3): PSHUFB, PHADDW, PHADDSW, PHADDD,
PMADDUBSW, PHSUBW, PHSUBSW, PHSUBD, PSIGNB, PSIGNW, PSIGND,
PMULHRSW, PABSB, PABSW, PABSD, PALIGNR.
SSE3
SSSE3
64 bit
This instruction set is called x86-64, x64, AMD64 or EM64T. It defines a new 64bit mode with 64-bit addressing and the following extensions: The general
purpose registers are extended to 64 bits, and the number of general purpose
registers is extended from eight to sixteen. The number of XMM registers is also
extended from eight to sixteen, but the number of MMX and ST registers is still
eight. Data can be addressed relative to the instruction pointer. There is no way
to get access to these extensions in 32-bit mode
Most instructions that involve segmentation are not available in 64 bit mode.
Direct far jumps and calls are not allowed, but indirect far jumps, indirect far calls
and far returns are allowed. These are used in system code for switching mode.
Segment registers DS, ES, and SS cannot be used. The FS and GS segments
and segment prefixes are available in 64 bit mode and are used for addressing
thread environment blocks and processor environment blocks
Page 6
Instruction sets
Instructions not The following instructions are not available in 64-bit mode: PUSHA, POPA,
available in 64 BOUND, INTO, BCD instructions: AAA, AAS, DAA, DAS, AAD, AAM,
bit mode
undocumented instructions (SALC, ICEBP, 82H alias for 80H opcode),
SYSENTER, SYSEXIT, ARPL. On some early Intel processors, LAHF and SAHF
are not available in 64 bit mode. Increment and decrement register instructions
cannot be coded in the short one-byte opcode form because these codes have
been reassigned as REX prefixes.
Most instructions that involve segmentation are not available in 64 bit
mode. Direct far jumps and calls are not allowed, but indirect far jumps,
indirect far calls and far returns are allowed. These are used in system
code for switching mode. PUSH CS, PUSH DS, PUSH ES, PUSH SS,
POP DS, POP ES, POP SS, LDS and LES instructions are not allowed.
CS, DS, ES and SS prefixes are allowed but ignored. The FS and GS
segments and segment prefixes are available in 64 bit mode and are
used for addressing thread environment blocks and processor
environment blocks.
Monitor
SSE4.1
SSE4.2
AES
CLMUL
AVX
AVX2
FMA3
The instructions MONITOR and MWAIT are available in some Intel and AMD
multiprocessor CPUs with SSE3
MPSADBW, PHMINPOSUW, PMULDQ, PMULLD, DPPS, DPPD, BLEND..,
PMIN.., PMAX.., ROUND.., INSERT.., EXTRACT.., PMOVSX.., PMOVZX..,
PTEST, PCMPEQQ, PACKUSDW, MOVNTDQA
CRC32, PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM, PCMPGTQ,
POPCNT.
AESDEC, AESDECLAST, AESENC, AESENCLAST, AESIMC,
AESKEYGENASSIST.
PCLMULQDQ.
The 128-bit XMM registers are extended to 256-bit YMM registers with room for
further extension in the future. The use of YMM registers requires operating
system support. Floating point vector instructions are available in 256-bit
versions. Almost all previous XMM instructions now have two versions: with and
without zero-extension into the full YMM register. The zero-extension versions
have three operands in most cases. Furthermore, the following instructions are
added in AVX: VBROADCASTSS, VBROADCASTSD, VEXTRACTF128,
VINSERTF128, VLDMXCSR, VMASKMOVPS, VMASKMOVPD, VPERMILPD,
VPERMIL2PD, VPERMILPS, VPERMIL2PS, VPERM2F128, VSTMXCSR,
VZEROALL, VZEROUPPER.
Integer vector instructions are available in 256-bit versions. Furthermore, the
following instructions are added in AVX2: ANDN, BEXTR, BLSI, BLSMSK,
BLSR, BZHI, INVPCID, LZCNT, MULX, PEXT, PDEP, RORX, SARX, SHLX,
SHRX, TZCNT, VBROADCASTI128, VBROADCASTSS, VBROADCASTSD,
VEXTRACTI128, VGATHERDPD, VGATHERQPD, VGATHERDPS,
VGATHERQPS, VPGATHERDD, VPGATHERQD, VPGATHERDQ,
VPGATHERQQ, VINSERTI128, VPERM2I128, VPERMD, VPERMPD,
VPERMPS, VPERMQ, VPMASKMOVD, VPMASKMOVQ, VPSLLVD, VPSLLVQ,
VPSRAVD, VPSRLVD, VPSRLVQ.
(FMA): Fused multiply and add instructions: VFMADDxxxPD, VFMADDxxxPS,
VFMADDxxxSD, VFMADDxxxSS, VFMADDSUBxxxPD, VFMADDSUBxxxPS,
VFMSUBADDxxxPD, VFMSUBADDxxxPS, VFMSUBxxxPD, VFMSUBxxxPS,
VFMSUBxxxSD, VFMSUBxxxSS, VFNMADDxxxPD, VFNMADDxxPS,
VFNMADDxxxSD, VFNMADDxxxSS, VFNMSUBxxxPD, VFNMSUBxxxPS,
VFNMSUBxxxSD, VFNMSUBxxxSS.
FMA4
MOVBE
Same as Intel FMA, but with 4 different operands according to a preliminary Intel
specification which is now supported only by AMD. Intel's FMA specification has
later been changed to FMA3, which is now also supported by AMD.
MOVBE
Page 7
Instruction sets
POPCNT
PCLMUL
XSAVE
XSAVEOPT
RDRAND
RDSEED
BMI1
BMI2
ADX
AVX512F
AVX512BW
AVX512DQ
AVX512VL
AVX512CD
AVX512ER
AVX512PF
SHA
MPX
SMAP
CVT16
POPCNT
PCLMULQDQ
RDRAND
RDSEED
ANDN, BEXTR, BLSI, BLSMSK, BLSR, LZCNT, TXCNT
BZHI, MULX, PDEP, PEXT, RORX, SARX, SHRX, SHLX
ADCX, ADOX, CLAC
The 256-bit YMM registers are extended to 512-bit ZMM registers. The number
of vector registers is extended to 32 in 64-bit mode, while there are still only 8
vector registers in 32-bit mode. 8 new vector mask registers k0 – k7. Masked
vector instructions. Many new instructions. Single- and double precision floating
point vectors are always supported. Other instructions are supported if the
various optional AVX512 variants, listed below, are supported as well.
Vectors of 8-bit and 16-bit integers in ZMM registers.
Vectors of 32-bit and 64-bit integers in ZMM registers.
The vector operations defined for 512-bit vectors in the various AVX512 subsets,
including masked operations, can be applied to 128-bit and 256-bit vectors as
well.
Conflict detection instructions
Approximate exponential function, reciprocal and reciprocal square root
Gather and scatter prefetch
Secure hash algorithm
Memory protection extensions
CLAC, STAC
VCVTPH2PS, VCVTPS2PH.
3DNow
(AMD only. Obsolete). Single precision floating point vector instructions in the
64-bit MMX registers. Only available on AMD processors. The 3DNow
instructions are: FEMMS, PAVGUSB, PF2ID, PFACC, PFADD,
PFCMPEQ/GT/GE, PFMAX, PFMIN, PFRCP/IT1/IT2, PFRSQRT/IT1, PFSUB,
PFSUBR, PI2FD, PMULHRW, PREFETCH/W.
(AMD only. Obsolete). PF2IW, PFNACC, PFPNACC, PI2FW, PSWAPD.
3DNowE
PREFETCHW This instruction has survived from 3DNow and now has its own feature name
PREFETCHWT1
SSE4A
PREFETCHWT1
(AMD only). EXTRQ, INSERTQ, LZCNT, MOVNTSD, MOVNTSS, POPCNT.
(POPCNT shared with Intel SSE4.2).
XOP
(AMD only). VFRCZPD, VFRCZPS, VFRCZSD, VFRCZSS, VPCMOV,
VPCOMB, VPCOMD, VPCOMQ, PCOMW, VPCOMUB, VPCOMUD,
VPCOMUQ, VPCOMUW, VPHADDBD, VPHADDBQ, VPHADDBW,
VPHADDDQ, VPHADDUBD, VPHADDUBQ, VPHADDUBW, VPHADDUDQ,
VPHADDUWD, VPHADDUWQ, VPHADDWD, VPHADDWQ, VPHSUBBW,
VPHSUBDQ, VPHSUBWD, VPMACSDD, VPMACSDQH, VPMACSDQL,
VPMACSSDD, VPMACSSDQH, VPMACSSDQL, VPMACSSWD,
VPMACSSWW, VPMACSWD, VPMACSWW, VPMADCSSWD, VPMADCSWD,
VPPERM, VPROTB, VPROTD, VPROTQ, VPROTW, VPSHAB, VPSHAD,
VPSHAQ, VPSHAW, VPSHLB, VPSHLD, VPSHLQ, VPSHLW.
Page 8
Microprocessors tested
Microprocessor versions tested
The tables in this manual are based on testing of the following microprocessors
Processor name
AMD K7 Athlon
AMD K8 Opteron
AMD K10 Opteron
AMD Bulldozer
AMD Piledriver
AMD Steamroller
AMD Bobcat
AMD Kabini
Intel Pentium
Intel Pentium MMX
Intel Pentium II
Intel Pentium III
Intel Pentium 4
Intel Pentium 4 EM64T
Intel Pentium M
Intel Core Duo
Intel Core 2 (65 nm)
Intel Core 2 (45 nm)
Intel Core i7
Intel 2nd gen. Core
Intel 3rd gen. Core
Intel 4th gen. Core
Intel 5th gen. Core
Intel 6th gen. Core
Intel Atom 330
Intel Bay Trail
VIA Nano L2200
VIA Nano L3050
Family Model
Microarchitecture number number
Code name
(hex)
(hex)
Comment
Bulldozer, Zambezi
Piledriver
Steamroller, Kaveri
Bobcat
Jaguar
P5
P5
P6
P6
Netburst
Netburst, Prescott
Dothan
Yonah
Merom
Wolfdale
Nehalem
Sandy Bridge
Ivy Bridge
Haswell
Broadwell
Skylake
Diamondville
Silvermont
Isaiah
6
F
10
15
15
15
14
16
5
5
6
6
F
F
6
6
6
6
6
6
6
6
6
6
6
6
6
6
Page 9
6
5
2
1
2
30
1
0
2
4
6
7
2
4
D
E
F
17
1A
2A
3A
3C
56
5E
1C
37
F
F
Step. 2, rev. A5
Stepping A
2350, step. 1
FX-6100, step 2
FX-8350, step 0. And others
A10-7850K, step 1
E350, step. 0
A4-5000, step 1
Stepping 4
Stepping 4, rev. B0
Xeon. Stepping 1
Stepping 6, rev. B1
Not fully tested
T5500, Step. 6, rev. B2
E8400, Step. 6
i7-920, Step. 5, rev. D0
i5-2500, Step 7
i7-3770K, Step 9
i7-4770K, step. 3
D1540, step 2
Step. 3
Step. 2
Step. 3
Step. 2
Step. 8 (prerelease sample)
AMD K7
AMD K7
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any
memory operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of
clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the execution units can handle 3 instructions per clock cycle in one
thread. However, the throughput may be limited by other bottlenecks in the
pipeline.
Execution unit:
Indicates which execution unit is used for the macro-operations. ALU means
any of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both
used. AGU means any of the three integer address generation units. FADD
means floating point adder unit. FMUL means floating point multiplier unit.
FMISC means floating point store and miscellaneous unit. FA/M means FADD
or FMUL is used. FANY means any of the three floating point units can be
used. Two macro-operations can execute simultaneously if they go to different
execution units.
Integer instructions
Instruction
Move instructions
MOV
MOV
Operands
r,r
r,i
Ops
1
1
Latency Reciprocal
throughput
1
1
1/3
1/3
MOV
MOV
MOV
MOV
r8,m8
r16,m16
r32,m32
m8,r8H
1
1
1
1
4
4
3
8
1/2
1/2
1/2
1/2
MOV
m8,r8L
1
2
1/2
MOV
MOV
MOV
m16/32,r
m,i
r,sr
1
1
1
2
2
2
1/2
1/2
1
Page 10
Execution
unit
Notes
ALU
ALU
Any addr. mode.
Add 1 clk if code
segment base ≠
ALU, AGU 0
do.
ALU, AGU
do.
AGU
AH, BH, CH, DH
AGU
Any other 8-bit
register
AGU
Any addressing
mode
AGU
AGU
AMD K7
MOV
MOVZX, MOVSX
MOVZX, MOVSX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POP
POPF(D)
POPA(D)
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
sr,r/m
r,r
r,m
r,r
r,m
r,r
6
1
1
1
1
3
9-13
1
4
1
r,m
3
2
1
1
2
2
1
9
2
3
6
9
2
9
2
1
4
2
1
10
1
16
5
1
1
7
1
1
7
1
r8/m8
1
1
1
1
1
1
1
1
1
1
9
12
16
4
31
3
1
7
5
6
7
5
13
3
r16/m16
r32/m32
r16,r16/m16
3
3
2
3
4
3
r
i
m
sr
r
m
DS/ES/FS/GS
SS
r16,[m]
r32,[m]
r,m
r
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
2
3
2
3
2
1
1
Page 11
8
1/3
1/2
1/3
1/2
1
16
1
1
1
1
1
4
1
1
10
18
1
4
1
1/3
2
2
1
9
1/3
1/3
1/2
2.5
1/3
1/2
2.5
1/3
1/2
1/3
3
5
6
7
ALU
ALU, AGU
ALU
ALU, AGU
ALU
Timing depends
ALU, AGU on hw
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
AGU
Any addr. size
AGU
Any addr. size
ALU
ALU
ALU
ALU
2
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
ALU
ALU0
2
3
2
ALU0_1
ALU0_1
ALU0
latency ax=3,
dx=4
AMD K7
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE
CWD, CDQ
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC, BTR, BTS
BSF
BSR
r32,r32/m32
r16,(r16),i
r32,(r32),i
r16,m16,i
r32,m32,i
r8/m8
r16/m16
r32/m32
r8
r16
r32
m8
m16
m32
2
2
2
3
3
32
47
79
41
56
88
42
57
89
1
1
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
r,r
r,r
1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
10
9
9
8
6
7
8
1
1
5
2
5
4
8
19
23
4
4
5
24
24
40
17
25
41
17
25
41
1
1
1
1
7
1
1
1
7
1
1
1
4
3
3
3
7
7
5
8
6
7
4
4
7
1
2
7
7
6
7
9
Page 12
2.5
1
2
2
2
23
23
40
17
25
41
17
25
41
1/3
1/3
ALU0
ALU0
ALU0
ALU0
ALU0
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1/3
1/2
2.5
1/3
1/2
1/3
2.5
1/3
1/3
1/3
4
3
3
3
3
4
4
4
4
3
2
3
3
1/3
1/2
2
1
2
2
3
7
9
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
AMD K7
BSF
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
r,m
r,m
r
m
Control transfer instructions
JMP
short/near
JMP
JMP
JMP
20
23
1
1
1
1
2
3
8
10
1
1
1
8
10
1/3
1/2
1/3
1/3
1
2
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
2
ALU
low values = real
mode
far
r
m(near)
16-20
1
1
23-32
m(far)
short/near
short
short
near
17-21
1
2
7
3
25-33
CALL
CALL
CALL
far
r
m(near)
16-22
4
5
23-32
3
3
CALL
RETN
RETN
m(far)
16-22
2
2
24-33
3
3
15-23
24-35
i
15-24
32
33
24-35
81
42
m
6
2
INTO
2
2
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
4
5
4
3
7
4
5
5
7
6
JMP
Jcc
J(E)CXZ
LOOP
CALL
i
RETF
RETF
IRET
INT
BOUND
i
2
2
3-4
2
2
2
2
1
3
1-4
2
2
6
3-4
Page 13
1/3 - 2
1/3 - 2
3-4
2
ALU
ALU, AGU
ALU
ALU
ALU
ALU
low values = real
mode
rcp. t.= 2 if jump
rcp. t.= 2 if jump
low values = real
mode
3
3
ALU
ALU, AGU
low values = real
mode
3
3
2
2
2
1
3
1-4
2
2
6
3-4
ALU
ALU
low values = real
mode
low values = real
mode
real mode
real mode
values are for no
jump
values are for no
jump
values per count
values per count
values per count
values per count
values per count
AMD K7
Other
NOP (90)
Long NOP (0F 1F)
ENTER
1
1
i,0
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC
3
8-9
16-17
19-28
5
9
0
0
12
1/3
1/3
12
ALU
ALU
12
3 ops, 5 clk if 16
bit
3
5
27
44-74
11
11
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1
Operands
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
FCMOVcc
FFREE
FINCSTP, FDECSTP
st0,r
r
FNSTSW
FSTSW
FNSTSW
FNSTCW
Ops
1
1
7
30
1
1
10
260
1
1
1
1
Latency Reciprocal
throughput
2
4
16
41
2
3
7
0
9
7
1/2
1/2
4
39
1/2
1
5
188
0.4
1
1
1
Execution
unit
Notes
FA/M
FANY
FA/M
FMISC
FMISC
FMISC, FA/M
FMISC
42
Low latency immediately after
FMISC, FA/M FCOMI
FANY
FANY
Low latency immediately after
FMISC, ALU FCOM FTST
FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
faster if
FMISC, ALU unchanged
4
4
4
4
1
1-2
1
2
FADD
FADD,FMISC
FMUL
FMUL,FMISC
11-25
8-22
FMUL
9
1
1
6
AX
AX
m16
m16
2
3
2
3
6-12
6-12
FLDCW
m16
14
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
r/m
m
r/m
m
1
2
1
2
FDIV(R)(P)
r/m
1
0
Page 14
5
1/3
1/3
12
12
8
1
Low values are
for round divisors
AMD K7
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
m
2
1
1
1
1
2
1
2
5
1
1
12-26
2
2
2
3
Math
FSQRT
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
1
44
51
76
46
72
5
7
8
49
63
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
1
1
7
25
76
65
44
85
r/m
r
m
9-23
1
1
1
1
1
1
2
3
8
8
FMUL,FMISC
FMUL
FADD
FADD
FADD
35
90-100
90-100
100-150
100-200
160-170
8
11
27
126
147
12
FMUL
0
0
1/3
1/3
24
92
147
120
59
87
FANY
ALU
FMISC
FMISC
2
10
7-10
8-11
do.
FADD, FMISC
FADD
FMISC, ALU
FMUL
FMUL
Integer MMX instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVNTQ
PACKSSWB/DW
PACKUSWB
Operands
Ops
Latency Reciprocal
throughput
r32, mm
mm, r32
mm,m32
m32, r
mm,mm
mm,m64
m64,mm
m,mm
2
2
1
1
1
1
1
1
7
9
mm,r/m
1
2
2
Page 15
Execution
unit
2
2
1/2
1
1/2
1/2
1
2
FMICS, ALU
FANY, ALU
FANY
FMISC
FA/M
FANY
FMISC
FMISC
2
FA/M
Notes
AMD K7
PUNPCKH/LBW/WD
PSHUFW
MASKMOVQ
PMOVMSKB
PEXTRW
PINSRW
mm,r/m
mm,mm,i
mm,mm
r32,mm
r32,mm,i
mm,r32,i
1
1
32
3
2
2
mm,r/m
mm,r/m
2
2
5
12
2
1/2
24
3
2
2
FA/M
FA/M
FADD
FMISC, ALU
FA/M
1
1
2
2
1/2
1/2
FA/M
FA/M
mm,r/m
mm,r/m
mm,r/m
mm,r/m
mm,r/m
1
1
1
1
1
3
3
2
2
3
1
1
1/2
1/2
1
FMUL
FMUL
FA/M
FA/M
FADD
mm,r/m
1
2
1/2
FA/M
mm,i/mm/m
1
2
1/2
FA/M
1/3
FANY
Arithmetic instructions
PADDB/W/D PADDSB/W
PADDUSB/W
PSUBB/W/D PSUBSB/W
PSUBUSB/W
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PAVGB/W
PMIN/MAX SW/UB
PSADBW
Logic
PAND PANDN POR
PXOR
PSLL/RLW/D/Q
PSRAW/D
Other
EMMS
1
Floating point XMM instructions
Instruction
Move instructions
MOVAPS
MOVAPS
MOVAPS
MOVUPS
MOVUPS
MOVUPS
MOVSS
MOVSS
MOVSS
MOVHLPS, MOVLHPS
MOVHPS, MOVLPS
MOVHPS, MOVLPS
MOVNTPS
MOVMSKPS
SHUFPS
Operands
Ops
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
m,r
r32,r
r,r/m,i
2
2
2
2
5
5
1
2
1
1
1
1
2
3
3
Latency Reciprocal
throughput
2
2
2
4
3
2
3
Page 16
1
2
2
1
2
2
1
1
1
1/2
1/2
1
4
2
3
Execution
unit
FA/M
FMISC
FMISC
FA/M
FA/M
FANY FMISC
FMISC
FA/M
FMISC
FMISC
FMISC
FADD
FMUL
Notes
AMD K7
UNPCK H/L PS
Conversion
CVTPI2PS
CVT(T)PS2PI
CVTSI2SS
CVT(T)SS2SI
Arithmetic
ADDSS SUBSS
ADDPS SUBPS
MULSS
MULPS
r,r/m
2
3
xmm,mm
mm,xmm
xmm,r32
r32,xmm
1
1
4
2
4
6
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
4
4
4
4
3
FMUL
10
3
FMISC
FMISC
FMISC
FMISC
1
2
1
2
FADD
FADD
FMUL
FMUL
DIVSS
DIVPS
RCPSS
RCPPS
MAXSS MINSS
MAXPS MINPS
CMPccSS
CMPccPS
COMISS UCOMISS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
1
2
1
2
1
11-16
18-30
3
3
2
2
2
2
2
8-13
18-30
1
2
1
2
1
2
1
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD
FADD
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
2
2
2
FMUL
Math
SQRTSS
SQRTPS
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
19
36
3
3
16
36
1
2
FMUL
FMUL
FMUL
FMUL
Other
LDMXCSR
STMXCSR
m
m
8
3
Low values are
for round divisors, e.g. powers
of 2.
do.
9
10
3DNow instructions (obsolete)
Instruction
Operands
Move and convert instructions
PREFETCH(W)
m
PF2ID
mm,mm
PI2FD
mm,mm
PF2IW
mm,mm
PI2FW
mm,mm
PSWAPD
mm,mm
Ops
1
1
1
1
1
1
Latency Reciprocal
throughput
5
5
5
5
2
Page 17
1/2
1
1
1
1
1/2
Execution
unit
AGU
FMISC
FMISC
FMISC
FMISC
FA/M
Notes
3DNow E
3DNow E
3DNow E
AMD K7
Integer instructions
PAVGUSB
PMULHRW
mm,mm
mm,mm
1
1
2
3
1/2
1
FA/M
FMUL
Floating point instructions
PFADD/SUB/SUBR
PFCMPEQ/GE/GT
PFMAX/MIN
PFMUL
PFACC
PFNACC, PFPNACC
PFRCP
PFRCPIT1/2
PFRSQRT
PFRSQIT1
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
mm,mm
1
1
1
1
1
1
1
1
1
1
4
2
2
4
4
4
3
4
3
4
1
1
1
1
1
1
1
1
1
1
FADD
FADD
FADD
FMUL
FADD
FADD
FMUL
FMUL
FMUL
FMUL
Other
FEMMS
mm,mm
1
1/3
FANY
Page 18
3DNow E
K8
AMD K8
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent
instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.
Execution unit:
Indicates which execution unit is used for the macro-operations. ALU means any
of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used.
AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means
floating point store and miscellaneous unit. FA/M means FADD or FMUL is used.
FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
Operands
Ops
Latency Reciprocal Execution
throughput unit
r,r
r,i
r8,m8
r16,m16
r32,m32
r64,m64
m8,r8H
1
1
1
1
1
1
1
1
1
4
4
3
3
8
1/3
1/3
1/2
1/2
1/2
1/2
1/2
ALU
ALU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
m8,r8L
m16/32/64,r
m,i
m64,i32
r,sr
sr,r/m
m,r
1
1
1
1
1
6
1
3
3
3
3
2
9-13
1/2
1/2
1/2
1/2
1/2-1
8
2-3
AGU
AGU
AGU
AGU
Page 19
AGU
Notes
Any addressing mode.
Add 1 clock if code
segment base ≠ 0
AH, BH, CH, DH
Any other 8-bit register
Any addressing mode
K8
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
r,r
r,m
r64,r32
r64,m32
r,r
r,m
r,r
1
1
1
1
1
1
3
1
4
1
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
SFENCE
LFENCE
MFENCE
IN
OUT
r,m
3
2
1
1
2
2
5
9
2
3
4-6
7-9
25
9
2
1
1
4
1
1
10
1
1
1
6
1
7
270
300
16
5
1
1
1
1
2
4
1
1
8
28
10
4
3
2
2
3
1
1
1
1
1
1
1
1
1
1
1
1
9
12
16
4
1
1
7
1
1
7
1
r
i
m
sr
r
m
DS/ES/FS/GS
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
SS
r16,[m]
r32,[m]
r64,[m]
r,m
r
m
m
r,i/DX
i/DX,r
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
1
2
1
1
7
5
6
7
5
Page 20
1/3
1/2
1/3
1/2
1/3
1/2
1
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
16
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
ALU
ALU
ALU
1
1
1
1
2
4
1
1
8
28
10
4
1
1/3
1/3
2
1/3
1/3
9
1/3
1/2
1/2
8
5
16
1/3
1/2
2.5
1/3
1/2
2.5
1/3
1/2
1/3
3
5
6
7
ALU
AGU
AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
Timing depends on
hw
Any address size
Any address size
Any address size
K8
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,R
OR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
31
1
3
2
2
1
1
1
2
1
1
3
3
3
31
46
78
143
40
55
87
152
41
56
88
153
1
1
13
3
3-4
3
4-5
3
3
4
4
3
4
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8
r16
r32
r64
m8
m16
m32
m64
15
23
39
71
17
25
41
73
17
25
41
73
1
1
1
2
1
2
1
1
2
1
1
2
2
2
2
15
23
39
71
17
25
41
73
17
25
41
73
1/3
1/3
ALU
ALU0
ALU0_1
ALU0_1
ALU0_1
ALU0
ALU0
ALU0_1
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0_1
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
7
1
1
1
7
1
1
1
3
3
4
3
1/3
1/2
2.5
1/3
1/2
1/3
2.5
1/3
1/3
1/3
3
3
4
3
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1
1
10
9
9
8
6
7
7
7
9
8
7
8
3
3
3
4
4
4
4
3
3
3
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
Page 21
latency ax=3, dx=4
latency rax=4, rdx=5
K8
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC
BTR, BTS
BSF
BSF
BSR
BSF
BSF
BSF
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
m,r
r16/32,r
r64,r
r,r
r16,m
r32,m
r64,m
r,m
r
m
Control transfer instructions
JMP
short/near
JMP
JMP
JMP
8
1
1
5
2
5
4
8
8
21
22
28
20
22
25
28
1
1
1
1
1
2
6
1
2
7
7
5
8
8
9
10
8
9
10
10
1
1
1
far
r
m(near)
16-20
1
1
23-32
m(far)
short/near
short
short
near
17-21
1
2
7
3
25-33
CALL
CALL
CALL
far
r
m(near)
16-22
4
5
23-32
3
3
CALL
RETN
RETN
m(far)
16-22
2
2
24-33
3
3
15-23
24-35
15-24
32
33
6
2
24-35
81
42
JMP
Jcc
J(E/R)CXZ
LOOP
CALL
i
RETF
RETF
IRET
INT
BOUND
INTO
i
i
m
3
1/3
1/2
2
1
2
2
5
3
8
9
10
8
9
10
10
1/3
1/2
1/3
1/3
1/3
1/3
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
2
ALU
low values = real
mode
2
2
3-4
2
1/3 - 2
1/3 - 2
3-4
2
Page 22
ALU
ALU
ALU
ALU
low values = real
mode
recip. thrp.= 2 if jump
recip. thrp.= 2 if jump
low values = real
mode
3
3
ALU
ALU, AGU
low values = real
mode
3
3
2
2
String instructions
ALU
ALU, AGU
ALU
ALU
low values = real
mode
low values = real
mode
real mode
real mode
values are for no jump
values are for no jump
K8
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
4
2
5
2
4
2
1.5 - 2 0.5 - 1
7
3
3
1-2
5
2
5
2
2
3
6
2
2
2
2
0.5 - 1
3
1-2
2
2
3
2
Other
NOP (90)
Long NOP (0F 1F)
ENTER
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC
1
0
1
0
i,0
12
2
8-9
16-17
22-50 47-164
6
10
9
12
1/3
1/3
12
3
5
27
values are per count
values are per count
values are per count
values are per count
values are per count
ALU
ALU
12
3 ops, 5 clk if 16 bit
7
7
Floating point x87 instructions
Instruction
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
7
30
1
1
10
260
1
1
1
1
2
4
16
41
2
3
7
173
0
9
7
1/2
1/2
4
39
1/2
1
5
160
0.4
1
1
1
FCMOVcc
FFREE
FINCSTP, FDECSTP
st0,r
r
9
1
1
4-15
4
2
1/3
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
AX
AX
m16
m16
m16
2
3
2
3
18
6-12
6-12
12
12
8
1
50
Arithmetic instructions
FADD(P),FSUB(R)(P)
r/m
1
4
1
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1
Latency Reciprocal Execution
throughput unit
0
Page 23
Notes
FA/M
FANY
FA/M
FMISC
FMISC
FMISC, FA/M
FMISC
Low latency immediFMISC, FA/M ately after FCOMI
FANY
FANY
Low latency immediately after FCOM
FMISC, ALU FTST
FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
FMISC, ALU faster if unchanged
FADD
K8
FIADD,FISUB(R)
FMUL(P)
FIMUL
m
r/m
m
2
1
2
4
4
4
1-2
1
2
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
r/m
m
1
2
1
1
1
1
2
1
2
5
1
1
11-25
12-26
2
2
2
3
8-22
9-23
1
1
1
1
1
1
1
3
8
8
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
1
1
66
73
98
67
97
5
7
53
72
75
27
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
1
1
8
26
77
70
61
101
r/m
r
m
2
10
7-10
8-11
140-190
150-190
170-200
150-180
217
8
12
126
179
175
0
0
12
1
FADD,FMISC
FMUL
FMUL,FMISC
Low values are for
round divisors
FMUL
FMUL,FMISC
do.
FMUL
FADD
FADD
FADD
FADD, FMISC
FADD
FMISC, ALU
FMUL
FMUL
FMUL
FMISC
7
1/3
1/3
27
100
171
136
56
95
FANY
ALU
FMISC
FMISC
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
Operands
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32, r
Ops
2
2
1
3
3
2
1
Latency Reciprocal Execution
throughput unit
4
9
2
3
Page 24
2
2
1/2
2
2
1
1
FMICS, ALU
FANY, ALU
FANY
FMISC, ALU
FANY
FMISC
Notes
K8
MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LBW/WD/
DQ
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFD
PSHUFW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
PINSRW
2
2
3
1
2
1
2
1
2
2
2
4
5
1
2
1
2
4
9
9
2
2
mm,r/m
1
2
2
FA/M
xmm,r/m
3
3
2
FA/M
mm,r/m
1
2
2
FA/M
xmm,r/m
xmm,r/m
xmm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
mm,mm
xmm,xmm
r32,mm/xmm
r32,mm/x,i
mm,r32,i
xmm,r32,i
2
2
1
3
1
2
32
64
1
2
2
3
2
2
2
3
2
2
FA/M
FA/M
FA/M
FA/M
FA/M
FA/M
2
5
12
12
2
1
1/2
1.5
1/2
1
13
26
1
2
2
3
FADD
FMISC, ALU
FA/M
FA/M
mm,r/m
1
2
1/2
FA/M
xmm,r/m
mm,r/m
xmm,r/m
2
1
2
2
2
2
1
1/2
1
FA/M
FA/M
FA/M
Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PCMPEQ/GT B/W/D
PCMPEQ/GT B/W/D
2
2
2
Page 25
2
2
2
1/2
1
1/2
1
1
1
2
2
2
2
1/2
1
2
3
Moves 64 bits.Name
FMISC, ALU of instruction differs
FANY, ALU
do.
FANY, ALU
do.
FA/M
FA/M, FMISC
FANY
FANY, FMISC
FMISC
FA/M
FMISC
FMISC
r64,mm/xmm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
FA/M
FA/M, FMISC
FMISC
FMISC
K8
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULLW PMULHW
PMULHUW
PMULUDQ
PMADDWD
PMADDWD
PAVGB/W
PAVGB/W
PMIN/MAX SW/UB
PMIN/MAX SW/UB
PSADBW
PSADBW
Logic
PAND PANDN POR
PXOR
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
mm,r/m
1
3
1
FMUL
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
2
1
2
1
2
1
2
1
2
3
3
3
2
2
2
2
3
3
2
1
2
1/2
1
1/2
1
1
2
FMUL
FMUL
FMUL
FA/M
FA/M
FA/M
FA/M
FADD
FADD
mm,r/m
1
2
1/2
FA/M
xmm,r/m
2
2
1
FA/M
mm,i/mm/m
1
2
1/2
FA/M
x,i/x/m
xmm,i
2
2
2
2
1
1
FA/M
FA/M
1/3
FANY
Other
EMMS
1
Floating point XMM instructions
Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS,
MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVDDUP
MOVSH/LDUP
MOVNTPS/D
MOVMSKPS/D
Operands
Ops
Latency Reciprocal Execution
throughput unit
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
2
2
2
2
4
5
1
2
1
2
2
4
3
1
2
2
1
2
2
1
1
1
FA/M
FANY FMISC
FMISC
r,r
1
2
1/2
FA/M
r,m
1
1
FMISC
m,r
r,r
r,r
m,r
r32,r
1
2
2
2
1
1
1
2
3
1
FMISC
2
2
2
8
Page 26
Notes
FA/M
FMISC
FMISC
FA/M
SSE3
SSE3
FMISC
FADD
K8
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D
MULPS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS
RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D
r,r/m,i
r,r/m
3
2
3
3
2
3
FMUL
FMUL
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
2
4
3
1
2
2
2
4
1
2
1
3
3
2
2
2
4
8
8
2
5
5
5
8
4
5
6
8
14
12
10
9
2
3
8
1
2
2
2
3
1
2
1
2
2
2
2
2
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
FMISC
r,r/m
r,r/m
1
2
4
4
1
2
FADD
FADD
r,r/m
r,r/m
r,r/m
2
1
2
4
4
4
2
1
2
FADD
FMUL
FMUL
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
1
2
1
2
1
2
11-16
18-30
11-20
16-34
3
3
2
2
2
2
8-13
18-30
8-17
16-34
1
2
1
2
1
2
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD
r,r/m
1
2
1
FADD
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
2
2
2
FMUL
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
1
19
36
27
48
3
16
36
24
48
1
FMUL
FMUL
FMUL
FMUL
FMUL
Page 27
SSE3
Low values are for
round divisors, e.g.
powers of 2.
do.
do.
do.
K8
RSQRTPS
r,r/m
2
Other
LDMXCSR
STMXCSR
m
m
8
3
3
2
9
10
Page 28
FMUL
K10
AMD K10
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any memory operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the operand is listed as register or memory (r/m).
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent
instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.
Execution unit:
Indicates which execution unit is used for the macro-operations. ALU means any
of the three integer ALU's. ALU0_1 means that ALU0 and ALU1 are both used.
AGU means any of the three integer address generation units. FADD means floating point adder unit. FMUL means floating point multiplier unit. FMISC means
floating point store and miscellaneous unit. FA/M means FADD or FMUL is used.
FANY means any of the three floating point units can be used. Two macro-operations can execute simultaneously if they go to different execution units.
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
Operands
r,r
r,i
r8,m8
r16,m16
r32,m32
r64,m64
m8,r8H
m8,r8L
m16/32/64,r
m,i
m64,i32
r,sr
sr,r/m
m,r
r,r
Ops
1
1
1
1
1
1
1
1
1
1
1
1
6
1
1
Latency Reciprocal Execution
throughput unit
1
1
4
4
3
3
8
3
3
3
3
3-4
8-26
1
Page 29
1/3
1/3
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
1/2
8
1
1/3
ALU
ALU
ALU, AGU
ALU, AGU
AGU
AGU
AGU
AGU
AGU
AGU
AGU
Notes
Any addr. mode. Add
1 clock if code segment base ≠ 0
AH, BH, CH, DH
Any other 8-bit reg.
Any addressing mode
from AMD manual
AGU
ALU
K10
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH(W)
SFENCE
LFENCE
MFENCE
IN
OUT
1
1
1
1
1
2
2
2
r
1
i
1
m
2
sr
2
9
9
r
1
m
3
6
DS/ES/FS/GS
SS
10
28
9
r16,[m]
2
r32/64,[m]
1
r32/64,[m]
1
4
1
1
r,m
10
r
1
m
1
m
1
m
1
6
1
4
r,i/DX
~270
i/DX,r
~300
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
r,m
r64,r32
r64,m32
r,r
r,m
r,r
r,m
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
1
1
1
1
1
1
1
1
1
1
9
12
16
4
30
4
1
4
1
4
1
21
5
6
3
10
26
16
6
3
1
2
3
1
1
1
1
4
1
4
1
1
7
5
6
7
5
13
Page 30
1/2
1/3
1/2
1/3
1/2
1
19
5
1/2
1/2
1
1
3
6
1/2
1
8
16
11
6
1
1/3
1/3
2
1/3
1
10
1/3
1/2
1/2
1/2
8
1
33
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
AGU
ALU
ALU
ALU
1/3
1/2
1
1/3
1/2
1
1/3
1/2
1/3
2
5
6
7
5
13
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU0
ALU
ALU
AGU
AGU
AGU
Timing depends on hw
Any address size
≤ 2 source operands
W. scale or 3 opr.
3DNow
K10
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
IDIV
IDIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,RO
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r8
m8
r16/m16
r32/m32
r64/m64
r16/m16
r32/m32
r64/m64
1
3
2
2
1
1
1
2
1
1
3
3
3
1
1
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
1
1
1
1
1
1
1
1
1
1
9
7
9
7
1
1
10
9
9
8
6
7
8
1
1
5
2
3
3
3
4
3
3
4
4
3
4
17
19
22
15-30
15-46
15-78
24-39
24-55
24-87
1
1
1
4
1
1
7
1
1
1
3
3
4
3
7
7
7
7
8
7
3
3
7.5
1
7
2
Page 31
1
2
1
2
1
1
2
1
1
2
2
2
2
17
19
22
15-30
15-46
15-78
24-39
24-55
24-87
1/3
1/3
ALU0
ALU0_1
ALU0_1
ALU0_1
ALU0
ALU0
ALU0_1
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0_1
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
1/3
1/2
1
1/3
1/2
1/3
1
1/3
1/3
1
3
3
4
3
1
1
5
6
6
5
2
3
6
1/3
1/2
2
1/3
ALU
ALU, AGU
ALU, AGU
ALU
ALU, AGU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU
ALU, AGU
ALU, AGU
ALU
latency ax=3, dx=4
latency rax=4, rdx=5
Depends on number
of significant bits in
absolute value of dividend. See AMD software optimization
guide.
K10
BTC
BTR, BTS
BTC
BTR, BTS
BSF
BSR
BSF
BSR
POPCNT
LZCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD
m,i
m,i
m,r
m,r
r,r
r,r
r,m
r,m
r,r/m
r,r/m
r
m
Control transfer instructions
JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
IRET
INT
i
BOUND
m
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
5
4
8
8
6
7
7
8
1
1
1
1
1
1
1
2
1
16-20
1
1
17-21
1
2
7
3
16-22
4
5
16-22
2
2
15-23
15-24
32
33
6
2
4
5
4
2
7
3
5
5
7
3
9
9
8
8
4
4
7
7
2
2
1
1
1.5
1.5
10
7
3
3
3
3
1
1
1/3
1/2
1/3
1/3
1/3
2/3
ALU, AGU
ALU, AGU
ALU, AGU
ALU, AGU
ALU
ALU
ALU, AGU
ALU, AGU
ALU
ALU
ALU
ALU, AGU
ALU
ALU
ALU
ALU
2
ALU
23-32
low values = real mode
2
2
ALU
ALU, AGU
1/3 - 2
2/3 - 2
3
2
ALU
ALU
ALU
ALU
3
3
ALU
ALU, AGU
3
3
ALU
ALU
25-33
2
23-32
3
3
24-33
3
3
24-35
24-35
81
42
SSE4.A / SSE4.2
SSE4.A, AMD only
low values = real mode
low values = real mode
low values = real mode
low values = real mode
low values = real mode
2
2
2
2
2
1
3
1
2
2
3
1
Other
Page 32
recip. thrp.= 2 if jump
recip. thrp.= 2 if jump
2
2
2
1
3
1
2
2
3
1
real mode
real mode
values are for no jump
values are for no jump
values are per count
values are per count
values are per count
values are per count
values are per count
K10
NOP (90)
Long NOP (0F 1F)
ENTER
LEAVE
CLI
STI
CPUID
RDTSC
RDPMC
1
0
1
0
i,0
12
2
8-9
16-17
22-50 47-164
30
13
1/3
1/3
ALU
ALU
12
3
5
27
3 ops, 5 clk if 16 bit
67
5
Floating point x87 instructions
Instruction
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
7
20
1
1
10
218
1
1
1
1
FCMOVcc
FFREE
FINCSTP, FDECSTP
st0,r
r
9
1
1
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
AX
AX
m16
m16
m16
2
3
2
3
12
r/m
m
r/m
m
r/m
m
1
2
1
2
1
2
1
1
1
1
2
1
2
6
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1
Latency Reciprocal Execution
throughput unit
2
4
13
94
2
2
8
167
0
6
4
0
1/2
1/2
4
30
1/2
1
7
163
1/3
1
1
1
Notes
FA/M
FANY
FA/M
FMISC
FMISC
FMISC
FMISC
1/3
1/3
Low latency immediFMISC, FA/M ately after FCOMI
FANY
FANY
16
14
9
2
14
FMISC, ALU after FCOM FTST
FMISC, ALU
do.
FMISC, ALU
do.
FMISC, ALU
FMISC, ALU faster if unchanged
Low latency immediately
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
r/m
r
m
4
4
?
31
2
Page 33
1
4
1
4
24
24
2
1
1
1
1
1
1
37
FADD
FADD,FMISC
FMUL
FMUL,FMISC
FMUL
FMUL,FMISC
FMUL
FADD
FADD
FADD
FADD, FMISC
FADD
FMISC, ALU
K10
FPREM
FPREM1
1
1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
1
1
45
51
76
45
9
5
11
8
8
12
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
1
1
8
26
77
70
61
85
m
m
m
m
35
~51?
~90?
~125?
~119
151?
9
9
65
13
114
0
0
162
133
63
89
7
7
FMUL
FMUL
35
1
FMUL
FMISC
45?
29
41
30?
30?
44?
1/3
1/3
28
103
149
149
58
79
FANY
ALU
FMISC
FMISC
Integer MMX and XMM instructions
Instruction
Operands
Ops
Latency Reciprocal Execution
throughput unit
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32,mm/x
1
2
1
1
2
1
1
3
6
4
3
6
2
2
1
3
1/2
1
3
1/2
1
MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
r64,(x)mm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,(x)mm
xmm,xmm
xmm,m
m,xmm
xmm,m
1
2
2
1
1
1
1
1
1
1
2
1
3
6
6
2
2.5
4
2
2
2.5
2
2
2
1
3
3
1/2
1/3
1/2
1/2
1
1/3
1/2
1
1/2
Page 34
Notes
FADD
FANY
FADD
FMISC
Moves 64 bits.Name
of instruction differs
do.
FMUL, ALU
do.
FA/M
FANY
FANY
?
FMISC
FANY
?
FMUL,FMISC
FADD
K10
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LBW/WD/
DQ
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFD
PSHUFW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
INSERTQ
INSERTQ
EXTRQ
EXTRQ
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
3
1
1
1
2
3
2
2
2
1/3
1/3
1
1
FANY
FANY
FMISC
FMUL,FMISC
mm,r/m
1
2
1/2
FA/M
xmm,r/m
1
3
1/2
FA/M
mm,r/m
1
2
1/2
FA/M
xmm,r/m
xmm,r/m
xmm,r/m
xmm,xmm,i
mm,mm,i
xmm,xmm,i
mm,mm
xmm,xmm
r32,mm/xmm
r32,(x)mm,i
(x)mm,r32,i
xmm,xmm
xmm,xmm,i,i
xmm,xmm
xmm,xmm,i,i
1
1
1
1
1
1
32
64
1
2
2
3
3
1
1
3
3
3
3
2
2
FA/M
FA/M
FA/M
FA/M
FA/M
FA/M
3
6
9
6
6
2
2
1/2
1/2
1/2
1/2
1/2
1/2
13
24
1
1
3
2
2
1/2
1/2
mm/xmm,r/m
1
1
2
2
1/2
1/2
FA/M
FA/M
mm/xmm,r/m
mm/xmm,r/m
mm/xmm,r/m
mm/xmm,r/m
mm/xmm,r/m
1
1
1
1
1
3
3
2
2
3
1
1
1/2
1/2
1
FMUL
FMUL
FA/M
FA/M
FADD
mm/xmm,r/m
1
2
1/2
FA/M
mm,i/mm/m
1
2
1/2
FA/M
x,i/(x)mm
xmm,i
1
1
3
3
1/2
1/2
FA/M
FA/M
Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
mm/xmm,r/m
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMULUDQ
PMADDWD
PAVGB/W
PMIN/MAX SW/UB
PSADBW
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
Page 35
FADD
FA/M
FA/M
FA/M
FA/M
FA/M
SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only
K10
Other
EMMS
1
1/3
FANY
Floating point XMM instructions
Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS,
MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
MULSS/D
Operands
Ops
Latency Reciprocal Execution
throughput unit
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
1
1
2
1
1
3
1
1
1
2.5
2
2
2.5
2
3
2
2
2
1/2
1/2
1
1/2
1/2
2
1/2
1/2
1
FANY
?
FMUL,FMISC
FANY
?
FMISC
FA/M
?
FMISC
r,r
1
3
1/2
FA/M
r,m
1
4
1/2
FA/M
m,r
m,r
m,r
r32,r
r,r/m,i
r,r/m
1
2
1
1
1
1
3
3
3
1
3
1
1
1/2
1/2
FMISC
FMUL,FMISC
FMISC
FADD
FA/M
FA/M
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
1
2
3
3
1
1
1
2
2
1
1
2
3
3
2
2
2
7
8
7
4
4
4
7
7
4
4
7
14
14
8
8
1
1
2
2
1
1
1
1
1
1
1
1
3
3
1
1
FMISC
FADD,FMISC
FADD,FMISC
r,r/m
r,r/m
r,r/m
1
1
1
4
4
4
1
1
1
FADD
FADD
FMUL
Page 36
FMISC
FMISC
FMISC
FMISC
FMISC
Notes
SSE4.A, AMD only
K10
MULPS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
1
1
1
1
r,r/m
1
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
1
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
Other
LDMXCSR
STMXCSR
m
m
4
16
18
20
20
3
2
2
2
2
1
13
15
17
17
1
1
1
1
1
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
FADD
FADD
FADD
FADD
1
FADD
2
1/2
FA/M
1
1
1
1
1
1
19
21
27
27
3
3
16
18
24
24
1
1
FMUL
FMUL
FMUL
FMUL
FMUL
FMUL
12
3
12
12
10
11
Obsolete 3DNow instructions
Instruction
Operands
Ops
Latency Reciprocal Execution
throughput unit
Move and convert instructions
PF2ID
mm,mm
PI2FD
mm,mm
PF2IW
mm,mm
PI2FW
mm,mm
PSWAPD
mm,mm
1
1
1
1
1
5
5
5
5
2
1
1
1
1
1/2
FMISC
FMISC
FMISC
FMISC
FA/M
Integer instructions
PAVGUSB
PMULHRW
mm,mm
mm,mm
1
1
2
3
1/2
1
FA/M
FMUL
Floating point instructions
PFADD/SUB/SUBR
mm,mm
PFCMPEQ/GE/GT
mm,mm
PFMAX/MIN
mm,mm
PFMUL
mm,mm
PFACC
mm,mm
PFNACC, PFPNACC
mm,mm
PFRCP
mm,mm
1
1
1
1
1
1
1
4
2
2
4
4
4
3
1
1
1
1
1
1
1
FADD
FADD
FADD
FMUL
FADD
FADD
FMUL
Page 37
Notes
3DNow extension
3DNow extension
3DNow extension
3DNow extension
K10
PFRCPIT1/2
PFRSQRT
PFRSQIT1
mm,mm
mm,mm
mm,mm
1
1
1
Other
FEMMS
mm,mm
1
4
3
4
1
1
1
FMUL
FMUL
FMUL
1/3
FANY
Thank you to Xucheng Tang for doing the measurements on the K10.
Page 38
Bulldozer
AMD Bulldozer
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.
Execution pipe:
Indicates which execution pipe or unit is used for the macro-operations:
Integer pipes:
EX0: integer ALU, division
EX1: integer ALU, multiplication, jump
EX01: can use either EX0 or EX1
AG01: address generation unit 0 or 1
Floating point and vector pipes:
P0: floating point add, mul, div, convert, shuffle, shift
P1: floating point add, mul, div, shuffle, shift
P2: move, integer add, boolean
P3: move, integer add, boolean, store
P01: can use either P0 or P1
P23: can use either P2 or P3
Two macro-operations can execute simultaneously if they go to different
execution pipes
Domain:
Tells which execution unit domain is used:
ivec: integer vector execution unit.
fp: floating point execution unit.
fma: floating point multiply/add subunit.
inherit: the output operand inherits the domain of the input operand.
ivec/fma means the input goes to the ivec domain and the output comes from the
fma domain.
There is an additional latency of 1 clock cycle if the output of an ivec instruction
goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before
memory store instructions are included in the latency counts.
An fma instruction has a latency of 5 if the output goes to another fma instruction,
6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or
store instruction.
Page 39
Bulldozer
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
LFENCE
MFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB
Operands
Ops
r,r
r,i
r,m
m,r
m,i
m,r
r,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
4
4
r,m
2
2
1
1
2
8
9
1
2
34
14
2
2
~50
6
1
1
4
2
1
1
1
1
1
6
1
6
2
1
3
2
1
1
1
1
1
1
r
i
m
r
m
r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]
r
m
m
m
r,r
r,i
Latency Reciprocal
throughput
5
1
5
4
1
5
1
1
0.5
0.5
0.5
1
1
2
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
EX01
EX01
AG01
EX01 AG01
~50
2
1
1
1.5
4
9
1
1
19
8
EX01
2-3
2-3
Page 40
Execution
pipes
Notes
all addr. modes
all addr. modes
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
89
0.25
89
EX01
EX01
0.5
0.5
EX01
EX01
Timing depends on
hw
any addr. size
16 bit addr. size
scale factor > 1
or 3 operands
all other cases
EX01
AMD 3DNow
Bulldozer
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CDQ, CQO
CWD
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
r,m
m,r
m,i
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
r
m
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
r,r
r,i
r,m
m,r
m,i
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
9
1
2
1
1
1
1
1
2
1
1
2
2
2
14
18
16
16
33
36
36
36
1
1
2
1
1
1
1
1
1
7-8
7-8
1
1
1
9
9
1
1
1
7-8
6
9
10
6
20
4
4
4
6
4
4
6
5
4
6
20
15-27
16-43
16-75
23
23-33
22-48
22-79
1
1
1
1
1
7-8
7-8
1
Page 41
0.5
1
1
1
1
1
0.5
0.5
0.5
0.5
1
20
2
2
2
4
2
2
4
2
2
4
2
2
4
20
15-28
16-43
16-75
20
20-27
20-43
20-75
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01
0.5
0.5
0.5
1
1
0.5
EX01
EX01
EX01
EX01
EX01
EX01
Bulldozer
TEST
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
LZCNT
POPCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
EXTRQ
EXTRQ
INSERTQ
INSERTQ
r,i
m,r
m,i
r
m
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r,r
r,r/m
r
m
r16/32,r16/32
r64,r64
r,r
x,i,i
x,x
x,x,i,i
x,x
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
1
1
2
2
1
1
2
1
1
1
1
1
1
7
1
1
1
8
9
1
8
8
3
4
1
2
3
4
4
2
4
1
0.5
0.5
0.5
0.5
1
0.5
0.5
3
3.5
3.5
0.5
0.5
3.5
1
2
5
3
4
4
5
2
2
0.5
1
0.5
1
4
4
2
3
3
3
3
1
1
1
1
1
1
1
1
Page 42
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX0
EX1
EX01
EX01
EX01
EX01
SSE4.A
SSE4.2
3
4
2
4
2
1
1
1
1
P1
P1
P1
P1
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
SSE4A
2
2
2
1-2
1-2
1-2
1-2
1-2
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping
Bulldozer
CALL
CALL
CALL
RET
RET
BOUND
INTO
near
r
m
i
m
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
CRC32
CRC32
CRC32
XGETBV
m,r
m,r
m,r
m8,r8
m8,r8
m,r16/32/64
m,r16/32/64
m64
m64
m128
m128
a,0
a,b
r32,r8
r32,r16
r32,r32
2
2
3
1
4
11
4
2
2
2
2
2-3
5
24
3
6n
3
2n
3 per 16B
5
2n
4 per 16B
3
7n
6
9n
3
3n
3
2n
3 per 16B
3
2n
3 per 16B
3
4n
3
4n
1
4
4
5
5
6
6
18
18
22
22
1
1
40
13
11+5b
2
37-63
36
22
3
5
5
4
EX1
EX1
EX1
EX1
EX1
for no jump
for no jump
small n
best case
small n
best case
~55
10
~51
15
~51
14
~52
15
~53
52
~94
3
5
6
Page 43
0.25
0.25
43
22
16+4b
4
112-280
42
300
2
5
6
31
none
none
Bulldozer
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
4
3
1
3
st0,r
r
AX
m16
m16
m16
r/m
m
r/m
m
r
m
m
r/m
r
m
1
2
1
2
1
2
2
1
1
1
2
2
1
1
1
1
1
1
1
10-162
160-170
Latency Reciprocal
throughput
2
8
14
61
2
8
9
240
0
12
8
3
0
~13
~13
5-6
5-6
10-42
2
2
~20
4
19-62
19-65
0.5
1
4
40
0.5
1
20
244
0.5
1
1
0.5
3
0.25
0.25
22
19
3
2
1
2
1
2
5-18
0.5
0.5
0.5
1
1
0.5
0.5
1
10-53
65-210
~160
Page 44
0.5
65-210
~160
Execution
pipes
Domain, notes
P01
fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp
P0 P1 P2 P3
P01
P0 P1 F3
P01
F3
P0 F3
P01
P0 P1 F3
none
none
P0 P2 P3
P0 P2 P3
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P0 P1 F3
P01
P01
P01
P0
P0
P0
P01
P01
P0 P1 P3
P0 P1 P3
inherit
fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
Bulldozer
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
12-166
11-190
10-355
8
12
10
10-175
10-175
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
1
1
18
31
103
76
m864
m864
95-160
95-245
60-440
52
10
64-71
300
312
95-160
95-245
60-440
5
0.25
0.25
57
170
300
312
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
none
none
P0
P0
P0 P1 P2 P3
P0 P3
Integer MMX and XMM instructions
Instruction
Operands
Move instructions
MOVD
r32/64, mm/x
MOVD
mm/x, r32/64
MOVD
mm/x,m32
MOVD
m32,mm/x
MOVQ
mm/x,mm/x
MOVQ
mm/x,m64
MOVQ
m64,mm/x
MOVDQA
xmm,xmm
MOVDQA
xmm,m
MOVDQA
m,xmm
VMOVDQA
ymm,ymm
VMOVDQA
ymm,m256
VMOVDQA
m256,ymm
MOVDQU
xmm,xmm
MOVDQU
xmm,m
MOVDQU
m,xmm
LDDQU
xmm,m
VMOVDQU
ymm,m256
VMOVDQU
m256,ymm
MOVDQ2Q
mm,xmm
MOVQ2DQ
xmm,mm
MOVNTQ
m,mm
MOVNTDQ
m,xmm
MOVNTDQA
xmm,m
PACKSSWB/DW
(x)mm,r/m
PACKUSWB
(x)mm,r/m
PUNPCKH/LBW/WD/D
Q
(x)mm,r/m
Ops
Latency Reciprocal
throughput
1
2
1
1
1
1
1
1
1
1
2
2
4
1
1
1
1
2
8
1
1
1
1
1
1
1
8
10
6
5
2
6
5
0
6
5
2
6
5
0
6
5
6
6
6
2
2
6
6
6
2
2
1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
3
0.25
0.5
1
0.5
1-2
10
0.5
0.5
2
2
0.5
1
1
1
2
1
Page 45
Execution
pipes
Notes
P23
P3
none
inherit domain
P3
P23
P3
none
P3
P2 P3
P23
P23
P3
P3
P1
P1
P1
inherit domain
Bulldozer
PUNPCKHQDQ
xmm,r/m
PUNPCKLQDQ
xmm,r/m
PSHUFB
(x)mm,r/m
PSHUFD
xmm,xmm,i
PSHUFW
mm,mm,i
PSHUFL/HW
xmm,xmm,i
PALIGNR
(x)mm,r/m,i
PBLENDW
xmm,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,mm/x
PEXTRB/W/D/Q
r,x/mm,i
PINSRB/W/D/Q
x/mm,r,i
PMOVSXBW/BD/BQ/
WD/WQ/DQ
xmm,xmm
PMOVZXBW/BD/BQ/W
D/WQ/DQ
xmm,xmm
VPCMOV
x,x,x,x/m
VPCMOV
y,y,y,y/m
VPPERM
x,x,x,x/m
1
1
1
1
1
1
1
1
31
64
2
2
2
2
2
3
2
2
2
2
2
38
48
10
10
12
1
1
1
1
1
1
1
0.5
37
61
1
1
2
P1
P1
P1
P1
P1
P1
P1
P23
P3
P1 P3
P1 P3
P1 P3
P1
1
2
1
P1
SSE4.1
1
1
2
1
2
2
2
2
1
1
2
1
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
1
2
0.5
P23
(x)mm,r/m
x,x
x,m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
3
4
1
1
1
2
5
5
2
2
2
0.5
2
2
0.5
0.5
0.5
P23
P1 P23
P1 P23
P23
P23
P23
(x)mm,r/m
xmm,r/m
xmm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
1
1
1
1
1
1
4
5
4
4
4
4
2
1
2
1
1
1
1
0.5
P0
P0
P0
P0
P0
P0
P23
(x)mm,r/m
xmm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i
1
2
1
1
2
8
2
4
2
2
4
8
0.5
1
0.5
0.5
1
4
P23
P1 P23
P23
P23
P23
P1 P23
VPCOMB/W/D/Q
x,x,x/m,i
1
2
0.5
P23
VPCOMUB/W/D/Q
x,x,x/m,i
1
2
0.5
P23
Arithmetic instructions
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW
Page 46
SSE4.1
AVX
SSSE3
SSSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
SSSE3
SSE4.1
SSSE3
SSSE3
SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7
Bulldozer
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
x,x/m
1
2
0.5
P23
AMD XOP
x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
1
1
1
1
1
1
1
1
1
1
2
2
4
5
4
4
5
4
4
4
0.5
0.5
1
2
1
1
2
1
1
1
P23
P23
P0
P0
P0
P0
P0
P0
P0
P0
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
1
2
0.5
P23
(x)mm,r/m
1
3
1
P1
(x)mm,i
xmm,i
xmm,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m
1
1
2
1
1
1
1
2
2
3
2
3
3
1
1
1
1
1
1
1
P1
P1
P1 P3
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP
x,x,i
x,x,i
x,x,i
x,x,i
27
27
7
7
17
10
14
7
10
10
3
4
P1 P2 P3
P1 P2 P3
P1 P2 P3
P1 P2 P3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
x,x/m,i
x,x
x,x
x,x
x,x
x,x
x,x,i
5
2
2
2
2
1
1
12
5
5
5
5
5
5
7
2
2
2
2
1
1
P1
P01
P01
P01
P01
P0
P0
pclmul
aes
aes
aes
aes
aes
aes
Execution
pipes
Domain, notes
Other
EMMS
1
0.25
Floating point XMM and YMM instructions
Instruction
Operands
Ops
Latency Reciprocal
throughput
Move instructions
Page 47
Bulldozer
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
x,x
y,y
1
2
0
2
0.25
0.5
x,m128
1
6
0.5
y,m256
2
6
1-2
m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
1
4
8
1
1
1
5
5
6
2
6
5
1
3
10
0.5
0.5
1
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
1
2
1
1
2
7
8
7
2
10
1
1
1
1
1
P1 P3
P3
P1
P1 P3
1
6
2
P3
4
1
2
1
2
1
2
3
4
0.5
1
1
2
1
0.5
2
1
0.5
0.5
0.5
0.5
1
0.5
2
1
1
2
1
P3
P1
P1
P1
P1
P1
P1
P23
P23
P23
P23
P1
P1
P1
SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
P1
ivec
P23
P23
P23
P1
ivec
P1
ivec
P1
P1
P1 P3
ivec
ivec
1
1
2
1
2
1
2
8
10
1
2
1
2
1
1
2
2
1
2
2
2
1
1
2
2
1
2
2
2
2
3
3
2
2
4
2
2
2
2
2
2
6
6
6
6
2
2
2
2
10
Page 48
none
P23
inherit domain
ivec
P3
P3
P2 P3
P01
fp
ivec
Bulldozer
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
2
1
2
1
1
2
2
1
2
18
34
14
2
7
2
2
9
9
9
22
25
1
1
1
1
1
1
1
0.5
1
7
13
P1 P3
P23
P23
P1
P1
P23
P23
P01
P01
P0 P1 P2 P3
P0 P1 P2 P3
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
2
4
2
4
1
1
1
2
1
2
2
4
2
4
1
1
2
2
2
2
2
2
7
7
7
7
4
4
4
4
4
4
7
8
7
7
4
4
7
7
14
13
14
13
1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0
P0
P0 P1
P0 P1
P0
P0
P0
P0
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
x,x/m
x,x/m
1
1
5-6
5-6
0.5
0.5
P01
P01
fma
fma
VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
y,y,y/m
x,x/m
y,y,y/m
2
1
2
5-6
5-6
5-6
1
0.5
1
P01
P01
P01
fma
fma
fma
HADDPS/D HSUBPS/D
x,x
3
10
2
P01 P1
ivec/fma
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
x,m128
4
2
P01 P1
ivec/fma
y,y,y
8
4
P01 P1
ivec/fma
y,y,m
10
4
P01 P1
ivec/fma
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
10
Page 49
ivec
ivec
Bulldozer
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m
1
1
2
1
2
1
2
1
2
5-6
5-6
5-6
9-24
9-24
9-27
9-27
5
5
0.5
0.5
1
4.5-9.5
9-19
4.5-11
9-22
1
2
P01
P01
P01
P01
P01
P01
P01
P01
P01
fma
fma
fma
fp
fp
fp
fp
fp
fp
x,x/m
y,y,y/m
1
2
2
2
0.5
1
P01
P01
fp
fp
x,x/m
2
1
P01 P3
fp
x,x/m
1
2
0.5
P01
fp
VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
16
DPPS
x,m128,i
18
VDPPS
y,y,y,i
25
VDPPS
y,m256,i
29
DPPD
x,x,i
15
DPPD
x,m128,i
17
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
All other FMA4 instructions: same as above
2
4
1
1
P01
P0
fp
fp
4
25
5-6
5-6
5-6
2
6
7
13
13
5
6
0.5
0.5
1
P0
P01 P23
P01 P23
P01 P3
P01 P3
P01 P23
P01 P23
P01
P01
P01
fp
fma
fma
fma
fma
fma
fma
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4
Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD
27
15
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x
x,m
1
2
1
2
1
2
2
3
14-15
14-15
24-26
24-26
5
5
10
10
4.5-12
9-24
4.5-16.5
9-33
1
2
2
2
P01
P01
P01
P01
P01
P01
P01
P01
fp
fp
fp
fp
fp
fp
AMD XOP
AMD XOP
AND/ANDN/OR/XORPS/
PD
x,x/m
1
2
0.5
P23
ivec
VAND/ANDN/OR/XOR
PS/PD
y,y,y/m
2
2
1
P23
ivec
Logic
Other
VZEROUPPER
VZEROUPPER
9
16
4
5
Page 50
32 bit mode
64 bit mode
Bulldozer
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
m32
m32
m4096
m4096
m
m
17
32
1
2
67
116
122
177
10
19
136
176
196
250
Page 51
6
10
4
19
136
176
196
250
P2 P3
P2 P3
P0 P3
P0 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
32 bit mode
64 bit mode
Piledriver
AMD Piledriver
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase the
delays. The latency listed does not include the memory operand where the listing
for register and memory operand are joined (r/m).
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.
Execution pipe:
Indicates which execution pipe or unit is used for the macro-operations:
Integer pipes:
EX0: integer ALU, division
EX1: integer ALU, multiplication, jump
EX01: can use either EX0 or EX1
AG01: address generation unit 0 or 1
Floating point and vector pipes:
P0: floating point add, mul, div, convert, shuffle, shift
P1: floating point add, mul, div, shuffle, shift
P2: move, integer add, boolean
P3: move, integer add, boolean, store
P01: can use either P0 or P1
P23: can use either P2 or P3
Two macro-operations can execute simultaneously if they go to different
execution pipes
Domain:
Tells which execution unit domain is used:
ivec: integer vector execution unit.
fp: floating point execution unit.
fma: floating point multiply/add subunit.
inherit: the output operand inherits the domain of the input operand.
ivec/fma means the input goes to the ivec domain and the output comes from the
fma domain.
There is an additional latency of 1 clock cycle if the output of an ivec instruction
goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before
memory store instructions are included in the latency counts.
An fma instruction has a latency of 5 if the output goes to another fma instruction,
6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or
store instruction.
Page 52
Piledriver
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
Operands
Ops
r8,r8
r16,r16
r32,r32
r64,r64
r,i
r,m
m,r
m,i
m,r
r16,r8
r32,r
r64,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r8,r8
r16,r16
r32,r32
r64,r64
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
4
4
r,m
2
2
1
1
2
8
9
1
2
34
14
2
2
~40
6
1
1
4
2
1
1
1
1
1
7
2
1
3
2
1
1
r
i
m
r
m
r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]
r
m
m
m
Latency Reciprocal
throughput
4
1
1
1
5
4
1
5
1
1
1
1
1
0.5
0.5
0.3
0.3
0.5
0.5
1
1
2
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
EX01
EX01
EX01 or AG01
EX01 or AG01
EX01
AG01
EX01 AG01
~40
2
1
1
1
4
9
1
1
18
8
EX01
2-3
2-3
Page 53
Execution
pipes
all addr. modes
all addr. modes
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
81
Notes
EX01
EX01
Timing depends on
hw
any addr. size
16 bit addr. size
scale factor > 1
or 3 operands
all other cases
EX01
PREFETCHW
Piledriver
LFENCE
MFENCE
1
7
Arithmetic instructions
ADD, SUB
r,r
ADD, SUB
r,i
ADD, SUB
r,m
ADD, SUB
m,r
ADD, SUB
m,i
ADC, SBB
r,r
ADC, SBB
r,i
ADC, SBB
r,m
ADC, SBB
m,r
ADC, SBB
m,i
CMP
r,r
CMP
r,i
CMP
r,m
CMP
m,i
INC, DEC, NEG
r
INC, DEC, NEG
m
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
r8/m8
MUL, IMUL
r16/m16
MUL, IMUL
r32/m32
MUL, IMUL
r64/m64
IMUL
r16,r16/m16
IMUL
r32,r32/m32
IMUL
r64,r64/m64
IMUL
r16,(r16),i
IMUL
r32,(r32),i
IMUL
r64,(r64),i
IMUL
r16,m16,i
IMUL
r32,m32,i
IMUL
r64,m64,i
DIV
r8/m8
DIV
r16/m16
DIV
r32/m32
DIV
r64/m64
IDIV
r8/m8
IDIV
r16/m16
IDIV
r32/m32
IDIV
r64/m64
CBW, CWDE, CDQE
CDQ, CQO
CWD
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
10
1
2
1
1
1
1
1
2
1
1
2
2
2
9
7
2
2
9
7
2
2
1
1
2
17-22
13-26
12-40
13-71
17-21
13-26
13-40
13-71
1
1
1
Logic instructions
AND, OR, XOR
AND, OR, XOR
1
1
1
1
r,r
r,i
0.25
81
1
1
7-8
7-8
1
1
1
9
9
1
1
1
7-8
6
9
10
6
15
4
4
4
6
4
4
6
5
4
6
Page 54
0.5
0.5
0.5
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
15
2
2
2
4
2
2
4
2
2
4
2
2
4
13-22
13-25
12-40
13-71
13-18
13-25
13-40
13-71
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01
0.5
0.5
EX01
EX01
Piledriver
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
TEST
TEST
NOT
NOT
ANDN
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
TZCNT
BEXTR
BEXTR
BLSI
BLSMSK
BLSR
BLCFILL
BLCI
BLCIC
BLCMSK
BLCS
BLSFILL
BLSI
r,m
m,r
m,i
r,r
r,i
m,r
m,i
r
m
r,r,r
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r
m
r16/32,r16/32
r64,r64
r,r
r,r
r,r,r
r,r,i
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
7-8
7-8
1
1
1
7-8
1
1
1
1
7
7
1
7
6
3
3
1
2
20
21
3
4
4
1
0.5
1
1
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
3
3
3.5
0.5
0.5
3.5
1
3
4
4
5
0.5
1
0.5
1
4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Page 55
3
4
2
4
2
2
0.67
0.67
1
1
1
1
1
1
1
1
1
1
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX0
BMI1
SSE4.2
SSE4.2
LZCNT
BMI1
BMI1
AMD TBM
BMI1
BMI1
BMI1
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
Piledriver
BLSIC
T1MSKC
TZMSK
r,r
r,r
r,r
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m
RET
RET
i
BOUND
m
INTO
String instructions
LODS
REP LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
m8/m16
m32/m64
m,r
m,r
m,r
m,r8/16
m,r8/16
m,r32/64
m,r32/64
m64
m64
m128
m128
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
4
11
4
2
2
2
1-2
1-2
1-2
1-2
1-2
2
2
2
2
2
5
2
3
6n
6n
3
1n
3 per 16B
5
1-3n
4.5 pr 16B
3
7n
6
9n
3
3n
2.5n
3
1n
3 per 16B
3
1n
3 per 16B
3
3-4n
3
4n
1
4
4
5
5
6
6
18
18
22
22
AMD TBM
AMD TBM
AMD TBM
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
for no jump
for no jump
small n
best case
small n
best case
~40
20
~39
23
~40
20
~40
25
~42
66
~80
1
1
40
0.25
0.25
40
Page 56
2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping
none
none
Piledriver
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDPMC
CRC32
CRC32
CRC32
a,0
a,b
r32,r8
r32,r16
r32,r32
13
20+3b
2
38-64
4
36
21
3
5
5
3
5
6
21
16+4b
4
105-271
30
42
310
2
5
6
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
3
2
1
2
2
7
20
64
2
7
22
220
0
11
7
1
2
1
2
1
1
2
1
1
1
2
2
1
1
1
5-6
st0,r
r
AX
m16
m16
m16
r/m
m
r/m
m
r
m
m
r/m
r
m
Latency Reciprocal
throughput
3
0
5-6
9-40
2
2
~20
4
Page 57
0.5
1
4
35
0.5
1
20
0.5
1
1
0.5
3
0.25
0.25
19
17
3
2
1
2
1
2
4-16
0.5
0.5
0.5
1
1
0.5
0.5
1
Execution
pipes
Domain, notes
P01
fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp
P0 P1 P2 P3
P01
P0 P1 F3
P01
F3
P0 F3
P01
P0 P1 F3
none
none
P0 P2 P3
P0 P2 P3
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P0 P1 F3
P01
P01
P01
P0
inherit
fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
Piledriver
FPREM
FPREM1
1
1
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
17-60
17-60
1
14-50
1
10-162 60-210
160-170
~154
12-166 86-141
11-190 166-231
10-355 60-352
8
44
12
7
10
60-73
10-176
10-176
m864
m864
1
1
18
31
103
76
300
236
P0
P0
5-20
0.5
60-146
~154
86-141
86-204
60-352
5
5
P01
P01
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
P0 P1 P3
0.25
0.25
54
134
300
236
none
none
P0
P0
P0 P1 P2 P3
P0 P3
fp
fp
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
VMOVDQA
VMOVDQA
VMOVDQA
MOVDQU
MOVDQU
MOVDQU
LDDQU
VMOVDQU
VMOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
Operands
Ops
r32/64, mm/x
mm/x, r32/64
mm/x,m32
m32,mm/x
mm/x,mm/x
mm/x,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
ymm,ymm
ymm,m256
m256,ymm
xmm,xmm
xmm,m
m,xmm
xmm,m
ymm,m256
m256,ymm
mm,xmm
xmm,mm
m,mm
1
2
1
1
1
1
1
1
1
1
2
2
4
1
1
1
1
2
8
1
1
1
Latency Reciprocal
throughput
8
10
6
5
2
6
5
0
6
5
2
6
11
0
6
5
6
6
14
2
2
5
Page 58
1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
17
0.25
0.5
1
0.5
1
20
0.5
0.5
2
Execution
pipes
Notes
P3
P3
P23
P3
none
inherit domain
P3
P23
P3
none
P3
P2 P3
P23
P23
P3
inherit domain
Piledriver
MOVNTDQ
m,xmm
MOVNTDQA
xmm,m
(x)mm,r/m
PACKSSWB/DW
(x)mm,r/m
PACKUSWB
PUNPCKH/LBW/WD/D
Q
(x)mm,r/m
PUNPCKHQDQ
xmm,r/m
PUNPCKLQDQ
xmm,r/m
PSHUFB
(x)mm,r/m
PSHUFD
xmm,xmm,i
PSHUFW
mm,mm,i
PSHUFL/HW
xmm,xmm,i
PALIGNR
(x)mm,r/m,i
PBLENDW
xmm,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,mm/x
PEXTRB/W/D/Q
r,x/mm,i
PINSRB/W/D/Q
x/mm,r,i
EXTRQ
x,i,i
EXTRQ
x,x
INSERTQ
x,x,i,i
INSERTQ
x,x
PMOVSXBW/BD/BQ/
WD/WQ/DQ
x,x
PMOVZXBW/BD/BQ/W
D/WQ/DQ
x,x
VPCMOV
x,x,x,x/m
VPCMOV
y,y,y,y/m
VPPERM
x,x,x,x/m
Arithmetic instructions
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
1
1
1
1
5
6
2
2
2
0.5
1
1
P3
1
1
1
1
1
1
1
1
1
31
64
2
2
2
1
1
1
1
2
2
2
3
2
2
2
2
2
36
59
10
10
12
3
1
1
1
1
1
1
1
1
1
1
1
0.5
59
92
1
1
2
1
1
1
1
P1
P1
P1
P1
P1
P1
P1
P1
P23
P3
P1 P3
P1 P3
P1 P3
P1
P1
P1
P1
P1
AMD SSE4A
AMD SSE4A
AMD SSE4A
AMD SSE4A
1
2
1
P1
SSE4.1
1
1
2
1
2
2
2
2
1
1
2
1
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
1
2
0.5
P23
(x)mm,r/m
x,x
x,m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
3
4
1
1
1
2
5
5
2
2
2
0.5
2
2
0.5
0.5
0.5
P23
P1 P23
P1 P23
P23
P23
P23
(x)mm,r/m
x,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
1
1
1
1
1
1
4
5
4
4
4
4
2
1
2
1
1
1
1
0.5
P0
P0
P0
P0
P0
P0
P23
(x)mm,r/m
x,r/m
1
2
2
4
0.5
1
P23
P1 P23
Page 59
P1
P1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
SSSE3
SSE4.1
Piledriver
PABSB/W/D
PSIGNB/W/D
PSADBW
MPSADBW
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
x,x,i
1
1
2
8
2
2
4
8
0.5
0.5
1
4
P23
P23
P23
P1 P23
VPCOMB/W/D/Q
x,x,x/m,i
1
2
0.5
P23
VPCOMUB/W/D/Q
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD
x,x,x/m,i
1
2
0.5
P23
SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7
x,x/m
1
2
0.5
P23
AMD XOP
x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
1
1
1
1
1
1
1
1
1
1
2
2
4
5
4
4
5
4
4
4
0.5
0.5
1
2
1
1
2
1
1
1
P23
P23
P0
P0
P0
P0
P0
P0
P0
P0
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
1
2
0.5
P23
(x)mm,r/m
1
3
1
P1
(x)mm,i
x,i
x,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m
1
1
2
1
1
1
1
2
2
3
2
3
3
1
1
1
1
1
1
1
P1
P1
P1 P3
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP
x,x,i
x,x,i
x,x,i
x,x,i
27
27
7
7
16
10
13
7
10
10
3
4
P1 P2 P3
P1 P2 P3
P1 P2 P3
P1 P2 P3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
x,x/m,i
x,x,x,i
x,x,m,i
x,x
x,x
x,x
x,x
x,x
x,x,i
5
6
7
2
2
2
2
1
1
12
12
12
5
5
5
5
5
5
7
7
7
2
2
2
2
1
1
P1
P1
P1
P01
P01
P01
P01
P0
P0
pclmul
pclmul
pclmul
aes
aes
aes
aes
aes
aes
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
VPCLMULQDQ
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
Page 60
SSSE3
SSSE3
Piledriver
Other
EMMS
1
0.25
Floating point XMM and YMM instructions
Instruction
Move instructions
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
Operands
Ops
Latency Reciprocal
throughput
x,x
y,y
1
2
0
2
0.25
0.5
x,m128
1
6
0.5
y,m256
2
6
1
m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
1
4
8
1
1
1
1
1
2
1
1
2
2
1
4
1
1
2
1
2
1
2
8
10
1
2
1
2
1
1
2
2
1
2
2
2
5
11
15
2
6
5
8
7
7
6
2
10
1
17
20
0.5
0.5
1
1
0.5
1
1
1
1
1
2
18
4
1
2
1
2
1
2
3
4
0.5
1
1
2
1
0.5
2
1
0.5
0.5
0.5
0.5
5
2
2
3
3
2
2
4
2
2
2
2
2
2
6
6
6
6
Page 61
Execution
pipes
Domain, notes
none
P23
inherit domain
ivec
P3
P3
P2 P3
P01
fp
P1
P01
P1 P3
P3
P1
P1 P3
ivec
P3
P3
P1
P1
P1
P1
P1
P1
P23
P23
P23
P23
P1
P1
P1
AMD SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
P1
ivec
P23
P23
P23
Piledriver
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
x,x
x,m128
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
1
1
2
2
1
2
2
2
1
2
1
1
2
2
1
2
18
34
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
x/m,x,i
x/m,y,i
x,x/m
y,x/m
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
2
6
2
6
2
7
2
13
7
13
~100
~190
1
0.5
2
1
1
2
1
1
0.5
1
1
2
1
1
0.5
1
~90
~180
P1
ivec
P1
ivec
P1
P1
P1 P3
P1 P3
P23
P23
P1
P1
P23
P23
P01
P01
P0 P1 P2 P3
P0 P1 P2 P3
ivec
ivec
2
4
2
4
1
1
1
2
1
2
2
4
2
4
2
1
2
2
2
2
2
2
2
4
2
4
8
7
8
8
4
4
4
4
4
4
8
8
8
7
8
4
7
7
13
12
13
12
8
8
8
8
1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
2
2
2
2
P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0 P23
P0
P0 P1
P0 P1
P0
P0 P3
P0
P0 P3
P0 P1
P0 P1
P0 P1
P0 P1
ivec/fp
ivec/fp
ivec/fp
ivec/fp
fp
fp
fp
fp
fp
fp
ivec/fp
ivec/fp
fp/ivec
fp/ivec
ivec/fp
fp
ivec/fp
fp/ivec
fp
fp
fp
fp
F16C
F16C
F16C
F16C
x,x/m
x,x/m
1
1
5-6
5-6
0.5
0.5
P01
P01
fma
fma
y,y,y/m
x,x/m
2
1
5-6
5-6
1
0.5
P01
P01
fma
fma
2
2
2
Page 62
ivec
ivec
Piledriver
VADDSUBPS/D
y,y,y/m
2
5-6
1
P01
fma
HADDPS/D HSUBPS/D
x,x
3
10
2
P01 P1
ivec/fma
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD
x,m
4
2
P01 P1
ivec/fma
y,y,y/m
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m
8
1
1
2
1
2
1
2
1
2
10
5-6
5-6
5-6
9-24
9-24
9-27
9-27
5
5
4
0.5
0.5
1
5-10
9-20
5-10
9-18
1
2
P01 P1
P01
P01
P01
P01
P01
P01
P01
P01
P01
ivec/fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
x,x/m
y,y,y/m
1
2
2
2
0.5
1
P01
P01
fp
fp
x,x/m
2
1
P01 P3
fp
x,x/m
1
2
0.5
P01
fp
VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
16
DPPS
x,m,i
18
VDPPS
y,y,y,i
25
VDPPS
y,m,i
29
DPPD
x,x,i
15
DPPD
x,m,i
17
VFMADD132SS/SD
x,x,x/m
1
VFMADD132PS/PD
x,x,x/m
1
VFMADD132PS/PD
y,y,y/m
2
All other FMA3 instructions: same as above
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
All other FMA4 instructions: same as above
2
4
1
1
P01
P0
fp
fp
4
25
5-6
5-6
5-6
2
6
7
13
13
5
6
1
1
1
P0
P01 P23
P01 P23
P01 P3
P01 P3
P01 P23
P01 P23
P01
P01
P01
5-6
5-6
5-6
0.5
0.5
1
P01
P01
P01
fp
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
FMA3
FMA3
FMA3
FMA3
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4
13-15
14-15
24-26
24-26
5
5
10
10
5-12
9-24
5-15
9-29
1
2
2
2
P01
P01
P01
P01
P01
P01
P01
P01
Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x
x,m
1
2
1
2
1
2
2
3
27
15
Page 63
fp
fp
fp
fp
fp
fp
AMD XOP
AMD XOP
Piledriver
Logic
AND/ANDN/OR/XORPS/
PD
x,x/m
1
2
0.5
P23
ivec
VAND/ANDN/OR/XOR
PS/PD
y,y,y/m
2
2
1
P23
ivec
136
176
196
250
4
5
6
10
34
17
136
176
196
250
P2 P3
P2 P3
P2 P3
P2 P3
P0 P3
P0 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
P0 P1 P2 P3
32 bit mode
64 bit mode
32 bit mode
64 bit mode
m32
m32
m4096
m4096
m
m
9
16
17
32
7
2
67
116
122
177
Other
VZEROUPPER
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
Page 64
Steamroller
AMD Steamroller
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB, JNE,
etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, x = 128 bit xmm register, y = 256 bit ymm register, m = any memory
operand including indirect operands, m64 means 64-bit memory operand, etc.
Number of macro-operations issued from instruction decoder to schedulers. Instructions with more than 2 macro-operations use microcode.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. The latency listed does not include the
memory operand where the listing for register and memory operand are joined
(r/m).
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/3 indicates that the
execution units can handle 3 instructions per clock cycle in one thread. However,
the throughput may be limited by other bottlenecks in the pipeline.
Execution pipe:
Indicates which execution pipe or unit is used for the macro-operations:
Integer pipes:
EX0: integer ALU, division
EX1: integer ALU, multiplication, jump
EX01: can use either EX0 or EX1
AG01: address generation unit 0 or 1
Floating point and vector pipes:
P0: floating point add, mul, div. Integer add, mul, bool
P1: floating point add, mul, div. Shuffle, shift, pack
P2: Integer add. Bool, store
P01: can use either P0 or P1
P02: can use either P0 or P2
Two macro-operations can execute simultaneously if they go to different
execution pipes
Domain:
Tells which execution unit domain is used:
ivec: integer vector execution unit.
fp: floating point execution unit.
fma: floating point multiply/add subunit.
inherit: the output operand inherits the domain of the input operand.
ivec/fma means the input goes to the ivec domain and the output comes from the
fma domain.
There is an additional latency of 1 clock cycle if the output of an ivec instruction
goes to the input of a fp or fma instruction, and when the output of a fp or fma instruction goes to the input of an ivec or store instruction. There is no latency between the fp and fma units. All other latencies after memory load and before
memory store instructions are included in the latency counts.
An fma instruction has a latency of 5 if the output goes to another fma instruction,
6 if the output goes to an fp instuction, and 6+1 if the output goes to an ivec or
store instruction.
Integer instructions
Page 65
Steamroller
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVSX
MOVZX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
POP
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH/W
SFENCE
LFENCE
MFENCE
Operands
Ops
r8,r8
r16,r16
r32,r32
r64,r64
r,i
r,m
m,r
m,i
m,r
r,r
r,m
r,m
r64,r32
r64,m32
r,r
r,m
r8,r8
r16,r16
r32,r32
r64,r64
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
3
4
r,m
2
2
1
1
2
8
9
1
2
34
14
1
2
1
~38
6
1
1
4
2
1
1
1
1
1
7
1
7
2
1
3
2
1
1
r
i
m
r
m
sp
r16,[m]
r32,[m]
r32/64,[m]
r32/64,[m]
r
m
m
m
Latency Reciprocal
throughput
4
1
5
4
1
5
1
1
1
1
1
0.5
0.5
0.25
0.25
0.5
0.5
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
EX01
EX01
EX01 or AG01
EX01 or AG01
EX01
AG01
EX01 AG01
~38
2
1
1
1
4
9
1
1
19
8
EX01
2
2-3
2
Arithmetic instructions
Page 66
Execution
pipes
all addr. modes
all addr. modes
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
0.5
2
1
1
0.5
0.5
0.5
0.5
~80
0.25
~80
Notes
EX01
EX01
Timing depends on
hw
any addr. size
16 bit addr. size
scale factor > 1
or 3 operands
all other cases
EX01
PREFETCHW
Steamroller
ADD, SUB
ADD, SUB
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA, AAS
DAA
DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CDQ, CQO
CWD
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
m,r
m,i
r,r
r,i
r,m
m,i
r
m
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r16,m16,i
r32,m32,i
r64,m64,i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
r,r
r,i
r,m
m,r
m,i
r,r
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
16
20
4
10
1
2
1
1
1
1
1
2
1
1
2
2
2
9
7
2
2
9
7
2
2
1
1
2
1
1
1
1
1
1
1
1
7
7
1
1
1
9
9
1
1
1
7
6
8
10
6
15
4
4
4
6
4
4
6
5
4
6
17-22
15-25
13-39
13-70
17-22
14-25
13-39
13-70
1
1
1
1
1
7
7
1
Page 67
0.5
0.5
0.5
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
15
2
2
2
4
2
2
4
2
2
4
2
2
4
13-17
15-25
13-39
13-70
13-17
14-24
13-39
13-70
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
0.5
1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX0
EX01
EX01
EX01
0.5
0.5
0.5
1
1
0.5
EX01
EX01
EX01
EX01
EX01
EX01
Steamroller
TEST
TEST
TEST
NOT
NOT
ANDN
SHL, SHR, SAR
ROL, ROR
RCL
RCL
RCL
RCR
RCR
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC, BTR, BTS
BTC, BTR, BTS
BSF
BSF
BSR
BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
POPCNT
POPCNT
LZCNT
TZCNT
BEXTR
BEXTR
BLSI
BLSMSK
BLSR
BLCFILL
BLCI
BLCIC
BLCMSK
BLCS
BLSFILL
BLSI
BLSIC
T1MSKC
TZMSK
r,i
m,r
m,i
r
m
r,r,r
r,i/CL
r,i/CL
r,1
r,i
r,cl
r,1
r,i
r,cl
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,r
r,r
r,m
r,r
r,m
r
m
r16/32,r16/32
r64,r64
r,r
r,r
r,r,r
r,r,i
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
r,r
1
1
1
1
1
1
1
1
1
16
17
1
15
16
6
7-8
8
1
1
7
2
4
10
6
8
7
9
1
1
1
1
2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
7
1
1
1
1
7
7
1
7
7
3
4
1
2
3
4
4
1
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
3
4
4
0.5
0.5
3.5
1
2
5
3
4
4
5
0.5
1
0.5
1
4
4
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Page 68
3
4
2
4
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX01
EX0
BMI1
SSE4.2
SSE4.2
LZCNT
BMI1
BMI1
AMD TBM
BMI1
BMI1
BMI1
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
AMD TBM
Steamroller
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Jcc
short/near
fused CMP+Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m
RET
RET
i
BOUND
m
INTO
String instructions
LODS
REP LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
CMPXCHG
CMPXCHG
LOCK CMPXCHG
LOCK CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
m8/m16
m32/m64
1
1
1
1
1
1
1
1
2
2
3
1
4
11
4
2
2
2
1-2
1-2
1-2
1-2
1-2
2
2
2
2
2
5
2
3
6n
6n
3
1n
3 per 16B
5
~1n
4-5 pr 16B
3
7n
6
9n
3
3n
2.5n
3
~1n
2 per 16B
3
~1n
~2 per 16B
3
3-4n
3
4n
m,r
m,r
m,r
m,r8
m,r16
m,r32/64
m8,r8
m16,r16
m,r32/64
m64
m64
m128
m128
1
4
4
5
6
6
5
6
6
18
18
24
24
a,0
a,b
1
1
8
13
11+5b
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
EX1
for no jump
for no jump
small n
best case
small n
best case
~39
9-12
~39
15
15
13
~40
~40
~40
~14
~42
~47
~80
0.25
0.25
4
21
20-30
Page 69
2 if jumping
2 if jumping
2 if jumping
2 if jumping
2 if jumping
none
none
Steamroller
LEAVE
CPUID
XGETBV
RDTSC
RDTSCP
RDPMC
CRC32
CRC32
CRC32
r32,r8
r32,r16
r32,r32
2
38-64
4
44
44
22
3
5
7
3
5
6
3
100-300
30
78
105
360
2
5
6
rdtscp
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FLDCW
FNSTCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM FPREM1
Operands
Ops
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
1
1
8
60
1
2
13
239
1
1
2
1
8
1
1
3
2
1
2
2
7
11
52
2
7
14
222
0
11
7
1
2
1
2
1
1
2
1
1
1
2
2
1
1
1
1
5
st0,r
r
AX
m16
m16
m16
r/m
m
r/m
m
r
m
m
r/m
r
m
Latency Reciprocal
throughput
3
0
11
5
9-37
2
2
26
4
17-60
Page 70
0.5
1
4
34
0.5
1
19
222
0.5
1
1
0.5
3
0.25
0.25
19
17
3
2
1
2
1
2
4-16
4
0.5
0.5
0.5
1
1
0.5
0.5
1
12-53
Execution
pipes
Domain, notes
P01
fp
fp
fp
fp
fp
fp
fp
fp
inherit
fp
fp
fp
fp
P0 P1 P2
P01
P0 P1 P2
P01
P01
P0 P2
P01
P0 P1 P2
none
none
P0 P2
P0 P2
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01
P01 P2
P01
P01
P01
P0
P0
inherit
fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
Steamroller
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
1
1
10-164
18-166
12-168
11-192
10-365
10
12
10-18
9-183
206
m864
m864
1
1
18
31
98
73
10-50
60-210
76-158
90-245
60-440
49
8
60-74
60-280
~390
256
166
5-20
0.5
60-165
90-165
90-210
60-365
5
5
0.25
0.25
63
131
256
166
P01
P01
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
none
none
P0
P0
P0 P1 P2
P0 P2
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
VMOVDQA
VMOVDQA
VMOVDQA
MOVDQU
MOVDQU
MOVDQU
LDDQU
VMOVDQU
VMOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
Operands
Ops
r32/64, mm/x
mm/x, r32/64
mm/x,m32
m32,mm/x
mm/x,mm/x
mm/x,m64
m64,mm/x
xmm,xmm
xmm,m
m,xmm
ymm,ymm
ymm,m256
m256,ymm
xmm,xmm
xmm,m
m,xmm
xmm,m
ymm,m256
m256,ymm
mm,xmm
xmm,mm
m,mm
m,xmm
xmm,m
1
2
1
1
1
1
1
1
1
1
2
2
2
1
1
1
1
2
2
1
1
1
1
1
Latency Reciprocal
throughput
4
5
2
3
2
2
3
0
2
3
2
3
4
0
2
3
2
3
4
1
1
3
3
2
Page 71
1
1
0.5
1
0.5
0.5
1
0.25
0.5
1
0.5
1
1
0.25
0.5
1
0.5
1
1
0.5
0.5
1
1
0.5
Execution
pipes
Notes
P2
P02
none
inherit domain
P2
P02
P2
none
P2
P02
P02
P2
P2
inherit domain
Steamroller
(x)mm,r/m
PACKSSWB/DW
(x)mm,r/m
PACKUSWB
PUNPCKH/LBW/WD/D
Q
(x)mm,r/m
PUNPCKHQDQ
xmm,r/m
PUNPCKLQDQ
xmm,r/m
PSHUFB
(x)mm,r/m
PSHUFD
xmm,xmm,i
PSHUFW
mm,mm,i
PSHUFL/HW
xmm,xmm,i
PALIGNR
(x)mm,r/m,i
PBLENDW
xmm,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,mm/x
PEXTRB/W/D/Q
r,x/mm,i
PINSRB/W/D/Q
x/mm,r,i
EXTRQ
x,i,i
EXTRQ
x,x
INSERTQ
x,x,i,i
INSERTQ
x,x
PMOVSXBW/BD/BQ/
WD/WQ/DQ
x,x
PMOVZXBW/BD/BQ/W
D/WQ/DQ
x,x
VPCMOV
x,x,x,x/m
VPCMOV
y,y,y,y/m
VPPERM
x,x,x,x/m
Arithmetic instructions
PADDB/W/D/Q/SB/SW
/USB/USW
PSUBB/W/D/Q/SB/SW/
USB/USW
PHADD/SUB(S)W/D
PCMPEQ/GT B/W/D
PCMPEQQ
PCMPGTQ
PMULLW PMULHW
PMULHUW PMULUDQ
PMULLD
PMULDQ
PMULHRSW
PMADDWD
PMADDUBSW
PAVGB/W
PMIN/MAX SB/SW/ SD
UB/UW/UD
PHMINPOSUW
PABSB/W/D
PSIGNB/W/D
PSADBW
1
1
2
2
1
1
P1
P1
1
1
1
1
1
1
1
1
1
31
65
2
2
2
1
1
1
1
2
2
2
3
2
2
2
2
2
32
45
5
5
6
3
1
1
1
1
1
1
1
1
1
1
1
0.5
16
31
1
1
1
1
1
1
1
P1
P1
P1
P1
P1
P1
P1
P1
P02
P2
P0 P1 P2
P1 P2
P1 P2
P1
P1
P1
P1
P1
AMD SSE4A
AMD SSE4A
AMD SSE4A
AMD SSE4A
1
2
1
P1
SSE4.1
1
1
2
1
2
2
2
2
1
1
2
1
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
1
2
0.5
P02
(x)mm,r/m
x,x
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
3
1
1
1
2
5
2
2
2
0.5
2
0.5
0.5
0.5
P02
P02 2P1
P02
P02
P02
(x)mm,r/m
x,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
1
1
1
1
1
1
4
5
4
4
4
4
2
1
2
1
1
1
1
0.5
P0
P0
P0
P0
P0
P0
P02
(x)mm,r/m
x,r/m
(x)mm,r/m
(x)mm,r/m
(x)mm,r/m
1
2
1
1
2
2
4
2
2
4
0.5
1
0.5
0.5
1
P02
P1 P02
P02
P02
P02
Page 72
SSE4.1
SSE4.1
SSSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
SSSE3
SSE4.1
SSSE3
SSSE3
Steamroller
MPSADBW
x,x,i
8
8
4
P1 P02
VPCOMB/W/D/Q
x,x,x/m,i
1
2
0.5
P02
VPCOMUB/W/D/Q
VPHADDBW/BD/BQ/
WD/WQ/DQ
VPHADDUBW/BD/BQ/
WD/WQ/DQ
VPHSUBBW/WD/DQ
VPMACSWW/WD
VPMACSDD
VPMACSDQH/L
VPMACSSWW/WD
VPMACSSDD
VPMACSSDQH/L
VPMADCSWD
VPMADCSSWD
x,x,x/m,i
1
2
0.5
P02
SSE4.1
AMD XOP
latency 0 if i=6,7
AMD XOP
latency 0 if i=6,7
x,x/m
1
2
0.5
P02
AMD XOP
x,x/m
x,x/m
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
x,x,x/m,x
1
1
1
1
1
1
1
1
1
1
2
2
4
5
4
4
5
4
4
4
0.5
0.5
1
2
1
1
2
1
1
1
P02
P02
P0
P0
P0
P0
P0
P0
P0
P0
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
AMD XOP
(x)mm,r/m
1
2
0.5
P02
(x)mm,r/m
1
3
1
P1
(x)mm,i
x,i
x,r/m
x,x,x/m
x,x,i
x,x,x/m
x,x,x/m
1
1
2
1
1
1
1
2
2
14
3
2
3
3
1
1
1
1
1
1
1
P1
P1
P1 P2
P1
P1
P1
P1
SSE4.1
AMD XOP
AMD XOP
AMD XOP
AMD XOP
x,x,i
x,x,i
x,x,i
x,x,i
30
30
9
8
11
10
5
6
11
10
5
6
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
x,x/m,i
x,x,x,i
x,x,m,i
x,x
x,x
x,x
x,x
x,x
x,x,i
7
7
8
2
2
2
2
1
1
11
11
7
7
7
1
1
1
1
1
1
P1
P1
P1
P01
P01
P01
P01
P0
P0
pclmul
pclmul
pclmul
aes
aes
aes
aes
aes
aes
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
VPROTB/W/D/Q
VPROTB/W/D/Q
VPSHAB/W/D/Q
VPSHLB/W/D/Q
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
Encryption
PCLMULQDQ
VPCLMULQDQ
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
Other
EMMS
5
5
5
5
5
5
1
0.25
Page 73
Steamroller
Floating point XMM and YMM instructions
Instruction
Move instructions
MOVAPS/D
MOVUPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
MOVNTSS/SD
SHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
Operands
Ops
Latency Reciprocal
throughput
x,x
y,y
1
2
0
2
0.25
0.5
x,m128
1
2
0.5
y,m256
2
2
1
m128,x
m256,y
m256,y
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
m,x
x,x/m,i
y,y,y/m,i
x,x,x/m
y,y,y/m
x,x/m,i
y,y/m,i
y,y,y,i
y,y,m,i
x,x/m,i
y,y,y/m,i
x,x/m,xmm0
y,y,y/m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
1
2
2
1
1
1
1
1
2
1
1
2
2
1
2
1
1
2
1
2
1
2
8
12
1
2
1
2
1
1
2
2
1
2
2
2
1
1
3
3
3
2
2
3
3
3
4
3
2
5
15
3
3
1
2
2
0.5
0.5
1
1
0.5
1
1
1
1
1
1
2-3
3
1
2
1
2
1
2
3.5
4
0.5
1
0.5
1
1
0.5
2
1
0.5
0.5
0.5
0.5
1
0.5
2
2
3
3
2
2
4
2
2
2
2
2
2
8
8
8
8
2
Page 74
Execution
pipes
Domain, notes
none
P02
inherit domain
ivec
P2
P2
P2
P01
fp
P2
P1
P01
P1 P2
P2
P1
P1 P2
P1 P2
P2
P2
P2
P2
P2
P1
P1
P1
P1
P0 P2
P0 P2
P01
P01
P01
P01
P1
ivec
AMD SSE4A
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
fp
fp
ivec
P1
ivec
P02
P02
P02
P1
ivec
Steamroller
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
y,y
y,m256
x,x/m
y,y,y/m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
2
2
1
2
2
2
1
2
1
1
2
2
1
2
20
41
Conversion
CVTPD2PS
VCVTPD2PS
CVTPS2PD
VCVTPS2PD
CVTSD2SS
CVTSS2SD
CVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
x,x
x,y
x,x
y,x
x,x
x,x
x,x
y,y
x,x
y,y
x,x
y,x
x,x
x,y
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32/64
r32/64,x
x/m,x,i
x/m,y,i
x,x/m
y,x/m
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
2
P1
ivec
P1
P1
P1 P2
P1 P2
P02
P0 P2
P1
P1
P02
P02
P01
P01
P0 P1 P2
P0 P1 P2
ivec
ivec
10
2
10
2
9
2
10
9
9
~35
~35
2
1
1
2
1
1
0.5
1
1
2
1
1
0.5
1
8
16
2
4
2
4
1
1
1
2
1
2
2
4
2
4
2
1
2
2
2
2
2
2
2
4
2
4
6
6
6
6
4
4
4
4
4
4
7
7
7
7
6
5
7
7
13
12
12
12
7
7
7
7
1
2
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
2
2
2
2
P01
P01
P01
P01
P0
P0
P0
P0
P0
P0
P01
P01
P01
P01
P0 P2
P0
P0 P1
P0 P1
P0
P0 P2
P0
P0 P2
P0 P1
P0 P1
P0 P1
P0 P1
ivec/fp
ivec/fp
ivec/fp
ivec/fp
fp
fp
fp
fp
fp
fp
ivec/fp
ivec/fp
fp/ivec
fp/ivec
ivec/fp
fp
ivec/fp
fp/ivec
fp
fp
fp
fp
F16C
F16C
F16C
F16C
x,x/m
x,x/m
1
1
5-6
5-6
1
1
P01
P01
fma
fma
y,y,y/m
x,x/m
y,y,y/m
2
1
2
5-6
5-6
5-6
2
1
1
P01
P01
P01
fma
fma
fma
2
2
Page 75
ivec
ivec
Steamroller
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULSD
MULPS MULPD
VMULPS VMULPD
DIVSS DIVPS
VDIVPS
DIVSD DIVPD
VDIVPD
RCPSS/PS
VRCPPS
CMPSS/D
CMPPS/D
VCMPPS/D
COMISS/D
UCOMISS/D
MAXSS/SD/PS/PD
MINSS/SD/PS/PD
x,x
4
10
2
P0 P1
ivec/fma
y,y,y/m
x,x/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y,y/m
x,x/m
y,y/m
8
1
1
2
1
2
1
2
1
2
10
5-6
5-6
5-6
9-17
9-17
9-32
9-32
5
5
4
0.5
0.5
1
4-6
9-12
4-13
9-27
1
2
P01 P1
P01
P01
P01
P01
P01
P01
P01
P01
P01
ivec/fma
fma
fma
fma
fp
fp
fp
fp
fp
fp
x,x/m
y,y,y/m
1
2
2
2
0.5
1
P01
P01
fp
fp
x,x/m
2
1
P01 P2
fp
x,x/m
1
2
0.5
P01
fp
VMAXPS/D VMINPS/D
y,y,y/m
2
ROUNDSS/SD/PS/PD
x,x/m,i
1
VROUNDSS/SD/PS/
PD
y,y/m,i
2
DPPS
x,x,i
9
DPPS
x,m,i
10
VDPPS
y,y,y,i
13
VDPPS
y,m,i
15
DPPD
x,x,i
7
DPPD
x,m,i
8
VFMADD132SS/SD
x,x,x/m
1
VFMADD132PS/PD
x,x,x/m
1
VFMADD132PS/PD
y,y,y/m
2
All other FMA3 instructions: same as above
VFMADDSS/SD
x,x,x,x/m
1
VFMADDPS/PD
x,x,x,x/m
1
VFMADDPS/PD
y,y,y,y/m
2
All other FMA4 instructions: same as above
2
4
1
1
P01
P0
fp
fp
4
25
5-6
5-6
5-6
2
4
5
8
8
3
4
0.5
0.5
1
P0
P0 P1
P0 P1
P0 P1
P0 P1
P0 P1
P0 P1
P01
P01
P01
5-6
5-6
5-6
0.5
0.5
1
P01
P01
P01
fp
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
FMA3
FMA3
FMA3
FMA3
AMD FMA4
AMD FMA4
AMD FMA4
AMD FMA4
Math
SQRTSS/PS
VSQRTPS
SQRTSD/PD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
VFRCZSS/SD/PS/PD
VFRCZSS/SD/PS/PD
25
14
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x
x,m
1
2
1
2
1
2
2
4
12-13
12-13
26-29
27-28
5
5
10
4-9
9-18
4-18
9-37
1
2
2
2
P01
P01
P01
P01
P01
P01
P01
P01
fp
fp
fp
fp
fp
fp
AMD XOP
AMD XOP
x,x/m
1
2
0.5
P02
ivec
Logic
AND/ANDN/OR/XORPS/
PD
Page 76
Steamroller
VAND/ANDN/OR/XOR
PS/PD
Other
VZEROUPPER
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
y,y,y/m
2
m32
m32
m4096
m4096
m
m
9
16
17
32
9
2
59-67
104-112
121-137
191-209
2
1
4
5
6
10
36
17
78
160
147-166
291-297
Page 77
P02
P02
P02
P0 P2
P0 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
P0 P1 P2
ivec
32 bit mode
64 bit mode
32 bit mode
64 bit mode
Bobcat
AMD Bobcat
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, m = any memory operand including
indirect operands, m64 means 64-bit memory operand, etc.
Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to
be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m).
The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar
instructions, assuming that this will make the processor boost the clock frequency
to the highest possible value.
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of
clock cycles from the execution of an instruction begins to a subsequent independent instruction of the same kind can begin to execute. A value of 1/2 indicates that the execution units can handle 2 instructions per clock cycle in one
thread. However, the throughput may be limited by other bottlenecks in the pipeline.
Execution pipe:
Indicates which execution pipe is used for the micro-operations. I0 means integer
pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD).
FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to
different execution pipes.
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
Operands
r,r
r,i
r,m
m,r
m8,r8H
m,i
m,r
r,r
r,m
r64,r32
r64,m32
r,r
r,m
r,r
r,m
Ops Latency Reciprocal
throughput
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
1
4
4
7
6
1
5
1
5
1
1
20
Page 78
0.5
0.5
1
1
1
1
1
0.5
1
0.5
1
0.5
1
1
Execution
pipe
I0/1
I0/1
AGU
AGU
AGU
AGU
AGU
I0/1
Notes
Any addr. mode
Any addr. mode
AH, BH, CH, DH
I0/1
I0/1
Timing dep. on hw
Bobcat
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
PREFETCH
SFENCE
LFENCE
MFENCE
r
i
m
r
m
r16,[m]
r32/64,[m]
r32/64,[m]
r64,[m]
r
m
m
m
Arithmetic instructions
ADD, SUB
r,r/i
ADD, SUB
r,m
ADD, SUB
m,r
ADC, SBB
r,r/i
ADC, SBB
r,m
ADC, SBB
m,r/i
CMP
r,r/i
CMP
r,m
INC, DEC, NEG
r
INC, DEC, NEG
m
AAA
AAS
DAA
DAS
AAD
AAM
MUL, IMUL
r8/m8
MUL, IMUL
r16/m16
MUL, IMUL
r32/m32
MUL, IMUL
r64/m64
IMUL
r16,r16/m16
IMUL
r32,r32/m32
IMUL
r64,r64/m64
IMUL
r16,(r16),i
IMUL
r32,(r32),i
IMUL
r64,(r64),i
DIV
r8/m8
2
1
1
3
9
9
1
4
29
9
2
1
1
1
4
1
1
1
1
1
1
4
1
4
1
1
1
1
1
1
1
1
1
1
9
9
12
16
4
33
1
3
2
2
1
1
1
2
1
1
1
5
3
1
2-4
4
1
1
1
1
1
6-7
1
1
6
5
10
7
8
5
23
3
3-5
3-4
6-7
3
3
6
4
3
7
27
Page 79
1
1
2
6
9
1
4
22
8
2
0.5
1
0.5
2
0.5
I0
I0/1
I0
I0/1
I0/1
0.5
1
1
1
~45
1
~45
I0/1
AGU
AGU
AGU
AGU
AGU
AGU
0.5
1
1
1
1
I0/1
0.5
1
0.5
I0/1
23
1
2
1
1
4
3
1
4
27
Any address size
no scale, no offset
w. scale or offset
RIP relative
AMD only
I0/1
I0/1
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
latency ax=3, dx=5
latency eax=3, edx=4
latency rax=6, rdx=7
Bobcat
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BTC, BTR, BTS
BTC
BTR, BTS
BTC
BTR, BTS
BSF, BSR
BSF, BSR
POPCNT
LZCNT
SETcc
SETcc
CLC, STC
CMC
CLD
STD
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
1
1
1
1
1
1
1
1
1
33
49
81
29
37
55
81
1
1
33
49
81
29
37
55
81
I0
I0
I0
I0
I0
I0
I0
I0/1
I0/1
r,r
r,m
m,r
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
1
1
1
1
1
1
1
1
1
1
9
7
9
9
1
0.5
1
1
0.5
1
0.5
1
0.5
0.5
1
5
4
5
4
I0/1
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
r,r/i
m,i
m,i
m,r
m,r
r,r
r,m
r,r/m
r,r/m
r
m
1
1
10
9
9
8
6
7
8
1
1
5
2
5
4-5
8
8
11
11
9
8
1
1
1
1
1
2
7
7
1
1
1
1
1
5
4
6
5
18
3
4
18
2
16
15
6
12
5
1
1
Page 80
I0/1
I0/1
I0/1
I0/1
I0/1
1
1
~15
~14
15
15
3
4
15
0.5
1
3
1
15
15
13
15
6
6
5
0.5
1
0.5
0.5
1
2
SSE4.A/SSE4.2
SSE4.A, AMD only
I0/1
I0/1
I0
I0,I1
Bobcat
Control transfer instructions
JMP
short/near
JMP
r
JMP
m(near)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
CALL
near
CALL
r
CALL
m(near)
RET
RET
i
BOUND
m
INTO
1
1
1
1
2
8
2
2
5
1
4
8
4
2
2
2
1/2 - 2
1-2
4
2
2
2
~3
~4
4
2
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
4
5
4
2
7
2
5
6
7
6
~3
~3
2
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
recip. t. = 2 if jump
recip. t. = 2 if jump
values for no jump
values for no jump
values are per count
best case 6-7 B/clk
5
best case 5 B/clk
3
3
4
3
1
0
1
0
6
i,0
12
a,b
10+6b
2
30-52 70-830
26
14
0.5
0.5
6
values are per count
values are per count
I0/1
I0/1
36
34+6b
3
32 bit mode
87
8
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
Operands
r
m32/64
m80
m80
r
m32/64
m80
m80
r
Ops Latency Reciprocal
throughput
1
1
7
21
1
1
16
217
1
2
6
14
30
2
6
19
177
0
Page 81
0.5
1
5
35
0.5
1
9
180
1
Execution
pipe
FP0/1
FP0/1
FP0/1
FP1
FP1
Notes
Bobcat
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FNSTCW
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
m
m
st0,r
r
AX
m16
m16
m16
r
m
m
r
m
m
r
m
m
r
m
r
m
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
1
1
1
12
1
1
2
2
3
12
1
1
2
1
1
2
1
1
2
1
1
1
1
1
2
1
2
5
1
1
9
6
7
1
~20
~20
3
3
5
5
19
2
2
m
1
1
3
3
3
19
19
19
2
1
1
1
2
1
1
2
11
11-16
11-19
1
31
1
4-44 27-105
11-51 51-94
11-75 48-110
~45
~113
9-75 49-163
5
8
7
9
30-56
~60
8
29
12
44
1
1
9
26
85
1
1
1
7
1
1
10
10
2
10
0
0
Page 82
1
27-105
51-94
48-110
~113
49-163
0.5
0.5
30
78
163
FP1
FP1
FP0/1
FP1
FP1
FP1
FP1
FP0
FP1
FP0
FP0
FP0,FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0, FP1
FP0
FP1
FP0, FP1
FP1
FP1
FP1
FP0
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
ALU
FP0, FP1
FP0, FP1
FP0, FP1
Bobcat
FRSTOR
FXSAVE
FXRSTOR
m
m
m
80
71
111
123
105
118
FP0, FP1
FP0, FP1
FP0, FP1
Integer MMX and XMM instructions
Instruction
Operands
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32,(x)mm
1
1
1
1
3
2
1
7
7
5
6
6
5
6
1
3
1
1
3
1
2
FP0
FP0/1
FP0/1
FP0
FP1
FP1
FP1
r64,(x)mm
mm,r64
xmm,r64
mm,mm
xmm,xmm
mm,m64
xmm,m64
m64,(x)mm
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
1
2
3
1
2
1
2
1
2
2
2
2
2
1
2
1
2
7
7
7
1
1
5
5
6
1
6
6
6-9
6-9
1
1
13
13
1
3
3
0.5
1
1
1
2
1
2
3
2-5.5
3-6
0.5
1
1.5
3
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP1
FP1
FP0/1
AGU
FP1
AGU
FP1
FP0/1
FP0/1
FP1
FP1
mm,r/m
1
1
0.5
FP0/1
xmm,r/m
3
2
2
FP0/1
mm,r/m
1
1
0.5
xmm,r/m
xmm,r/m
xmm,r/m
mm,mm
xmm,xmm
xmm,xmm,i
mm,mm,i
xmm,xmm,i
xmm,xmm,i
mm,mm
xmm,xmm
r32,(x)mm
2
2
1
1
6
3
1
2
20
32
64
1
1
1
1
2
3
2
1
2
19
146-1400
279-3000
8
1
1
0.5
1
3
2
0.5
2
12
130-1170
260-2300
2
MOVD (MOVQ)
MOVD (MOVQ)
MOVD (MOVQ)
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU, LDDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/D
Q
PUNPCKH/LBW/WD/D
Q
PUNPCKHQDQ
PUNPCKLQDQ
PSHUFB
PSHUFB
PSHUFD
PSHUFW
PSHUFL/HW
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
Ops Latency Reciprocal
throughput
Page 83
Execution
pipe
FP0, FP1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
FP0
Notes
Moves 64 bits.
Name differs
do.
do.
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Bobcat
PEXTRW
PINSRW
PINSRW
INSERTQ
INSERTQ
EXTRQ
EXTRQ
Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
PADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
PHADD/SUBW/SW/D
PHADD/SUBW/SW/D
PCMPEQ/GT B/W/D
PCMPEQ/GT B/W/D
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULLW PMULHW
PMULHUW
PMULUDQ
PMULHRSW
PMULHRSW
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAX SW/UB
PMIN/MAX SW/UB
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
Logic
PAND PANDN POR
PXOR
PAND PANDN POR
PXOR
r32,(x)mm,i
mm,r32,i
xmm,r32,i
xmm,xmm
xmm,xmm,i,i
xmm,xmm
xmm,xmm,i,i
2
2
3
3
3
1
1
12
10
10
3-4
3-4
1
2
2
6
3
3
1
2
FP0, FP1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
FP0/1
FP0/1
mm,r/m
1
1
0.5
FP0/1
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
2
1
2
1
2
1
1
4
1
1
1
0.5
1
0.5
1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
mm,r/m
1
2
1
FP0
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
mm,r/m
xmm,r/m
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
1
2
1
2
1
2
0.5
1
0.5
1
0.5
1
0.5
1
2
2
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0
FP0, FP1
mm,r/m
1
1
0.5
FP0/1
xmm,r/m
2
1
1
FP0/1
Page 84
SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only
SSE4.A, AMD only
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Bobcat
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
mm,i/mm/m
1
1
1
FP0/1
xmm,i/xmm/m
2
2
1
1
1
1
FP0/1
FP0/1
0.5
FP0/1
xmm,i
Other
EMMS
1
Floating point XMM instructions
Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS, MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVDDUP
MOVDDUP
MOVSHDUP,
MOVSLDUP
MOVSHDUP,
MOVSLDUP
MOVMSKPS/D
SHUFPS/D
UNPCK H/L PS/D
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
Operands
Ops Latency Reciprocal
throughput
Execution
pipe
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,m
m,r
2
2
2
2
2
2
1
2
1
1
6
6
1
6-9
6-9
1
6
5
1
2
3
1
2-6
3-6
0.5
2
2
FP0/1
AGU
FP1
FP0/1
AGU
FP1
FP0/1
FP1
FP1
r,r
1
1
0.5
FP0/1
r,m
1
6
2
AGU
m,r
m,r
m,r
r,r
r,m64
1
2
1
2
2
5
12
12
2
7
3
3
2
1
2
FP1
FP1
FP1
FP0/1
FP0/1
r,r
2
1
1
FP0/1
r,m
r32,r
r,r/m,i
r,r/m
2
1
3
2
12
~6
2
1
3
2
2
1
AGU
FP0
FP0/1
FP0/1
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
2
4
3
1
2
2
2
4
1
2
1
5
5
5
4
4
5
4
6
4
5
4
2
3
3
1
4
2
4
3
2
2
1
FP1
FP0, FP1
FP0, FP1
FP1
FP1
FP1
FP1
FP0, FP1
FP1
FP1
FP1
Page 85
Notes
SSE4.A, AMD only
SSE3
SSE3
Bobcat
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SS2SI
CVT(T)SD2SI
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
3
3
2
2
2
6
12
11
12
11
2
3
3
1
1
FP0, FP1
FP0, FP1
FP1
FP0, FP1
FP0, FP1
r,r/m
r,r/m
r,r/m
1
2
2
3
3
3
1
2
2
FP0
FP0
FP0
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
2
1
1
2
2
1
2
1
2
1
2
1
2
1
2
3
2
4
2
4
13
38
17
34
3
3
2
2
2
2
2
1
2
2
4
13
38
17
34
1
2
1
2
1
2
FP0
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
r,r/m
1
1
FP0
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
2
1
1
FP0/1
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
2
1
2
1
2
14
48
24
48
3
3
14
48
24
48
1
2
FP1
FP1
FP1
FP1
FP1
FP1
Other
LDMXCSR
STMXCSR
m
m
12
3
10
11
FP0, FP1
FP0, FP1
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDSUBPS/D
HADDPS/D
HSUBPS/D
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVPS
DIVSD
DIVPD
RCPSS
RCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
CMPccSS/D
CMPccPS/D
COMISS/D
UCOMISS/D
Page 86
SSE3
SSE3
Jaguar
AMD Jaguar
List of instruction timings and macro-operation breakdown
Explanation of column headings:
Instruction:
Operands:
Ops:
Latency:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, m = any memory operand including
indirect operands, m64 means 64-bit memory operand, etc.
Number of micro-operations issued from instruction decoder to schedulers. Instructions with more than 2 micro-operations are micro-coded.
This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to
be normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latencies listed do not include memory operands where the operand is listed as register or memory (r/m).
The clock frequency varies dynamically, which makes it difficult to measure latencies. The values listed are measured after the execution of millions of similar instructions, assuming that this will make the processor boost the clock frequency
to the highest possible value.
Reciprocal throughput:
This is also called issue latency. This value indicates the average number of clock
cycles from the execution of an instruction begins to a subsequent independent
instruction of the same kind can begin to execute. A value of 1/2 indicates that
the execution units can handle 2 instructions per clock cycle in one thread. However, the throughput may be limited by other bottlenecks in the pipeline.
Execution pipe:
Indicates which execution pipe is used for the micro-operations. I0 means integer
pipe 0. I0/1 means integer pipe 0 or 1. FP0 means floating point pipe 0 (ADD).
FP1 means floating point pipe 1 (MUL). FP0/1 means either one of the two floating point pipes. Two micro-operations can execute simultaneously if they go to
different execution pipes.
Integer instructions
Instruction
Move instructions
MOV
MOV
Operands
Ops Latency Reciprocal
throughput
Execution
pipe
r,r
r,i
1
1
1
0.5
0.5
I0/1
I0/1
MOV
r8/16,m
1
4
1
AGU
MOV
m,r8/16
1
4
1
AGU
MOV
r32/64,m
1
3
1
AGU
MOV
MOV
MOVNTI
MOVZX, MOVSX
MOVZX, MOVSX
MOVSXD
m,r32/64
m,i
m,r
r,r
r,m
r64,r32
1
1
1
1
1
1
0
1
1
1
0.5
1
0.5
AGU
AGU
AGU
I0/1
6
1
4
1
Page 87
Notes
Any addressing
mode
Any addressing
mode
Any addressing
mode
Any addressing
mode
Jaguar
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
BSWAP
MOVBE
MOVBE
PREFETCHNTA
PREFETCHT0/1/2
PREFETCHW
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC, NEG
INC, DEC, NEG
AAA
AAS
DAA
DAS
AAD
AAM
r64,m32
r,r
r,m
r8,r8
r,r
1
1
1
3
2
3
1
r,m
3
2
1
1
2
2
9
9
1
3
1
29
9
2
1
1
1
4
1
1
1
1
1
1
1
1
1
4
4
16
5
1
1
1
1
1
1
1
1
1
1
9
9
12
16
4
8
1
r
i
m
SP
r
m
SP
r16,[m]
r32/64,[m]
r32/64,[m]
r64,[m]
r
r,m
m,r
m
m
m
r,r/i
r,m
m,r
r,r/i
r,m
m,r/i
r,r/i
r,m
r
m
2
1
3
1
2
3
1
1
1
6
1
8
1
1
6
5
8
6
8
5
14
Page 88
1
0.5
1
2
1
I0/1
I0/1
I0/1
Timing depends on
hw
3
1
1
1
1
6
8
1
2
2
18
8
2
0.5
1
0.5
2
0.5
1
0.5
1
1
~100
~100
~100
0.5
~45
~45
I0
I0/1
I0
I0/1
I0/1
I0/1
MOVBE
MOVBE
AGU
AGU
AGU
AGU
AGU
AGU
0.5
1
1
1
1
I0/1
0.5
1
0.5
1
I0/1
13
Any address size
1-2 comp., no scale
3 comp. or scale
RIP relative
I0/1
I0/1
Jaguar
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW, CWDE, CDQE
CWD, CDQ, CQO
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
ANDN
ANDN
TEST
TEST
TEST
NOT
NOT
SHL, SHR, SAR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHLD, SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
r8/m8
r16/m16
r32/m32
r64/m64
r16,r16/m16
r32,r32/m32
r64,r64/m64
r16,(r16),i
r32,(r32),i
r64,(r64),i
r8/m8
r16/m16
r32/m32
r64/m64
r8/m8
r16/m16
r32/m32
r64/m64
1
3
2
2
1
1
1
2
1
1
1
2
2
2
1
2
2
2
1
1
3
3
3
6
3
3
6
4
3
6
11-14
12-19
12-27
12-43
11-14
12-19
12-27
12-43
1
1
1
3
2
5
1
1
4
1
1
4
11-14
12-19
12-27
12-43
11-14
12-19
12-27
12-43
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0
I0/1
I0/1
r,i
r,r
r,m
m,r
r,r,r
r,r,m
r,i
r,r
r,m
r
m
r,i/CL
r,i/CL
r,1
r,i
r,i
r,CL
r,CL
1
1
1
1
1
2
1
1
1
1
1
1
1
1
9
7
9
7
1
1
0.5
0.5
1
1
0.5
1
0.5
0.5
1
0.5
1
0.5
0.5
1
5
4
5
4
I0/1
I0/1
m,i /CL
m,1
m,i
m,i
m,CL
m,CL
r,r,i
r,r,cl
m,r,i/CL
r,r/i
m,i
m,r
1
1
10
9
9
8
6
7
8
1
1
5
6
6
1
1
1
1
6
1
1
1
5
4
5
4
3
4
Page 89
1
1
11
11
11
11
3
4
11
0.5
1
3
BMI1
BMI1
I0/1
I0/1
I0/1
I0/1
I0/1
I0/1
Jaguar
BTC, BTR, BTS
BTC
BTR, BTS
BTC, BTR, BTS
BSF
BSR
BSF, BSR
POPCNT
LZCNT
TZCNT
BLSI BLSR
BLSI BLSR
BLSMSK
BLSMSK
BEXTR
BEXTR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
r,r/i
m,i
m,i
m,r
r,r
r,r
r,m
r,r/m
r,r
r,r
r,r
r,m
r,r
r,m
r,r,r
r,m,r
r
m
2
5
4
8
7
8
8
1
1
2
2
3
2
3
1
2
1
1
1
1
1
2
2
4
4
1
1
2
2
2
1
1
1
11
11
11
4
4
4
0.5
0.5
1
1
2
1
2
0.5
1
0.5
1
0.5
1
1
2
Control transfer instructions
JMP
short/near
JMP
r
JMP
m(near)
Jcc
short/near
J(E/R)CXZ
short
LOOP
short
LOOPE LOOPNE
short
CALL
near
CALL
r
CALL
m(near)
RET
RET
i
1
1
1
1
2
8
10
2
2
5
1
4
2
2
2
0.5 - 2
1-2
5
6
2
2
2
3
3
BOUND
8
4
4
2
4
~5n
4
~2n
2/16B
7
~2n
2/16B
5
~6n
7
2
~3n
2
~n
1/16B
4
~1.5n
1/16B
3
~3n
4
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
m
Page 90
SSE4A/SSE4.2
SSE4A/LZCNT
BMI1
BMI1
BMI1
BMI1
BMI1
BMI1
BMI1
I0/1
I0/1
I0
I0,I1
2 if jumping
2 if jumping
values are for no
jump
values are for no
jump
for small n
best case
for small n
best case
Jaguar
REP CMPS
Synchronization
LOCK ADD
XADD
LOCK XADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
XGETBV
RDTSC
RDTSCP
RDPMC
CRC32
CRC32
~6n
m,r
m,r
m,r
m,r8
m,r8
m,r16/32/64
m,r16/32/64
m64
m64
m128
m128
r,r
r,m
1
4
4
5
5
6
6
18
18
28
28
~3n
19
11
16
11
16
11
17
11
19
32
38
1
1
37
i,0
12
a,b
10+6b
2
30-59 70-230
5
34
34
30
3
3
4
0.5
0.5
46
I0/1
I0/1
18
17+3b
3
32 bit mode
5
41
42
27
2
2
rdtscp
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(T)(P)
FLDZ, FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FNSTSW
FNSTCW
Operands
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m
m
st0,r
r
AX
m16
m16
Ops Latency Reciprocal
throughput
1
1
7
21
1
1
10
217
1
1
1
1
12
1
1
2
2
3
2
4
9
24
2
3
9
167
0
8
4
7
1
Page 91
0.5
1
5
29
0.5
1
7
168
1
1
1
1
7
1
1
11
11
2
Execution
pipe
FP0/1
FP0/1
FP0/1
FP1
FP1
FP1
FP1
FP1
FP0/1
FP1
FP1
FP1
FP1
FP0
Notes
Jaguar
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD(P),FSUB(R)(P)
FIADD,FISUB(R)
FMUL(P)
FMUL(P)
FIMUL
FDIV(R)(P)
FDIV(R)(P)
FIDIV(R)
FABS, FCHS
FCOM(P), FUCOM(P)
FCOM(P), FUCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
m16
12
r
m
m
r
m
m
r
m
m
1
1
2
1
1
1
1
1
2
1
1
1
1
1
2
1
2
5
1
1
r
m
r
m
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
1
1
4-44
11-51
11-76
11-45
9-75
5
7
8
8-51
61
m
m
1
1
9
27
88
80
3
5
22
2
8
11-54
11-56
35
30-139
38-93
55-122
55-177
44-167
27
9
32-37
30-120
~160
0
138-150
136
9
FP1
1
1
2
3
3
FP0
FP0
FP0,FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0, FP1
FP0
1FP1
FP0, FP1
FP1
FP1
22
22
22
2
1
1
1
2
1
1
2
4
35
1
30-151
30-120
~160
FP1
FP0
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
0.5
0.5
32
78
138-150
136
FP0/1
ALU
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
55-180
55-177
44-167
6
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
Operands
r32, mm
mm, r32
Ops Latency Reciprocal
throughput
1
2
4
6
Page 92
1
1
Execution
pipe
FP0
FP0/1
Notes
Jaguar
MOVD
MOVD
MOVD
MOVD
MOVD
mm,m32
r32, x
x, r32
x,m32
m32,(x)mm
1
1
2
1
1
4
4
6
4
3
1
1
1
1
1
AGU
FP0
FP1
AGU
FP1
MOVD / MOVQ
r64,(x)mm
MOVQ
mm,r64
MOVQ
x,r64
MOVQ
mm,mm
MOVQ
x,x
MOVQ
(x)mm,m64
MOVQ
m64,(x)mm
MOVDQA
x,x
VMOVDQA
y,y
MOVDQA
x,m
VMOVDQA
y,m
MOVDQA
m,x
VMOVDQA
m,y
MOVDQU, LDDQU
x.m
MOVDQU
m,x
MOVDQ2Q
mm,x
MOVQ2DQ
x,mm
MOVNTQ
m,mm
MOVNTDQ
m,x
PACKSSWB/DW
PACKUSWB
mm,r/m
PACKSSWB/DW
PACKUSWB
x,r/m
PUNPCKH/LBW/WD/D
Q
mm,r/m
PUNPCKH/LBW/WD/D
Q
x,r/m
PUNPCKH/LQDQ
x,r/m
PSHUFB
mm,mm
PSHUFB
x,x
PSHUFD
x,x,i
PSHUFW
mm,mm,i
PSHUFL/HW
x,x,i
PALIGNR
x,x,i
PBLENDW
x,r/m
MASKMOVQ
mm,mm
MASKMOVDQU
x,x
PMOVMSKB
r32,(x)mm
PEXTRW
r32,(x)mm,i
PINSRW
mm,r32,i
PINSRB/W/D/Q
x,r,i
PINSRB/W/D/Q
x,m,i
PEXTRB/W/D/Q
r,x,i
PEXTRB/W/D/Q
m,x,i
INSERTQ
x,x
INSERTQ
x,x,i,i
EXTRQ
x,x
1
2
2
1
1
1
1
1
2
1
2
1
2
1
1
1
1
1
1
4
6
6
1
1
4
3
1
1
4
4
3
3
4
3
1
1
429
429
1
1
1
0.5
0.5
1
1
0.5
1
1
2
1
2
1
1
0.5
0.5
2
2
FP0
FP0/1
FP0/1
FP0/1
FP0/1
AGU
FP1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
AGU
FP1
FP0/1
FP0/1
FP1
FP1
1
1
0.5
FP0/1
1
2
0.5
FP0/1
1
1
0.5
FP0/1
1
1
1
3
1
1
1
1
1
32
64
1
1
2
2
1
1
1
3
3
1
2
2
1
4
2
1
1
2
1
432
43-2210
3
4
8
7
0.5
0.5
0.5
2
0.5
0.5
0.5
0.5
0.5
17
34
1
1
1
1
1
1
1
2
2
0.5
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0, FP1
FP0, FP1
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0
FP1
FP0, FP1
FP0, FP1
FP0/1
3
2
2
1
Page 93
Moves 64 bits.Name
of instruction differs
do.
do.
AVX
AVX
AVX
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
SSE4.1
SSE4.1
SSE4.1
SSE4A, AMD only
SSE4A, AMD only
SSE4A, AMD only
Jaguar
EXTRQ
PMOVSXBW/BD/BQ/
WD/WQ/DQ
PMOVZXBW/BD/BQ/
WD/WQ/DQ
x,x,i,i
1
1
0.5
FP0/1
SSE4A, AMD only
x,x
1
2
0.5
FP0/1
SSE4.1
x,x
1
2
0.5
FP0/1
SSE4.1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
1
3
1
1
1
1
1
1
1
1
1
3
2
4
2
2
2
2
1
1
1
1
2
4
1
2
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
FP0
FP0 FP1
FP0
FP0
FP0
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
(x)mm,r/m
1
1
0.5
FP0/1
mm,i/mm/m
1
1
0.5
FP0/1
x,x
1
2
0.5
FP0/1
x,i
x,i
x,x/m
1
1
1
1
2
3
0.5
0.5
1
FP0/1
FP0/1
FP0
SSE4.1
x,x,i
x,m,i
x,x,i
x,m,i
x,x,i
x,m,i
9
10
9
10
3
4
5
5
5
9
9
2
2
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
Arithmetic instructions
PADDB/W/D/Q
PADDSB/W
ADDUSB/W
PSUBB/W/D/Q
PSUBSB/W
PSUBUSB/W
(x)mm,r/m
PHADD/SUBW/SW/D
mm,r/m
PHADD/SUBW/SW/D
x,r/m
PCMPEQ/GT B/W/D
mm,r/m
PCMPEQ/GT B/W/D
x,r/m
PCMPEQQ
(x)mm,r/m
PCMPGTQ
(x)mm,r/m
PMULLW PMULHW
PMULHUW
PMULUDQ
(x)mm,r/m
x,r/m
PMULLD
x,r/m
PMULDQ
PMULHRSW
(x)mm,r/m
PMADDWD
(x)mm,r/m
PMADDUBSW
(x)mm,r/m
PAVGB/W
(x)mm,r/m
PMIN/MAX SW/UB
(x)mm,r/m
PABSB/W/D
(x)mm,r/m
PSIGNB/W/D
(x)mm,r/m
PSADBW
(x)mm,r/m
MPSADBW
x,x,i
Logic
PAND PANDN POR
PXOR
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLL/RL W/D/Q
PSRAW/D
PSLLDQ, PSRLDQ
PTEST
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
9
2
Page 94
Suppl. SSE3
Suppl. SSE3
SSE4.1
SSE4.2
SSE4.1
SSE4.1
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
Suppl. SSE3
SSE4.1
Jaguar
PCMPISTRM
PCMPISTRM
Encryption
PCLMULQDQ
AESDEC
AESDECLAST
AESENC
AESENCLAST
AESIMC
AESKEYGENASSIST
x,x,i
x,m,i
3
4
8
8
2
FP0/1
FP0/1
SSE4.2
SSE4.2
x,x/m,i
x,x
x,x
x,x
x,x
x,x
x,x,i
1
2
2
2
2
1
1
3
5
5
5
5
2
2
1
1
1
1
1
1
1
FP0
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
PCLMUL
AES
AES
AES
AES
AES
AES
0.5
FP0/1
Other
EMMS
1
Floating point XMM instructions
Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D
VMOVAPS/D
MOVAPS/D
VMOVAPS/D
MOVUPS/D
VMOVUPS/D
MOVUPS/D
VMOVUPS/D
MOVUPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHLPS, MOVLHPS
MOVHPS/D,
MOVLPS/D
MOVHPS/D,
MOVLPS/D
MOVNTPS/D
MOVNTSS/D
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
MOVMSKPS/D
VMOVMSKPS/D
Operands
Ops Latency Reciprocal
throughput
Execution
pipe
x,x
y,y
x,m
y,m
m,x
m,y
x,x
y,y
x,m
y,m
m,x
m,y
x,x
x,m
m,x
1
2
1
2
1
2
1
2
1
2
1
2
1
1
1
1
1
4
4
3
3
1
1
4
4
3
3
1
4
3
0.5
1
1
2
1
2
0.5
1
1
2
1
2
0.5
1
1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
FP0/1
FP0/1
AGU
AGU
FP1
FP1
FP0/1
AGU
FP1
x,x
1
2
2
FP0/1
x,m
1
5
1
FP0/1
m,x
m,x
m,x
x,x
x,m64
y,y
y,m
x,x
x,m
y,y
y,m
r32,x
r32,y
1
1
1
1
1
2
2
1
1
2
2
1
1
4
429
1
1
1
0.5
1
1
2
0.5
1
1
2
1
1
FP1
FP1
FP1
FP0/1
AGU
FP0/1
AGU
FP0/1
AGU
FP0/1
AGU
FP0
FP0
2
2
1
1
3
3
Page 95
Notes
SSE4A, AMD only
SSE3
SSE3
AVX
AVX
AVX
AVX
AVX
Jaguar
SHUFPS/D
VSHUFPS/D
UNPCK H/L PS/D
VUNPCK H/L PS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
x,x/m,i
y,y,y,i
x,x/m
y,y,y
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
1
2
1
2
1
1
1
1
1
1
2
2
1
2
19
36
2
2
2
2
3
3
1
12
Conversion
CVTPS2PD
VCVTPS2PD
CVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS/PD
VCVTDQ2PS/PD
CVT(T)PS2DQ
VCVT(T)PS2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SS2SI
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
x,x/m
y,x/m
x,x/m
x,y
x,x/m
x,x/m
x,x/m
y,y
x,x/m
y,y
x,x/m
y,y
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
x/m,x,i
x/m,y,i
x,x/m
y,x/m
x,x/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
x,x/m
y,y/m
Arithmetic
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
HADD/SUBPS/D
VHADD/SUBPS/D
MULSS/PS
VMULPS
6
1
13
15
15
21
32
0.5
1
0.5
1
1
1
0.5
1
1
1
1
2
1
2
16
22
FP0/1
FP0/1
FP0/1
FP0/1
FP0
FP1
FP0/1
FP1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP0/1
FP1
FP1
AVX
AVX
>300 clk if mask=0
>300 clk if mask=0
AVX
AVX
1
2
1
3
2
2
1
2
1
2
1
3
1
1
1
1
2
2
2
2
1
3
1
2
3
4
4
6
5
4
4
4
4
4
4
7
4
4
4
4
9
9
8
8
4
6
4
5
1
2
1
2
8
7
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
2
1
2
FP1
FP1
FP1
FP0, FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0, FP1
FP1
FP1
F16C
F16C
F16C
F16C
1
1
2
1
2
1
2
1
2
3
3
3
3
3
4
4
2
2
1
1
2
1
2
1
2
1
2
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP1
FP1
Page 96
AVX
AVX
AVX
AVX
SSE3
SSE3
Jaguar
MULSD/PD
VMULPD
DIVSS
DIVPS
VDIVPS
DIVSD
DIVPD
VDIVPD
RCPSS
RCPPS
VRCPPS
MAXSS/D MINSS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
CMPccSS/D
CMPccPS/D
VCMPccPS/D
(U)COMISS/D
ROUNDSS/SD/PS/PD
VROUNDSS/D/PS/D
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m,i
y,y/m,i
x,x,i
x,m,i
y,y,y,i
y,m,i
x,x,i
x,m,i
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
1
2
5
6
10
12
3
4
4
4
14
19
38
19
19
38
2
2
2
2
2
2
2
2
2
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
VANDPS/D, etc.
x,x/m
y,y/m
1
2
Math
SQRTSS
SQRTPS
VSQRTPS
SQRTSD
SQRTPD
VSQRTPD
RSQRTSS/PS
VRSQRTPS
x,x/m
x,x/m
y,y/m
x,x/m
x,x/m
y,y/m
x,x/m
y,y/m
m
m
Other
LDMXCSR
STMXCSR
VZEROUPPER
VZEROUPPER
VZEROALL
VZEROALL
FXSAVE
FXSAVE
FXRSTOR
FXRSTOR
2
2
14
19
38
19
19
38
1
1
2
1
1
2
1
1
2
1
1
2
4
4
7
7
3
3
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP1
FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
FP0, FP1
1
1
0.5
1
FP0/1
FP0/1
1
2
2
1
2
2
1
2
16
21
42
27
27
54
2
2
16
21
42
27
27
54
1
2
FP1
FP1
FP1
FP1
FP1
FP1
FP1
FP1
12
3
21
37
41
73
66
58
115
123
9
13
8
12
30
46
58
90
66
58
189
197
FP0, FP1
FP0, FP1
4
4
11
12
9
66
58
189
198
Page 97
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
Jaguar
XSAVE
XSAVE
XRSTOR
XRSTOR
130
114
219
251
145
129
342
375
Page 98
145
129
342
375
32 bit mode
64 bit mode
32 bit mode
64 bit mode
Intel Pentium
Intel Pentium and Pentium MMX
List of instruction timings
Explanation of column headings:
Operands
Clock cycles
Pairability
r = register, accum = al, ax or eax, m = memory, i = immediate data, sr =
segment register, m32 = 32 bit memory operand, etc.
The numbers are minimum values. Cache misses, misalignment, and
exceptions may increase the clock counts considerably.
u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe,
np = not pairable.
Integer instructions (Pentium and Pentium MMX)
Instruction
NOP
MOV
MOV
MOV
MOV
XCHG
XCHG
XCHG
XLAT
PUSH
POP
PUSH
POP
PUSH
POP
PUSHF
POPF
PUSHA POPA
PUSHAD POPAD
LAHF SAHF
MOVSX MOVZX
LEA
LDS LES LFS LGS LSS
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
TEST
TEST
TEST
TEST
INC DEC
INC DEC
NEG NOT
Operands
r/m, r/m/i
r/m, sr
sr , r/m
m , accum
(E)AX, r
r,r
r,m
r/i
r
m
m
sr
sr
r , r/m
r,m
m
r , r/i
r,m
m , r/i
r , r/i
r,m
m , r/i
r , r/i
m , r/i
r,r
m,r
r,i
m,i
r
m
r/m
Clock cycles Pairability
1
uv
1
uv
1
np
>= 2 b)
np
1
uv h)
2
np
3
np
>15
np
4
np
1
uv
1
uv
2
np
3
np
1 b)
np
>= 3 b)
np
3-5
np
4-6
np
5-9 i)
np
5
np
2
np
3 a)
np
1
uv
4 c)
np
1
uv
2
uv
3
uv
1
u
2
u
3
u
1
uv
2
uv
1
uv
2
uv
1
f)
2
np
1
uv
3
uv
1/3
np
Page 99
Intel Pentium
MUL IMUL
MUL IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW CWDE
CWD CDQ
SHR SHL SAR SAL
SHR SHL SAR SAL
SHR SHL SAR SAL
ROR ROL RCR RCL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
JMP CALL
JMP CALL
conditional jump
CALL JMP
RETN
RETN
RETF
RETF
J(E)CXZ
LOOP
BOUND
CLC STC CMC CLD STD
CLI STI
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
BSWAP
CPUID
r8/r16/m8/m16
all other versions
r8/m8
r16/m16
r32/m32
r8/m8
r16/m16
r32/m32
r,i
m,i
r/m, CL
r/m, 1
r/m, i(><1)
r/m, CL
r/m, i(><1)
r/m, CL
r, i/CL
m, i/CL
r, r/i
m, i
m, i
r, r/i
m, i
m, r
r , r/m
r/m
short/near
far
short/near
r/m
i
i
short
short
r,m
r
11
9 d)
17
25
41
22
30
46
3
2
1
3
4/5
1/3
1/3
4/5
8/10
7/9
4 a)
5 a)
4 a)
4 a)
9 a)
7 a)
8 a)
14 a)
7-73 a)
1/2 a)
1 e)
>= 3 e)
1/4/5/6 e)
2/5 e
2/5 e
3/6 e)
4/7 e)
5/8 e)
4-11 e)
5-10 e)
8
2
6-9
2
7+3*n g)
3
10+n g)
4
12+n g)
4
9+4*n g)
5
8+4*n g)
1 a)
13-16 a)
Page 100
np
np
np
np
np
np
np
np
np
np
u
u
np
u
np
np
np
np
np
np
np
np
np
np
np
np
np
np
v
np
v
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
Intel Pentium
RDTSC
Notes:
a
b
c
d
e
f
g
h
i
j
6-13 a) j)
np
This instruction has a 0FH prefix which takes one clock cycle extra to decode on a P1 unless preceded by a multi-cycle instruction.
versions with FS and GS have a 0FH prefix. see note a.
versions with SS, FS, and GS have a 0FH prefix. see note a.
versions with two operands and no immediate have a 0FH prefix, see
note
a.
high values
are for mispredicted jumps/branches.
only pairable if register is AL, AX or EAX.
add one clock cycle for decoding the repeat prefix unless preceded by a
multi-cycle instruction (such as CLD).
pairs as if it were writing to the accumulator.
9 if SP divisible by 4 (imperfect pairing).
on P1: 6 in privileged or real mode; 11 in non-privileged; error in virtual
mode. On PMMX: 8 and 13 clocks respectively.
Floating point instructions (Pentium and Pentium MMX)
Explanation of column headings
Operands
Clock cycles
Pairability
i-ov
fp-ov
Instruction
FLD
FLD
FBLD
FST(P)
FST(P)
FST(P)
FBSTP
FILD
FIST(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FNSTSW
FLDCW
FNSTCW
FADD(P)
FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FCHS FABS
r = register, m = memory, m32 = 32-bit memory operand, etc.
The numbers are minimum values. Cache misses, misalignment,
denormal operands, and exceptions may increase the clock counts
considerably.
+ = pairable with FXCH, np = not pairable with FXCH.
Overlap with integer instructions. i-ov = 4 means that the last four clock
cycles can overlap with subsequent integer instructions.
Overlap with floating point instructions. fp-ov = 2 means that the last two
clock cycles can overlap with subsequent floating point instructions.
(WAIT is considered a floating point instruction here)
Operand
r/m32/m64
m80
m80
r
m32/m64
m80
m80
m
m
AX/m16
m16
m16
r/m
r/m
r/m
r/m
Clock cycles Pairability
1
0
3
np
48-58
np
1
np
2 m)
np
3 m)
np
148-154
np
3
np
6
np
2
np
5 s)
np
6 q)
np
8
np
2
np
3
0
3
0
3
0
19/33/39 p)
0
1
0
Page 101
i-ov
0
0
0
0
0
0
0
2
0
0
2
0
0
0
2
2
2
38 o)
0
fp-ov
0
0
0
0
0
0
0
2
0
0
2
0
0
0
2
2
2 n)
2
0
Intel Pentium
FCOM(P)(P) FUCOM
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM
FTST
FXAM
FPREM
FPREM1
FRNDINT
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
FNOP
FXCH
FINCSTP FDECSTP
FFREE
FNCLEX
FNINIT
FNSAVE
FRSTOR
WAIT
Notes:
m
n
o
p
q
r
s
r/m
m
m
m
m
r
r
m
m
1
6
6
22/36/42 p)
4
1
17-21
16-64
20-70
9-20
20-32
12-66
70
65-100 r)
89-112 r)
53-59 r)
103 r)
105 r)
120-147 r)
112-134 r)
1
1
2
2
6-9
12-22
124-300
70-95
1
0
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
np
0
2
2
38 o)
0
0
4
2
2
0
5
0
69 o)
2
2
2
2
2
36 o)
2
0
0
0
0
0
0
0
0
0
0
2
2
2
0
0
0
2
2
0
0
0
2
2
2
2
2
2
0
2
0
0
0
0
0
0
0
0
0
The value to store is needed one clock cycle in advance.
1 if the overlapping instruction is also an FMUL.
Cannot overlap integer multiplication instructions.
FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9
of the floating point control word.
The first 4 clock cycles can overlap with preceding integer instructions.
Clock counts are typical. Trivial cases may be faster, extreme cases may
be slower.
May be up to 3 clocks more when output needed for FST, FCHS, or
FABS.
MMX instructions (Pentium MMX)
A list of MMX instruction timings is not needed because they all take one clock cycle, except the MMX
multiply instructions which take 3. MMX multiply instructions can be pipelined to yield a throughput of one
multiplication per clock cycle.
The EMMS instruction takes only one clock cycle, but the first floating point instruction after an EMMS
takes approximately 58 clocks extra, and the first MMX instruction after a floating point instruction takes
approximately 38 clocks extra. There is no penalty for an MMX instruction after EMMS on the PMMX.
Page 102
Intel Pentium
There is no penalty for using a memory operand in an MMX instruction because the MMX arithmetic unit
is one step later in the pipeline than the load unit. But the penalty comes when you store data from an
MMX register to memory or to a 32-bit register: The data have to be ready one clock cycle in advance.
This is analogous to the floating point store instructions.
All MMX instructions except EMMS are pairable in either pipe. Pairing rules for MMX instructions are described in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
Page 103
Pentium II and III
Intel Pentium II and Pentium III
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, sr = segment register, m = memory, m32 = 32-bit memory operand, etc.
μops:
p0:
p1:
p01:
p2:
p3:
p4:
Latency:
The number of μops that the instruction generates for each execution port.
Port 0: ALU, etc.
Port 1: ALU, jumps
Instructions that can go to either port 0 or 1, whichever is vacant first.
Port 2: load data, etc.
Port 3: address generation for store
Port 4: store data
This is the delay that the instruction generates in a dependency chain. (This is
not the same as the time spent in the execution unit. Values may be inaccurate
in situations where they cannot be measured exactly, especially with memory
operands). The numbers are minimum values. Cache misses, misalignment, and
exceptions may increase the clock counts considerably. Floating point operands
are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays by 50-150 clocks, except in XMM move, shuffle and Boolean
instructions. Floating point overflow, underflow, denormal or NAN results give a
similar delay.
Reciprocal throughput: The average number of clock cycles per instruction for a series of independent
instructions of the same kind.
Integer instructions (Pentium Pro, Pentium II and Pentium III)
Instruction
Operands
p0
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVSX MOVZX
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
POP
POP
PUSH
POP
PUSH
POP
r,r/i
r,m
m,r/i
r,sr
m,sr
sr,r
sr,m
r,r
r,m
r,r
r,m
r,r
r,m
r/i
r
(E)SP
m
m
sr
sr
p1
μops
p01 p2 p3
1
1
1
1
1
1
8
7
Latency
p4
1
1
5
8
1
1
1
1
1
1
1
3
4
1
1
1
2
1
5
2
8
Page 104
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
high b)
Reciprocal
throughput
Pentium II and III
PUSHF(D)
POPF(D)
PUSHA(D)
POPA(D)
LAHF SAHF
LEA
LDS LES LFS LGS
LSS
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADD SUB AND OR XOR
ADC SBB
ADC SBB
ADC SBB
CMP TEST
CMP TEST
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS
AAD
AAM
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
CBW CWDE
CWD CDQ
SHR SHL SAR ROR
ROL
SHR SHL SAR ROR
ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
3
10
r,m
11
6
2
2
1
1
1
8
8
1
8
1
1 c)
m
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
8
1
1
1
2
2
3
1
1
1
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
r,(r),(i)
(r),m
r8
r16
r32
m8
m16
m32
1
1
1
1
2
3
3
2
2
2
1
2
2
4
15
4
4
19
23
39
19
23
39
1
1
1
1
1
1
1
1
1
1
1
1
r,i/CL
1
m,i/CL
r,1
r8,i/CL
r16/32,i/CL
m,1
m8,i/CL
m16/32,i/CL
r,r,i/CL
m,r,i/CL
r,r/i
m,r/i
r,r/i
m,r/i
r,r
r,m
r
1
1
4
3
1
4
4
2
2
1
4
3
2
3
2
1
1
1
1
1
1
6
1
6
1
1
1
Page 105
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
12
21
37
12
21
37
Pentium II and III
SETcc
JMP
JMP
JMP
JMP
JMP
conditional jump
CALL
CALL
CALL
CALL
CALL
RETN
RETN
RETF
RETF
J(E)CXZ
LOOP
LOOP(N)E
ENTER
ENTER
LEAVE
BOUND
CLC STC CMC
CLD STD
CLI
STI
INTO
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
BSWAP
NOP (90)
Long NOP (0F 1F)
CPUID
RDTSC
IN
OUT
PREFETCHNTA d)
PREFETCHT0/1/2 d)
SFENCE d)
Notes
m
short/near
far
r
m(near)
m(far)
short/near
near
far
r
m(near)
m(far)
1
1
1
1
2
21
1
1
1
21
1
1
1
1
1
2
4
1
1
2
3
28
1
28
i
23
23
i
short
short
short
i,0
a,b
ca.
r,m
7
2
2
1
2
1
8
8
12
18 +4b
2
6
1
4
1
2
1
1
3
3
1
2
1
1
2
2
2
1
2
1
1
2
2
2
2
2
1
1
1
2
2
1
b-1
1
2b
1
1
1
1
1
2
9
17
5
2
10+6n
1
1
a)
3
a)
2
4
2
ca. 5n
1
ca. 6n
12+7n
12+9n
r
1
1
1
1
0.5
1
23-48
31
18
18
m
m
>300
>300
1
1
1
Page 106
1
6
Pentium II and III
a)
b)
c)
d)
Faster under certain conditions: see manual 3: "The microarchitecture of Intel,
AMD and VIA CPUs".
Has an implicit LOCK prefix.
3 if constant without base or index register
P3 only.
Floating point x87 instructions (Pentium Pro, II and III)
Instruction
Operands
p0
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FLDZ
FLD1 FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
r
AX
m16
m16
m16
r
m
r
m
r
m
r
m
r
m
m
m
m
m
p1
μops
p01 p2 p3
Latency
p4
Reciprocal
throughput
1
1
2
2
2
38
1
1
2
2
2
165
3
2
1
2
2
3
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
6
6
6
6
1
1
23
33
30
1
1
2
2
1
1
1
1
0
5
5
⅓ f)
2
7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
10
3
3-4
5
5-6
38 h)
38 h)
2
1
1
1
1
1
1
2
Page 107
1
1
2 g)
2 g)
37
37
Pentium II and III
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
FNOP
FINCSTP FDECSTP
FFREE
FFREEP
FNCLEX
FNINIT
FNSAVE
FRSTOR
WAIT
Notes:
e)
f)
g)
h)
i)
r
r
56
15
1
17-97
18-110
17-48
36-54
31-53
21-102
25-86
1
1
1
2
27-103
29-130
66
103
98-107
13-143
44-143
69
e)
e)
e)
e)
e)
e)
e)
e,i)
3
13
141
72
2
Not pipelined
FXCH generates 1 μop that is resolved by register renaming without going to any
port.
FMUL uses the same circuitry as integer multiplication. Therefore, the combined
throughput of mixed floating point and integer multiplications is 1 FMUL + 1 IMUL
per 3 clock cycles.
FDIV latency depends on precision specified in control word: 64 bits precision
gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks. Reciprocal throughput is 1/(latency-1).
Faster for lower precision.
Integer MMX instructions (Pentium II and Pentium III)
Instruction
Operands
p0
MOVD MOVQ
MOVD MOVQ
MOVD MOVQ
PADD PSUB PCMP
PADD PSUB PCMP
PMUL PMADD
PMUL PMADD
PAND(N) POR PXOR
PAND(N) POR PXOR
PSRA PSRL PSLL
PSRA PSRL PSLL
PACK PUNPCK
PACK PUNPCK
EMMS
MASKMOVQ d)
r,r
mm,m32/64
m32/64,mm
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm/i
mm,m64
mm,mm
mm,m64
p1
μops
p01 p2 p3
1
1
1
1
1
1
1
1
Latency
p4
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
11
mm,mm
1
Page 108
1
1
6 k)
2-8
Reciprocal
throughput
0.5
1
1
0.5
1
1
1
0.5
1
1
1
1
1
2 - 30
Pentium II and III
PMOVMSKB d)
MOVNTQ d)
PSHUFW d)
PSHUFW d)
PEXTRW d)
PINSRW d)
PINSRW d)
PAVGB PAVGW d)
PAVGB PAVGW d)
PMIN/MAXUB/SW d)
PMIN/MAXUB/SW d)
PMULHUW d)
PMULHUW d)
PSADBW d)
PSADBW d)
Notes:
d)
k)
r32,mm
m64,mm
mm,mm,i
mm,m64,i
r32,mm,i
mm,r32,i
mm,m16,i
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
mm,mm
mm,m64
1
1
1
1
1
1
1
1
1
1
2
2
1
2
1
2
1
2
3
4
5
6
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1 - 30
1
1
1
1
1
0.5
1
0.5
1
1
1
2
2
P3 only.
The delay can be hidden by inserting other instructions between EMMS and any
subsequent floating point instruction.
Floating point XMM instructions (Pentium III)
Instruction
Operands
p0
MOVAPS
MOVAPS
MOVAPS
MOVUPS
MOVUPS
MOVSS
MOVSS
MOVSS
MOVHPS MOVLPS
MOVHPS MOVLPS
MOVLHPS MOVHLPS
MOVMSKPS
MOVNTPS
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVTPS2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVTSS2SI
ADDPS SUBPS
ADDPS SUBPS
ADDSS SUBSS
ADDSS SUBSS
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32
m32,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
p1
1
μops
p01 p2 p3
2
2
2
4
4
1
1
1
1
1
1
1
Latency
p4
2
4
1
1
1
2
2
2
2
1
2
2
1
1
2
2
1
1
Page 109
1
2
1
2
1
2
2
1
1
2
3
2
3
1
1
1
1
1
1
1
2
3
4
3
4
4
5
3
4
3
3
3
3
Reciprocal
throughput
1
2
2
4
4
1
1
1
1
1
1
1
2 - 15
1
2
1
1
2
2
1
2
2
2
1
1
Pentium II and III
MULPS
MULPS
MULSS
MULSS
DIVPS
DIVPS
DIVSS
DIVSS
AND(N)PS ORPS XORPS
AND(N)PS ORPS XORPS
MAXPS MINPS
MAXPS MINPS
MAXSS MINSS
MAXSS MINSS
CMPccPS
CMPccPS
CMPccSS
CMPccSS
COMISS UCOMISS
COMISS UCOMISS
SQRTPS
SQRTPS
SQRTSS
SQRTSS
RSQRTPS
RSQRTPS
RSQRTSS
RSQRTSS
RCPPS
RCPPS
RCPSS
RCPSS
SHUFPS
SHUFPS
UNPCKHPS UNPCKLPS
UNPCKHPS UNPCKLPS
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,m128
m32
m32
m4096
m4096
2
2
1
1
2
2
1
1
2
1
2
1
2
2
2
2
1
1
2
2
1
1
1
1
2
2
1
2
1
1
2
2
2
2
2
2
1
1
2
2
1
1
2
1
2
1
2
1
2
2
2
2
1
2
2
11
6
116
89
Page 110
2
4
4
4
4
48
48
18
18
2
2
3
3
3
3
3
3
3
3
1
1
56
57
30
31
2
3
1
2
2
3
1
2
2
2
3
3
15
7
62
68
2
2
1
1
34
34
17
17
2
2
2
2
1
1
2
2
1
1
1
1
56
56
28
28
2
2
1
1
2
2
1
1
2
2
2
2
15
9
Pentium M
Intel Pentium M, Core Solo and Core Duo
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
μops fused domain:
μops unfused domain:
p0:
p1:
p01:
p2:
p3:
p4:
Latency:
Reciprocal throughput:
i = immediate data, r = register, mm = 64 bit mmx register, xmm =
128 bit xmm register, sr = segment register, m = memory, m32 =
32-bit memory operand, etc.
The number of μops at the decode, rename, allocate and retirement stages in the pipeline. Fused μops count as one.
The number of μops for each execution port. Fused μops count
as two.
Port 0: ALU, etc.
Port 1: ALU, jumps
Instructions that can go to either port 0 or 1, whichever is vacant
first.
Port 2: load data, etc.
Port 3: address generation for store
Port 4: store data
This is the delay that the instruction generates in a dependency
chain. (This is not the same as the time spent in the execution
unit. Values may be inaccurate in situations where they cannot be
measured exactly, especially with memory operands). The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating
point operands are presumed to be normal numbers. Denormal
numbers, NAN's and infinity increase the delays by 50-150
clocks, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give
a similar delay.
The average number of clock cycles per instruction for a series of
independent instructions of the same kind.
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
Operands
r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r32
r,r
r,m
r,r
r,m
r,r
μops
fused
domain
1
1
1
2
1
2
8
8
2
1
1
2
2
3
μops unfused domain
p0
p1 p01 p2
p3
Latency Reciprocal
p4
throughput
1
0.5
1
1
1
1
1
1
8
7
1
1
1
1
1
1
5
8
1
1
1
1
2
2
0.5
1
1.5
2
1.5
1
1
1
1
Page 111
1
1
3
1
Pentium M
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POP
POPF(D)
POPA(D)
LAHF SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
SFENCE/LFENCE/MFENCE
IN
OUT
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV IDIV
r,m
r
i
m
sr
r
(E)SP
m
sr
r,m
r
m
m
m
7
2
1
2
2
2
16
18
1
3
2
10
17
10
1
2
1
2
11
1
1
2
4
1
1
1
1
1
11
2
3
2
9
6
2
1
10
1
1
1
1
1
1
1
1
8
1
1
high b)
1
1
1
1
1
8
1
1
1
1
1
8
1
1
2
1
1
1
1
8
6
8
1
1
2
1
7
1
1
1
1
8
3
1
1
1
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r
m,i
r
m
r8
r16/r32
r,r
r,r,i
m8
m16/m32
r,m
r,m,i
r8
r16
r32
m8
m16
1
1
3
2
2
7
1
1
2
1
3
1
3
4
1
3
1
1
1
3
1
2
5
4
4
6
5
1
1
6
1
18
18
16
7
1
1
1
>300
>300
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
1
1
1
1
1
1
1
0.5
1
1
2
1
0.5
1
1
0.5
2
15
4
5
4
4
4
5
4
4
15-16 c)
15-24 c)
15-39 c)
15-16 c)
15-24 c)
1
1
1
1
1
1
1
1
12
12-20 c)
12-20 c)
12
12-20 c)
1
1
1
1
1
3
1
1
1
3
1
1
4
3
3
4
3
Page 112
1
2
2
1
1
1
1
1
1
1
1
1
1
1
Pentium M
DIV IDIV
CBW CWDE
CWD CDQ
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
TEST
SHR SHL SAR ROR ROL
SHR SHL SAR ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD STD
m32
r,r/i
r,m
m,r/i
r,r/i
m,r
m,i
r,i/CL
m,i/CL
r,1
r8,i/CL
r8,i/CL
r16/32,i/CL
m,1
m8,i/CL
m8,i/CL
m16/32,i/CL
r,r,i/CL
m,r,i/CL
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
Control transfer instructions
JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
conditional jump
short/near
J(E)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
5
1
1
1
1
3
1
1
2
1
3
2
9
8
6
7
12
11
10
2
4
1
8
2
1
10
3
2
2
1
2
1
4
1
22
1
2
25
1
2
11
11
4
32
4
4
35
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
5
4
3
2
6
5
5
2
1
1
1
1
1
2
0.5
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
4
4
3
2
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
1
1
7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
23
1
1
1
1
1
1
8
8
1
27
1
1
1
2
1
2
29
2
2
2
1
1
9
6
1
7
1
21
2
11
10
9
1
4
Page 113
12-20 c)
1
1
1
1
2
2
15-39 c)
1
1
1
2
1
1
2
1
1
2
1
2
1
1
2
1
28
1
2
31
1
1
6
6
2
27
9
2
30
2
Pentium M
RETN
RETF
RETF
BOUND
INTO
i
i
r,m
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
CLI
STI
ENTER
ENTER
LEAVE
CPUID
RDTSC
Notes:
a)
b)
c)
3
27
27
15
5
1
24
24
7
2
6n
3
5n
6
6n
3
7n
6
9n
1
6
5
1
3
3
2
2
30
30
8
4
2
4
0.5
1
0.7
0.7
0.5
1.3
0.6
0.7
0.5
10+6n
1
ca. 5n a)
1
3
ca. 6n a)
1
2
12+7n
4
2
12+9n
1
1
2
1
1
1
1
1
1
2
0.5
1
9
17
i,0
a,b
12
ca.
3
38-59
13
10
18 +4b
2
1
1
b-1 2b
1
38-59
13
ca. 130
42
Faster under certain conditions: see manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
Has an implicit LOCK prefix.
High values are typical, low values are for round divisors. Core Solo/Duo is
more efficient than Pentium M in cases with round values that allow an earlyout algorithm.
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
μops
fused
domain
1
1
4
40
1
1
6
169
1
μops unfused domain
p0
p1 p01 p2
p3
Latency Reciprocal
p4
throughput
1
2
38
1
2
165
Page 114
1
1
1
2
2
1
2
2
1
2
2
1
0
3
167
0.33 f)
Pentium M
FILD
FIST(P)
FISTTP g)
FLDZ
FLD1 FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE
FFREEP
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
m
m
m
r
AX
m16
m16
m16
r
r
r
m
r
m
r
m
r
m
r
m
m
m
m
4
4
4
1
2
2
3
2
3
3
1
1
2
142
72
1
1
1
1
1
1
1
1
1
1
2
1
6
6
6
6
1
1
26
15
28
15
1
80-100
90-110
~ 20
~ 40
~ 55
~ 100
~ 85
1
2
3
2
2
1
2
2
3
1
1
1
1
1
2
142
72
1
1
1
1
1
1
1
1
2
7
3
1
19
3
1
2
131
91
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
3
3
5
5
9-38 c)
9-38 c)
1
1
1
1
1
1
3
5
9-38 c)
26
15
1
1
2
2
8-37 c)
8-37 c)
1
1
1
1
1
1
3
3
8-37 c)
4
1
1
37
19
28
15
1
80-100
90-110
~20
~40
~55
~100
~85
1
43
9
9 h)
8
80-110
100-130
~45
~60
~65
~140
~140
1
Page 115
2
2
2
1
1
1
3
5
5
3
1
1
5
5
5
1
1
1
Pentium M
FNCLEX
FNINIT
Notes:
c)
f)
g)
3
14
3
14
13
27
High values are typical, low values are for low precision or round divisors.
FXCH generates 1 μop that is resolved by register renaming without going to
any port.
SSE3 instruction only available on Core Solo and Core Duo.
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKHQDQ
PUNPCKHQDQ
Operands
μops
fused
domain
μops unfused domain
p0
p1 p01 p2
r32,mm
mm,r32
mm,m32
m32,mm
r32,xmm
xmm,r32
xmm,m32
m32, xmm
mm,mm
mm,m64
m64,mm
xmm,xmm
xmm,m64
m64, xmm
xmm, xmm
xmm, m128
m128, xmm
xmm, m128
m128, xmm
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
1
1
1
1
1
2
2
1
1
1
1
2
2
1
2
2
2
4
8
4
1
2
1
4
mm,mm
1
1
mm,m64
1
1
xmm,xmm
3
2
1
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm, m128
4
1
1
2
3
2
3
1
1
1
2
1
1
p3
Latency Reciprocal
p4
throughput
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
1
2
2
2
5-6
1
1
2-3 2-3
1
1
1
1
2
Page 116
2
2
1
2
1
2
1
1
1
1
2
2
2
1
2
1
1
2
2
1
1
1
2
2
1
1
1
1
2
0.5
0.5
1
1
1
1
1
1
0.5
1
1
1
1
1
1
2
2
2-10
4-20
2
1
1
2
3
Pentium M
PUNPCKLQDQ
PUNPCKLQDQ
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
PINSRW
Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PADDQ PSUBQ
PADDQ PSUBQ
PADDQ PSUBQ
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULUDQ
PMULUDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDWD
PMADDWD
PAVGB/W
PAVGB/W
PAVGB/W
PAVGB/W
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PSADBW
PSADBW
PSADBW
xmm,xmm
xmm, m128
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm, m128,i
mm,mm
xmm,xmm
r32,mm
r32,xmm
r32,mm,i
r32,xmm,i
mm,r32,i
xmm,r32,i
1
1
1
2
3
4
2
3
3
8
1
1
2
4
1
2
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm
mm,m64
xmm,xmm
1
1
2
4
2
2
4
6
1
1
2
2
1
1
2
4
1
1
2
4
1
1
2
4
1
1
2
4
1
1
2
4
2
2
4
1
1
1
1
1
2
1
1
1
1
1
1
1
1
2
2
2
1
1
2
1
1
1
1
2
1
2
j)
1
2
1
1
2
2
2
2
4
4
1
1
2
2
1
1
2
2
1
1
2
2
Page 117
1
1
2
3
1
1
1
1
1
2
1
2
1
0.5
1
1
2
1
1
2
2
0.5
1
1
2
1
1
2
2
1
1
2
2
1
1
2
2
0.5
1
1
2
0.5
1
1
2
1
1
2
1
2
2
1
2
2
1
1
1
2
2
1
2
1
1
2
2
1
1
2
2
1
1
2
2
2
2
4
1
2
1
1
1
2
3
3
3
3
4
4
4
4
3
3
3
3
1
1
1
2
1
1
1
2
1
1
1
1
1
2
2
1
1
4
4
4
Pentium M
PSADBW
xmm,m128
6
4
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PAND(N) POR PXOR
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
mm,mm
mm,m64
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
1
1
2
4
1
1
2
3
3
4
1
1
2
2
Other
EMMS
Notes:
g)
j)
k)
1
1
2
2
2
4
2
1
3
0.5
1
1
2
1
1
2
2
2
3
6 k)
6
1
1
2
1
1
2
2
1
1
1
3
11
2
11
SSE3 instruction only available on Core Solo and Core Duo.
Also uses some execution units under port 1.
You may hide the delay by inserting other instructions between EMMS and
any subsequent floating point instruction.
Floating point XMM instructions
Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS/D
SHUFPS/D
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD
Operands
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
μops
fused
domain
2
2
2
4
8
1
2
1
1
1
1
1
2
3
4
2
2
4
4
4
2
3
μops unfused domain
p0
p1 p01 p2
p3
Latency Reciprocal
p4
throughput
2
2
2
2
2
2
1
1
1
1
2
2
4
4
1
1
1
1
1
1
1
j)
2
1
1
1
1
2
3
2
3
1
1
1
1
1
1
2
2
2
1
2
2
Page 118
2
2
1
1
3-4
2
1
1
1
1
1
2
2
2
4
1
1
1
1
1
1
1
3
2
2
1
2
5
5
1
1
Pentium M
Conversion
CVTPS2PD
CVTPS2PD
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
MULSS
MULSD
MULSS
MULSD
MULPS
MULPD
MULPS
MULPD
DIVSS
DIVSD
DIVSS
DIVSD
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
4
4
4
6
2
3
2
3
2
4
2
4
4
5
4
6
1
2
1
2
4
5
3
5
2
2
3
2
3
2
3
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,m32
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,xmm
xmm,m32
xmm,m64
1
2
2
4
2
6?
3
1
1
2
2
2
2
4
4
1
1
2
2
2
1
3
3
2
2
1
1
3
1
4
2
2
2
2
2
2
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
Page 119
2
3
2
4
1
4
2
3
3
1
5
1
1
1
4
2
4
4
1
4
1
1
1
1
1
2
2
2
?
3
1
1
1
1
2
2
2
2
1
1
1
1
3
1
3
3
1
1
1
2
2
2
2
4
4
4
4
2
2
4
4
1
1
2
1
1
2
2
1
1
3
3
3
3
3
7
4
4
5
4
5
4
5
4
5
9-18 c)
9-32 c)
9-18 c)
9-32 c)
3
3
3
3
2
2
2
2
2
2
2
2
2
2
3
3
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
4
2
1
2
1
2
2
4
2
4
8-17 c)
8-31 c)
8-17 c)
8-31 c)
Pentium M
DIVPS
DIVPD
DIVPS
DIVPD
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
RCPSS
RCPSS
RCPPS
RCPPS
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
2
2
4
4
1
2
2
4
1
2
1
2
2
4
1
2
2
4
2
2
2
2
Math
SQRTSS
SQRTSS
SQRTSD
SQRTSD
SQRTPS
SQRTPD
SQRTPS
SQRTPD
RSQRTSS
RSQRTSS
RSQRTPS
RSQRTPS
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,xmm
xmm,m128
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
2
3
1
2
2
2
4
4
1
2
2
4
2
2
1
1
2
2
2
2
Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D
xmm,xmm
xmm,m128
2
4
m32
m32
m4096
m4096
9
6
118
87
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
c)
g)
j)
2
2
1
1
2
2
1
1
1
3
2
1
1
1
2
2
1
1
2
2
3
3
3
3
3
1
2
1
3
2
6-30
1
5-58
1
8-56
16-114
2
2
1
1
3
2
3
1
3
2
2
2
9
6
32
43
16-34 c)
16-62 c)
16-34 c)
16-62 c)
3
1
4-28
4-28
4-57
4-57
16-55
16-114
16-55
16-114
1
1
2
2
2
1
1
44
20
12
63
72
43
43
High values are typical, low values are for round divisors.
SSE3 instruction only available on Core Solo and Core Duo.
Also uses some execution units under port 1.
Page 120
16-34 c)
16-62 c)
16-34 c)
16-62 c)
1
1
2
2
1
1
1
1
2
2
1
1
2
2
Merom
Intel Core 2 (Merom, 65nm)
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
μops fused domain:
μops unfused domain:
p015:
p0:
p1:
p5:
p2:
p3:
p4:
Unit:
Latency:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of μops at the decode, rename, allocate and retirement stages in
the pipeline. Fused μops count as one.
The number of μops for each execution port. Fused μops count as two. Fused
macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under μops
fused domain. An x under p0, p1 or p5 means that at least one of the μops
listed under p015 can optionally go to this port. For example, a 1 under p015
and an x under p0 and p5 means one μop which can go to either port 0 or port
5, whichever is vacant first. A value listed under p015 but nothing under p0, p1
and p5 means that it is not known which of the three ports these μops go to.
The total number of μops going to port 0, 1 and 5.
The number of μops going to port 0 (execution units).
The number of μops going to port 1 (execution units).
The number of μops going to port 5 (execution units).
The number of μops going to port 2 (memory read).
The number of μops going to port 3 (memory write address).
The number of μops going to port 4 (memory write data).
Tells which execution unit cluster is used. An additional delay of 1 clock cycle
is generated if a register written by a μop in the integer unit (int) is read by a
μop in the floating point unit (float) or vice versa. flt→int means that an instruction with multiple μops receive the input in the float unit and delivers the output
in the int unit. Delays for moving data between different units are included under latency when they are unavoidable. For example, movd eax,xmm0 has an
extra 1 clock delay for moving from the XMM-integer unit to the general purpose integer unit. This is included under latency because it occurs regardless
of which instruction comes next. Nothing listed under unit means that additional
delays are either unlikely to occur or unavoidable and therefore included in the
latency figure.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Integer instructions
Instruction
Move instructions
MOV
Operands
r,r/i
μops μops unfused domain
Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
x
Page 121
x
x
int
Laten- Recicy
procal
throughput
1
0.33
Merom
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
CLFLUSH
IN
OUT
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
1
1
1
1
2
8
8
2
r,r
r,m
r,r
r,m
r,r
r,m
m8
1
1
2
2
3
7
2
1
1
2
2
17
18
1
4
2
10
24
10
1
2
1
2
11
1
1
2
2
2
4
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
1
1
2
2
2
4
1
1
1
3
r
i
m
sr
r
(E/R)SP
m
sr
r,m
r
m
m
m
1
4
3
1
x
x
x
x
x
x
x
1
1
4
5
1
1
1
1
1
1
1
1
x
1
2
2
3
x
1
x
x
x
x
x
x
x
x
x
1
1
1
1
1
15
9
x
x
x
3
9
23
2
1
2
1
2
11
x
x
x
x
x
1
1
x
x
x
x
1
1
1
1
1
8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
8
1
1
1
1
1
2
x
x
x
1
1
1
2
2
3
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 122
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
2
3
3
1
1
1
1
1
16
16
2
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
0.33
1
1
int
int
int
int
int
int
int
int
int
int
2
2
high b)
4
3
2
2
1
1
1
1
1
7
8
1
1.5
17
20
1
4
1
4
240
1
6
2
2
7
1
1
1
6
7
0.33
1
1
1
17
1
1
8
9
9
117
0.33
1
1
2
2
0.33
1
0.33
1
Merom
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
RCL
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r64
m8
m16
m32
m64
m64
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r16/32/64,i/cl
m,1
m8,i/cl
m8,i/cl
1
3
4
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
3
5
4
32
56
4
6
5
32
56
1
1
1
3
4
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
3
5
4
32
56
3
5
4
31
55
1
1
1
1
2
1
1
1
3
1
3
2
9
8
6
4
12
11
1
1
1
1
1
1
2
1
2
2
9
8
6
3
9
8
x
x
x
x
1
x
1
x
x
x
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
x
x
x
x
1
1
1
1
x
x
2
1
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 123
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
1
17
3
5
5
7
3
3
5
3
3
5
3
5
5
7
3
3
5
18
18-26
18-42
29-61
39-72
18
18-26
18-42
29-61
39-72
1
1
1
6
1
1
6
1
6
2
12
11
11
7
14
13
1
1.5
1.5
4
1
1
2
1
1
2
1
1.5
1.5
4
1
1
2
2
1
2
12
12-20 c)
12-36 c)
18-37 c)
28-40 c)
12
12-20 c)
12-36 c)
18-37 c)
28-40 c)
0.33
1
1
0.33
1
0.5
1
1
1
2
Merom
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD
m16/32/64,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
Control transfer instructions
JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e,i)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND i)
r,m
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
REP(N)E CMPS
10
2
3
1
10
2
1
11
3
2
2
1
2
1
7
6
7
2
2
1
9
1
1
8
1
2
2
1
1
1
7
6
1
30
1
1
31
1
1
2
11
11
3
43
3
4
44
1
3
32
32
15
5
1
30
1
1
29
1
1
2
11
11
2
43
2
3
42
1
x
30
30
13
5
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
3
2
4+7n - 14+6n
4
2
8+5n - 20+1.2n
8
5
1
1
1
7+7n - 13+n
4
3
7+8n - 17+7n
7
5
7+10n - 7+9n
Page 124
x
x
x
x
1
2
1
1
1
x
x
x
1
1
1
2
1
1
2
2
2
1
1
1
1
1
1
1
1
5
1
2
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
13
2
7
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
0
int
int
int
int
int
int
int
int
int
int
int
1
5
6
2
1
1
0
0
0
0
1
1
5
1
1
2
1
1
0.33
4
14
1-2
76
1-2
1-2
68
1
1
1-2
5
5
2
75
2
2
75
2
2
78
78
8
3
1
1+5n - 21+3n
1
7+2n - 0.55n
1+3n - 0.63n
1
3+8n - 23+6n
3
2+7n - 22+5n
Merom
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)
i,0
a,b
1
1
3
12
1
1
3
10
3
46-100
29
23
2
x
x
x
x
x
x
x
x
x
1
1
1
int
int
int
int
int
int
int
int
int
0.33
1
8
8
180-215
64
54
Applies to all addressing modes
Has an implicit LOCK prefix.
Low values are for small results, high values for high results.
See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion.
Not available in 64 bit mode.
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST
FISTP
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
m
r
AX
m16
m16
m16
r
m
m
μops μops unfused domain
Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
40
1
1
7
170
1
1
2
3
3
1
2
2
2
1
2
2
3
1
2
142
78
1
1
2
38
1
1
3
166
0 f)
1
1
1
1
1
2
2
2
1
1
1
1
1
2
x
x
1
2
2
2
1
x
x
1
x
x
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
2
2
1
1
1
1
1
1
1
1
2
Arithmetic instructions
Page 125
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
Laten- Recicy
procal
throughput
1
3
4
45
1
3
4
164
0
6
6
6
6
2
1
184
169
1
1
3
20
1
1
5
166
1
1
1
1
1
1
2
2
2
1
2
10
8
1
2
192
177
Merom
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM FPREM1
FRNDINT
r
m
r
m
r
m
r
m
r
m
m
m
m
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2
2
2
2
2
2
2
2
1
1
1
1
21-27 21-27
7-15 7-15
27
82
1
~96
~100
~19
~53
~98
~70
27
82
1
~96
~100
~19
~53
~98
~70
1
2
4
15
1
2
4
15
1
1
1
1
1
1
1
1
1
1
1
2
2
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
1
3
1
1
5
2
2
6-38 d) 5-37 d)
5-37 d)
1
1
1
1
1
1
1
2
2
5-37 d)
2
1
1
16-56
22-29
41
170
6-69
~96
~115
~45
~96
~136
~119
float
float
float
float
1
1
15
63
Round divisors or low precision give low values.
Resolved by register renaming. Generates no μops in the unfused domain.
SSE3 instruction set.
Integer MMX and XMM instructions
Instruction
Operands
Move instructions
MOVD k)
MOVD k)
MOVD k)
MOVD k)
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
μops μops unfused domain
Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1
1
1
x
x
x
int
1
1
x
x
1
Page 126
1
int
int
Laten- Recicy
procal
throughput
2
3
2
2
0.33
1
0.5
1
Merom
MOVQ
(x)mm, (x)mm
MOVQ
(x)mm,m64
MOVQ
m64, (x)mm
MOVDQA
xmm, xmm
MOVDQA
xmm, m128
MOVDQA
m128, xmm
MOVDQU
m128, xmm
MOVDQU
xmm, m128
LDDQU g)
xmm, m128
MOVDQ2Q
mm, xmm
MOVQ2DQ
xmm,mm
MOVNTQ
m64,mm
MOVNTDQ
m128,xmm
mm,mm
PACKSSWB/DW
PACKUSWB
mm,m64
xmm,xmm
PACKSSWB/DW
PACKUSWB
xmm,m128
mm,mm
PUNPCKH/LBW/WD/DQ
mm,m64
PUNPCKH/LBW/WD/DQ
xmm,xmm
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ xmm,m128
PUNPCKH/LQDQ
xmm,xmm
PUNPCKH/LQDQ
xmm, m128
PSHUFB h)
mm,mm
PSHUFB h)
mm,m64
PSHUFB h)
xmm,xmm
PSHUFB h)
xmm,m128
PSHUFW
mm,mm,i
PSHUFW
mm,m64,i
PSHUFD
xmm,xmm,i
PSHUFD
xmm,m128,i
PSHUFL/HW
xmm,xmm,i
PSHUFL/HW
xmm, m128,i
PALIGNR h)
mm,mm,i
PALIGNR h)
mm,m64,i
PALIGNR h)
xmm,xmm,i
PALIGNR h)
xmm,m128,i
MASKMOVQ
mm,mm
MASKMOVDQU
xmm,xmm
PMOVMSKB
r32,(x)mm
PEXTRW
r32,mm,i
PEXTRW
r32,xmm,i
PINSRW
mm,r32,i
PINSRW
mm,m16,i
PINSRW
xmm,r32,i
PINSRW
xmm,m16,i
1
1
1
1
1
1
9
4
4
1
1
1
1
1
1
3
4
1
1
3
4
1
2
1
2
4
5
1
2
2
3
1
2
2
2
2
2
4
10
1
2
3
1
2
3
4
1
Arithmetic instructions
PADD/SUB(U)(S)B/W/D (x)mm, (x)mm
PADD/SUB(U)(S)B/W/D
(x)mm,m
PADDQ PSUBQ
(x)mm, (x)mm
PADDQ PSUBQ
(x)mm,m
1
1
2
2
x
x
x
int
int
1
1
1
x
x
1
x
int
int
1
4
2
2
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
1
2
2
1
2
int
int
int
int
1
1
1
1
3
3
1
1
3
3
1
1
1
1
4
4
1
1
2
2
1
1
2
2
2
2
1
1
1
2
3
1
1
3
3
1
1
1
2
2
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
Page 127
x
x
x
x
x
x
x
x
1
1
1
1
1
1
x
x
x
x
1
1
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
2
1
2
3
1
2
3
3-8
2-8
2-8
1
1
1
1
int
int
flt→int
int
int
int
flt→int
int
int
int
int
int
int
int
int
int
flt→int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
int
int
int
int
1
3
1
3
1
1
3
1
3
1
2
2
2
3
5
2
6
2
0.33
1
1
0.33
1
1
4
2
2
0.33
0.33
2
2
1
1
2
2
1
1
2
2
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
2-5
6-10
1
1
1
1
1
1.5
1.5
0.5
1
1
1
Merom
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW h)
PMULHRSW h)
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW h)
PMADDUBSW h)
PAVGB/W
PAVGB/W
PMIN/MAXUB/SW
PMIN/MAXUB/SW
PABSB PABSW PABSD
h)
PSIGNB PSIGNW
PSIGND h)
PSADBW
PSADBW
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
Other
EMMS
Notes:
g)
h)
k)
mm,mm
5
5
int
mm,m64
6
5
xmm,xmm
7
7
xmm,m128
mm,mm
mm,m64
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
8
3
4
5
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
3
3
5
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
(x)mm,(x)mm
(x)mm,m
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
1
1
1
1
1
2
3
2
1
1
1
1
1
2
2
2
x
x
1
1
1
x
x
x
x
x
11
11
x
x
1
int
int
1
1
1
x
x
x
x
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
x
x
1
1
x
x
x
1
x
5
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
float
4
4
6
3
5
1
3
3
3
3
3
1
1
1
1
3
1
1
1
2
2
4
4
2
2
3
3
0.5
1
1
1
1
1
1
1
1
1
1
1
0.5
1
0.5
1
0.5
1
0.5
1
1
1
0.33
1
1
1
1
1
1
1
6
SSE3 instruction set.
Supplementary SSE3 instruction set.
MASM uses the name MOVD rather than MOVQ for this instruction even when
moving 64 bits.
Page 128
Merom
Floating point XMM instructions
Instruction
Operands
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPS
SHUFPD
SHUFPD
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
1
1
1
4
9
1
1
1
2
2
1
1
1
1
3
4
1
2
1
2
1
2
3
4
1
2
1
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
2
2
2
2
2
2
2
2
1
1
1
1
2
3
2
2
2
2
2
2
2
2
2
2
1
1
1
1
2
2
2
2
Conversion
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
μops μops unfused domain
Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
x
x
x
int
int
1
1
2
4
1
1
x
x
x
x
1
x
x
2
1
1
int
2
2
int
int
1
1
1
1
1
1
1
1
1
1
int
1
1
3
3
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
2
2
1
2
1
1
Page 129
1
1
float
float
1
3
3
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
Laten- Recicy
procal
throughput
1
2
3
2-4
3-4
1
2
3
3
5
3
1
1
1
flt→int
flt→int
float
float
int
int
int
int
flt→int
int
float
float
3
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
4
1
1
1
3
1
4
2
2
3
3
4
4
0.33
1
1
2
4
0.33
1
1
1
1
1
1
1
2-3
2
2
1
1
1
1
1
1
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
Merom
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS
MULSS
MULSD
MULSD
MULPS
MULPS
MULPD
MULPD
DIVSS
DIVSS
DIVSD
DIVSD
DIVPS
DIVPS
DIVPD
DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32/64
xmm,xmm
1
1
1
1
1
1
6
7
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
6
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Page 130
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
3
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
3
3
4
4
4
3
4
3
3
3
9
5
4
5
4
5
6-18 d)
6-32 d)
6-18 d)
6-32 d)
3
3
3
3
3
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
1
1
1
1
1
1
3
3
2
2
1
1
1
1
1
1
1
1
5-17 d)
5-17 d)
5-31 d)
5-31 d)
5-17 d)
5-17 d)
5-31 d)
5-31 d)
2
2
1
1
1
1
1
1
1
Merom
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
xmm,m32/64
xmm,xmm
xmm,m128
1
1
1
1
1
1
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
Logic
AND/ANDN/OR/XORPS/D xmm,xmm
AND/ANDN/OR/XORPS/D xmm,m128
1
1
1
1
x
x
14
6
141
119
13
4
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
d)
g)
m32
m32
m4096
m4096
1
1
1
1
float
float
float
1
6-29
1
float
float
float
float
float
float
int
int
1
1
1
1
1
1
x
x
x
x
3
6-58
3
1
1
1
145
164
Round divisors give low values.
SSE3 instruction set.
Page 131
1
1
1
6-29
6-29
6-58
6-58
2
2
0.33
1
42
19
145
164
Wolfdale
Intel Core 2 (Wolfdale, 45nm)
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
μops fused domain:
μops unfused domain:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit
xmm register, (x)mm = mmx or xmm register, sr = segment register, m =
memory, m32 = 32-bit memory operand, etc.
The number of μops at the decode, rename, allocate and retirement stages in
the pipeline. Fused μops count as one.
The number of μops for each execution port. Fused μops count as two. Fused
macro-ops count as one. The instruction has μop fusion if the sum of the
numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under
μops fused domain. An x under p0, p1 or p5 means that at least one of the
μops listed under p015 can optionally go to this port. For example, a 1 under
p015 and an x under p0 and p5 means one μop which can go to either port 0
or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that it is not known which of the three ports these
μops go to.
p015:
p0:
p1:
p5:
p2:
p3:
p4:
Unit:
The total number of μops going to port 0, 1 and 5.
The number of μops going to port 0 (execution units).
The number of μops going to port 1 (execution units).
The number of μops going to port 5 (execution units).
The number of μops going to port 2 (memory read).
The number of μops going to port 3 (memory write address).
The number of μops going to port 4 (memory write data).
Tells which execution unit cluster is used. An additional delay of 1 clock cycle
is generated if a register written by a μop in the integer unit (int) is read by a
μop in the floating point unit (float) or vice versa. flt→int means that an instruction with multiple μops receive the input in the float unit and delivers the output in the int unit. Delays for moving data between different units are included
under latency when they are unavoidable. For example, movd eax,xmm0 has
an extra 1 clock delay for moving from the XMM-integer unit to the general
purpose integer unit. This is included under latency because it occurs regardless of which instruction comes next. Nothing listed under unit means that additional delays are either unlikely to occur or unavoidable and therefore included in the latency figure.
Latency:
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Integer instructions
Instruction
Operands
μops μops unfused domain
Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
Move instructions
Page 132
Laten- Recicy
procal
throughput
Wolfdale
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
CLFLUSH
IN
OUT
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
1
1
1
1
1
2
8
8
2
1
r,r
r16/32,m
r64,m
r,r
r,m
r,r
r,m
1
x
x
x
1
2
2
3
x
1
x
x
x
x
x
x
x
x
x
x
x
x
m8
1
1
2
2
2
3
7
2
1
1
2
2
17
18
1
4
2
10
24
10
1
2
1
2
11
1
1
2
2
2
4
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
1
1
2
2
2
4
1
1
r
i
m
sr
r
(E/R)SP
m
sr
r,m
r
m
m
m
x
x
x
1
4
3
x
x
x
x
x
1
1
4
5
x
9
23
2
1
2
1
2
11
x
x
x
1
1
x
x
x
x
1
1
1
1
0.33
1
1
1
2
x
3
1
1
1
1
x
1
1
0.33
1
1
1
1
1
16
16
2
1
1
1
1
15
9
1
1
1
2
3
3
1
1
1
1
1
8
1
1
1
1
1
1
1
1
1
1
1
1
1
8
2
high b)
4
3
2
1
1
1
20
x
x
1
4
1
4
1
1
1
1
2
2
3
1
1
x
x
x
x
x
x
x
x
1
1
1
1
1
x
x
x
x
x
x
x
x
Page 133
x
x
x
x
x
x
x
x
1
1
1
1
120
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
7
8
1
1.5
17
1
1
1
2
2
6
2
2
7
1
1
7
0.33
1
1
1
17
1
1
8
6
9
90
0.33
1
1
2
2
0.33
1
Wolfdale
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
DIV IDIV
DIV IDIV
DIV IDIV
DIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
r
m
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r64
m8
m16
m32
m64
m64
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r,i/cl
m,1
1
3
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
4
7
7
32-38
56-62
4
7
7
32
56
1
1
1
1
2
1
1
1
3
1
3
2
9
8
6
4
1
1
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
4
7
7
x
x
x
x
x
x
x
x
x
1
x
x
1
x
x
x
1
1
x
x
1
17
3
5
5
7
3
3
5
3
3
5
3
5
5
7
3
3
5
x
x
x
1
1
1
1
x
x
2
1
x
x
x
x
1
1
1
1
1
3
7
6
31
55
1
1
1
1
1
1
1
1
2
1
2
2
9
8
6
3
x
x
x
x
x
x
x
x
x
x
x
x
x
x
56-62
1
x
x
1
1 2 1
x x x
2 3 2
9 10 13
x x x
1 2
2 3 2
x x x
x x x
x x x
x x x
x
x
32-38
1
x
x
x
x
x
x
x
x
x
x
Page 134
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
6
1
1
1
1
1
1
1
1
1
1
0.33
1
1
1
1
1.5
1.5
4
1
1
2
1
1
2
1
1.5
1.5
4
1
1
2
2
1
2
9-18 c)
14-22 c)
14-23 c)
18-57 c)
34-88 c)
9-18
14-22 c)
14-23 c)
34-88 c)
39-72 c)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
1
1
1
6
1
6
2
12
11
11
7
0.33
1
1
0.33
1
0.5
1
1
1
2
Wolfdale
RCR
RCL
RCR RCL
SHLD SHRD
SHLD SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD
m8,i/cl
m8,i/cl
m,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
Control transfer instructions
JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e,i)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND i)
r,m
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP(N)E SCAS
CMPS
12
11
10
2
3
1
9
3
1
10
3
2
2
1
2
1
6
6
9
8
7
2
2
1
8
2
1
7
1
2
2
1
1
1
6
6
1
30
1
1
31
1
1
2
11
11
3
43
3
4
44
1
3
32
32
15
5
1
30
1
1
29
1
1
2
11
11
2
43
2
3
42
1
1
30
30
13
5
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
1
1
1
1
1
3
2
4+7n-14+6n
4
2
8+5n-20+1.2n
8
5
1
1
1
7+7n-13+n
4
3
7+8n-17+7n
7
5
Page 135
1
5
6
2
1
1
1
1
1
0
0
0
1
2
1
1
1
x
x
x
1
1
1
1
4
1
1
1
1
1
1
14
13
13
2
7
1
0
0
1
2
1
1
2
2
2
1
1
1
1
1
1
1
1
1
1
0.33
3
14
1-2
76
1-2
1-2
68
1
1
1-2
5
5
2
75
2
2
75
2
2
78
78
8
3
1
1
1+5n-21+3n
1
1
1
7+2n-0.55n
5
1+3n-0.63n
1
1
3+8n-23+6n
2
3
Wolfdale
REP(N)E CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)
7+10n-7+9n
i,0
a,b
1
1
3
12
1
1
3
10
3
53-117
13
23
2
2+7n-22+5n
x
x
x
x
x
x
x
x
x
1
0.33
1
8
8
1
1
53-211
32
54
Applies to all addressing modes
Has an implicit LOCK prefix.
Low values are for small results, high values for high results. The reciprocal
throughput is only slightly less than the latency.
See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion.
Not available in 64 bit mode.
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST
FISTP
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
m
r
AX
m16
m16
m16
r
m
μops μops unfused domain
Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
40
1
1
7
171
1
1
2
3
3
1
2
2
2
1
2
2
3
1
2
141
1
1
2
38
1
1
2
x
1
x
x
3
167
0 f)
1
1
1
1
1
2
2
2
1
1
1
1
1
2
95
x
x
x
x
x
x
1
1
1
1
1
1
1
2
2
1
2
2
1
2
2
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
x
x
x
x
Page 136
x
x
7 23 23
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
Laten- Recicy
procal
throughput
1
3
4
45
1
3
4
164
0
6
6
6
6
2
1
1
1
3
20
1
1
5
166
1
1
1
1
1
1
2
2
2
1
2
10
8
1
2
142
Wolfdale
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
m
78
51
r
m
r
m
r
m
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
26-29
28-35
17-19
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
r
m
r
m
m
m
m
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)
1
2
4
15
1
2
4
15
x
x 27
1
1
1
1
1
1
1
1
1
1
1
float
1
1
5
2
2
6-21 d) 5-20 d)
6-21 d) 5-20 d)
1
1
1
1
1
1
43
~170
6-20
32-85
70-100
38-107
45
50-100
40-130
55-130
x
x
x
x
x
1
x
x
x
x
x
x
x
x
x
x
float
float
float
float
float
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
float
float
float
float
float
x
x
x
x
x
x
float
float
float
float
1
1
1
1
x
x
1
1
1
1
1
177
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
1
1
1
1
2
1
1
2
1
1
x
x
x
x
x
x
28
28
53-84
1
1
18-85
76-100
18105
19
19
57-65
19-100
23-87
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
x
3
3
5
6-21
1
2
2
5-20 d)
2
1
1
13-40
18-41
10-22
1
1
15
63
Round divisors or low precision give low values.
Resolved by register renaming. Generates no μops in the unfused domain.
SSE3 instruction set.
Integer MMX and XMM instructions
Instruction
Operands
μops μops unfused domain
fused
domain
Page 137
Unit
Laten- Recicy
procal
throughput
μops
Wolfdale
fused
dop015 p0 p1 p5 p2 p3 p4
main
Move instructions
MOVD k)
MOVD k)
MOVD k)
MOVD k)
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA j)
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW j)
PACKUSDW j)
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW j)
PMOVSX/ZXBW j)
PMOVSX/ZXBD j)
PMOVSX/ZXBD j)
PMOVSX/ZXBQ j)
PMOVSX/ZXBQ j)
PMOVSX/ZXWD j)
PMOVSX/ZXWD j)
PMOVSX/ZXWQ j)
PMOVSX/ZXWQ j)
PMOVSX/ZXDQ j)
PMOVSX/ZXDQ j)
PSHUFB h)
PSHUFB h)
PSHUFB h)
PSHUFB h)
r,(x)mm
m,(x)mm
(x)mm,r
(x)mm,m
v,v
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
xmm, m128
1
1
1
1
1
1
1
1
1
1
9
4
4
1
1
1
1
1
1
x
x
x
1
x
mm,mm
1
1
1
mm,m64
1
1
1
xmm,xmm
1
1
1
xmm,m128
xmm,xmm
xmm,m
mm,mm
mm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm, m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m16
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
mm,mm
mm,m64
xmm,xmm
xmm,m128
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
1
1
x
int
int
int
int
1
1
x
x
x
1
1
1
x
x
1
x
int
int
1
4
2
2
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
Reciprocal
throughput
1
2
2
1
2
1
2
int
int
int
int
1
1
int
1
1
Page 138
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
int
1
2
0.33
1
0.5
1
0.33
1
1
0.33
1
1
4
2
2
0.33
0.33
2
2
1
1
1
1
1
2
3
2
2
1
2
3
1
2
3
3-8
2-8
2-8
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Wolfdale
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR h)
PALIGNR h)
PALIGNR h)
PALIGNR h)
PBLENDVB j)
PBLENDVB j)
PBLENDW j)
PBLENDW j)
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB j)
PEXTRB j)
PEXTRW
PEXTRW j)
PEXTRD j)
PEXTRD j)
PEXTRQ j,m)
PEXTRQ j,m)
PINSRB j)
PINSRB j)
PINSRW
PINSRW
PINSRD j)
PINSRD j)
PINSRQ j,m)
PINSRQ j,m)
Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W h)
PHADD(S)W
PHSUB(S)W h)
PHADDD PHSUBD h)
PHADDD PHSUBD h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ j)
PCMPEQQ j)
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW h)
PMULHRSW h)
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
x, m128,i
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
x,x,xmm0
x,m,xmm0
xmm,xmm,i
xmm,m,i
mm,mm
xmm,xmm
r32,(x)mm
r32,xmm,i
m8,xmm,i
r32,(x)mm,i
m16,(x)mm,i
r32,xmm,i
m32,xmm,i
r64,xmm,i
m64,xmm,i
xmm,r32,i
xmm,m8,i
(x)mm,r32,i
(x)mm,m16,i
xmm,r32,i
xmm,m32,i
xmm,r64,i
xmm,m64,i
1
2
1
2
1
2
2
3
1
1
2
2
1
1
4
10
1
2
2
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
1
1
1
1
1
2
3
1
1
2
2
1
1
1
4
1
2
2
2
2
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
1
1
2
2
1
1
v,v
(x)mm,m
v,v
(x)mm,m
1
1
2
2
1
1
2
2
x
x
x
x
x
x
x
x
v,v
3
3
1
2
(x)mm,m64
v,v
(x)mm,m64
v,v
(x)mm,m
xmm,xmm
xmm,m128
v,v
(x)mm,m
v,v
(x)mm,m
4
3
4
1
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
1
1
1
1
1
x
x
2
2
2
x
x
1
1
1
1
1
x
x
x
?
x
x
3
x
x
x
?
x
x
1
1
1
1
Page 139
x
x
x
1
x
1
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
int
int
int
int
1
int
3
int
int
int
int
int
int
int
int
int
int
int
1
1
2
1
2
1
2
3
3
3
3
3
1
2
1
1
2
3
1
1
3
3
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2-5
6-10
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
1
1
1
2
2
2
2
0.5
1
1
1
1
1
1
1
Wolfdale
PMULLD j)
xmm,xmm
PMULLD j)
xmm,m128
PMULDQ j)
xmm,xmm
PMULDQ j)
xmm,m128
PMULUDQ
v,v
PMULUDQ
(x)mm,m
PMADDWD
v,v
PMADDWD
(x)mm,m
PMADDUBSW h)
v,v
PMADDUBSW h)
(x)mm,m
PAVGB/W
v,v
PAVGB/W
(x)mm,m
PMIN/MAXSB j)
xmm,xmm
PMIN/MAXSB j)
xmm,m128
PMIN/MAXUB
v,v
PMIN/MAXUB
(x)mm,m
PMIN/MAXSW
v,v
PMIN/MAXSW
(x)mm,m
PMIN/MAXUW j)
xmm,xmm
PMIN/MAXUW j)
xmm,m
PMIN/MAXSD j)
xmm,xmm
PMIN/MAXSD j)
xmm,m128
PMIN/MAXUD j)
xmm,xmm
PMIN/MAXUD j)
xmm,m128
PHMINPOSUW j)
xmm,xmm
PHMINPOSUW j)
xmm,m128
PABSB PABSW PABSD h)
v,v
PABSB PABSW PABSD
h)
(x)mm,m
PSIGNB PSIGNW
PSIGND h)
v,v
PSIGNB PSIGNW
PSIGND h)
(x)mm,m
PSADBW
v,v
PSADBW
(x)mm,m
MPSADBW j)
xmm,xmm,i
MPSADBW j)
xmm,m,i
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PTEST j)
PTEST j)
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
Other
EMMS
Notes:
v,v
(x)mm,m
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
4
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
1
4
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
1
1
1
2
2
1
1
1
1
1
1
1
1
x
x
1
1
x
x
x
x
1
2
2
1
1
1
1
1
x
x
1
1
x
x
x
x
1
1
1
1
1
1
1
1
1
1
x
4
4
x
1
x
x
1
1
1
x
x
1
1
1
3
4
1
1
1
3
3
x
1
1
2
2
1
1
1
2
3
1
1
1
2
2
1
1
1
2
2
1
11
11
x
1
1
1
1
2
2
x
x
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
Page 140
x
1
5
5
3
3
3
3
1
1
1
1
1
1
1
4
1
int
int
1
x
x
x
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
float
2
4
1
1
1
1
1
1
1
1
0.5
1
1
1
0.5
1
0.5
1
1
1
1
1
1
1
4
4
0.5
1
1
3
5
1
1
1
1
2
1
0.5
1
1
1
2
2
0.33
1
1
1
1
1
1
1
1
1
6
Wolfdale
g)
h)
j)
k)
m)
SSE3 instruction set.
Supplementary SSE3 instruction set.
SSE4.1 instruction set
MASM uses the name MOVD rather than MOVQ for this instruction even
when moving 64 bits
Only available in 64 bit mode
Floating point XMM instructions
Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPS
SHUFPD
SHUFPD
BLENDPS/PD j)
BLENDPS/PD j)
BLENDVPS/PD j)
BLENDVPS/PD j)
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS
UNPCKH/LPS
UNPCKH/LPD
UNPCKH/LPD
EXTRACTPS j)
EXTRACTPS j)
INSERTPS j)
INSERTPS j)
Conversion
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
Operands
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
x,m32/64
m32/64,x
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
μops μops unfused domain
Unit
fused
dop015 p0 p1 p5 p2 p3 p4
main
1
x,m,xmm0
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
r32,xmm,i
m32,xmm,i
xmm,xmm,i
xmm,m32,i
1
1
1
4
9
1
1
1
2
2
1
1
1
1
1
2
1
2
1
1
2
2
1
2
1
2
1
1
1
2
2
2
1
2
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
2
2
2
2
2
2
2
2
x,x,xmm0
x
x
x
int
int
1
2
4
1
1
x
x
x
x
1
x
x
2
1
1
1
2
2
int
int
int
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
float
float
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
1
1
1
1
Page 141
x
1
1
1
1
1
1
1
1
2
3
2-4
3-4
1
2
3
3
5
3
1
1
1
1
1
1
2
2
Laten- Recicy
procal
throughput
1
int
int
float
float
int
int
int
int
int
int
int
int
int
int
float
float
int
int
int
int
1
float
float
float
float
4
1
1
2
1
1
1
1
4
1
4
0.33
1
1
2
4
0.33
1
1
1
1
1
1
1
2-3
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Wolfdale
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS
MULSS
MULSD
MULSD
MULPS
MULPS
MULPD
MULPD
DIVSS
DIVSS
DIVSD
DIVSD
DIVPS
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
xmm,xmm
1
1
1
1
1
1
3
4
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
Page 142
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
x
x
1
1
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
2
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
3
2
3
3
4
4
3
3
4
4
4
3
4
3
2
2
2
2
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
1
1
3
1
1
3
1
1
7
3
3
6
1.5
1.5
4
1
1
5
1
1
4
1
1
5
1
1
6-13 d) 5-12 d)
5-12 d)
6-21 d) 5-20 d)
5-20 d)
6-13 d) 5-12 d)
Wolfdale
DIVPS
DIVPD
DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D
CMPccSS/D
CMPccPS/D
CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
ROUNDSS/D j)
ROUNDSS/D j)
ROUNDPS/D j)
ROUNDPS/D j)
DPPS j)
DPPS j)
DPPD j)
DPPD j)
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm
x,m32/64
xmm,xmm
x,m32/64
xmm,xmm
xmm,m128
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
4
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
4
4
4
1
1
1
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
xmm,xmm
xmm,m128
1
1
1
1
x
x
x
x
x
x
m32
m32
m4096
m4096
13
10
151
121
12
8
67
74
x
x
x
x
x
x
x
x
x 1
x
1 1
x 8 38 38
x 47
2
2
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
x
x
1
1
1
1
1
1
1
1
1
x
x
1
5-12 d)
6-21 d) 5-20 d)
5-20 d)
3
2
2
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
3
1
1
11
3
3
9
3
3
6-13
1
float
float
float
float
float
float
1
1
int
int
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
6-20
3
5-12
5-12
5-19
5-19
2
2
Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
d)
g)
Round divisors give low values.
SSE3 instruction set.
Page 143
0.33
1
38
20
145
150
Nehalem
Intel Nehalem
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
μops fused domain:
μops unfused domain:
p015:
p0:
p1:
p5:
p2:
p3:
p4:
Domain:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of μops at the decode, rename, allocate and retirement stages in
the pipeline. Fused μops count as one.
The number of μops for each execution port. Fused μops count as two. Fused
macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p2 + p3 + p4 exceeds the number listed under μops
fused domain. An x under p0, p1 or p5 means that at least one of the μops
listed under p015 can optionally go to this port. For example, a 1 under p015
and an x under p0 and p5 means one μop which can go to either port 0 or port
5, whichever is vacant first. A value listed under p015 but nothing under p0, p1
and p5 means that it is not known which of the three ports these μops go to.
The total number of μops going to port 0, 1 and 5.
The number of μops going to port 0 (execution units).
The number of μops going to port 1 (execution units).
The number of μops going to port 5 (execution units).
The number of μops going to port 2 (memory read).
The number of μops going to port 3 (memory write address).
The number of μops going to port 4 (memory write data).
Tells which execution unit domain is used: "int" = integer unit (general purpose
registers), "ivec" = integer vector unit (SIMD), "fp" = floating point unit (XMM
and x87 floating point). An additional "bypass delay" is generated if a register
written by a μop in one domain is read by a μop in another domain. The bypass delay is 1 clock cycle between the "int" and "ivec" units, and 2 clock cycles between the "int" and "fp", and between the "ivec" and "fp" units.
The bypass delay is indicated under latency only where it is unavoidable because either the source operand or the destination operand is in an unnatural
domain such as a general purpose register (e.g. eax) in the "ivec" domain. For
example, the PEXTRW instruction executes in the "int" domain. The source
operand is an xmm register and the destination operand is a general purpose
register. The latency for this instruction is indicated as 2+1, where 2 is the latency of the instruction itself and 1 is the bypass delay, assuming that the xmm
operand is most likely to come from the "ivec" domain. If the xmm operand
comes from the "fp" domain then the bypass delay will be 2 rather than one.
The flags register can also have a bypass delay. For example, the COMISS instruction (floating point compare) executes in the "fp" domain and returns the
result in the integer flags. Almost all instructions that read these flags execute
in the "int" domain. Here the latency is indicated as 1+2, where 1 is the latency
of the instruction itself and 2 is the bypass delay from the "fp" domain to the
"int" domain.
The bypass delay from the memory read unit to any other unit and from any
unit to the memory write unit are included in the latency figures in the table.
Where the domain is not listed, the bypass delays are either unlikely to occur
or unavoidable and therefore included in the latency figure.
Page 144
Nehalem
Latency:
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Integer instructions
Instruction
Move instructions
MOV
MOV a)
MOV a)
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D) i)
POP
POP
POP
POP
POPF(D/Q)
POPA(D) i)
LAHF SAHF
SALC i)
LEA a)
BSWAP
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
Operands
μops μops unfused domain
Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
r,r/i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
1
1
1
1
1
2
6
6
2
1
r,r
1
1
r,m
r,r
r,m
r,r
r,m
1
2
2
3
7
2
1
1
2
2
3
18
1
3
2
7
8
10
1
2
1
1
1
9
1
r
i
m
sr
r
(E/R)SP
m
sr
r,m
r32
r64
m
m
x
x
x
1
3
2
x
x
x
x
x
x
x
1
1
3
4
1
1
1
1
1
1
1
1
x
1
2
2
3
x
1
x
x
x
x
x
x
x
x
x
1
1
1
1
1
2
2
x
x
x
1
x
x
2
x
1
x
2
7
2
1
2
1
1
1
3
x
x
x
x
x
x
x
1
1
1
x
x
x
x
Page 145
x
1
1
1
5
1
8
6
1
1
1
1
1
1
1
8
1
1
1
1
1
1
1
8
1
Laten- Recicy
procal
throughput
int
int
int
int
int
int
int
int
int
~270
0.33
1
1
1
1
1
13
14
1
int
1
0.33
2
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
2
3
3
2
20 b)
5
3
2
1
4
1
1
3
2
1
1
1
1
1
1
8
1
5
1
15
14
8
0.33
1
1
1
1
15
1
Nehalem
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS DAA DAS i)
AAD i)
AAM i)
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV c)
DIV c)
DIV c)
DIV c)
IDIV c)
IDIV c)
IDIV c)
IDIV c)
CBW CWDE CDQE
CWD CDQ CQO
POPCNT ℓ)
POPCNT ℓ)
CRC32 ℓ)
CRC32 ℓ)
m
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r16,m16
r32,m32
r64,m64
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64
r,r
r,m
r,r
r,m
1
2
3
2
1
1
1
2
2
2
4
1
1
1
3
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
3
1
1
1
1
1
1
4
6
6
~40
4
8
7
~60
1
1
1
1
1
1
1
x
x
x
1
1
1
2
2
3
1
1
1
1
1
3
5
1
3
3
3
1
1
1
1
1
1
1
3
3
2
1
1
1
1
1
1
4
6
6
x
4
8
7
x
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
x
x
1
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
1
1
1
1
x
x
2
1
x
x
x
x
1
1
1
1
1
1
1
x
x
x
1
x
x
x
x
x
Page 146
2
4
3
x
2
5
3
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
1
x
x
x
x
x
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
9
23
5
1
6
2
2
7
1
1
1
6
3
15
20
3
5
5
3
3
3
3
3
3
3
3
5
5
3
3
3
3
11-21
17-22
17-28
28-90
10-22
18-23
17-28
37-100
1
1
3
3
0.33
1
1
2
2
0.33
1
0.33
1
1
2
7
1
2
2
2
1
1
1
1
1
2
1
2
2
2
1
1
1
1
1
1
7-11
7-12
7-17
19-69
7-11
7-12
7-17
26-86
1
1
1
1
1
1
Nehalem
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR RCL
RCR
RCL
RCR RCL
RCR RCL
RCR
RCL
RCR RCL
SHLD
SHLD
SHRD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC STC CMC
CLD
STD
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r8,i/cl
r8,i/cl
r16/32/64,i/cl
m,1
m8,i/cl
m8,i/cl
m16/32/64,i/cl
r,r,i/cl
m,r,i/cl
r,r,i/cl
m,r,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
Control transfer instructions
JMP
short/near
JMP i)
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
Fused compare/test and branch e)
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL i)
far
CALL
r
CALL
m(near)
CALL
m(far)
1
1
2
1
1
1
3
1
3
2
9
8
6
4
12
11
10
2
3
2
3
1
9
2
1
10
3
1
2
1
2
1
2
2
1
1
1
1
1
1
2
1
2
2
9
8
6
3
9
8
7
2
2
2
2
1
8
2
1
7
3
1
1
1
1
1
2
2
1
31
1
1
31
1
1
2
6
11
2
46
3
4
47
1
31
1
1
31
1
1
2
6
11
2
46
2
3
47
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
1
1
1
1
1
x
x
x
?
x
x
x
?
1
11
1
1
1
x
x
1
1
1
1
1
1
1
9
?
?
Page 147
?
?
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
1
6
1
1
6
1
6
2
13
11
12-13
7
16
14
15
3
8
4
9
1
1
6
6
3
3
1
1
0
0
0
0
0
0.33
1
1
0.33
1
0.5
1
1
1
2
12-13
1
1
1
5
1
1
1
1
1
1
0.33
4
5
2
67
2
2
73
2
2
2
4
7
2
74
2
2
79
Nehalem
RETN
RETN
RETF
RETF
BOUND i)
INTO i)
String instructions
LODS
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
Notes:
a)
b)
c)
e)
i)
ℓ)
i
i
r,m
small n
large n
small n
large n
a,0
a,b
1
3
39
40
15
4
1
2
39
40
13
4
1
1
int
int
int
int
int
int
2
2
1
11+4n
3
1
60+n
2.5/16 bytes
5
2
13+6n
2/16 bytes
3
2
37+6n
5
3
65+8n
x
1
1
5
11
34+7b
3
25-100
22
28
1
1
5
9
1
1
x
x
1
x
x
x
x
x
x
1
x
x
x
1
x
x
x
2
x
x
x
x
x
x
x
x
x
x
x
x
3
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
2
2
120
124
7
5
1
40+12n
1
12+n
1 clk / 16 bytes
4
12+n
1 clk / 16 bytes
1
40+2n
4
42+2n
0.33
1
9
8
79+5b
~200
5
~200
24
40-60
Applies to all addressing modes
Has an implicit LOCK prefix.
Low values are for small results, high values for high results.
See manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" for restrictions on macro-op fusion.
Not available in 64 bit mode.
SSE4.2 instruction set.
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
Operands
r
m32/64
m80
m80
r
m32/m64
μops μops unfused domain
Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
1
1
4
41
1
1
1
1
2
38
1
1
1
x
1
1
x
x
1
2
3
1
Page 148
1
float
float
float
float
float
float
Laten- Recicy
procal
throughput
1
3
4
45
1
4
1
1
2
20
1
1
Nehalem
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP g)
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
m80
m80
r
m
m
m
r
AX
m16
m16
m16
r
m
m
r
m
r
m
r
m
r
m
r
m
m
m
m
7
208
1
1
3
3
1
2
2
2
2
3
2
2
1
2
143
79
3
204
0 f)
1
1
1
1
2
2
2
2
2
1
1
1
2
89
52
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
25
35
17
1
1
1
1
1
1
1
1
1
1
2
1
2
2
2
2
1
1
25
35
17
24
17
1
~100
~100
~100
19
~55
~100
~82
24
17
1
~100
~100
~100
19
~55
~100
~82
x
x
x
x
x
x
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x 8 23 23
x 27
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
1
x
x
x
x
x
x
x
Page 149
1
1
1
1
1
1
2
1
1
2
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
5
242
0
6
7
7
2+2
7
5
1
178
156
5
245
1
1
1
1
1
2
2
2
1
2
31
1
1
4
178
156
float
3
1
float
1
float
5
1
float
1
float 7-27 d) 7-27 d)
float 7-27 d) 7-27 d)
float
1
1
float
1
1
float
1
float
1
float
1
float
1
float
3
2
float
5
2
float 7-27 d) 7-27 d)
float
1
float
1
float
1
float
14
float
19
float
22
float
12
float
13
float
~27
float 40-100
float 40-100
float ~110
float
58
float
~80
float ~115
float ~120
Nehalem
Other
FNOP
WAIT
FNCLEX
FNINIT
Notes:
d)
f)
g)
1
1
1
2
2
x
3
3
~190 ~190 x
x
x
x
float
float
float
float
x
x
x
1
1
17
77
Round divisors or low precision give low values.
Resolved by register renaming. Generates no μops in the unfused domain.
SSE3 instruction set.
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD k)
MOVD k)
MOVD k)
MOVD k)
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU g)
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA j)
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW j)
PACKUSDW j)
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW j)
PMOVSX/ZXBW j)
PMOVSX/ZXBD j)
PMOVSX/ZXBD j)
Operands
μops μops unfused domain
Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
xmm, m128
m128, xmm
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
xmm, m128
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
mm,mm
1
1
1
mm,m64
1
1
1
xmm,xmm
1
1
x
x
xmm,m128
xmm,xmm
xmm,m
(x)mm, (x)mm
(x)mm,m
xmm,xmm
xmm, m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
int
1
1
x
x
1
x
ivec
1
1
x
x
x
ivec
1
1
1
x
x
1
x
ivec
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
ivec
ivec
1
1
1
1
1
Page 150
ivec
Laten- Recicy
procal
throughput
1+1
3
1+1
2
1
2
3
1
2
3
2
3
2
1
1
~270
~270
2
0.33
1
0.33
1
0.33
1
1
0.33
1
1
1
1
1
0.33
0.33
2
2
1
1
1
1
2
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
1
1
1
1
1
1
0.5
2
2
2
0.5
2
0.5
1
1
2
1
2
Nehalem
PMOVSX/ZXBQ j)
PMOVSX/ZXBQ j)
PMOVSX/ZXWD j)
PMOVSX/ZXWD j)
PMOVSX/ZXWQ j)
PMOVSX/ZXWQ j)
PMOVSX/ZXDQ j)
PMOVSX/ZXDQ j)
PSHUFB h)
PSHUFB h)
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR h)
PALIGNR h)
PBLENDVB j)
PBLENDVB j)
PBLENDW j)
PBLENDW j)
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB j)
PEXTRB j)
PEXTRW
PEXTRW j)
PEXTRD j)
PEXTRD j)
PEXTRQ j,m)
PEXTRQ j,m)
PINSRB j)
PINSRB j)
PINSRW
PINSRW
PINSRD j)
PINSRD j)
PINSRQ j,m)
PINSRQ j,m)
Arithmetic instructions
PADD/SUB(U)
(S)B/W/D/Q
PADD/SUB(U)
(S)B/W/D/Q
PHADD/SUB(S)W/D h)
PHADD/SUB(S)W/D h)
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ j)
PCMPEQQ j)
xmm,xmm
xmm,m16
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m64
(x)mm, (x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
xmm,m,xmm0
xmm,xmm,i
xmm,m,i
mm,mm
xmm,xmm
r32,(x)mm
r32,xmm,i
m8,xmm,i
r32,(x)mm,i
m16,(x)mm,i
r32,xmm,i
m32,xmm,i
r64,xmm,i
m64,xmm,i
xmm,r32,i
xmm,m8,i
(x)mm,r32,i
(x)mm,m16,i
xmm,r32,i
xmm,m32,i
xmm,r64,i
xmm,m64,i
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
2
2
2
1
2
1
2
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
2
2
2
2
1
2
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1
x
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
(x)mm, (x)mm
1
1
x
x
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
1
3
4
1
1
1
1
1
3
3
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 151
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
2
ivec
1
ivec
ivec
float
ivec
2+2
2+1
ivec
2+1
ivec
2+1
ivec
2+1
ivec
1+1
ivec
1+1
ivec
1+1
ivec
1+1
ivec
1
ivec
3
ivec
1
ivec
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
0.5
1
0.5
1
0.5
1
0.5
1
1
1
1
1
0.5
1
2
7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
2
1.5
3
0.5
2
0.5
2
Nehalem
PCMPGTQ ℓ)
PCMPGTQ ℓ)
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW h)
PMULHRSW h)
PMULLD j)
PMULLD j)
PMULDQ j)
PMULDQ j)
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW h)
PMADDUBSW h)
PAVGB/W
PAVGB/W
PMIN/MAXSB j)
PMIN/MAXSB j)
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW j)
PMIN/MAXUW j)
PMIN/MAXU/SD j)
PMIN/MAXU/SD j)
PHMINPOSUW j)
PHMINPOSUW j)
PABSB PABSW PABSD
h)
PABSB PABSW PABSD
h)
PSIGNB PSIGNW
PSIGND h)
PSIGNB PSIGNW
PSIGND h)
PSADBW
PSADBW
MPSADBW j)
MPSADBW j)
PCLMULQDQ n)
AESDEC, AESDECLAST,
AESENC, AESENCLAST
n)
AESIMC n)
AESKEYGENASSIST n)
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
1
1
1
1
1
1
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1
1
(x)mm,(x)mm
1
1
x
x
(x)mm,m
1
1
x
x
(x)mm,(x)mm
1
1
x
x
(x)mm,m
(x)mm,(x)mm
(x)mm,m
xmm,xmm,i
xmm,m,i
xmm,xmm,i
1
1
1
3
4
1
1
1
3
3
x
x
x
x
x
x
x
x
x
x
x
x
x
ivec
3
ivec
3
ivec
6
ivec
3
ivec
3
ivec
3
ivec
3
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
1
ivec
3
ivec
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
1
ivec
1
ivec
3
ivec
5
1
1
1
1
1
x
x
Page 152
x
x
x
x
ivec
0.5
12
~5
~5
~5
~2
~2
~2
1
0.33
1
1
1
0.5
2
1
3
1
2
8
1
x
x
1
1
1
1
1
1
1
1
0.5
1
1
2
0.5
2
0.5
2
1
2
1
2
1
3
1
xmm,xmm
xmm,xmm
xmm,xmm,i
(x)mm,(x)mm
(x)mm,m
1
1
1
1
1
1
2
1
1
1
x
x
3
1
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
ivec
Nehalem
PTEST j)
PTEST j)
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
xmm,xmm
xmm,m128
mm,mm/i
mm,m64
xmm,i
xmm,xmm
xmm,m128
xmm,i
2
2
1
1
1
2
3
1
2
2
1
1
1
2
2
1
x
x
String instructions
PCMPESTRI ℓ)
PCMPESTRI ℓ)
PCMPESTRM ℓ)
PCMPESTRM ℓ)
PCMPISTRI ℓ)
PCMPISTRI ℓ)
PCMPISTRM ℓ)
PCMPISTRM ℓ)
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
8
9
9
10
3
4
4
6
8
8
9
10
3
4
4
5
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
11
11
x
x
x
Other
EMMS
Notes:
g)
h)
j)
k)
ℓ)
m)
n)
x
x
x
x
x
1
1
1
1
1
x
x
ivec
3
ivec
1
ivec
ivec
1
2
ivec
1
1
1
1
2
1
2
1
1
ivec
ivec
ivec
ivec
ivec
ivec
ivec
ivec
14
14
7
7
8
8
7
7
5
6
6
6
2
2
2
5
1
1
x
x
x
1
1
1
1
1
float
6
SSE3 instruction set.
Supplementary SSE3 instruction set.
SSE4.1 instruction set
MASM uses the name MOVD rather than MOVQ for this instruction even when
moving 64 bits
SSE4.2 instruction set
Only available in 64 bit mode
Only available on newer models
Floating point XMM instructions
Instruction
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS/D
SHUFPS/D
BLENDPS/PD j)
Operands
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
μops μops unfused domain
Dofused
main
dop015 p0 p1 p5 p2 p3 p4
main
1
1
1
1
1
1
1
1
2
2
1
1
1
1
2
1
1
1
float
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
1
1
1
1
1
Page 153
1
1
1
1
1
float
float
float
Laten- Recicy
procal
throughput
1
2
3
2
3
1
2
3
3
5
1
1+2
~270
1
1
1
1
1
1-4
1-3
1
1
1
2
1
1
1
2
1
1
1
Nehalem
BLENDPS/PD j)
BLENDVPS/PD j)
BLENDVPS/PD j)
MOVDDUP g)
MOVDDUP g)
MOVSH/LDUP g)
MOVSH/LDUP g)
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS j)
EXTRACTPS j)
INSERTPS j)
INSERTPS j)
Conversion
CVTPD2PS
CVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
xmm,m128,i
x,x,xmm0
xmm,m,xmm0
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
r32,xmm,i
m32,xmm,i
xmm,xmm,i
xmm,m32,i
2
2
3
1
1
1
1
1
1
1
2
1
3
1
2
2
1
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m64
xmm,xmm
xmm,m32
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m64
xmm,xmm
xmm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,mm
xmm,m64
mm,xmm
mm,m128
xmm,r32
xmm,m32
r32,xmm
r32,m32
xmm,r32
xmm,m32
r32,xmm
r32,m64
2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
2
1
1
2
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
2
2
2
2
1
1
1
1
2
1
1
1
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
1
1
1
1
1
1
1
1
1
2
2
1
1
float
float
float
float
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
float
2
1
2
1
1
?
1
1
1
1
x
x
1
Page 154
1
1
1
?
1
1
1
1
1
?
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
float
float
float
1
1
1+2
1
float
float
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
4
4
2
1
3+2
3+2
4+2
4+2
3+2
3+2
ivec/float
6
float/ivec
6
float
float
float
float
float
float
float
float
3+2
float
float
float
float
3
1
1
1
1
1
1
1
1
3+2
4+2
3+2
3
1
2
2
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
1
1
1
1
Nehalem
ADDSUBPS/D g)
ADDSUBPS/D g)
HADDPS HSUBPS g)
HADDPS HSUBPS g)
HADDPD HSUBPD g)
HADDPD HSUBPD g)
MULSS MULPS
MULSS MULPS
MULSD MULPD
MULSD MULPD
DIVSS DIVPS
DIVSS DIVPS
DIVSD DIVPD
DIVSD DIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D CMPccPS/D
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m128
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
1
3
4
3
4
1
1
1
1
1
1
1
1
1
1
1
1
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
xmm,xmm
1
1
1
xmm,m
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m32/64
xmm,xmm
xmm,m128
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
xmm,xmm,i
1
1
1
xmm,m128,i
xmm,xmm,i
xmm,m128,i
xmm,xmm,i
xmm,m128,i
2
4
6
3
4
1
4
5
3
3
xmm,xmm
xmm,m
xmm,xmm
xmm,m
xmm,xmm
xmm,m
1
2
1
2
1
1
1
1
1
1
1
1
xmm,xmm
xmm,m128
1
1
1
1
m32
m32
m4096
m4096
6
2
141
112
6
1
141
90
1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
float
3
5
3
4
5
7-14
7-22
3
3
1
1
2
2
2
2
1
1
1
1
7-14
7-14
7-22
7-22
2
2
1
CMPccSS/D CMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
ROUNDSS/D
ROUNDPS/D j)
ROUNDSS/D
ROUNDPS/D j)
DPPS j)
DPPS j)
DPPD j)
DPPD j)
Math
SQRTSS/PS
SQRTSS/PS
SQRTSD/PD
SQRTSD/PD
RSQRTSS/PS
RSQRTSS/PS
1
x
x
x
1
2
x
x
x
1
1
1
1
float
float
float
float
float
float
float
float
1
1
x
x
x
1
1
1
1
1
3
3
1
1
1
1
1
1
1
3
1
11
1
2
9
1
3
7-18
1
float
float
float
float
float
float
7-18
7-18
7-32
7-32
2
2
1
1
float
float
1
1
1
1
1
float
float
float
float
float
1+2
7-32
3
Logic
AND/ANDN/OR/XORPS/D
AND/ANDN/OR/XORPS/D
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
Notes:
1
1
x
x
x
x
x
x
Page 155
x 1
1
1 1
x 5 38 38
x 42
90
1
1
5
1
90
100
Nehalem
g)
SSE3 instruction set.
Page 156
Sandy Bridge
Intel Sandy Bridge
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same = same
register for both operands. m = memory operand, m32 = 32-bit memory operand, etc.
μops fused domain:
The number of μops at the decode, rename, allocate and retirement stages in
the pipeline. Fused μops count as one.
The number of μops for each execution port. Fused μops count as two. Fused
macro-ops count as one. The instruction has μop fusion if the sum of the numbers listed under p015 + p23 + p4 exceeds the number listed under μops fused
domain. A number indicated as 1+ under a read or write port means a 256-bit
read or write operation using two clock cycles for handling 128 bits each cycle.
The port cannot receive another read or write µop in the second clock cycle, but
a read port can receive an address-calculation µop in the second clock cycle.
An x under p0, p1 or p5 means that at least one of the μops listed under p015
can optionally go to this port. For example, a 1 under p015 and an x under p0
and p5 means one μop which can go to either port 0 or port 5, whichever is vacant first. A value listed under p015 but nothing under p0, p1 and p5 means that
it is not known which of the three ports these μops go to.
μops unfused domain:
p015:
p0:
p1:
p5:
p23:
p4:
Latency:
The total number of μops going to port 0, 1 and 5.
The number of μops going to port 0 (execution units).
The number of μops going to port 1 (execution units).
The number of μops going to port 5 (execution units).
The number of μops going to port 2 or 3 (memory read or address calculation).
The number of μops going to port 4 (memory write data).
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Where hyperthreading is enabled,
the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles given by the
time stamp counter.
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
The latencies and throughputs listed below for addition and multiplication using
full size YMM registers are obtained only after a warm-up period of a thousand
instructions or more. The latencies may be one or two clock cycles longer and
the reciprocal throughputs double the values for shorter sequences of code.
There is no warm-up effect when vectors are 128 bits wide or less.
Integer instructions
Instruction
Move instructions
MOV
Operands
r,r/i
μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4
procal ments
dothroughmain
put
1
1
x
Page 157
x
x
1
Sandy Bridge
MOV
r,m
1
1
MOV
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
m,r
m,i
m,r
r,r
1
1
2
1
1
1
1
r,m
1
r,r
r,m
r,r
r,m
2
2
3
8
2
2
3
x
3
1
1
2
3
16
1
1
2
9
18
1
3
1
1
2
XLAT
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
LEA
BSWAP
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
r
i
m
r
(E/R)SP
m
r,m
r,m
r32
r64
m
m
r,r/i
r,m
m,r/i
r,same
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
1
x
x
1
1
1
x
2
0.5
3
1
1
1
~350
1
1
2
0
x
x
x
x
x
x
x
x
x
x
x
x
0
8
10
1
3
1
1
1
2
1
1
2
3
2
1
2
1
1
2
1
2
2
4
1
1
1
3
1
1
1
0
2
2
3
1
1
1
1
x
x
x
x
2
1
1
1
2
1
8
1
1
2
1
8
1
1
1
1
1
8
2
25
7
3
2
1
x
1
1
1
1
3
1
2
1
2
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 158
0.5
1
2
1
1
1
1
1
1
implicit
lock
1
1
1
1
1
8
0.5
0.5
1
18
9
1
1
0.5
1
1
1
0.5
0.5
4
33
6
1
1
2
1
1
2
1
1
2
all addressing
modes
1
6
0
2
2
7
1
1
1
6
0.5
1
0.25
1
1
1.5
0.5
2
not 64 bit
not 64 bit
not 64 bit
simple
complex
or rip relative
Sandy Bridge
AAA AAS
2
2
4
not 64 bit
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
3
3
8
1
4
3
2
1
2
1
1
1
3
2
1
1
2
1
1
10
11
10
x
10
10
9
x
4
2
20
3
4
4
3
3
4
3
3
3
not 64 bit
not 64 bit
not 64 bit
r,r
r,m
r,r
r,m
3
3
8
1
4
3
2
1
2
1
1
1
4
3
2
1
2
1
1
10
11
10
34-56
10
10
9
59138
1
1
1
2
1
1
1
1
1
1
r,r/i
r,m
m,r/i
r,same
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,i
1
1
2
1
1
1
1
3
3
5
1
1
1
1
0
1
1
1
1
3
3
1
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
CRC32
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
r8
r16
r32
r64
r,r
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r,m
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
20-24
21-25
20-28
30-94
21-24
21-25
20-27
40-103
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
3
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Page 159
11
1
2
2
1
1
1
1
1
1
2
2
2
1
1
1
1
11-14
11-14
11-18
22-76
11-14
11-14
11-18
25-84
0.5
1
0.5
1
1
0.5
1
1
1
1
1
1
2
1
6
0
1
1
1
2
1
2
1
2
1
0.5
1
0.25
0.5
0.5
2
2
4
1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
Sandy Bridge
ROR ROL
ROR ROL
ROR ROL
RCR
RCR
RCR
RCR
RCR
RCR
RCL
RCL
RCL
RCL
RCL
SHRD SHLD
SHRD SHLD
SHRD SHLD
SHRD SHLD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC CMC
CLD STD
m,i
r,cl
m,cl
r8,1
r16/32/64,1
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
m,r,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
Fused arithmetic and
branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
BOUND
INTO
String instructions
LODS
short
short
short
near
r
m
i
r,m
4
3
5
high
3
8
11
8
11
3
8
11
8
11
1
3
4
5
1
10
2
1
11
3
1
1
1
2
1
1
3
3
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
7
11
3
2
3
2
3
15
4
2
7
11
2
1
2
2
2
13
4
3
2
2
2
1
high
2
5
3
8
7
8
7
3
8
7
8
7
1
4
3
1
8
1
1
7
1
1
1
1
1
0
1
3
1
x
x
x
x
5
2
6
x
x
x
x
2
1
2
1
6
2
1
x
1
1
x
2
x
1
3
1
x
x
x
1
x
x
x
1
1
1
x
2
2
4
high
2
5
6
5
6
2
6
6
6
6
0.5
2
2
4
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
1
4
x
Page 160
x
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
0
0
0
0
2
2
2
1-2
0
1-2
2-4
5
5
2
2
2
2
2
7
6
1
fast if not
jumping
not 64 bit
not 64 bit
Sandy Bridge
REP LODS
STOS
REP STOS
5n+12
3
2n
REP STOS
1.5/16B
1/16B
MOVS
REP MOVS
5
2n
1.5 n
REP MOVS
3/16B
1/16B
SCAS
REP SCAS
CMPS
REP CMPS
3
6n+47
5
8n+80
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDTSCP
RDPMC
1
1
1
n
worst
case
best case
4
1
1
a,0
a,b
~2n
1
worst
case
best case
1
2n+45
4
2n+80
0
0
0.25
0.25
7
7
12
10
49+6b
3
3
31-75
21
23
35
2
decode
only 1
per clk
11
8
1
84+3b
1
7
100-250
28
36
42
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
r
μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4
procal ments
dothroughmain
put
1
1
4
43
1
1
7
246
1
1
3
3
1
2
2
3
1
1
2
40
1
1
1
1
1
1
0
6
7
7
1
1
2
3
0
1
1
1
1
2
2
3
1
2
1
3
4
45
1
4
5
1
2
3
1
1
1
1
1
Page 161
1
1
1
1
2
3
1
1
2
21
1
1
5
252
0.5
1
2
2
2
2
2
2
SSE3
Sandy Bridge
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
AX
m16
m16
m16
2
2
3
2
1
1
143
90
2
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
41-87
17
1
2
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
r
m
m
r
m
r
m
r
m
r
m
r
m
m
m
m
2
1
1
1
1
1
1
1
1
1
1
8
5
1
3
1
1
1
1
1
1
1
1
1
1
5
1
10-24
1
1
1
1
1
1
1
2
1
1
1
1
3
1
1
4
1
1
1
1
17
21
26-50
22
27
27
17
17
1
1
64-100 x
20-110 x
20-110 x
53-118 x
12
10
10-24
47-100
47-115
43-123
61-69
1
102 102
28-91 x
Other
FNOP
WAIT
FNCLEX
FNINIT
1
2
5
26
1
2
5
26
1
1
1
1
1
166
165
1
1
1
1
10-24
10-24
1
1
1
1
1
1
1
1
2
1
2
21
26-50
130
93-146
1
Integer MMX and XMM instructions
Page 162
1
1
22
81
Sandy Bridge
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW
PMOVSX/ZXBW
PMOVSX/ZXBD
PMOVSX/ZXBD
PMOVSX/ZXBQ
PMOVSX/ZXBQ
PMOVSX/ZXWD
PMOVSX/ZXWD
PMOVSX/ZXWQ
PMOVSX/ZXWQ
PMOVSX/ZXDQ
PMOVSX/ZXDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
Operands
μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4
procal ments
dothroughmain
put
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
x, m128
m128, x
x, m128
mm, x
x,mm
m64,mm
m128,x
x, m128
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
x
x
x
mm,mm
1
1
1
mm,m64
1
1
1
x,x
1
1
x
x
x,m128
x,x
x,m
(x)mm,(x)mm
(x)mm,m
x,x
x, m128
x,x
x,m64
x,x
x,m32
x,x
x,m16
x,x
x,m64
x,x
x,m32
x,x
x,m64
(x)mm,(x)mm
(x)mm,m
mm,mm,i
mm,m64,i
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
1
x
x
x
1
1
1
1
1
x
x
x
1
1
1
1
1
1
1
1
2
1
1
1
1
Page 163
1
1
1
1
1
1
3
1
3
1
3
3
1
3
3
3
3
3
1
1
~300
~300
1
0.5
0.5
1
0.5
1
0.5
1
0.5
1
SSE3
1
0.5
1
1
1
0.5
SSE4.1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
Sandy Bridge
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
PBLENDW
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB
PEXTRB
PEXTRW
PEXTRW
PEXTRD
PEXTRD
PEXTRQ
PEXTRQ
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD
PINSRD
PINSRQ
PINSRQ
x,x,i
x,m128,i
x,x,i
x, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
x,m,xmm0
x,x,i
x,m,i
mm,mm
x,x
r32,(x)mm
r32,x,i
m8,x,i
r32,(x)mm,i
m16,(x)mm,i
r32,x,i
m32,x,i
r64,x,i
m64,x,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
x,r64,i
x,m64,i
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
3
2
3
2
2
2
2
2
2
2
2
1
1
1
1
1
1
2
2
1
1
1
4
1
2
1
2
1
2
2
2
2
2
1
2
1
2
1
2
1
(x)mm, (x)mm
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
x,same
x,same
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
3
3
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
1
1
x
x
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
2
1
1
1
2
4
1
x
2
2
1
1
2
1
1
2
1
1
1
1
2
2
1
2
1
2
1
2
1
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1
6
1
1
1
1
2
1
1
1
1
1
0.5
1
0.5
1
0.5
1
0.5
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1,
64b
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1,
64 b
Arithmetic instructions
PADD/SUB(U,S)B/W/D/Q
PADD/SUB(U,S)B/W/D/Q
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PSUBxx, PCMPGTx
PCMPEQx
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
1
1
1
1
1
1
1
1
1
1
1
1
Page 164
1
1
2
1
1
1
1
1
5
1
0
0
5
1
5
1
5
1
5
1
5
1
0.5
0.5
1.5
1.5
0.5
0.5
0.5
0.5
1
1
0.25
0.5
1
1
1
1
1
1
1
1
1
1
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.2
SSE4.2
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Sandy Bridge
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAXSB
PMIN/MAXSB
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW
PMIN/MAXUW
PMIN/MAXU/SD
PMIN/MAXU/SD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x,i
x,m,i
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PXOR
PTEST
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
(x)mm,(x)mm
(x)mm,m
x,same
x,x
x,m128
mm,mm/i
mm,m64
x,i
x,x
x,m128
x,i
1
1
1
1
1
1
1
1
2
3
1
1
1
0
2
2
1
1
1
2
2
1
x
x
x
x
x
x
1
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
8
8
8
8
3
4
3
4
8
7
8
7
3
3
3
3
x,x,i
18
18
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
Encryption instructions
PCLMULQDQ
5
1
5
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
Page 165
1
1
1
1
1
1
1
1
1
1
5
1
x
x
x
x
1
1
1
1
1
1
1
1
1
x
x
x
x
1
1
1
1
5
1
x
x
x
x
6
0
1
1
1
x
x
x
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
1
1
x
x
x
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
0.5
0.5
1
1
1
1
1
2
1
1
4
1
11-12
1
3
1
11
1
14
0.5
0.25
1
1
1
2
1
1
1
1
SSE4.1
SSE4.1
4
4
4
4
3
3
3
3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
8
CLMUL
Sandy Bridge
AESDEC, AESDECLAST,
AESENC, AESENCLAST
AESIMC
AESKEYGENASSIST
x,x
x,x
x,x,i
Other
EMMS
2
2
11
2
2
11
31
31
8
8
4
2
8
AES
AES
AES
18
Floating point XMM and YMM instructions
Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VSHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
BLENDPS/PD
VBLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
Operands
μops μops unfused domain Latency ReciComfused p015 p0 p1 p5 p23 p4
procal ments
dothroughmain
put
x,x
y,y
x,m128
1
1
1
y,m256
m128,x
1
1
m256,y
x,x
x,m32/64
m32/64,x
x,m64
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i
x,m128,i
y,y,y,i
y, y,m256,i
x,x,x/i
y,y,y/i
x,x,m
y,y,m
x,m,i
y,m,i
y,y,y,i
y,y,m,i
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,xmm0
x,m,xmm0
y,y,y,y
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
2
2
2
2
1
2
1
2
1
2
2
3
2
1
1
1
1
1
1
1+
1
1
1
1+
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
1
x
x
x
x
x
x
x
Page 166
1
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
1
4
1
1
3
1
1
0.5
4
3
1
1
AVX
3
1
3
3
3
3
1
2
2
~300
~300
1
1
1
0.5
1
1
1
1
1
1
1
25
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
1
1
1
1
1
AVX
1
1
1+
1
1
1
1+
1
1+
2
1+
1
1
1
1+
2
1
2
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
Sandy Bridge
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
VUNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
y,y,m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y,y
y,y,m256
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
3
1
1
1
1
1
2
2
2
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
2
3
3
4
4
2
1
x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
2
2
2
2
2
2
2
2
2
3
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
1
2
2
2
1
1
1
1
1
1
1
1
1
2
2
x
x
1
1+
1
3
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1+
1
1
1
1
1
3
1
4
1
1
1+
1
1
1
1+
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Page 167
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1+
1 1
1 1+
4
1
4
1+
3
1
3
1
1
1
1
4
1
3
1
1
1
1
1
1
1
1
1
1
1
3
1
3
1+
3
1
3
1+
1
1
4
1
1
1
0.5
1
1
1
1
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
AVX
SSE3
SSE3
AVX
AVX
AVX
AVX
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
Sandy Bridge
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
ADDSUBPS/D
VADDSUBPS/D
VADDSUBPS/D
HADDPS/D HSUBPS/D
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULPS
MULSS MULPS
VMULPS
VMULPS
MULSD MULPD
MULSD MULPD
VMULPD
VMULPD
DIVSS DIVPS
DIVSS DIVPS
VDIVPS
VDIVPS
DIVSD DIVPD
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32
x,m32
r32,x
r32,m64
2
3
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
1
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
2
2
2
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
y,y,y
y,y,m256
x,x
x,m128
1
1
1
1
1
1
1
1
1
1
3
4
1
1
1
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
1
1
1
2
2
y,y,y
3
3
1
2
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
4
1
1
1
1
1
1
1
1
1
1
3
4
1
3
1
1
1
1
1
1
1
1
1
1
3
3
1
1
2
5
1
4
1
5
1+
4
1
1
4
1
1
1
4
1
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
Page 168
1
1
1
1
1
1
1
1
1
4
1
4
1
4
1
4
1
3
1
3
1
3
1+
3
1
3
1+
5
1
5
1+
5
1
5
1+
5
1
5
1+
10-14
1
1
1
21-29
1+
10-22
1
1
1
1
1
1
2
2
1
1
1
1
1
1
1.5
1.5
1
1
1.5
1.5
1
1
AVX
AVX
AVX
AVX
1
1
1
1
1
1
1
1
1
1
2
2
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
2
AVX
2
1
1
1
1
1
1
1
1
10-14
10-14
20-28
20-28
10-22
AVX
AVX
AVX
AVX
AVX
AVX
AVX
Sandy Bridge
DIVSD DIVPD
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D CMPccPS/D
x,m
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256
1
3
4
1
1
3
4
1
3
3
1
1
3
3
1
2
2
1
1
2
1
x,x
1
1
1
x,m128
y,y,y
y,y,m256
x,x
x,m32/64
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,i
x,m128,i
y,y,y,i
y,m256,i
x,x,i
x,m128,i
2
1
2
2
2
1
1
1
1
1
1
1
2
1
2
4
6
4
6
3
4
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
4
5
4
5
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
1
1
3
4
1
2
3
4
1
1
3
4
1
1
3
3
1
1
3
3
1
1
3
3
x,x
x,m128
1
1
1
1
1
1
y,y,y
1
1
1
y,y,m256
1
1
1
1
1
21-45
1+
5
1
1
7
1+
3
10-22
20-44
20-44
1
1
2
2
AVX
AVX
AVX
AVX
1
CMPccSS/D CMPccPS/D
VCMPccPS/D
VCMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
VMAXPS/D VMINPS/D
ROUNDSS/SD/PS/PD
ROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD
Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS
1
1
1
3
1+
2
1
3
1
3
1
3
1+
3
1
3
1+
12
1
12
1+
1
1
1
9
1
1
1
10-14
1
1+
1
1
10-21
1
21-43
1+
1
1
5
1
7
1+
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
4
2
4
2
2
10-14
10-14
21-28
21-28
10-21
10-21
21-43
21-43
1
1
2
2
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX
Logic
AND/ANDN/OR/XORPS/PD
AND/ANDN/OR/XORPS/PD
VAND/ANDN/OR/XORPS/
PD
VAND/ANDN/OR/XORPS/
PD
Page 169
1
1
1
1
1
AVX
1
AVX
1
1+
Sandy Bridge
(V)XORPS/PD
x/y,x/y,same
1
Other
VZEROUPPER
4
VZEROALL
12
VZEROALL
LDMXCSR
STMXCSR
VSTMXCSR
FXSAVE
FXRSTOR
XSAVEOPT
m32
m32
m32
m4096
m4096
m
0
0
0.25
2
1
11
20
3
3
3
3
3
3
130
116
100-161
1
1
Page 170
1
1
1
1
1
9
3
1
1
68
72
1
1
60-500
AVX
AVX,
32 bit
AVX,
64 bit
AVX
Ivy Bridge
Intel Ivy Bridge
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm
register, (x)mm = mmx or xmm register, y = 256 bit ymm register, same =
same register for both operands. m = memory operand, m32 = 32-bit memory
operand, etc.
μops fused domain:
The number of μops at the decode, rename, allocate and retirement stages in
the pipeline. Fused μops count as one.
The number of μops for each execution port. Fused μops count as two. Fused
macro-ops count as one. The instruction has μop fusion if the sum of the
numbers listed under p015 + p23 + p4 exceeds the number listed under μops
fused domain. A number indicated as 1+ under a read or write port means a
256-bit read or write operation using two clock cycles for handling 128 bits
each cycle. The port cannot receive another read or write µop in the second
clock cycle, but a read port can receive an address-calculation µop in the second clock cycle. An x under p0, p1 or p5 means that at least one of the μops
listed under p015 can optionally go to this port. For example, a 1 under p015
and an x under p0 and p5 means one μop which can go to either port 0 or port
5, whichever is vacant first. A value listed under p015 but nothing under p0, p1
and p5 means that it is not known which of the three ports these μops go to.
μops unfused domain:
p015:
p0:
p1:
p5:
p23:
The total number of μops going to port 0, 1 and 5.
The number of μops going to port 0 (execution units).
The number of μops going to port 1 (execution units).
The number of μops going to port 5 (execution units).
The number of μops going to port 2 or 3 (memory read or address calculation).
p4:
Latency:
The number of μops going to port 4 (memory write data).
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Where hyperthreading is enabled,
the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency.
The time unit used is core clock cycles, not the reference clock cycles given by
the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
The latencies and throughputs listed below for addition and multiplication using
full size YMM registers are obtained only after a warm-up period of a thousand
instructions or more. The latencies may be one or two clock cycles longer and
the reciprocal throughputs double the values for shorter sequences of code.
There is no warm-up effect when vectors are 128 bits wide or less.
Integer instructions
Instruction
Operands
μops μops unfused domain Latency ReciComfused
procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put
Page 171
Ivy Bridge
Move instructions
MOV
MOV
MOV
r,i
r8/16,r8/16
r32/64,r32/64
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
MOV
MOV
MOV
r8/16,m8/16
r32/64,m32/64
r,m
1
1
1
1
x
x
x
MOV
MOV
MOVNTI
MOVSX MOVSXD
MOVZX
MOVZX
m,r
m,i
m,r
r,r
r16,r8
r32/64,r8
1
1
2
1
1
1
MOVZX
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
r32/64,r16
r16,m8
r32/64,m
r,r
r,m
r,r
r,m
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
1
2
1
1
1
x
x
x
x
x
x
2
2
3
7
2
2
3
x
x
x
x
x
x
x
x
x
x
2
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
LEA
r16,m
r32/64,m
3
1
1
2
2
3
19
1
3
2
9
18
1
3
2
1
LEA
r32/64,m
1
1
r32
r64
m
m
1
2
1
1
2
3
2
1
2
BSWAP
BSWAP
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
r
i
m
(E/R)SP
r
(E/R)SP
m
1
1
1
1
1
x
x
x
x
x
x
x
x
x
2
x
x
x
8
10
1
3
2
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
x
1
1
1
2
1
1
8
1
1
2
1
8
3
1
1
1
1
1
8
1
1
0.5
0.5
1
3
~340
1
1
0-1
1
1
1
0.33
0.33
0.25
1
3
2
0.33
0.5
0.5
2
0.67
~0.8
1
x
1
1
2
25
7
3
1
1
2-4
1
3
1
1
2
1
1
43
43
4
36
6
3
1
1
may be
elimin.
64 b abs
address
may be
elimin.
implicit
lock
1
1
1
1
1
1
8
0.5
0.5
1
18
9
1
1
1
0.5
1
1
1
Page 172
2
2
2
2
1
x
0.33
0.33
0.25
1
2
1
2
3
1
1
1
1
1
0-1
not 64 bit
not 64 bit
not 64 bit
1-2 components
3 components
or RIP
Ivy Bridge
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
CRC32
CRC32
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
r8
r16
r32
r64
r,r
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r,m
r16,m16,i
r32,m32,i
r64,m64,i
r8
r16
r32
r64
r8
r16
r32
r64
r,r
r,m
r,r
r,m
1
1
2
2
2
4
1
1
1
3
2
3
3
8
1
4
3
2
1
2
1
1
1
4
3
2
1
2
1
1
11
11
10
35-57
11
11
9
59134
1
1
1
2
1
1
1
1
1
1
1
1
1
2
2
3
1
1
1
1
2
3
3
8
1
4
3
2
1
2
1
1
1
3
2
1
1
2
1
1
11
11
10
x
11
11
9
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
2
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
x
x
x
x
x
x
x
x
x
x
x
1
1
2
1
1
2
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
2
2
7-8
1
1
1
6
4
4
2
20
3
4
4
3
3
4
3
3
3
19-22
20-24
19-27
29-94
20-23
20-24
19-26
28-103
Logic instructions
Page 173
x
x
x
x
1
1
1
1
x
x
x
x
x
x
1
1
1
1
1
1
3
1
3
1
0.33
0.5
1
1
1
2
0.33
0.5
0.33
1
8
1
2
2
1
1
1
1
1
1
2
2
2
1
1
1
1
9
10
11
22-76
8
8
8-11
26-88
not 64 bit
not 64 bit
not 64 bit
not 64 bit
0.33
0.5
1
1
1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
Ivy Bridge
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCL RCR
RCL RCR
RCL RCR
RCL RCR
RCL RCR
SHRD SHLD
SHRD SHLD
SHRD SHLD
SHRD SHLD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC CMC
CLD STD
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
m,r,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
Fused arithmetic and
branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
short
short
short
near
r
m
i
1
1
2
1
1
1
3
2
5
2
1
4
2
5
3
8
11
8
11
1
3
4
5
1
10
2
1
11
3
1
1
1
2
1
1
3
1
1
1
1
1
1
1
2
3
2
1
3
2
3
3
8
8
8
8
1
3
4
4
1
9
1
1
8
2
1
1
1
1
0
1
3
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
1
1
2
1
2
1
1
1
1
2
1
2
1
1
2
5
2
1
2
1
2
1
2
1
5
1
2
1
1
1
1
2
1
1
1
3
1
x
x
1
1
1
0.33
0.5
1
0.33
0.5
0.5
2
1
4
1
0.5
2
1
4
2
5
6
5
6
0.5
2
2
4
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
0.33
4
x
x
1
1
1
1
1
1
1
1
1
0
0
0
0
2
2
2
1-2
1
1
1
0
1-2
2
7
11
2
2
3
2
3
2
7
11
1
1
1
1
2
x
x
Page 174
x
x
6
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
2
x
1
x
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1-2
4-5
6
2
2
2
2
2
short form
fast if no
jump
fast if no
jump
Ivy Bridge
BOUND
INTO
r,m
15
4
13
4
x
x
x
String instructions
LODS
REP LODS
STOS
REP STOS
3
~5n
3
many
2
x
x
1
x
x
REP STOS
many
MOVS
REP MOVS
5
2n
REP MOVS
4/16B
SCAS
REP SCAS
CMPS
REP CMPS
3
~6n
5
~8n
2
4
8
7
5
9
14
18
22
24
3
5
5
3
6
11
15
19
21
Synchronization instructions
XADD
LOCK XADD
LOCK ADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP / Long NOP
PAUSE
ENTER
ENTER
LEAVE
XGETBV
CPUID
RDTSC
RDPMC
RDRAND
m,r
m,r
m,r
m,r
m,r
m,r
m,r
m,r
m,r
a,0
a,b
r
2
7
6
x
1
1
x
1
not 64 bit
not 64 bit
~2n
1
1
n
worst
case
best
case
1/16B
2
x
x
x
2
1
4
n
worst
case
best
case
1/16B
x
x
x
1
1
~2n
3
x
x
x
2
4
~2n
1
0
7
7
12
9
45+7b
3
2
8
37-82
21
35
13
12
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
2
1
2
2
2
2
2
2
x
x
x
2
x
x
x
1
1
1
1
1
1
1
1
1
1
7
22
22
7
22
7
22
16
27
0.25
10
8
1
84+3b
6
9
XGETBV
100-340
x
x
x
1
27
39
104-117 RDRAND
Floating point x87 instructions
Instruction
Operands
μops μops unfused domain Latency ReciComfused
procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put
Move instructions
Page 175
Ivy Bridge
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FADD(P) FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
r
AX
m16
m16
m16
r
m
m
r
m
r
m
r
m
r
m
r
m
m
m
m
1
1
4
43
1
1
7
243
1
1
3
3
1
2
2
3
2
2
3
2
1
1
143
90
1
1
2
40
1
1
1
1
2
1
2
1
2
1
1
1
1
2
3
2
2
2
2
1
2
28
41
17
1
1
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
25
17
1
21-78
25
17
1
1
1
0
6
7
7
1
2
3
1
1
2
3
0
1
1
1
1
2
2
3
2
1
2
1
1
1
1
2
1
3
5
45
1
4
5
1
1
1
1
1
1
1
1
x
1
1
1
1
1
2
x
2
1
1
1
1
1
2
4
1
1
1
1
1
3
5
1
10-24
1
Page 176
1
1
3
1
1
1
4
5
1
1
1
1
17
x
x
1
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
2
1
1
2
21
1
1
5
252
0.5
1
1
2
x
x
x
x
x
x
2
2
1
1
3
1
1
1
167
162
1
1
1
1
8-18
8-18
1
1
1
1
1
1
2
2
21-26
27-50
22
2
1
2
12
19
11
49
10
10-23
47-106
49
10
8-17
47-106
SSE3
Ivy Bridge
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
23-100
20-110
16-23
42
42
56
56
102 102
28-72
Other
FNOP
WAIT
FNCLEX
FNINIT
1
2
5
26
1
2
5
26
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
48-115
50-123
~68
90-106
82
130
94-150
48-115
50-123
~68
1
1
22
80
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA MOVDQU
MOVDQA MOVDQU
MOVDQA MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PUNPCKH/LQDQ
PMOVSX/ZXBW
Operands
μops μops unfused domain Latency ReciComfused
procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
x, m128
mm, x
x,mm
m64,mm
m128,x
x, m128
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
mm,mm
1
1
1
mm,m64
1
1
1
x,x
1
1
x
x
x,m128
x,x
x,m
(x)mm,(x)mm
(x)mm,m
x,x
x, m128
x,x
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
x
x
x
1
1
1
1
2
1
x
x
1
1
1
x
x
1
1
1
1
1
Page 177
1
x
1
1
1
3
1
3
1
3
3
0-1
3
3
3
1
1
~360
~360
3
1
1
1
0.5
0.33
0.5
1
0.25
0.5
1
0.5
1
0.33
1
1
0.5
1
1
1
1
eliminat.
SSE3
SSE4.1
1
1
0.5
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
1
1
1
1
SSE4.1
SSE4.1
SSE4.1
Ivy Bridge
PMOVSX/ZXBW
PMOVSX/ZXBD
PMOVSX/ZXBD
PMOVSX/ZXBQ
PMOVSX/ZXBQ
PMOVSX/ZXWD
PMOVSX/ZXWD
PMOVSX/ZXWQ
PMOVSX/ZXWQ
PMOVSX/ZXDQ
PMOVSX/ZXDQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
PBLENDW
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRB
PEXTRB
PEXTRW
PEXTRW
PEXTRD
PEXTRD
PEXTRQ
PEXTRQ
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD
PINSRD
PINSRQ
PINSRQ
x,m64
x,x
x,m32
x,x
x,m16
x,x
x,m64
x,x
x,m32
x,x
x,m64
(x)mm,(x)mm
(x)mm,m
mm,mm,i
mm,m64,i
xmm,x,i
x,m128,i
x,x,i
x, m128,i
(x)mm,(x)mm,i
(x)mm,m,i
x,x,xmm0
x,m,xmm0
x,x,i
x,m,i
mm,mm
x,x
r32,(x)mm
r32,x,i
m8,x,i
r32,(x)mm,i
m16,(x)mm,i
r32,x,i
m32,x,i
r64,x,i
m64,x,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
x,r64,i
x,m64,i
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
2
2
3
1
2
4
10
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
4
1
2
1
1
1
2
1
2
1
2
1
2
1
1
1
1
1
(x)mm, (x)mm
(x)mm,m
(x)mm, (x)mm
(x)mm,m64
(x)mm,(x)mm
(x)mm,m
x,x
1
1
3
4
1
1
1
1
1
3
3
1
1
1
1
x
1
1
1
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
4
1
2
2
2
1
1
1
1
2
2
1
1
1
1
2
2
1
2
1
2
1
2
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1
6
1
1
1
1
1
1
1
1
1
1
0.5
1
0.5
1
0.5
1
0.5
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Arithmetic instructions
PADD/SUB(U,S)B/W/D/Q
PADD/SUB(U,S)B/W/D/Q
PHADD/SUB(S)W/D
PHADD/SUB(S)W/D
PCMPEQ/GTB/W/D
PCMPEQ/GTB/W/D
PCMPEQQ
Page 178
1
1
3
1
1
1
1
0.5
0.5
1.5
1.5
0.5
0.5
0.5
SSSE3
SSSE3
SSE4.1
Ivy Bridge
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/MAXSB
PMIN/MAXSB
PMIN/MAXUB
PMIN/MAXUB
PMIN/MAXSW
PMIN/MAXSW
PMIN/MAXUW
PMIN/MAXUW
PMIN/MAXU/SD
PMIN/MAXU/SD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x
x,m
x,x
x,m128
x,x
x,m128
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
(x)mm,(x)mm
(x)mm,m
x,x,i
x,m,i
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
3
Logic instructions
PAND(N) POR PXOR
PAND(N) POR PXOR
PTEST
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
(x)mm,(x)mm
(x)mm,m
x,x
x,m128
mm,mm/i
mm,m64
xmm,i
x,x
x,m128
x,i
1
1
2
3
1
1
1
2
3
1
1
1
2
2
1
1
1
2
2
1
x
x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
x
x
1
1
1
1
1
1
1
String instructions
Page 179
1
1
1
1
1
1
1
1
1
1
1
1
5
1
x
x
x
x
1
1
1
1
1
x
x
x
x
1
1
1
1
5
1
1
1
1
1
x
x
x
x
x
x
x
x
6
1
1
1
1
1
1
1
x
x
x
x
x
x
1
2
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
0.5
0.5
1
1
1
1
0.33
0.5
1
1
1
1
1
1
1
0.5
SSE4.1
SSE4.2
SSE4.2
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Ivy Bridge
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
Encryption instructions
PCLMULQDQ
PCLMULQDQ
AESDEC, AESDECLAST,
AESENC, AESENCLAST
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
8
8
8
8
3
4
3
4
8
7
8
7
3
3
3
3
3
3
3
3
3
3
3
3
1
1
1
1
4
3
4
3
x,x,i
x,m,i
18
18
18
17
x
x
x
x
x
x
x,x
2
2
x
x
1
x,m
x,x
x,m
x,x,i
x,m,i
3
2
3
11
11
2
2
2
11
10
x
x
1
2
2
x
x
31
31
4
3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
14
8
8
CLMUL
CLMUL
4
1
AES
1
2
2
8
7
AES
AES
AES
AES
AES
1
12
1
4
4
4
4
3
1
3
11
1
1
AESDEC, AESDECLAST,
AESENC, AESENCLAST
AESIMC
AESIMC
AESKEYGENASSIST
AESKEYGENASSIST
Other
EMMS
x
x
x
x
1
14
1
10
1
18
Floating point XMM and YMM instructions
Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVH/LPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
Operands
μops μops unfused domain Latency ReciComfused
procal ments
dothroughp015 p0 p1 p5 p23 p4
main
put
x,x
y,y
x,m128
1
1
1
y,m256
m128,x
1
1
m256,y
x,x
x,m32/64
m32/64,x
x,m64
m64,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i
1
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1+
1
1
1
1+
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Page 180
1
1
1+
0-1
0-1
3
≤1
≤1
0.5
elimin.
elimin.
4
3
1
1
AVX
4
1
3
3
4
3
1
2
2
~380
~380
1
2
1
0.5
1
1
1
1
1
1
1
2
1
AVX
AVX
Ivy Bridge
SHUFPS/D
VSHUFPS/D
VSHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
BLENDPS/PD
BLENDPS/PD
VBLENDPS/PD
VBLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VMOVDDUP
VMOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
VMOVSH/LDUP
VMOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
VUNPCKH/LPS/D
VUNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
x,m128,i
y,y,y,i
y, y,m256,i
x,x,x/i
y,y,y/i
x,x,m
y,y,m
x,m,i
y,m,i
y,y,y,i
y,y,m,i
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,xmm0
x,m,xmm0
y,y,y,y
y,y,m,y
x,x
x,m64
y,y
y,m256
x,m32
y,m32
y,m64
y,m128
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y,y
y,y,m256
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
x,x,m128
y,y,m256
m128,x,x
m256,y,y
2
1
2
1
1
2
2
2
2
1
2
1
2
1
2
2
3
2
3
1
1
1
1
1
2
2
2
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
2
3
3
4
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
x,x
x,m128
x,y
x,m256
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
x
x
x
x
x
x
x
x
1
x
x
x
x
x
x
x
x
1
1
1+
1
1
1
1+
1
1+
2
1+
1
1
1
1+
2
1
2
1+
1
3
1
3
4
5
5
5
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1+
1
1
1
1
1
1
1
1
1
1
1
2
2
1
0
1
1
1
1
2
2
2
2
1
1
1
1
x
x
1
1+
x
x
1
1
1
x
x
x
x
x
x
Page 181
x
x
1
1
1
1
1
1
1
1
1
1
1
1+
2
1
1
1
1
2
4
1
1
1
1
1+
1 1
1 1+
2
4
4
5
4
1
4
1+
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
1
1
1
1
1
1
0.5
1
1
0.5
1
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX
AVX
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE3
SSE3
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX
1
1
1
1
AVX
AVX
Ivy Bridge
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32
x,m32
r32,x
r32,m64
x,v,i
m,v,i
v,x
v,m
2
2
2
2
2
3
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
2
2
2
3
3
2
2
2
2
2
1
2
2
2
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
2
2
1
2
2
2
1
2
2
3
2
2
1
Arithmetic
ADDSS/D SUBSS/D
ADDSS/D SUBSS/D
ADDPS/D SUBPS/D
ADDPS/D SUBPS/D
VADDPS/D VSUBPS/D
VADDPS/D VSUBPS/D
ADDSUBPS/D
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Page 182
1
1
1
4
1
1
1
1
1
1
4
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
3
1+
3
1
3
1+
1
1
1
1
1
1
1
1
4
1
5
1
4
1
5
1+
1
1
1
1
1
1
1
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
4
1
1
4
1
1
1
1
1
1
4
1
4
1
4
1
4
1
1
4
1
4
1
1
10
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6
1
3
1
3
1
3
1+
3
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
1
1
1
1
F16C
F16C
F16C
F16C
1
1
1
1
1
1
1
AVX
AVX
SSE3
Ivy Bridge
ADDSUBPS/D
VADDSUBPS/D
VADDSUBPS/D
HADDPS/D HSUBPS/D
HADDPS/D HSUBPS/D
VHADDPS/D
VHSUBPS/D
VHADDPS/D
VHSUBPS/D
MULSS MULPS
MULSS MULPS
VMULPS
VMULPS
MULSD MULPD
MULSD MULPD
VMULPD
VMULPD
DIVSS DIVPS
DIVSS DIVPS
VDIVPS
VDIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D CMPccPS/D
x,m128
y,y,y
y,y,m256
x,x
x,m128
1
1
1
3
4
1
1
1
3
3
1
1
1
1
1
2
2
y,y,y
3
3
1
2
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256
4
1
1
1
1
1
1
1
1
1
1
3
4
1
1
3
4
1
1
3
4
3
1
1
1
1
1
1
1
1
1
1
3
3
1
1
3
3
1
1
3
3
1
2
x,x
1
1
1
x,m128
y,y,y
y,y,m256
x,x
x,m32/64
x,x
x,m32/64
x,x
x,m128
y,y,y
y,y,m256
x,x,i
x,m128,i
y,y,i
y,m256,i
x,x,i
x,m128,i
y,y,y,i
y,m256,i
x,x,i
x,m128,i
2
1
2
2
2
1
1
1
1
1
1
1
2
1
2
4
6
4
6
3
4
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
4
5
4
5
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2
2
1
1
2
2
1
3
1+
5
1
5
1+
5
1
5
1+
5
1
5
1+
10-13
1
1
1
19-21
1+
10-20
1
1
1
20-35
1+
5
1
1
1
7
1+
3
1
1
1
2
2
SSE3
AVX
AVX
SSE3
SSE3
2
AVX
2
1
1
1
1
1
1
1
1
7
7
14
14
8-14
8-14
16-28
16-28
1
1
2
2
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
1
CMPccSS/D CMPccPS/D
VCMPccPS/D
VCMPccPS/D
COMISS/D UCOMISS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
MAXPS/D MINPS/D
VMAXPS/D VMINPS/D
VMAXPS/D VMINPS/D
ROUNDSS/SD/PS/PD
ROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
VROUNDSS/SD/PS/PD
DPPS
DPPS
VDPPS
VDPPS
DPPD
DPPD
1
1
1
1
1
1
1
1
Page 183
1
3
1+
1
3
1
3
1
3
1+
3
1
3
1+
1
2
1
2
1
1
12
1
12
1+
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
4
2
4
1
1
AVX
AVX
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
Ivy Bridge
Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
1
1
3
4
1
1
3
4
1
1
3
4
1
1
3
3
1
1
3
3
1
1
3
3
x,x
x,m128
1
1
y,y,y
1
1
2
2
1
1
2
2
1
1
2
2
11
1
1
1
19
1+
16
1
1
1
28
1+
5
1
1
1
1+
7
1
1
1
1
1
1
1
1
y,y,m256
1
1
1
m32
m32
m4096
m4096
m
4
12
20
3
3
130
116
100-161
0
2
2
2
2
7
7
14
14
8-14
8-14
16-28
16-28
1
1
2
2
AVX
AVX
AVX
AVX
AVX
AVX
Logic
AND/ANDN/OR/XORPS/PD
AND/ANDN/OR/XORPS/PD
VAND/ANDN/OR/XORPS/
PD
VAND/ANDN/OR/XORPS/
PD
Other
VZEROUPPER
VZEROALL
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
XSAVEOPT
1
1
Page 184
1
1
1
1
1
1
1
AVX
1
AVX
1
11
9
3
1
66
68
AVX
32 bit
64 bit
1+
1
1
1
6
7
60-500
Haswell
Intel Haswell
List of instruction timings and μop breakdown
Explanation of column headings:
Instruction:
Name of instruction. Multiple names mean that these instructions have the same data.
Instructions with or without V name prefix behave the same unless otherwise noted.
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register,
(x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx,
xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc.
μops fused
domain:
μops unfused
domain:
The number of μops at the decode, rename and allocate stages in the pipeline. Fused
μops count as one.
The total number of μops for all execution port. Fused μops count as two. Fused macroops count as one. The instruction has μop fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any
execution port or if the counters are inaccurate.
µops each port: The number of μops for each execution port. p0 means a µop to execution port 0.
p01means a µop that can go to either port 0 or port 1. p0 p1 means two µops going to
port 0 and 1, respectively.
Port 0: Integer, f.p. and vector ALU, mul, div, branch
Port 1: Integer, f.p. and vector ALU
Port 2: Load
Port 3: Load
Port 4: Store
Port 5: Integer and vector ALU
Port 6: Integer ALU, branch
Port 7: Store address
Latency:
This is the delay that the instruction generates in a dependency chain. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase the clock
counts considerably. Where hyperthreading is enabled, the use of the same execution
units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
Reciprocal
throughput:
The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
Operands
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
r,i
r8/16,r8/16
r32/64,r32/64
r8l,m
r8h,m
r16,m
r32/64,m
1
1
1
1
1
1
1
1
1
1
2
1
2
1
p0156
p0156
p0156
p23 p0156
p23
p23 p0156
p23
m,r
1
2
p237 p4
Page 185
2
0.25
0.25
0.25
0.5
0.5
0.5
0.5
3
1
1
0-1
may be elim.
all addressing
modes
Haswell
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
m,i
m,r
r,r
1
2
1
2
2
1
p237 p4
p23 p4
p0156
r16,m8
r,m
1
1
2
1
p23 p0156
p23
r,r
r,m
r,r
r,m
2
3
3
8
3
2
2
3
3
4
19
1
3
3
9
18
1
3
2
2p0156
2p0156 p23
3p0156
r16,m
2
3
3
8
3
1
1
2
2
3
11
1
3
2
9
18
1
3
2
LEA
r32/64,m
1
LEA
r32/64,m
LEA
BSWAP
BSWAP
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
PREFETCHNTA/
0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
~400
1
1
1
0.25
0.5
0.5
2
2
21
7
3
0.5
1
1
implicit lock
p23
p23 2p0156
2p237 p4
2
p06
3p0156
p1 p0156
1
1
4
2
1
1
1
1
1
8
0.5
4
1
18
9
1
1
1
1
p15
1
0.5
1
1
p1
3
1
r32/64,m
1
1
p1
r32
r64
r16,m16
r32,m32
r64,m64
m16,r16
m32,r32
m64,r64
1
2
3
2
3
2
2
3
1
2
3
2
3
3
3
4
p15
p06 p15
2p0156 p23
p15 p23
2p0156 p23
p06 p237 p4
p15 p237 p4
p06 p15 p237 p4
m
1
1
p23
0.5
2
3
2
2
2
none counted
p23 p4
p23 p4
4
33
5
1
1
1
2
p0156
p0156 p23
r
i
m
stack pointer
r
stack pointer
m
r,r/i
r,m
p237 p4
p237 p4
p4 2p237
p0156 p237 p4
p1 p4 p237 p06
Page 186
1
1
2
1
all other
combinations
0.5
1
0.5
0.5
0.5
1
1
1
0.25
0.5
not 64 bit
not 64 bit
not 64 bit
16 or 32 bit
address size
1 or 2 components in
address
3 components
in address
rip relative
address
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
Haswell
ADD SUB
m,r/i
2
4
2p0156 2p237 p4
6
1
ADC SBB
ADC SBB
ADC SBB
r,r/i
r,m
m,r/i
2
2
4
2
3
6
2p0156
2p0156 p23
2
3p0156 2p237 p4
7
1
1
2
CMP
CMP
INC DEC NEG
NOT
INC DEC NOT
NEG
AAA
AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MULX
MULX
MULX
MULX
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
r,r/i
m,r/i
r
1
1
1
1
2
1
p0156
p0156 p23
p0156
1
1
1
0.25
0.5
0.25
m
m
3
2
2
2
3
3
8
1
4
3
2
1
4
3
2
1
1
2
1
1
2
1
1
3
3
2
2
9
11
10
36
9
10
9
59
1
1
1
2
1
1
1
1
4
4
2
2
3
3
8
1
4
3
2
2
5
4
3
1
2
2
1
1
3
2
2
3
4
2
3
9
11
10
36
9
10
9
59
1
1
1
2
1
1
1
2
p0156 2p237 p4
p0156 2p237 p4
p1 p0156
p1 p56
p1 2p0156
p1 2p0156
p0 p1 p5 p6
p1
p1 p0156
p1 p0156
p1 p6
p1 p23
p1 3p0156 p23
p1 2p0156 p23
p1 p6 p23
p1
p1 p23
p1 p0156
p1
p1
p1 p0156 p23
p1 p23
p1 p23
p1 2p056
p1 2p056 p23
p1 p6
p1 p6 p23
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0156
p0156
p0156
p0156
p06
p06
p1
p1 p23
6
6
4
6
4
4
21
3
4
4
3
1
1
r8
r16
r32
r64
m8
m16
m32
m64
r,r
r,m
r16,r16,i
r32,r32,i
r64,r64,i
r16,m16,i
r32,m32,i
r64,m64,i
r32,r32,r32
r32,r32,m32
r64,r64,r64
r64,r64,m64
r8
r16
r32
r64
r8
r16
r32
r64
r,r
r,m
Page 187
3
4
3
3
4
4
22-25
23-26
22-29
32-96
23-26
23-26
22-29
39-103
1
1
1
1
1
1
3
8
1
2
2
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
9
9
9-11
21-74
8
8
8-11
24-81
1
1
not 64 bit
not 64 bit
not 64 bit
not 64 bit
not 64 bit
AVX2
AVX2
AVX2
AVX2
SSE4.2
SSE4.2
Haswell
CRC32
CRC32
r,r
r,m
1
1
1
2
p1
p1 p23
3
1
1
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
r,r/i
r,m
m,r/i
1
1
2
1
2
4
p0156
p0156 p23
1
2p0156 2p237 p4
6
0.25
0.5
1
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
m,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
r,r,cl
m,r,cl
r,r,r
r,m,r
r,r,i
r,m,i
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
1
1
1
3
3
5
2
1
4
3
5
3
4
8
11
8
11
1
3
4
4
5
1
2
1
2
1
10
2
1
10
3
1
1
1
2
1
1
1
3
1
1
1
1
1
2
1
4
3
6
2
1
5
3
6
3
6
8
11
8
11
1
5
4
4
7
1
2
1
2
1
10
2
1
11
4
1
2
1
3
0
1
1
3
1
2
1
2
p0156
p0156 p23
p06
2p06 p237 p4
3p06
3p06 2p23 p4
2p06
p06
2p06 2p237 p4
3p06
1
2p06 p0156
2
p0156
6
p0156
6
p1
3
p0156
p0156
3
4
p06
p06 p23
p06
p06 p23
p06
1
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHRD SHLD
SHRD SHLD
SHLD
SHRD
SHRD SHLD
SHLX SHRX SARX
SHLX SHRX SARX
RORX
RORX
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC
CMC
CLD STD
LZCNT
LZCNT
TZCNT
TZCNT
r,r
r,m
r,r
r,m
p06 p23
p06
2p06 p23 p4
p1
p1 p23
p06
p06 p237 p4
none
p0156
p0156
p15 p6
p1
p1 p23
p1
p1 p23
Page 188
1
2
1
1
2
1
1
1
3
1
0.25
0.5
0.5
2
2
4
1
0.5
2
2
4
2
3
6
6
6
6
1
2
2
2
4
0.5
0.5
0.5
0.5
0.5
5
0.5
0.5
5
2
1
1
0.5
1
0.25
0.25
SSE4.2
SSE4.2
short form
BMI2
BMI2
BMI2
BMI2
1
3
3
4
1
1
1
1
LZCNT
LZCNT
BMI1
BMI1
Haswell
ANDN
ANDN
BLSI BLSMSK
BLSR
BLSI BLSMSK
BLSR
BEXTR
BEXTR
BZHI
BZHI
PDEP
PDEP
PEXT
PEXT
r,r,r
r,r,m
r,r
1
1
1
1
2
1
p15
p15 p23
p15
r,m
1
2
p15 p23
r,r,r
r,m,r
r,r,r
r,m,r
r,r,r
r,r,m
r,r,r
r,r,m
2
3
1
1
1
1
1
1
2
3
1
2
1
2
1
2
2p0156
2p0156 p23
p15
p15 p23
p1
p1 p23
p1
p1 p23
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
1
1
1
1
1
1
2
1
p6
p6
p23 p6
p6
1-2
2
2
1-2
Conditional jump
1
1
p06
0.5-1
1
1
p6
1-2
1
1
p06
0.5-1
2
7
11
2
2
3
1
3
15
4
2
7
11
3
3
4
2
4
15
4
p0156 p6
0.5-2
5
6
2
2
3
1
2
8
5
3
2
5n+12
3
<2n
2.6/32B
3
2
2p0156 p23
p0156 p23
3
p23 p0156 p4
MOVS
REP MOVS
REP MOVS
5
~2n
4/32B
5
SCAS
REP SCAS
3
≥6n
3
Fused arithmetic
and branch
Fused arithmetic
and branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
BOUND
INTO
String instructions
LODSB/W
LODSD/Q
REP LODS
STOS
REP STOS
REP STOS
short/near
short
short
short
near
r
m
i
r,m
p237 p4 p6
p237 p4 p6
2p237 p4 p6
p237 p6
p23 2p6 p015
2p23 p4 2p0156
p23 2p0156
Page 189
1
1
1
2
1
3
3
0.5
0.5
0.5
BMI1
BMI1
BMI1
0.5
BMI1
0.5
1
0.5
0.5
1
1
1
1
BMI1
BMI1
BMI2
BMI2
BMI2
BMI2
BMI2
BMI2
1
1
~2n
1
~0.5n
1/32B
4
~1.5 n
1/32B
1
≥2n
predicted
taken
predicted not
taken
predicted
taken
predicted not
taken
not 64 bit
not 64 bit
worst case
best case
aligned by 32
worst case
best case
aligned by 32
Haswell
CMPS
REP CMPS
5
≥8n
5
Synchronization instructions
XADD
m,r
LOCK XADD
m,r
LOCK ADD
m,r
CMPXCHG
m,r
LOCK CMPXCHG
m,r
CMPXCHG8B
m,r
LOCK CMPXCHG8B
m,r
CMPXCHG16B
m,r
LOCK CMPXCHG16B
m,r
4
9
8
5
10
15
19
22
24
5
9
8
6
10
15
19
22
24
1
1
0
0
Other
NOP (90)
Long NOP (0F
1F)
PAUSE
ENTER
ENTER
LEAVE
XGETBV
RDTSC
RDPMC
RDRAND
a,0
a,b
r
2p23 3p0156
5
5
12
12
~14+7b ~45+7b
3
3
8
8
15
15
34
34
17
17
4
≥2n
7
19
19
8
19
9
19
15
25
none
none
0.25
0.25
p05 3p6
9
8
~87+2b
2p0156 p23
6
9
24
37
~320
p23 16p0156
XGETBV
RDRAND
Floating point x87 instructions
Instruction
Operands
Move instructions
FLD
r
FLD
m32/64
FLD
m80
FBLD
m80
FST(P)
r
FST(P)
m32/m64
FSTP
m80
FBSTP
m80
FXCH
r
FILD
m
FIST(P)
m
FISTTP
m
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
r
FNSTSW
AX
FNSTSW
m16
FLDCW
m16
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
1
1
4
43
1
1
7
238
2
1
3
3
1
2
2
3
2
2
3
1
1
4
43
1
2
7
226
0
2
3
3
1
2
2
3
2
3
3
p01
p23
2p01 2p23
p01
p4 p237
3p0156 2p23 2p4
none
p01 p23
p1 p23 p4
p1 p23 p4
p01
2p01
2p01
2p0 p5
p0 p0156
p0 p4 p237
p01 p23 p6
Page 190
1
3
4
47
1
4
1
0
6
7
7
2
6
7
0.5
0.5
2
22
0.5
1
5
265
0.5
1
1
2
1
2
2
2
1
1
2
SSE3
Haswell
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P)
FSUB(R)(P)
FADD(P)
FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
Other
FNOP
WAIT
FNCLEX
FNINIT
m16
3
1
1
147
90
p237 p4 p6
p01
p01
0
r
m
m
2
1
1
147
90
r
1
1
p1
3
m
r
m
r
m
1
1
1
1
1
1
1
1
1
2
3
2
2
2
2
1
2
28
41
17
2
1
2
1
2
1
1
1
2
2
3
3
3
3
3
1
2
28
41
17
p1 p23
p0
p0 p23
p0
p0 p23
p0
p0
p1
p1 p23
2p01
3p01
2p1 p23
p0 p1 p23
p0 p1 p23
2p1 p23
p1
2p1
r
m
r
m
m
m
m
25-75
17
1
71-100
110
70-120
58-89
55-417
55-228
110-121
78-160
1
2
5
26
10-24
1
1
19
27
11
17
1
1
2
5
26
5
p0
p01
p01
p0156
Integer MMX and XMM instructions
Page 191
49-125
15
10-23
47-106
112
52-123
63-68
58-680
58-360
130
96-156
1
0.5
0.5
150
164
1
1
1
1
8-18
8-18
1
1
1
1
1
1.5
2
2
2
1
2
13
17
23
11
8-17
0.5
1
22
83
Haswell
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA/U
MOVDQA/U
MOVDQA/U
VMOVDQA/U
VMOVDQA/U
VMOVDQA/U
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
VMOVNTDQ
MOVNTDQA
VMOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
QDQ
PUNPCKH/L
QDQ
Operands
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
r64,(x)mm
(x)mm,r64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
1
1
2
p0
p237 p4
p5
p23
p0
p5
p015
p23
p237 p4
p015
p23
p237 p4
1
3
1
3
1
1
1
3
3
0-1
3
3
1
1
1
0.5
1
1
0.33
0.5
1
0.33
0.5
1
y,y
y,m256
m256,y
x, m128
mm, x
x,mm
m64,mm
m128,x
m256,y
x, m128
y,m256
1
1
1
1
2
1
1
1
1
1
1
1
1
2
1
2
1
2
2
2
1
1
p015
p23
p237 p4
p23
p01 p5
p015
p237 p4
p237 p4
p237 p4
p23
p23
0-1
3
4
3
1
1
~400
~400
~400
3
3
0.33
0.5
1
0.5
1
0.33
1
1
1
0.5
0.5
mm,mm
3
3
p5
2
2
mm,m64
3
3
p23 2p5
x,x / y,y,y
1
1
p5
1
1
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
1
1
1
2
1
2
p23 p5
p5
p23 p5
1
1
1
1
v,v / v,v,v
1
1
p5
1
1
v,m / v,v,m
1
2
p23 p5
x,x / y,y,y
1
1
p5
x,m / y,y,m
2
2
p23 p5
PMOVSX/ZX BW
BD BQ DW DQ
x,x
1
1
p5
PMOVSX/ZX BW
BD BQ DW DQ
x,m
1
2
p23 p5
VPMOVSX/ZX BW
BD BQ DW DQ
y,x
1
1
p5
Page 192
may be elim.
AVX
may be elim.
AVX
AVX
SSE3
AVX2
SSE4.1
AVX2
2
SSE4.1
SSE4.1
1
1
1
1
1
3
1
SSE4.1
1
SSE4.1
1
AVX2
Haswell
VPMOVSX/ZX BW
BD BQ DW DQ
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD/Q
PINSRD/Q
VINSERTI128
VINSERTI128
y,m
v,v / v,v,v
v,m / v,v,m
mm,mm,i
mm,m64,i
v,v,i
v,m,i
v,v,i
v,m,i
v,v,i / v,v,v,i
v,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,v,i
v,v,m,i
y,y,y
y,y,m
y,y,i
y,m,i
y,y,y,i
y,y,m,i
mm,mm
x,x
v,v,m
m,v,v
r,v
r32,x,i
m8,x,i
x,y,i
m,y,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
y,y,x,i
y,y,m,i
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
1
1
2
1
2
4
10
3
4
1
2
2
1
2
2
2
2
2
2
2
1
2
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
2
1
2
1
2
4
10
3
4
1
2
3
1
2
2
2
2
2
2
2
1
2
p5 p23
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
2p5
2p5 p23
2p5
2p5 p23
p5
p23 p5
p015
p015 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p0 p4 2p23
4p04 2p56 4p23
p23 2p5
p0 p1 p4 p23
p0
p0 p5
p23 p4 p5
p5
p23 p4
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p015 p23
VPBROADCAST
B/W/D/Q
x,x
1
1
VPBROADCAST
B/W
x,m8/16
3
VPBROADCAST
D/Q
x,m32/64
VPBROADCAST
B/W/D/Q
VPBROADCAST
B/W
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
VPBLENDVB
VPBLENDVB
PBLENDW
PBLENDW
VPBLENDD
VPBLENDD
VPERMD
VPERMD
VPERMQ
VPERMQ
VPERM2I128
VPERM2I128
MASKMOVQ
MASKMOVDQU
VPMASKMOVD/Q
VPMASKMOVD/Q
PMOVMSKB
PEXTRB/W/D/Q
PEXTRB/W/D/Q
VEXTRACTI128
VEXTRACTI128
3
4
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
0.33
0.5
1
1
1
1
1
1
1
6
2
1
1
1
1
1
1
2
1
2
1
2
1
1
0.5
p5
1
1
AVX2
3
p01 p23 p5
5
1
AVX2
1
1
p23
4
0.5
AVX2
y,x
1
1
p5
3
1
AVX2
y,m8/16
3
3
p01 p23 p5
7
1
AVX2
Page 193
1
1
1
1
1
2
2
1
1
3
3
3
13-413
14-438
4
13-14
3
2
3
4
2
2
2
AVX2
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX2
AVX2
Haswell
VPBROADCAST
D/Q
y,m32/64
y,m128
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y
1
1
20
34
15
22
12
20
14
22
1
1
20
34
15
22
12
20
14
22
p23
p23
5
3
0.5
0.5
9
12
8
7
7
9
7
9
PADD/SUB(S,US)
B/W/D/Q
v,v / v,v,v
1
1
p15
1
0.5
PADD/SUB(S,US)
B/W/D/Q
v,m / v,v,m
1
2
p15 p23
v,v / v,v,v
3
3
p1 2p5
v,m / v,v,m
4
4
p1 2p5 p23
v,v / v,v,v
1
1
p15
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
1
1
1
1
1
2
1
2
1
2
p15 p23
p15
p15 p23
p0
p0 p23
v,v / v,v,v
1
1
p0
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x / y,y,y
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
1
1
1
2
3
1
1
1
1
1
1
1
1
1
1
2
1
2
2
3
1
2
1
2
1
2
1
2
1
2
p0 p23
p0
p0 p23
2p0
2p0 p23
p0
p0 p23
p0
p0 p23
p0
p0 p23
p0
p0 p23
p15
p15 p23
x,x / y,y,y
1
1
p15
x,m / y,y,m
1
2
p15 p23
VBROADCASTI128
VPGATHERDD
VPGATHERDD
VPGATHERQD
VPGATHERQD
VPGATHERDQ
VPGATHERDQ
VPGATHERQQ
VPGATHERQQ
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
Arithmetic instructions
PHADD(S)W/D
PHSUB(S)W/D
PHADD(S)W/D
PHSUB(S)W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW
PMULHUW
PMULL/HW
PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PMIN/PMAX
SB/SW/SD
UB/UW/UD
Page 194
0.5
3
1
1
5
5
5
10
5
5
5
5
1
1
2
SSSE3
2
SSSE3
0.5
0.5
0.5
0.5
1
1
SSE4.1
SSE4.1
SSE4.2
SSE4.2
1
1
1
1
2
2
1
1
1
1
1
1
1
1
0.5
0.5
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
0.5
SSE4.1
0.5
SSE4.1
Haswell
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
Logic instructions
PAND PANDN
POR PXOR
PAND PANDN
POR PXOR
PTEST
PTEST
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
PSLLDQ
PSRLDQ
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
x,x
x,m128
v,v
v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x,i / v,v,v,i
x,m,i / v,v,m,i
1
1
1
1
1
1
1
1
3
4
1
2
1
2
1
2
1
2
3
4
p0
p0 p23
p15
p15 p23
p15
p15 p23
p0
p0 p23
p0 2p5
p0 2p5 p23
5
v,v / v,v,v
1
1
p015
1
0.33
v,m / v,v,m
v,v
v,m
1
2
2
2
2
3
p015 p23
p0 p5
p0 p5 p23
2
0.5
1
1
mm,mm
1
1
p0
1
1
mm,m64
1
2
p0 p23
x,x / v,v,x
2
2
p0 p5
x,m / v,v,m
2
2
p0 p23
v,i / v,v,i
1
1
p0
1
1
v,v,v
3
3
2p0 p5
2
2
AVX2
v,v,m
4
4
2p0 p5 p23
2
AVX2
x,i / v,v,i
1
1
p5
1
1
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
8
8
9
9
3
4
3
4
8
8
9
9
3
4
3
4
6p05 2p16
11
4
4
5
5
3
3
3
3
1
1
5
6
Page 195
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
1
2
1
1
3p0 2p16 2p5 p23
3p0 2p16 4p5
6p05 2p16 p23
3p0
3p0 p23
3p0
3p0 p23
1
1
0.5
0.5
0.5
0.5
1
1
2
2
10
11
10
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
Haswell
Encryption instructions
PCLMULQDQ
x,x,i
PCLMULQDQ
x,m,i
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,x
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,m
AESIMC
x,x
AESIMC
x,m
AESKEYGENAS
SIST
x,x,i
AESKEYGENAS
SIST
x,m,i
Other
EMMS
3
4
3
4
2p0 p5
2p0 p5 p23
7
2
2
CLMUL
CLMUL
1
1
p5
7
1
AES
2
2
3
2
2
3
p5 p23
2p5
2p5 p23
14
1.5
2
2
AES
AES
AES
10
10
2p0 8p5
10
9
AES
10
10
2p0 p23 7p5
8
AES
31
31
13
Floating point XMM and YMM instructions
Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVHPS/D
MOVLPS/D
MOVLPS/D
MOVHLPS
MOVLHPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
Operands
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
x,x
y,y
1
1
1
1
p5
p5
0-1
0-1
1
1
x,m128
1
1
p23
3
0.5
y,m256
1
1
p23
3
0.5
m128,x
1
2
p237 p4
3
1
m256,y
x,x
x,m32/64
m32/64,x
x,m64
m64,x
x,m64
m64,x
x,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i / v,v,v,i
x,m,i / v,v,m,i
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
2
2
2
2
2
1
1
1
1
2
2
1
2
p237 p4
p5
p23
p237 p4
p23 p5
p4 p237
p23 p5
p4 p237
p5
p5
p0
p0
p4 p237
p4 p237
p5
p5 p23
4
1
3
3
4
3
4
3
1
1
3
2
~400
~400
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
Page 196
may be elim.
may be elim.
AVX
AVX
AVX
Haswell
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
VPERMPS
VPERMPS
VPERMPD
VPERMPD
BLENDPS/PD
BLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VPGATHERDPS
VPGATHERDPS
VPGATHERQPS
VPGATHERQPS
VPGATHERDPD
VPGATHERDPD
VPGATHERQPD
VPGATHERQPD
Conversion
CVTPD2PS
CVTPD2PS
v,v,i
v,m,i
v,v,v
v,v,m
y,y,y,i
y,y,m,i
y,y,y
y,y,m
y,y,i
y,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
v,v
v,m
x,m32
y,m32
x,x
y,x
y,m64
y,x
y,m128
v,v
v,m
x,x / v,v,v
x,m / v,v,m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
v,v,m
m128,x,x
m256,y,y
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y
1
2
1
2
1
2
1
1
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
3
1
2
1
2
1
2
3
4
4
20
34
15
22
12
20
14
22
1
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
2
1
2
1
2
3
4
4
20
34
15
22
12
20
14
22
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p015
p015 p23
2p5
2p5 p23
2p5
2p5 p23
p5
p23
p23
p23
p5
p5
p23
p5
p23
p5
p23
p5
p5 p23
p0 p5
p0 p5 p23
p5
p23 p4
p5
p23 p5
p5
p015 p23
2p5 p23
p0 p1 p4 p23
p0 p1 p4 p23
x,x
x,m128
2
2
2
3
p1 p5
p1 p5 p23
Page 197
1
1
3
3
3
1
2
2
1
3
4
5
1
3
5
3
3
1
3
1
4
3
4
1
4
3
4
4
13
14
4
1
1
1
1
1
1
1
1
1
1
0.33
0.5
2
2
2
2
1
0.5
0.5
0.5
1
1
0.5
1
0.5
1
0.5
1
1
1
1
1
1
1
1
1
2
2
1
2
9
12
8
7
7
9
7
9
1
1
AVX
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX2
AVX2
AVX
AVX2
AVX
SSE3
SSE3
SSE3
SSE3
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
Haswell
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
Arithmetic
ADDSS/D PS/D
SUBSS/D PS/D
ADDSS/D PS/D
SUBSS/D PS/D
ADDSUBPS/D
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,m32
r32,x
r32,m32
x,r32/64
x,m32
r32/64,x
r32,m64
x,v,i
m,v,i
v,x
v,m
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
2
2
2
2
2
2
1
2
2
2
2
2
2
2
4
2
2
2
3
2
3
2
2
2
2
2
2
1
2
1
2
1
2
1
2
2
2
2
2
2
3
2
3
1
2
2
2
2
2
2
3
2
2
2
3
2
2
2
3
2
4
2
2
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p0 p5
p0 p23
p0 p5
p0 p23
p0 p5
p0 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 p23
p0 p1
p0 p1 p23
p1 p5
p1 p23
p0 p1
p0 p1 p23
p1 p5
p1 p4 p5 p23
p1 p5
p1 p23
5
x,x / v,v,v
1
1
p1
3
1
x,m / v,v,m
x,x / v,v,v
1
1
2
1
p1 p23
p1
3
1
1
Page 198
4
2
5
2
3
3
3
3
4
6
4
6
4
4
4
4
4
4
4
4
4
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
3
1
1
1
1
1
1
3
3
1
1
3
3
1
1
1
1
1
1
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
F16C
F16C
F16C
F16C
SSE3
Haswell
ADDSUBPS/D
HADDPS/D
HSUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D PS/D
MULSS/D PS/D
DIVSS DIVPS
DIVSS DIVPS
DIVSD DIVPD
DIVSD DIVPD
VDIVPS
VDIVPS
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D
CMPccPS/D
CMPccSS/D
CMPccPS/D
(U)COMISS/D
(U)COMISS/D
MAXSS/D PS/D
MINSS/D PS/D
MAXSS/D PS/D
MINSS/D PS/D
x,m / v,v,m
1
2
p1 p23
x,x / v,v,v
3
3
p1 2p5
x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
x,x
x,m
x,x
x,m
y,y,y
y,y,m256
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256
4
1
1
1
1
1
1
3
4
3
4
1
1
3
4
4
1
2
1
2
1
2
3
4
3
4
1
2
3
4
p1 2p5 p23
p01
p01 p23
p0
p0 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
x,x / v,v,v
1
1
p1
x,m / v,v,m
x,x
x,m32/64
2
1
2
2
1
2
p1 p23
p1
p1 p23
x,x / v,v,v
1
1
p1
x,m / v,v,m
1
2
p1 p23
ROUNDSS/D PS/D
v,v,i
2
2
2p1
6
ROUNDSS/D PS/D
v,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,i
x,m128,i
3
4
6
3
4
3
4
6
3
4
2p1 p23
2p0 p1 p5
14
2p0 p1 p5 p23 p6
p0 p1 p5
p0 p1 p5 p23
9
v,v,v
1
1
p01
5
v,v,m
1
2
p01 p23
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m128
1
1
3
4
1
1
3
4
1
1
1
2
3
4
1
2
3
4
1
2
p0
p0 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
DPPS
DPPS
DPPD
DPPD
VFMADD...
(all FMA instr.)
VFMADD...
(all FMA instr.)
Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD/PD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
Page 199
5
5
10-13
10-20
18-21
19-35
5
7
3
1
SSE3
2
SSE3
2
0.5
0.5
7
7
8-14
8-14
14
14
16-28
16-28
1
1
2
2
SSE3
AVX
AVX
AVX
AVX
AVX
AVX
1
1
1
1
3
1
1
11
19
16
28-29
5
2
SSE4.1
2
2
4
1
1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
0.5
FMA
0.5
FMA
7
7
14
14
8-14
8-14
16-28
16-28
1
1
AVX
AVX
AVX
AVX
Haswell
VRSQRTPS
VRSQRTPS
y,y
y,m256
3
4
3
4
2p0 p15
2p0 p15 p23
7
2
2
AND/ANDN/OR/XO
RPS/PD
x,x / v,v,v
1
1
p5
1
1
AND/ANDN/OR/XO
RPS/PD
x,m / v,v,m
1
2
p5 p23
1
Other
VZEROUPPER
4
4
none
1
VZEROALL
12
12
none
10
20
3
3
3
130
116
224
173
20
3
4
none
p0 p6 p23
p0 p4 p6 p237
AVX
AVX
Logic
VZEROALL
LDMXCSR
STMXCSR
VSTMXCSR
FXSAVE
FXRSTOR
XSAVE
XRSTOR
XSAVEOPT
m32
m32
m32
m4096
m4096
m
Page 200
6
7
8
3
1
1
68
72
84
111
AVX
AVX,
32 bit
AVX,
64 bit
AVX
Broadwell
Intel Broadwell
List of instruction timings and μop breakdown
Explanation of column headings:
Instruction:
Name of instruction. Multiple names mean that these instructions have the same data.
Instructions with or without V name prefix behave the same unless otherwise noted.
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register,
(x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx,
xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc.
μops fused
domain:
μops unfused
domain:
The number of μops at the decode, rename and allocate stages in the pipeline. Fused
μops count as one.
The total number of μops for all execution port. Fused μops count as two. Fused macroops count as one. The instruction has μop fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any
execution port or if the counters are inaccurate.
µops each port: The number of μops for each execution port. p0 means a µop to execution port 0.
p01means a µop that can go to either port 0 or port 1. p0 p1 means two µops going to
port 0 and 1, respectively.
Port 0: Integer, f.p. and vector ALU, mul, div, branch
Port 1: Integer, f.p. and vector ALU
Port 2: Load
Port 3: Load
Port 4: Store
Port 5: Integer and vector ALU
Port 6: Integer ALU, branch
Port 7: Store address
Latency:
This is the delay that the instruction generates in a dependency chain. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase the clock
counts considerably. Where hyperthreading is enabled, the use of the same execution
units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
Reciprocal
throughput:
The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
Operands
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
r,i
r8/16,r8/16
r32/64,r32/64
r8l,m
r8h,m
r16,m
r32/64,m
1
1
1
1
1
1
1
1
1
1
2
1
2
1
p0156
p0156
p0156
p23 p0156
p23
p23 p0156
p23
m,r
1
2
p237 p4
Page 201
2
0.25
0.25
0.25
0.5
0.5
0.5
0.5
3
1
1
0-1
may be elim.
all addressing
modes
Broadwell
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
m,i
m,r
r,r
1
2
1
2
2
1
p237 p4
p23 p4
p0156
r16,m8
r,m
1
1
2
1
p23 p0156
p23
r,r
r,m
r,r
r,m
1
2
3
8
3
2
2
3
3
4
19
1
3
3
9
18
1
3
2
p06
p06 p23
3p0156
r16,m
1
2
3
8
3
1
1
2
2
3
11
1
3
2
9
18
1
3
2
LEA
r32/64,m
1
LEA
r32/64,m
LEA
BSWAP
BSWAP
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
PREFETCHNTA/
0/1/2
PREFETCHW
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
1
1
0.25
0.5
0.5
1
2
21
7
3
implicit lock
p06
3p0156
p1 p05
1
1
2-4
1
p15
1
0.5
1
1
p1
3
1
r32/64,m
1
1
p1
r32
r64
r16,m16
r32,m32
r64,m64
m16,r16
m32,r32
m64,r64
1
2
3
2
3
2
2
3
1
2
3
2
3
3
3
4
p15
p06 p15
2p0156 p23
p15 p23
2p0156 p23
p06 p237 p4
p15 p237 p4
p06 p15 p237 p4
m
1
1
p23
0.5
m
1
2
3
2
1
3
2
p23
none counted
p23 p4
p23 p4
1
4
33
6
1
1
p0156
r
stack pointer
m
r,r/i
p23
p23 2p0156
2p237 p4
Page 202
2
1
1
2
1
all other
combinations
0.5
0.5
1
2
1
1
1
1
1
8
0.5
4
1
18
8
1
1
1
r
i
m
stack pointer
p23 2p0156
p237 p4
p237 p4
p4 2p237
p0156 p237 p4
p1 p4 p237 p06
~400
1
0.5
1
0.5-1
0.5
0.5
1
1
1
0.25
not 64 bit
not 64 bit
not 64 bit
16 or 32 bit
address size
1 or 2 components in
address
3 components
in address
rip relative
address
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
PREFETCHW
Broadwell
ADD SUB
ADD SUB
r,m
m,r/i
1
2
2
4
ADC SBB
ADC SBB
ADC SBB
r,r/i
r,m
m,r/i
1
2
4
1
2
6
CMP
CMP
INC DEC NEG
NOT
INC DEC NOT
NEG
AAA
AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MULX
MULX
MULX
MULX
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
r,r/i
m,r/i
r
1
1
1
m
m
3
2
2
2
3
3
8
1
4
3
2
1
4
3
2
1
1
2
1
1
2
1
1
3
3
2
2
9
11
10
36
9
10
9
59
1
1
1
2
1
1
r8
r16
r32
r64
m8
m16
m32
m64
r,r
r,m
r16,r16,i
r32,r32,i
r64,r64,i
r16,m16,i
r32,m32,i
r64,m64,i
r32,r32,r32
r32,r32,m32
r64,r64,r64
r64,r64,m64
r8
r16
r32
r64
r8
r16
r32
r64
p0156 p23
0.5
1
2p0156 2p237 p4
6
p06
p06 p23
1
3p0156 2p237 p4
7
1
1
2
1
2
1
p0156
p0156 p23
p0156
1
1
1
0.25
0.5
0.25
4
4
2
2
3
3
8
1
4
3
2
2
5
4
3
1
2
2
1
1
3
2
2
3
4
2
3
9
11
10
36
9
10
9
59
1
1
1
2
1
1
p0156 2p237 p4
p0156 2p237 p4
p1 p56
p1 p056
p1 2p056
p1 2p056
p0 p1 p5 p6
p1
p1 p0156
p1 p0156
p1 p6
p1 p23
p1 3p0156 p23
p1 2p0156 p23
p1 p6 p23
p1
p1 p23
p1 p0156
p1
p1
p1 p0156 p23
p1 p23
p1 p23
p1 2p056
p1 2p056 p23
p1 p5
p1 p6 p23
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0156
p0156
p0156
p0156
p06
p06
6
6
4
6
4
6
21
3
4
4
3
1
1
Page 203
3
4
3
3
4
4
22-25
23-26
22-29
32-95
23-26
23-26
22-29
39-103
1
1
1
1
1
1
7
1
2
2
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
9
9
9
21-73
6
6
6
24-81
not 64 bit
not 64 bit
not 64 bit
not 64 bit
not 64 bit
AVX2
AVX2
AVX2
AVX2
Broadwell
POPCNT
POPCNT
CRC32
CRC32
r,r
r,m
r,r
r,m
1
1
1
1
1
2
1
2
p1
p1 p23
p1
p1 p23
3
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
r,r/i
r,m
m,r/i
1
1
2
1
2
4
p0156
p0156 p23
1
2p0156 2p237 p4
6
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
m,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
r,r,cl
m,r,cl
r,r,r
r,m,r
r,r,i
r,m,i
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
1
1
1
3
3
5
2
1
4
3
5
3
4
8
11
8
11
1
3
4
4
5
1
2
1
2
1
10
2
1
10
2
1
1
1
2
1
1
1
3
1
1
1
2
1
4
3
6
2
1
5
3
6
3
6
8
11
8
11
1
5
4
4
7
1
2
1
2
1
10
2
1
10
2
1
2
1
3
0
1
1
3
1
2
p0156
p0156 p23
p06
2p06 p237 p4
3p06
3p06 2p23 p4
2p06
p06
2p06 2p237 p4
3p06
3p06 p23 p4
2p06 p0156
1
1
1
p0156
6
p0156
6
p1
3
p0156
p0156
3
4
p06
p06 p23
p06
p06 p23
p06
1
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHRD SHLD
SHRD SHLD
SHLD
SHRD
SHRD SHLD
SHLX SHRX SARX
SHLX SHRX SARX
RORX
RORX
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC
CMC
CLD STD
LZCNT
LZCNT
r,r
r,m
p06 p23
p06
p06 p23
p1
p1 p23
p06
p06 p237 p4
none
p0156
p0156
p15 p6
p1
p1 p23
Page 204
3
2
1
1
2
2
1
1
1
3
1
1
3
1
1
1
1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
0.25
0.5
1
0.25
0.5
0.5
2
2
4
1
0.5
2
2
4
2
3
6
6
6
6
1
2
2
2
4
0.5
0.5
0.5
0.5
0.5
5
0.5
0.5
5
0.5
1
1
0.5
1
0.25
0.25
1
4
1
1
short form
BMI2
BMI2
BMI2
BMI2
LZCNT
LZCNT
Broadwell
TZCNT
TZCNT
ANDN
ANDN
BLSI BLSMSK
BLSR
BLSI BLSMSK
BLSR
BEXTR
BEXTR
BZHI
BZHI
PDEP
PDEP
PEXT
PEXT
r,r
r,m
r,r,r
r,r,m
r,r
1
1
1
1
1
1
2
1
2
1
p1
p1 p23
p15
p15 p23
p15
r,m
1
2
p15 p23
r,r,r
r,m,r
r,r,r
r,m,r
r,r,r
r,r,m
r,r,r
r,r,m
2
3
1
1
1
1
1
1
2
3
1
2
1
2
1
2
2p0156
2p0156 p23
p15
p15 p23
p1
p1 p23
p1
p1 p23
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
1
1
1
1
1
1
2
1
p6
p6
p23 p6
p6
1-2
2
2
1-2
Conditional jump
1
1
p06
0.5-1
1
1
p6
1-2
1
1
p06
0.5-1
2
7
11
2
2
3
1
3
15
4
2
7
11
3
3
4
2
4
15
4
p0156 p6
0.5-2
5
6
2
2
3
1
2
8
5
3
2
5n+12
3
<2n
2.6/32B
3
2
2p0156 p23
p0156 p23
3
p23 p0156 p4
5
~2n
4/32B
5
Fused arithmetic
and branch
Fused arithmetic
and branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
BOUND
INTO
String instructions
LODSB/W
LODSD/Q
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
short/near
short
short
short
near
r
m
i
r,m
p237 p4 p6
p237 p4 p6
2p237 p4 p6
p237 p6
p23 2p6 p015
2p23 p4 2p0156
Page 205
3
1
1
1
2
1
3
3
1
1
0.5
0.5
0.5
BMI1
BMI1
BMI1
BMI1
BMI1
0.5
BMI1
0.5
1
0.5
0.5
1
1
1
1
BMI1
BMI1
BMI2
BMI2
BMI2
BMI2
BMI2
BMI2
1
1
~2n
1
~0.5n
1/32B
4
< 1n
1/32B
predicted
taken
predicted not
taken
predicted
taken
predicted not
taken
not 64 bit
not 64 bit
worst case
best case
aligned by 32
worst case
best case
aligned by 32
Broadwell
SCAS
REP SCAS
CMPS
REP CMPS
3
≥6n
5
≥8n
3
p23 2p0156
5
2p23 3p0156
Synchronization instructions
XADD
m,r
LOCK XADD
m,r
LOCK ADD
m,r
CMPXCHG
m,r
LOCK CMPXCHG
m,r
CMPXCHG8B
m,r
LOCK CMPXCHG8B
m,r
CMPXCHG16B
m,r
LOCK CMPXCHG16B
m,r
4
9
8
5
10
15
19
22
24
5
9
8
6
10
15
19
22
24
1
1
0
0
Other
NOP (90)
Long NOP (0F
1F)
PAUSE
ENTER
ENTER
LEAVE
XGETBV
RDTSC
RDTSCP
RDPMC
RDRAND
RDSEED
a,0
a,b
r
r
5
5
12
12
~14+7b ~45+7b
3
3
8
8
15
15
21
21
34
34
16
16
16
16
1
≥2n
4
≥2n
6
21
21
7
21
8
21
15
27
none
none
0.25
0.25
p05 3p6
9
8
~87+2b
2p0156 p23
5
5
24
30
37
~230
~230
p23 15p0156
p23 15p0156
XGETBV
RDTSCP
RDRAND
RDSEED
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
1
1
4
43
1
1
7
238
2
1
3
3
1
2
1
1
4
43
1
2
7
226
0
2
3
3
1
2
p01
p23
2p01 2p23
p01
p4 p237
3p0156 2p23 2p4
none
p01 p23
p1 p23 p4
p1 p23 p4
p01
2p01
Page 206
1
3
4
47
1
4
5
269
0
6
7
7
0.5
0.5
2
22
0.5
1
5
267
0.5
1
1
2
1
2
SSE3
Broadwell
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P)
FSUB(R)(P)
FADD(P)
FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P)
FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
Other
FNOP
r
m
m
2
3
2
2
3
2
1
1
152
95
2
3
2
3
3
3
1
1
152
95
2p01
2p0 p5
p0 p0156
p0 p4 p237
p01 p23 p6
p237 p4 p6
p01
p01
r
1
1
p1
m
r
m
r
m
1
1
1
1
1
1
1
1
1
2
2
1
2
1
2
1
1
1
2
2
p1 p23
p0
p0 p23
p0
p0 p23
p0
p0
p1
p1 p23
2p01
3
2
2
2
2
1
2
28
28
17
3
3
3
3
3
1
2
28
28
17
3p01
2p1 p23
p0 p1 p23
p0 p1 p23
2p1 p23
p1
2p1
27
17
1
75-100
70-100
70-110
16-86
55-96
56
71-102
27-71
27
17
1
p0
1
1
p01
r
AX
m16
m16
m16
r
m
r
m
m
m
m
Page 207
173
175
2
2
1
1
2
1
0.5
0.5
173
175
3
1
2
6
6
7
6
0
5
10-15
1
1
3
7
3
6
20-24
23-48
11
125
12
10-23
48-106
49-112
52-124
63-68
92
74
132
97-147
1
1
1
4-5
4-5
1
1
1
1
1
1.5
2
2
2
1
2
13
13
23
130
11
4-9
0.5
Broadwell
WAIT
FNCLEX
FNINIT
2
5
26
2
5
26
p01
p0156
1
22
84
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA/U
MOVDQA/U
MOVDQA/U
VMOVDQA/U
VMOVDQA/U
VMOVDQA/U
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
VMOVNTDQ
MOVNTDQA
VMOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
QDQ
PUNPCKH/L
QDQ
Operands
r32/64,(x)mm
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
r64,(x)mm
(x)mm,r64
(x)mm,(x)mm
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
2
1
1
2
p0
p237 p4
p5
p23
p0
p5
p015
p23
p237 p4
p015
p23
p237 p4
1
3
1
3
1
1
1
3
3
0-1
3
3
1
1
1
0.5
1
1
0.33
0.5
1
0.25
0.5
1
y,y
y,m256
m256,y
x, m128
mm, x
x,mm
m64,mm
m128,x
m256,y
x, m128
y,m256
1
1
1
1
2
1
1
1
1
1
1
1
1
2
1
2
1
2
2
2
1
1
p015
p23
p237 p4
p23
p01 p5
p015
p237 p4
p237 p4
p237 p4
p23
p23
0-1
3
4
3
1
1
~400
~400
~400
3
3
0.25
0.5
1
0.5
1
0.33
1
1
1
0.5
0.5
mm,mm
3
3
p5
2
2
mm,m64
3
3
p23 2p5
x,x / y,y,y
1
1
p5
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
1
1
1
2
1
2
v,v / v,v,v
1
v,m / v,v,m
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
AVX
may be elim.
AVX
AVX
SSE3
AVX2
SSE4.1
AVX2
2
1
1
p23 p5
p5
p23 p5
1
1
1
1
1
p5
1
1
1
2
p23 p5
x,x / y,y,y
1
1
p5
x,m / y,y,m
2
2
p23 p5
Page 208
may be elim.
1
1
1
1
SSE4.1
SSE4.1
Broadwell
PMOVSX/ZX BW
BD BQ DW DQ
x,x
1
1
p5
PMOVSX/ZX BW
BD BQ DW DQ
x,m
1
2
p23 p5
VPMOVSX/ZX BW
BD BQ DW DQ
y,x
1
1
p5
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD/Q
PINSRD/Q
VINSERTI128
VINSERTI128
y,m
v,v / v,v,v
v,m / v,v,m
mm,mm,i
mm,m64,i
v,v,i
v,m,i
v,v,i
v,m,i
v,v,i / v,v,v,i
v,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,v,i
v,v,m,i
y,y,y
y,y,m
y,y,i
y,m,i
y,y,y,i
y,y,m,i
mm,mm
x,x
v,v,m
m,v,v
r,v
r32,x,i
m8,x,i
x,y,i
m,y,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
y,y,x,i
y,y,m,i
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
1
1
2
1
2
4
10
3
4
1
2
2
1
2
2
2
2
2
2
2
1
2
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
2
1
2
1
2
1
2
1
2
4
10
3
4
1
2
3
1
2
2
2
2
2
2
2
1
2
p5 p23
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
2p5
2p5 p23
2p5
2p5 p23
p5
p23 p5
p015
p015 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p0 p4 2p23
4p04 2p56 4p23
p23 2p5
p0 p1 p4 p23
p0
p0 p5
p23 p4 p5
p5
p23 p4
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p015 p23
VPBROADCAST
B/W/D/Q
x,x
1
1
VPBROADCAST
B/W
x,m8/16
3
3
VPMOVSX/ZX BW
BD BQ DW DQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
VPBLENDVB
VPBLENDVB
PBLENDW
PBLENDW
VPBLENDD
VPBLENDD
VPERMD
VPERMD
VPERMQ
VPERMQ
VPERM2I128
VPERM2I128
MASKMOVQ
MASKMOVDQU
VPMASKMOVD/Q
VPMASKMOVD/Q
PMOVMSKB
PEXTRB/W/D/Q
PEXTRB/W/D/Q
VEXTRACTI128
VEXTRACTI128
1
SSE4.1
1
SSE4.1
1
AVX2
AVX2
SSSE3
SSSE3
3
4
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
1
1
0.33
0.5
1
1
1
1
1
1
1
6
2
1
1
1
1
1
1
2
1
2
1
2
1
1
0.5
p5
1
1
AVX2
p01 p23 p5
5
1
AVX2
Page 209
1
3
1
1
1
1
1
2
2
1
1
3
3
3
18-500
18-500
4
15
3
2
3
4
2
2
2
SSSE3
SSSE3
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX2
AVX2
Broadwell
VPBROADCAST
D/Q
x,m32/64
1
1
p23
4
0.5
AVX2
VPBROADCAST
B/W/D/Q
y,x
1
1
p5
3
1
AVX2
VPBROADCAST
B/W
y,m8/16
3
3
p01 p23 p5
7
1
AVX2
y,m32/64
y,m128
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y
1
1
10
14
9
10
7
9
7
9
1
1
10
14
9
10
7
9
7
9
p23
p23
5
3
0.5
0.5
6
7
6
6
5
6
5
6
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
PADD/SUB(S,US)
B/W/D/Q
v,v / v,v,v
1
1
p15
1
0.5
PADD/SUB(S,US)
B/W/D/Q
v,m / v,v,m
1
2
p15 p23
v,v / v,v,v
3
3
p1 2p5
v,m / v,v,m
4
4
p1 2p5 p23
v,v / v,v,v
1
1
p15
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
1
1
1
1
1
2
1
2
1
2
p15 p23
p15
p15 p23
p0
p0 p23
v,v / v,v,v
1
1
p0
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x / y,y,y
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
1
1
1
2
3
1
1
1
1
1
1
1
1
1
1
2
1
2
2
3
1
2
1
2
1
2
1
2
1
2
p0 p23
p0
p0 p23
2p0
2p0 p23
p0
p0 p23
p0
p0 p23
p0
p0 p23
p0
p0 p23
p15
p15 p23
VPBROADCAST
D/Q
VBROADCASTI128
VPGATHERDD
VPGATHERDD
VPGATHERQD
VPGATHERQD
VPGATHERDQ
VPGATHERDQ
VPGATHERQQ
VPGATHERQQ
Arithmetic instructions
PHADD(S)W/D
PHSUB(S)W/D
PHADD(S)W/D
PHSUB(S)W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW
PMULHUW
PMULL/HW
PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
Page 210
0.5
3
1
1
5
5
5
10
5
5
5
5
1
2
SSSE3
2
SSSE3
0.5
0.5
0.5
0.5
1
1
SSE4.1
SSE4.1
SSE4.2
SSE4.2
1
1
1
1
2
2
1
1
1
1
1
1
1
1
0.5
0.5
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
Broadwell
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
Logic instructions
PAND PANDN
POR PXOR
PAND PANDN
POR PXOR
PTEST
PTEST
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
PSLLDQ
PSRLDQ
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
x,x / y,y,y
1
1
p15
x,m / y,y,m
x,x
x,m128
v,v
v,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
x,x,i / v,v,v,i
x,m,i / v,v,m,i
1
1
1
1
1
1
1
1
1
3
4
2
1
2
1
2
1
2
1
2
3
4
p15 p23
p0
p0 p23
p15
p15 p23
p15
p15 p23
p0
p0 p23
p0 2p5
p0 2p5 p23
v,v / v,v,v
1
1
p015
1
0.33
v,m / v,v,m
v,v
v,m
1
2
2
2
2
3
p015 p23
p0 p5
p0 p5 p23
2
0.5
1
1
mm,mm
1
1
p0
1
1
mm,m64
1
2
p0 p23
x,x / v,v,x
2
2
p0 p5
x,m / v,v,m
2
2
p0 p23
v,i / v,v,i
1
1
p0
1
1
v,v,v
3
3
2p0 p5
2
2
AVX2
v,v,m
4
4
2p0 p5 p23
2
AVX2
x,i / v,v,i
1
1
p5
1
1
x,x,i
x,m128,i
x,x,i
8
8
9
8
8
9
6p05 2p16
4
3p0 2p16 2p5 p23
4
4
11
3p0 2p16 4p5
Page 211
1
5
1
1
5
6
0.5
SSE4.1
0.5
1
1
0.5
0.5
0.5
0.5
1
1
2
2
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
1
2
1
1
11
SSE4.2
SSE4.2
SSE4.2
Broadwell
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
Encryption instructions
PCLMULQDQ
x,x,i
PCLMULQDQ
x,m,i
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,x
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,m
AESIMC
x,x
AESIMC
x,m
AESKEYGENAS
SIST
x,x,i
AESKEYGENAS
SIST
x,m,i
Other
EMMS
9
3
4
3
4
9
3
4
3
4
6p05 2p16 p23
3p0
3p0 p23
3p0
3p0 p23
5
3
3
11
3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
1
2
1
2
p0
p0 p23
5
1
1
CLMUL
CLMUL
1
1
p5
7
1
AES
2
2
3
2
2
3
p5 p23
2p5
2p5 p23
14
1.5
2
2
AES
AES
AES
10
10
2p0 8p5
10
9
AES
10
10
2p0 p23 7p5
8
AES
31
31
3
11
12
Floating point XMM and YMM instructions
Instruction
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVHPS/D
MOVLPS/D
MOVLPS/D
MOVHLPS
MOVLHPS
Operands
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
x,x
y,y
1
1
1
1
p5
p5
0-1
0-1
1
1
x,m128
1
1
p23
3
0.5
y,m256
1
1
p23
3
0.5
m128,x
1
2
p237 p4
3
1
m256,y
x,x
x,m32/64
m32/64,x
x,m64
m64,x
x,m64
m64,x
x,x
x,x
1
1
1
1
1
1
1
1
1
1
2
1
1
2
2
2
2
2
1
1
p237 p4
p5
p23
p237 p4
p23 p5
p4 p237
p23 p5
p4 p237
p5
p5
4
1
3
3
4
3
4
3
1
1
1
1
0.5
1
1
1
1
1
1
1
Page 212
may be elim.
may be elim.
AVX
AVX
Broadwell
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
VPERMPS
VPERMPS
VPERMPD
VPERMPD
BLENDPS/PD
BLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
MOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VPGATHERDPS
VPGATHERDPS
VPGATHERQPS
VPGATHERQPS
VPGATHERDPD
VPGATHERDPD
VPGATHERQPD
r32,x
r32,y
m128,x
m256,y
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,i
v,m,i
v,v,v
v,v,m
y,y,y,i
y,y,m,i
y,y,y
y,y,m
y,y,i
y,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
v,v
v,m
x,m32
y,m32
x,x
y,x
y,m64
y,x
y,m128
v,v
v,m
x,x / v,v,v
x,m / v,v,m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
v,v,m
m128,x,x
m256,y,y
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
1
1
1
1
1
2
1
2
1
2
1
2
1
1
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
2
1
2
1
2
3
4
4
10
14
9
10
7
9
7
1
1
2
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
2
3
2
3
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
2
1
2
1
2
3
4
4
10
14
9
10
7
9
7
p0
p0
p4 p237
p4 p237
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p015
p015 p23
2p5
2p5 p23
2p5
2p5 p23
p5
p23
p23
p23
p5
p5
p23
p5
p23
p5
p23
p5
p5 p23
p0 p5
p0 p5 p23
p5
p23 p4
p5
p23 p5
p5
p015 p23
2p5 p23
p0 p1 p4 p23
p0 p1 p4 p23
Page 213
3
3
~400
~400
1
1
1
3
3
3
1
2
2
1
3
4
5
1
3
5
3
4
1
3
1
4
3
4
1
4
3
4
4
15
16
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.33
0.5
2
2
2
2
1
0.5
0.5
0.5
1
1
0.5
1
0.5
1
0.5
1
1
1
1
1
1
1
1
1
2
2
1
1
6
7
6
6
5
6
5
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX2
AVX2
AVX
AVX2
AVX
SSE3
SSE3
SSE3
SSE3
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
Broadwell
VPGATHERQPD
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
y,[r+s*y],y
9
9
x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,r64
x,m32
r32,x
r32,m32
x,r32/64
x,m32
r32/64,x
r32,m64
x,v,i
m,v,i
v,x
v,m
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
1
1
2
2
2
2
2
2
2
3
1
2
2
2
2
2
2
2
3
2
2
2
3
2
3
2
3
2
2
2
2
2
2
1
2
1
2
1
2
1
2
2
2
2
2
2
3
2
3
1
2
2
2
2
2
2
3
2
3
2
2
3
2
2
2
3
2
3
2
2
6
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p0 p5
p0 p23
p0 p5
p0 p23
p0 p5
p0 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 p5 p23
p1
p1 p23
p1 p5
p1 p23
p1 p5
p1 p23
p1 p5
p1 p5 p23
p1 p5
p1 2p5
p1 p23
p0 p1
p0 p1 p23
p1 p5
p1 p23
p0 p1
p0 p1 p23
p1 p5
p1 p4 p23
p1 p5
p1 p23
Page 214
4
5
4
2
5
2
3
3
3
3
4
6
4
6
4
4
4
4
4
5
4
4
4
4-6
4-6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
4
3
1
1
1
1
1
1
3
4
3
1
1
3
3
1
1
1
1
1
1
AVX2
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
F16C
F16C
F16C
F16C
Broadwell
Arithmetic
ADDSS/D PS/D
SUBSS/D PS/D
ADDSS/D PS/D
SUBSS/D PS/D
ADDSUBPS/D
ADDSUBPS/D
HADDPS/D
HSUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D PS/D
MULSS/D PS/D
DIVSS
DIVPS
DIVSS DIVPS
DIVSD
DIVPD
DIVSD DIVPD
VDIVPS
VDIVPS
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
VRCPPS
VRCPPS
CMPccSS/D
CMPccPS/D
CMPccSS/D
CMPccPS/D
(U)COMISS/D
(U)COMISS/D
MAXSS/D PS/D
MINSS/D PS/D
MAXSS/D PS/D
MINSS/D PS/D
3
1
p1 p23
p1
p1 p23
3
1
1
1
SSE3
SSE3
3
p1 2p5
5
2
SSE3
4
1
1
1
1
1
1
1
1
3
4
3
4
1
1
3
4
4
1
2
1
1
2
1
1
2
3
4
3
4
1
2
3
4
p1 2p5 p23
p01
p01 p23
p0
p0
p0 p23
p0
p0
p0 p23
2p0 p15
2p0 p15 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
2
0.5
0.5
2.5
5
3-5
4-5
8
4-5
10
10
16
16
1
1
2
2
SSE3
x,x / v,v,v
1
1
p1
x,m / v,v,m
x,x
x,m32/64
2
1
2
2
1
2
p1 p23
p1
p1 p23
x,x / v,v,v
1
1
p1
x,m / v,v,m
1
2
p1 p23
ROUNDSS/D PS/D
v,v,i
2
2
2p1
ROUNDSS/D PS/D
v,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,i
x,m128,i
3
4
6
3
4
3
4
6
3
4
2p1 p23
2p0 p1 p5
2p0 p1 p5 p23 p6
p0 p1 p5
p0 p1 p5 p23
7
v,v,v
1
1
p01
5
v,v,m
1
2
p01 p23
x,x
1
1
p0
DPPS
DPPS
DPPD
DPPD
VFMADD...
(all FMA instr.)
VFMADD...
(all FMA instr.)
Math
SQRTSS
x,x / v,v,v
1
1
p1
x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
1
1
1
2
1
2
x,x / v,v,v
3
x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
x,x
x,x
x,m
x,x
x,x
x,m
y,y,y
y,y,m256
y,y,y
y,y,m256
x,x
x,m128
y,y
y,m256
Page 215
3
11
11
10-14
10-14
17
19-23
5
7
3
AVX
AVX
AVX
AVX
AVX
AVX
1
1
1
1
3
1
1
6
12
11
2
SSE4.1
2
2
4
1
1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
0.5
FMA
0.5
FMA
4
Broadwell
SQRTPS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD
SQRTPD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
VRSQRTPS
VRSQRTPS
x,x
x,m128
y,y
y,m256
x,x
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
1
1
3
4
1
1
1
3
4
1
1
3
4
1
2
3
4
1
1
2
3
4
1
2
3
4
p0
p0 p23
2p0 p15
2p0 p15 p23
p0
p0
p0 p23
2p0 p15
2p0 p15 p23
p0
p0 p23
2p0 p15
2p0 p15 p23
11
AND/ANDN/OR/XO
RPS/PD
x,x / v,v,v
1
1
p5
AND/ANDN/OR/XO
RPS/PD
x,m / v,v,m
1
2
p5 p23
1
Other
VZEROUPPER
4
4
none
1
VZEROALL
12
12
none
10
20
3
3
111
141
107
115
174
224
172
173
114
20
3
4
none
p0 p6 p23
p0 p4 p6 p237
19
15-16
15-16
27-29
5
7
7
4-7
14
14
4-8
8-14
4-14
16-28
16-28
1
1
2
2
AVX
AVX
AVX
AVX
AVX
AVX
Logic
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXSAVE
FXRSTOR
FXRSTOR
XSAVE
XSAVE
XRSTOR
XRSTOR
XSAVEOPT
m32
m32
m4096
m4096
m4096
m4096
m
Page 216
1
6
7
66
66
80
80
70
84
111
112
51
1
8
3
1
66
66
80
80
70
84
111
112
51
AVX
AVX,
32 bit
AVX,
64 bit
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
Skylake
Intel Skylake
List of instruction timings and μop breakdown
Explanation of column headings:
Instruction:
Name of instruction. Multiple names mean that these instructions have the same data.
Instructions with or without V name prefix behave the same unless otherwise noted.
Operands:
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm register,
(x)mm = mmx or xmm register, y = 256 bit ymm register, v = any vector register (mmx,
xmm, ymm). same = same register for both operands. m = memory operand, m32 = 32bit memory operand, etc.
μops fused
domain:
μops unfused
domain:
The number of μops at the decode, rename and allocate stages in the pipeline. Fused
μops count as one.
The total number of μops for all execution port. Fused μops count as two. Fused macroops count as one. The instruction has μop fusion if this number is higher than the number under fused domain. Some operations are not counted here if they do not go to any
execution port or if the counters are inaccurate.
µops each port: The number of μops for each execution port. p0 means a µop to execution port 0.
p01means a µop that can go to either port 0 or port 1. p0 p1 means two µops going to
port 0 and 1, respectively.
Port 0: Integer, f.p. and vector ALU, mul, div, branch
Port 1: Integer, f.p. and vector ALU
Port 2: Load
Port 3: Load
Port 4: Store
Port 5: Integer and vector ALU
Port 6: Integer ALU, branch
Port 7: Store address
Latency:
This is the delay that the instruction generates in a dependency chain. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase the clock
counts considerably. Where hyperthreading is enabled, the use of the same execution
units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.
Reciprocal
throughput:
The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Integer instructions
Instruction
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
Operands
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
r,i
r8/16,r8/16
r32/64,r32/64
r8l,m
r8h,m
r16,m
r32/64,m
1
1
1
1
1
1
1
1
1
1
2
1
2
1
p0156
p0156
p0156
p23 p0156
p23
p23 p0156
p23
m,r
1
2
p237 p4
Page 217
2
0.25
0.25
0.25
0.5
0.5
0.5
0.5
2
1
1
0-1
may be elim.
all addressing
modes
Skylake
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVZX
MOVSX MOVZX
MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF SAHF
SALC
LEA
m,i
m,r
r,r
1
2
1
2
2
1
p237 p4
p23 p4
p0156
r16,m8
r,m
1
1
2
1
p23 p0156
p23
r,r
r,m
r,r
r,m
1
2
3
8
3
2
2
3
3
4
19
1
3
3
9
18
1
3
2
p06
p06 p23
3p0156
r16,m
1
2
3
8
3
1
1
2
2
3
11
1
3
2
9
18
1
3
2
LEA
r32/64,m
1
LEA
r32/64,m
LEA
BSWAP
BSWAP
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
PREFETCHNTA/
0/1/2
PREFETCHW
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
1
1
0.25
0.5
0.5
1
2
23
7
3
implicit lock
p23
p23 2p0156
2p237 p4
2
p06
3p0156
p1 p05
1
1
2-4
1
p15
1
0.5
1
1
p1
3
1
r32/64,m
1
1
p1
r32
r64
r16,m16
r32,m32
r64,m64
m16,r16
m32,r32
m64,r64
1
2
3
2
3
2
2
3
1
2
3
2
3
3
3
4
p15
p06 p15
2p0156 p23
p15 p23
2p0156 p23
p06 p237 p4
p15 p237 p4
p06 p15 p237 p4
m
1
1
p23
0.5
m
1
2
4
2
1
4
2
p23
none counted
p23 p4
p23 p4
1
4
33
6
1
1
p0156
r
stack pointer
m
r,r/i
Page 218
1
1
2
1
all other
combinations
0.5
0.5
1
2
1
1
1
1
1
8
0.5
3
1
20
8
1
1
1
r
i
m
stack pointer
p23 2p0156
p237 p4
p237 p4
p4 2p237
p0156 p237 p4
p1 p4 p237 p06
~400
1
0.5
1
0.5-1
0.5
0.75
1
1
1
0.25
not 64 bit
not 64 bit
not 64 bit
16 or 32 bit
address size
1 or 2 components in
address
3 components
in address
rip relative
address
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
MOVBE
PREFETCHW
Skylake
ADD SUB
ADD SUB
r,m
m,r/i
1
2
2
4
ADC SBB
ADC SBB
ADC SBB
r,r/i
r,m
m,r/i
1
2
4
1
2
6
CMP
CMP
INC DEC NEG
NOT
INC DEC NOT
NEG
AAA
AAS
DAA DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
MULX
MULX
MULX
MULX
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
r,r/i
m,r/i
r
1
1
1
m
m
3
2
2
2
3
3
11
1
4
3
2
1
4
3
2
1
1
2
1
1
2
1
1
3
3
2
2
10
10
10
36
11
10
10
57
1
1
1
2
1
1
r8
r16
r32
r64
m8
m16
m32
m64
r,r
r,m
r16,r16,i
r32,r32,i
r64,r64,i
r16,m16,i
r32,m32,i
r64,m64,i
r32,r32,r32
r32,r32,m32
r64,r64,r64
r64,r64,m64
r8
r16
r32
r64
r8
r16
r32
r64
p0156 p23
0.5
1
2p0156 2p237 p4
5
p06
p06 p23
1
3p0156 2p237 p4
5
1
1
2
1
2
1
p0156
p0156 p23
p0156
1
1
1
0.25
0.5
0.25
4
4
2
2
3
3
11
1
4
3
2
2
5
4
3
1
2
2
1
1
3
2
2
3
4
2
3
10
10
10
36
11
10
10
57
1
1
1
2
1
1
p0156 2p237 p4
p0156 2p237 p4
p1 p56
p1 p056
p1 2p056
p1 2p056
p0 p1 p5 p6
p1
p1 p0156
p1 p0156
p1 p6
p1 p23
p1 3p0156 p23
p1 2p0156 p23
p1 p6 p23
p1
p1 p23
p1 p0156
p1
p1
p1 p0156 p23
p1 p23
p1 p23
p1 2p056
p1 2p056 p23
p1 p5
p1 p6 p23
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0 p1 p5 p6
p0156
p0156
p0156
p0156
p06
p06
5-6
5-6
4
4
4
4
23
3
4
4
3
1
1
Page 219
3
4
3
3
4
4
23
23
26
35-88
24
23
26
42-95
1
1
1
1
1
1
7
1
2
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
6
6
6
21-83
6
6
6
24-90
not 64 bit
not 64 bit
not 64 bit
not 64 bit
not 64 bit
AVX2
AVX2
AVX2
AVX2
Skylake
POPCNT
POPCNT
CRC32
CRC32
r,r
r,m
r,r
r,m
1
1
1
1
1
2
1
2
p1
p1 p23
p1
p1 p23
3
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
r,r/i
r,m
m,r/i
1
1
2
1
2
4
p0156
p0156 p23
1
2p0156 2p237 p4
5
r,r/i
m,r/i
r,i
m,i
r,cl
m,cl
r,1
r,i
m,i
r,cl
m,cl
r,1
m,1
r,i
m,i
r,cl
m,cl
r,r,i
m,r,i
r,r,cl
r,r,cl
m,r,cl
r,r,r
r,m,r
r,r,i
r,m,i
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r,m
r
m
1
1
1
3
3
5
2
1
4
3
5
3
4
8
11
8
11
1
3
4
4
5
1
2
1
2
1
10
2
1
10
3
1
1
1
2
1
1
1
3
1
1
1
2
1
4
3
6
2
1
5
3
6
3
6
8
11
8
11
1
5
4
4
7
1
2
1
2
1
10
2
1
11
4
1
2
1
3
0
1
1
3
1
2
p0156
p0156 p23
p06
2p06 p237 p4
3p06
3p06 2p23 p4
2p06
p06
2p06 2p237 p4
3p06
3p06 p23 p4
2p06 p0156
1
1
1
p0156
6
p0156
6
p1
3
p0156
p0156
3
4
p06
p06 p23
p06
p06 p23
p06
1
TEST
TEST
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
ROR ROL
ROR ROL
ROR ROL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
RCR RCL
SHRD SHLD
SHRD SHLD
SHLD
SHRD
SHRD SHLD
SHLX SHRX SARX
SHLX SHRX SARX
RORX
RORX
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
BSF BSR
SETcc
SETcc
CLC
STC
CMC
CLD STD
LZCNT
LZCNT
r,r
r,m
p06 p23
p06
p06 p4 p23
p1
p1 p23
p06
p06 p237 p4
none
p0156
p0156
p15 p6
p1
p1 p23
Page 220
3
2
1
1
2
2
1
1
1
3
1
1
3
1
1
1
1
SSE4.2
SSE4.2
SSE4.2
SSE4.2
0.25
0.5
1
0.25
0.5
0.5
2
2
4
1
0.5
2
2
4
2
3
6
6
6
6
1
2
2
2
4
0.5
0.5
0.5
0.5
0.5
5
0.5
0.5
5
1
1
1
0.5
1
0.25
0.25
1
4
1
1
short form
BMI2
BMI2
BMI2
BMI2
LZCNT
LZCNT
Skylake
TZCNT
TZCNT
ANDN
ANDN
BLSI BLSMSK
BLSR
BLSI BLSMSK
BLSR
BEXTR
BEXTR
BZHI
BZHI
PDEP
PDEP
PEXT
PEXT
r,r
r,m
r,r,r
r,r,m
r,r
1
1
1
1
1
1
2
1
2
1
p1
p1 p23
p15
p15 p23
p15
r,m
1
2
p15 p23
r,r,r
r,m,r
r,r,r
r,m,r
r,r,r
r,r,m
r,r,r
r,r,m
2
3
1
1
1
1
1
1
2
3
1
2
1
2
1
2
2p0156
2p0156 p23
p15
p15 p23
p1
p1 p23
p1
p1 p23
Control transfer instructions
JMP
short/near
JMP
r
JMP
m
Conditional jump
short/near
1
1
1
1
1
1
2
1
p6
p6
p23 p6
p6
1-2
2
2
1-2
Conditional jump
1
1
p06
0.5-1
1
1
p6
1-2
1
1
p06
0.5-1
2
7
11
2
2
3
1
2
7
11
3
3
4
2
2
15
5
p0156 p6
p237 p4 p6
p237 p4 p6
2p237 p4 p6
p237 p6
0.5-2
5
6
3
2
3
1
2
8
6
3
2
5n+12
3
<2n
2.6/32B
3
2
2p0156 p23
p0156 p23
3
p23 p0156 p4
5
~2n
4/32B
5
Fused arithmetic
and branch
Fused arithmetic
and branch
J(E/R)CXZ
LOOP
LOOP(N)E
CALL
CALL
CALL
RET
RET
BOUND
INTO
String instructions
LODSB/W
LODSD/Q
REP LODS
STOS
REP STOS
REP STOS
MOVS
REP MOVS
REP MOVS
short/near
short
short
short
near
r
m
i
r,m
15
5
2p23 p4 2p0156
Page 221
3
1
1
1
2
1
3
3
1
1
0.5
0.5
0.5
BMI1
BMI1
BMI1
BMI1
BMI1
0.5
BMI1
0.5
1
0.5
0.5
1
1
1
1
BMI1
BMI1
BMI2
BMI2
BMI2
BMI2
BMI2
BMI2
1
1
~2n
1
~0.5n
1/32B
4
< 1n
1/32B
predicted
taken
predicted not
taken
predicted
taken
predicted not
taken
not 64 bit
not 64 bit
worst case
best case
aligned by 32
worst case
best case
aligned by 32
Skylake
SCAS
REP SCAS
CMPS
REP CMPS
3
≥6n
5
≥8n
3
p23 2p0156
5
2p23 3p0156
Synchronization instructions
XADD
m,r
LOCK XADD
m,r
LOCK ADD
m,r
CMPXCHG
m,r
LOCK CMPXCHG
m,r
CMPXCHG8B
m,r
LOCK CMPXCHG8B
m,r
CMPXCHG16B
m,r
LOCK CMPXCHG16B
m,r
4
9
8
5
10
16
20
23
25
5
9
8
6
10
16
20
23
25
1
1
0
0
Other
NOP (90)
Long NOP (0F
1F)
PAUSE
ENTER
ENTER
LEAVE
XGETBV
RDTSC
RDTSCP
RDPMC
RDRAND
RDSEED
a,0
a,b
r
r
1
≥2n
4
≥2n
5
18
18
6
18
11
19
16
26
none
none
4
4
12
12
~14+7b ~45+7b
3
3
15
15
20
20
22
22
35
35
16
16
16
16
0.25
0.25
p6
8
~87+2b
2p0156 p23
5
9
25
32
40
~460
~460
p23 15p0156
p23 15p0156
XGETBV
RDTSCP
RDRAND
RDSEED
Floating point x87 instructions
Instruction
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
Operands
r
m32/64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
1
1
4
43
1
1
7
244
2
1
3
3
1
2
1
1
4
43
1
2
7
226
0
2
3
3
1
2
p05
p23
2p01 2p23
p05
p4 p237
3p0156 2p23 2p4
none
p05 p23
p5 p23 p4
p1 p23 p4
p05
2p05
Page 222
1
3
4
46
1
3
4
264
0
5
7
7
0.5
0.5
2
22
0.5
1
5
266
0.5
1
1
2
1
2
SSE3
Skylake
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P)
FSUB(R)(P)
FADD(P)
FSUB(R)(P)
FMUL(P)
FMUL(P)
FDIV(R)(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P)
FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
Math
FSCALE
FXTRACT
FSQRT
FSIN
FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
Other
FNOP
r
m
m
2
4
2
2
3
2
1
1
133
89
2
4
2
3
3
3
1
1
133
89
2p05
p0 p1 p56
p0 p0156
p0 p4 p237
p01 p23 p6
p237 p4 p6
p05
p05
r
1
1
p5
m
r
m
r
m
2
1
2
1
1
1
1
1
1
2
3
1
3
1
2
1
1
1
2
2
p5 p23
p0
p0 p23
p0
p0 p23
p0
p0
p5
p5 p23
p0 p5
3
3
2
2
2
1
2
31
31
17
3
4
3
3
3
1
2
31
31
17
p5
2p5 p23
p0 p5 p23
p0 p5 p23
2p5 p23
p5
2p5
27
17
1
53-105
53-105
55-120
16-90
40-100
56
40-112
30-160
27
17
1
p0
1
1
p05
r
AX
m16
m16
m16
r
m
r
m
m
m
m
Page 223
176
175
2
2
2
1
2
1
0.5
0.5
176
175
3
1
3
6
6
7
6
0
5
14-16
1
1
3
1
1
1
4-5
4-5
1
1
1
1
1
1
2
1
3
6
26-30
30-57
21
130
11
14-21
50-120
50-130
55-150
65-80
103
77
140-160
100-160
2
1
2
17
17
11
130
11
4-7
0.5
Skylake
WAIT
FNCLEX
FNINIT
2
5
18
2
5
18
p05
p156
2
22
78
Integer MMX and XMM instructions
Instruction
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA/U
MOVDQA/U
MOVDQA/U
VMOVDQA/U
VMOVDQA/U
VMOVDQA/U
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
VMOVNTDQ
MOVNTDQA
VMOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PACKUSDW
PACKUSDW
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
BW/WD/DQ
PUNPCKH/L
QDQ
PUNPCKH/L
QDQ
Operands
r32/64,(x)mm
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
r64,(x)mm
(x)mm,r64
mm,mm
x,x
(x)mm,m64
m64, (x)mm
x,x
x, m128
m128, x
y,y
y,m256
m256,y
x, m128
mm, x
x,mm
m64,mm
m128,x
m256,y
x, m128
y,m256
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
1
1
1
2
2
1
2
1
1
1
1
1
2
1
1
2
1
1
2
1
2
2
2
2
2
2
2
p0
p237 p4
p5
p23
p0
p5
p05
p015
p23
p237 p4
p015
p23
p237 p4
p015
p23
p237 p4
p23
p0 p5
p0 p15
p237 p4
p237 p4
p237 p4
p23 p015
p23 p015
2
3
2
2
2
1
1
1
2
3
0-1
2
3
0-1
3
3
3
2
2
~418
~450
~400
3
3
1
1
1
0.5
1
1
0.5
0.33
0.5
1
0.25
0.5
1
0.25
0.5
1
0.5
1
1
1
1
1
0.5
0.5
mm,mm
3
3
p5
2
2
mm,m64
3
3
p23 2p5
x,x / y,y,y
1
1
p5
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
1
1
1
2
1
2
v,v / v,v,v
1
v,m / v,v,m
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
may eliminate
AVX
AVX
SSE3
AVX2
SSE4.1
AVX2
2
1
1
p23 p5
p5
p23 p5
1
1
1
1
1
p5
1
1
1
2
p23 p5
x,x / y,y,y
1
1
p5
x,m / y,y,m
1
2
p23 p5
Page 224
may eliminate
1
1
1
1
SSE4.1
SSE4.1
Skylake
PMOVSX/ZX BW
BD BQ DW DQ
x,x
1
1
p5
PMOVSX/ZX BW
BD BQ DW DQ
x,m
1
2
p23 p5
VPMOVSX/ZX BW
BD BQ DW DQ
y,x
1
1
p5
PINSRB
PINSRB
PINSRW
PINSRW
PINSRD/Q
PINSRD/Q
VINSERTI128
VINSERTI128
y,m
v,v / v,v,v
v,m / v,v,m
mm,mm,i
mm,m64,i
v,v,i
v,m,i
v,v,i
v,m,i
v,v,i / v,v,v,i
v,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,v,i
v,v,m,i
y,y,y
y,y,m
y,y,i
y,m,i
y,y,y,i
y,y,m,i
mm,mm
x,x
v,v,m
m,v,v
r,v
r32,x,i
m8,x,i
x,y,i
m,y,i
x,r32,i
x,m8,i
(x)mm,r32,i
(x)mm,m16,i
x,r32,i
x,m32,i
y,y,x,i
y,y,m,i
2
1
2
1
2
1
1-2
1
2
1
2
1
2
2
3
1
2
1
2
1
1
1
2
1
2
4
10
2
3
1
2
2
1
2
2
2
2
2
2
2
1
2
2
1
2
1
2
1
2
1
2
1
2
1
2
2
3
1
2
1
2
1
2
1
2
1
2
4
10
2
3
1
2
3
1
2
2
2
2
2
2
2
1
2
p5 p23
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
p5
p23 p5
p015
p015 p23
2p015
2p015 p23
p5
p23 p5
p015
p015 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p0 p4 2p23
4p04 2p56 4p23
p23 p015
p0 p4 p23
p0
p0 p5
p23 p4 p5
p5
p23 p4
2p5
p23 p5
p5
p23 p5
2p5
p23 p5
p5
p015 p23
VPBROADCAST
B/W/D/Q
x,x
1
1
VPBROADCAST
B/W
x,m8/16
2
2
VPMOVSX/ZX BW
BD BQ DW DQ
PSHUFB
PSHUFB
PSHUFW
PSHUFW
PSHUFD
PSHUFD
PSHUFL/HW
PSHUFL/HW
PALIGNR
PALIGNR
PBLENDVB
PBLENDVB
VPBLENDVB
VPBLENDVB
PBLENDW
PBLENDW
VPBLENDD
VPBLENDD
VPERMD
VPERMD
VPERMQ
VPERMQ
VPERM2I128
VPERM2I128
MASKMOVQ
MASKMOVDQU
VPMASKMOVD/Q
VPMASKMOVD/Q
PMOVMSKB
PEXTRB/W/D/Q
PEXTRB/W/D/Q
VEXTRACTI128
VEXTRACTI128
1
SSE4.1
1
SSE4.1
1
AVX2
AVX2
SSSE3
SSSE3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
0.33
0.5
1
1
1
1
1
1
2
6
0.5
1
1
1
1
1
1
2
1
2
1
2
1
1
0.5
p5
1
1
AVX2
p23 p5
7
1
AVX2
Page 225
1
3
1
1
1
1
1
1
2
1
1
3
3
3
~450
18-500
4
14
2-3
3
3
4
3
3
3
SSSE3
SSSE3
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX2
AVX2
Skylake
VPBROADCAST
D/Q
x,m32/64
1
1
p23
4
0.5
AVX2
VPBROADCAST
B/W/D/Q
y,x
1
1
p5
3
1
AVX2
VPBROADCAST
B/W
y,m8/16
2
2
p23 p5
7
1
AVX2
y,m32/64
y,m128
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y
1
1
4
4
5
4
5
4
5
4
1
1
4
4
5
4
5
4
5
4
p23
p23
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
3
3
0.5
0.5
4
5
2
4
2
4
2
4
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
PADD/SUB(S,US)
B/W/D/Q
v,v / v,v,v
1
1
p015
1
0.33
PADD/SUB(S,US)
B/W/D/Q
v,m / v,v,m
1
2
p015 p23
v,v / v,v,v
3
3
p01 2p5
v,m / v,v,m
4
4
p01 2p5 p23
mm,mm
1
1
p0
1
1
x,x / y,y,y
1
1
p01
1
0.5
x,m / y,y,m
v,v / v,v,v
v,m / v,v,m
v,v / v,v,v
v,m / v,v,m
1
1
1
1
1
2
1
2
1
2
p01 p23
p01
p01 p23
p5
p5 p23
mm,mm
1
1
p0
5
1
x,x / y,y,y
1
1
p01
5
0.5
x,m / y,y,m
mm,mm
x,x / y,y,y
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
x,x / y,y,y
x,m / y,y,m
mm,mm
x,x / y,y,y
x,m / y,y,m
1
1
1
1
2
3
1
1
1
1
1
2
1
1
2
2
3
1
2
1
1
2
p01 p23
p0
p01
p01 p23
2p01
2p01 p23
p01
p01 p23
p0
p01
p01 p23
VPBROADCAST
D/Q
VBROADCASTI128
VPGATHERDD
VPGATHERDD
VPGATHERQD
VPGATHERQD
VPGATHERDQ
VPGATHERDQ
VPGATHERQQ
VPGATHERQQ
Arithmetic instructions
PHADD(S)W/D
PHSUB(S)W/D
PHADD(S)W/D
PHSUB(S)W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQB/W/D
PCMPGTB/W/D
PCMPEQQ
PCMPEQQ
PCMPGTQ
PCMPGTQ
PMULL/HW
PMULHUW
PMULL/HW
PMULHUW
PMULL/HW
PMULHUW
PMULHRSW
PMULHRSW
PMULHRSW
PMULLD
PMULLD
PMULDQ
PMULDQ
PMULUDQ
PMULUDQ
PMULUDQ
Page 226
0.5
3
1
3
5
5
10
5
5
5
2
SSSE3
2
SSSE3
0.5
0.5
0.5
1
1
0.5
1
0.5
0.5
1
1
0.5
0.5
1
0.5
0.5
SSE4.1
SSE4.1
SSE4.2
SSE4.2
SSSE3
SSSE3
SSSE3
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Skylake
PMADDWD
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PMADDUBSW
PAVGB/W
PAVGB/W
PAVGB/W
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PHMINPOSUW
PHMINPOSUW
PABSB/W/D
PABSB/W/D
PABSB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSIGNB/W/D
PSADBW
PSADBW
MPSADBW
MPSADBW
Logic instructions
PAND PANDN
POR PXOR
PAND PANDN
POR PXOR
PAND PANDN
POR PXOR
PTEST
PTEST
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
mm,mm
x,x / y,y,y
x,m / y,y,m
mm,mm
x,x / y,y,y
x,m / y,y,m
mm,mm
x,x / y,y,y
x,m / y,y,m
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
1
2
p0
p01
p01 p23
p0
p01
p01 p23
p0
p01
p01 p23
5
5
mm,mm
1
1
p0
x,x / y,y,y
1
1
p01
x,m / y,y,m
x,x
x,m128
mm,mm
x,x / y,y
x,m / y,m
mm,mm
x,x / y,y,y
x,m / y,y,m
v,v / v,v,v
v,m / v,v,m
x,x,i / v,v,v,i
x,m,i / v,v,m,i
1
1
1
1
1
1
1
1
1
1
1
2
3
2
1
2
1
1
2
1
1
2
1
2
2
3
p01 p23
p0
p0 p23
p0
p01
p01 p23
p0
p01
p01 p23
p5
p5 p23
2p5
2p5 p23
mm,mm
1
1
p05
1
0.5
x,x / y,y,y
1
1
p015
1
0.33
v,m / v,v,m
v,v
v,m
1
2
2
2
2
3
p015 p23
p0 p5
p0 p5 p23
3
0.5
1
1
mm,mm
1
1
p0
1
1
mm,m64
2
2
p0 p23
x,x / v,v,x
2
2
p01 p5
x,m / v,v,m
2
2
p01 p23
Page 227
1
0.5
0.5
1
0.5
0.5
1
0.5
0.5
SSSE3
SSSE3
SSSE3
1
1
SSE4.1
1
0.5
SSE4.1
0.5
1
1
1
0.5
0.5
1
0.5
0.5
1
1
2
2
SSE4.1
SSE4.1
SSE4.1
SSSE3
SSSE3
SSSE3
SSSE3
SSSE3
SSSE3
5
5
1
1
4
1
1
1
1
3
4
1
1
1
0.5
SSE4.1
SSE4.1
SSE4.1
SSE4.1
Skylake
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
PSLLW/D/Q
PSRLW/D/Q
PSRAW/D/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
VPSLLVD/Q
VPSRAVD
VPSRLVD/Q
PSLLDQ
PSRLDQ
String instructions
PCMPESTRI
PCMPESTRI
PCMPESTRM
PCMPESTRM
PCMPISTRI
PCMPISTRI
PCMPISTRM
PCMPISTRM
mm,i
1
1
p0
1
1
x,i / y,y,i
1
1
p01
1
0.5
v,v,v
1
1
p01
1
0.5
AVX2
v,v,m
1
2
p01 p23
0.5
AVX2
x,i / v,v,i
1
1
p5
1
1
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
x,x,i
x,m128,i
8
8
9
9
3
4
3
4
8
8
9
9
3
4
3
4
6p05 2p16
12
3p0 2p16 4p5
6p05 2p16 p23
3p0
3p0 p23
3p0
3p0 p23
4
4
5
5
3
3
3
3
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
SSE4.2
1
2
1
2
p5
p5 p23
7
1
1
CLMUL
CLMUL
1
1
p0
4
1
AES
2
2
3
2
2
3
p0 p23
2p0
2p0 p23
8
1.5
2
2
AES
AES
AES
13
13
p0 p5
12
12
AES
13
13
12
AES
10
10
Encryption instructions
PCLMULQDQ
x,x,i
PCLMULQDQ
x,m,i
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,x
AESDEC,
AESDECLAST,
AESENC,
AESENCLAST
x,m
AESIMC
x,x
AESIMC
x,m
AESKEYGENAS
SIST
x,x,i
AESKEYGENAS
SIST
x,m,i
Other
EMMS
3p0 2p16 2p5 p23
p05
9
12
9
6
Floating point XMM and YMM instructions
Instruction
Operands
Reciproµops
µops
cal
fused
unfused
through
domain domain µops each port Latency put
Comments
Page 228
Skylake
Move instructions
MOVAPS/D
VMOVAPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVAPS/D
MOVUPS/D
VMOVAPS/D
VMOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVHPS/D
MOVLPS/D
MOVLPS/D
MOVHLPS
MOVLHPS
MOVMSKPS/D
VMOVMSKPS/D
MOVNTPS/D
VMOVNTPS/D
SHUFPS/D
SHUFPS/D
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERMILPS/PD
VPERM2F128
VPERM2F128
VPERMPS
VPERMPS
VPERMPD
VPERMPD
BLENDPS/PD
BLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
VBLENDVPS/PD
VBLENDVPS/PD
MOVDDUP
MOVDDUP
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSS
VBROADCASTSD
VBROADCASTSD
VBROADCASTF128
MOVSH/LDUP
x,x
y,y
1
1
1
1
p015
p015
0-1
0-1
0.25
0.25
x,m128
1
1
p23
2
0.5
y,m256
1
1
p23
3
0.5
m128,x
1
2
p237 p4
3
1
m256,y
x,x
x,m32/64
m32/64,x
x,m64
m64,x
x,m64
m64,x
x,x
x,x
r32,x
r32,y
m128,x
m256,y
x,x,i / v,v,v,i
x,m,i / v,v,m,i
v,v,i
v,m,i
v,v,v
v,v,m
y,y,y,i
y,y,m,i
y,y,y
y,y,m
y,y,i
y,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,xmm0
x,m,xmm0
v,v,v,v
v,v,m,v
v,v
v,m
x,m32
y,m32
x,x
y,x
y,m64
y,x
y,m128
v,v
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
2
1
2
1
1
1
2
1
2
1
2
2
3
1
1
1
1
1
1
1
1
1
1
2
1
1
2
2
2
2
2
1
1
1
1
2
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
2
3
1
1
1
1
1
1
1
1
1
1
p237 p4
p5
p23
p237 p4
p23 p5
p4 p237
p23 p5
p4 p237
p5
p5
p0
p0
p4 p237
p4 p237
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p5
p5 p23
p015
p015 p23
p015
p015 p23
2p015
2p015 p23
p5
p23
p23
p23
p5
p5
p23
p5
p23
p5
3
1
3
3
4
3
4
3
1
1
2
3
~400
~400
1
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.33
0.5
1
1
1
1
1
0.5
0.5
0.5
1
1
0.5
1
0.5
1
Page 229
1
1
3
3
3
1
1
2
1
3
2
3
1
3
3
3
3
1
may eliminate
may eliminate
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
SSE4.1
SSE4.1
SSE4.1
SSE4.1
AVX
AVX
SSE3
SSE3
AVX
AVX
AVX2
AVX2
AVX
AVX2
AVX
SSE3
Skylake
MOVSH/LDUP
UNPCKH/LPS/D
UNPCKH/LPS/D
EXTRACTPS
EXTRACTPS
VEXTRACTF128
VEXTRACTF128
INSERTPS
INSERTPS
VINSERTF128
VINSERTF128
VMASKMOVPS/D
VMASKMOVPS/D
VMASKMOVPS/D
VPGATHERDPS
VPGATHERDPS
VPGATHERQPS
VPGATHERQPS
VPGATHERDPD
VPGATHERDPD
VPGATHERQPD
VPGATHERQPD
Conversion
CVTPD2PS
CVTPD2PS
VCVTPD2PS
VCVTPD2PS
CVTSD2SS
CVTSD2SS
CVTPS2PD
CVTPS2PD
VCVTPS2PD
VCVTPS2PD
CVTSS2SD
CVTSS2SD
CVTDQ2PS
CVTDQ2PS
VCVTDQ2PS
VCVTDQ2PS
CVT(T) PS2DQ
CVT(T) PS2DQ
VCVT(T) PS2DQ
VCVT(T) PS2DQ
CVTDQ2PD
CVTDQ2PD
VCVTDQ2PD
VCVTDQ2PD
CVT(T)PD2DQ
CVT(T)PD2DQ
VCVT(T)PD2DQ
VCVT(T)PD2DQ
v,m
x,x / v,v,v
x,m / v,v,m
r32,x,i
m32,x,i
x,y,i
m128,y,i
x,x,i
x,m32,i
y,y,x,i
y,y,m128,i
v,v,m
m128,x,x
m256,y,y
x,[r+s*x],x
y,[r+s*y],y
x,[r+s*x],x
x,[r+s*y],x
x,[r+s*x],x
y,[r+s*x],y
x,[r+s*x],x
y,[r+s*y],y
1
1
1
2
2
1
2
1
2
1
2
2
4
4
4
4
5
4
5
4
5
4
1
1
2
2
3
1
2
1
2
1
2
2
4
4
4
4
5
4
5
4
5
4
p23
p5
p5 p23
p0 p5
p4 p5 p23
p5
p23 p4
p5
p23 p5
p5
p015 p23
p015 p23
p0 p4 p23
p0 p4 p23
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
p0 p1 p23 p5
x,x
x,m128
x,y
x,m256
x,x
x,m64
x,x
x,m64
y,x
y,m128
x,x
x,m32
x,x
x,m128
y,y
y,m256
x,x
x,m128
y,y
y,m256
x,x
x,m64
y,x
y,m128
x,x
x,m128
x,y
x,m256
2
2
2
2
2
2
2
1
2
1
2
1
1
1
1
1
1
1
1
1
2
2
2
1
2
3
2
2
2
3
2
3
2
3
2
2
2
2
2
2
1
2
1
2
1
2
1
2
2
2
2
2
2
3
2
3
p01 p5
p01 p5 p23
p01 p5
p01 p5 p23
p01 p5
p01 p5 p23
p01 p5
p01 p5 p23
p01 p5
p01 p5 p23
p01 p5
p01 p5 p23
p01
p01 p23
p01
p01 p23
p01
p01 p23
p01
p01 p23
p01 p5
p01 p23
p01 p5
p01 p23
p01 p5
p01 p23 p5
p01 p5
p01 p23 p5
Page 230
3
1
5
3
6
1
4
3
5
3
13
13
12
13
5
7
5
5
7
5
4
4
4
4
5
7
5
7
0.5
1
1
1
1
1
1
1
1
1
0.5
0.5
1
1
4
5
2
4
2
4
2
4
1
1
1
1
1
1
1
0.5
1
0.5
2
2
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
1
0.5
1
1
1
1
SSE3
SSE3
SSE3
SSE4.1
SSE4.1
AVX
AVX
SSE4.1
SSE4.1
AVX
AVX
AVX
AVX
AVX
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX2
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
AVX
Skylake
CVTPI2PS
CVTPI2PS
CVT(T)PS2PI
CVT(T)PS2PI
CVTPI2PD
CVTPI2PD
CVT(T) PD2PI
CVT(T) PD2PI
CVTSI2SS
CVTSI2SS
CVTSI2SS
CVT(T)SS2SI
CVT(T)SS2SI
CVT(T)SS2SI
CVTSI2SD
CVTSI2SD
CVT(T)SD2SI
CVT(T)SD2SI
VCVTPS2PH
VCVTPS2PH
VCVTPH2PS
VCVTPH2PS
Arithmetic
ADDSS/D PS/D
SUBSS/D PS/D
ADDSS/D PS/D
SUBSS/D PS/D
ADDSUBPS/D
ADDSUBPS/D
HADDPS/D
HSUBPS/D
HADDPS/D
HSUBPS/D
MULSS/D PS/D
MULSS/D PS/D
DIVSS
DIVPS
DIVSS DIVPS
DIVSD
DIVPD
DIVSD DIVPD
VDIVPS
VDIVPS
VDIVPD
VDIVPD
RCPSS/PS
RCPSS/PS
CMPccSS/D
CMPccPS/D
CMPccSS/D
CMPccPS/D
(U)COMISS/D
x,mm
x,m64
mm,x
mm,m128
x,mm
x,m64
mm,x
mm,m128
x,r32
x,r64
x,m32
r32,x
r64,x
r32,m32
x,r32/64
x,m32
r32/64,x
r32,m64
x,v,i
m,v,i
v,x
v,m
2
1
2
2
2
1
2
2
2
3
1
2
3
3
2
1
2
3
2
3
2
1
2
2
2
2
2
2
2
3
2
3
2
2
3
3
2
2
2
3
2
3
2
2
p0 p1
p01 p23
p0 p5
p0 p23
p01 p5
p01 p23
p01 p5
p01 p23 p5
p01 p5
p01 2p5
p1 p23
2p01
2p01 p5
2p01 p23
p01 p5
p01 p23
p0 p1
2p01 p23
p01 p5
p01 p4 p23
p01 p5
p01 p23
x,x / v,v,v
1
1
p01
4
0.5
x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
1
1
1
2
1
2
p01 p23
p01
p01 p23
4
0.5
0.5
0.5
SSE3
SSE3
x,x / v,v,v
3
3
p01 2p5
6
2
SSE3
x,m / v,v,m
x,x / v,v,v
x,m / v,v,m
x,x
x,x
x,m
x,x
x,x
x,m
y,y,y
y,y,m256
y,y,y
y,y,m256
v,v
v,m
4
1
1
1
1
1
1
1
1
1
1
1
4
1
1
4
1
2
1
1
2
1
1
2
1
2
1
4
1
2
p1 2p5 p23
p01
p01 p23
p0
p0
p0 p23
p0
p0
p0 p23
p0
p0 p23
p0
p0 p23
p0
p0 p23
2
0.5
0.5
3
3
3-5
4
4
4
5
5
8
8
1
1
SSE3
x,x / v,v,v
1
1
p01
x,m / v,v,m
x,x
2
1
2
1
p01 p23
p0
Page 231
6
7
5
5
6
7
6
7
6
6
5-7
5-7
4
11
11
13-14
13-14
11
13-14
4
4
2
3
1
1
1
0.5
1
1
2
2
3
1
1
1
2
2
1
1
1
1
1
1
0.5
0.5
1
F16C
F16C
F16C
F16C
AVX
AVX
AVX
AVX
Skylake
(U)COMISS/D
MAXSS/D PS/D
MINSS/D PS/D
MAXSS/D PS/D
MINSS/D PS/D
x,m32/64
2
2
p0 p23
x,x / v,v,v
1
1
p01
x,m / v,v,m
1
2
p01 p23
ROUNDSS/D PS/D
v,v,i
2
2
2p01
8
ROUNDSS/D PS/D
v,m,i
x,x,i / v,v,v,i
x,m,i / v,v,m,i
x,x,i
x,m128,i
3
4
6
3
4
3
4
6
3
4
2p01 p23
3p01 p5
13
3p01 p23 p5 p6
2p01 p5
2p01 p23 p5
9
v,v,v
1
1
p01
4
v,v,m
1
2
p01 p23
x,x
x,m128
y,y
y,m256
x,x
x,x
x,m128
y,y
y,m256
v,v
v,m
1
1
1
4
1
1
1
1
4
1
1
1
2
1
4
1
1
2
1
4
1
2
p0
p0 p23
p0
p0 p23
p0
p0
p0 p23
p0
p0 p23
p0
p0 p23
AND/ANDN/OR/XO
RPS/PD
x,x / v,v,v
1
1
p015
AND/ANDN/OR/XO
RPS/PD
x,m / v,v,m
1
2
p015 p23
0.5
Other
VZEROUPPER
4
4
none
1
VZEROALL
25
25
p0 p1 p5 p6
12
34
4
3
106
136
105
121
247
304
257
257
168
34
4
4
p0 p1 p5 p6
p0 p5 p6 p23
p0 p4 p6 p237
DPPS
DPPS
DPPD
DPPD
VFMADD...
(all FMA instr.)
VFMADD...
(all FMA instr.)
Math
SQRTSS/PS
SQRTSS/PS
VSQRTPS
VSQRTPS
SQRTSD
SQRTPD
SQRTSD/PD
VSQRTPD
VSQRTPD
RSQRTSS/PS
RSQRTSS/PS
1
4
0.5
0.5
12
12
15-16
15-16
15-16
4
1
SSE4.1
1
1.5
1.5
1
1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
SSE4.1
0.5
FMA
0.5
FMA
3
3
6
6
4-6
4-6
4-6
9-12
9-12
1
1
AVX
AVX
AVX
AVX
Logic
VZEROALL
LDMXCSR
STMXCSR
FXSAVE
FXSAVE
FXRSTOR
FXRSTOR
XSAVE
XSAVE
XRSTOR
XRSTOR
XSAVEOPT
m32
m32
m4096
m4096
m4096
m4096
m
Page 232
1
5
5
78
64
76
77
107
107
122
122
74
0.33
12
3
2
78
64
76
77
107
107
122
122
74
AVX
AVX,
32 bit
AVX,
64 bit
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
32 bit mode
64 bit mode
Pentium 4
Intel Pentium 4
List of instruction timings and μop breakdown
This list is measured for a Pentium 4, model 2. Timings for model 3 may be more like the values for
P4E, listed on the next sheet
Explanation of column headings:
Instruction:
Operands:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any
memory operand including indirect operands, m64 means 64-bit memory operand, etc.
μops:
Microcode:
Latency:
Number of μops issued from instruction decoder and stored in trace cache.
Number of additional μops issued from microcode ROM.
This is the delay that the instruction generates in a dependency chain if the
next dependent instruction starts in the same execution unit. The numbers are
minimum values. Cache misses, misalignment, and exceptions may increase
the clock counts considerably. Floating point operands are presumed to be
normal numbers. Denormal numbers, NAN's, infinity and exceptions increase
the delays. The latency of moves to and from memory cannot be measured
accurately because of the problem with memory intermediates explained
above under “How the values were measured”.
Additional latency:
This number is added to the latency if the next dependent instruction is in a
different execution unit. There is no additional latency between ALU0 and
ALU1.
This is also called issue latency. This value indicates the number of clock cycles from the execution of an instruction begins to a subsequent independent
instruction can begin to execute in the same execution subunit. A value of
0.25 indicates 4 instructions per clock cycle in one thread.
Reciprocal
throughput:
Port:
Execution unit:
Execution subunit:
Instruction set
The port through which each μop goes to an execution unit. Two independent
μops can start to execute simultaneously only if they are going through different ports.
Use this information to determine additional latency. When an instruction with
more than one μop uses more than one execution unit, only the first and the
last execution unit is listed.
Throughput measures apply only to instructions executing in the same subunit.
Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.
Integer instructions
Page 233
Pentium 4
alu0/1
alu0/1
load
load
store
store
86
86
86
86
86
86
86
86
sse2
386
386
386
386
ppro
86
86
86
86
186
86
86
86
186
86
86
86
86
186
86
86
386
386
386
86
86
86
86
486
alu0/1
load
alu0
alu0/1
alu0/1
alu0/1
int,alu
int,alu
int,alu
int
alu0/1
int
int,alu
86
sse
sse
Notes
Page 234
Instruction set
r,m
r
r,r/i
m
m
Subunit
r,[r+r/i]
r,[r+r+i]
r,[r*i]
r,[r+r*i]
r,[r+r*i+i]
0
0.5 0.5-1 0.25 0/1
0
0.5 0.5-1 0.25 0/1
0
2
0
1
2
0
3
0
1
2
0
1
2
0
0
2 0,3
2
6
4
12
0
14
0
≈33
0
0.5 0.5-1 0.25 0/1
0
2
0
1
2
0
0.5 0.5-1 0.5 0
0
3 0.5-1 1 2,0
0
6
0
3
0
1.5 0.5-1 1 0/1
8 >100
0
3
0
1
2
0
1
2
0
2
4
7
4
10
10
19
0
1
0
1
8
14
5
13
8
52
16
14
0
0.5 0.5-1 0.25 0/1
0
1 0.5-1 0.5 0/1
0
4 0.5-1 1
1
0
4 0.5-1 1
1
0
4 0.5-1 1
1
0
4
0
4
1
0
0.5 0.5-1 0.5 0/1
0
5
0
1
1
7
15
0
7
0
2
64
>1000
2
6
2
6
Execution unit
r
m
sr
Port
Reciprocal throughput
r
i
m
sr
Additional latency
1
1
1
2
1
3
4
4
2
1
1
1
2
3
3
4
4
2
2
3
4
4
4
2
4
4
4
4
1
2
3
2
3
1
1
3
4
3
8
4
4
Latency
r,r
r,i
r32,m
r8/16,m
m,r
m,i
r,sr
sr,r/m
m,r32
r,r
r,m
r,r
r,m
r,r/m
r,r
r,m
Microcode
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX
MOVZX
MOVSX
MOVSX
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D)
PUSHA(D)
POP
POP
POP
POPF(D)
POPA(D)
LEA
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
BSWAP
IN, OUT
PREFETCHNTA
PREFETCHT0/1/2
Operands
μops
Instruction
c
b, c
a, q
c
c
a, e
d
Pentium 4
SFENCE
LFENCE
MFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC
INC, DEC
NEG
NEG
AAA, AAS
DAA, DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
IDIV
IDIV
IDIV
CBW
CWD, CDQ
CWDE
Logic instructions
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
4
4
4
r,r
r,m
m,r
r,r
r,i
r,m
m,r
r,r
r,m
r
m
r
m
r8/32
r16
m8/32
m16
r32,r
r32,(r),i
r16,r
r16,r,i
r16,m16
r32,m32
r,m,i
r8/m8
r16/m16
r32/m32
r8/m8
r16/m16
r32/m32
r,r
r,m
m,r
r,r
r,m
r
1
2
3
4
3
4
4
1
2
2
4
1
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
2
2
1
1
2
3
1
2
1
2
2
2
40
38
100
0
0.5 0.5-1 0.25
0
1 0.5-1 1
0
≥8
≥4
4
6
0
6
0
6
0
6
6
8
0
8
7
≥9
8
0
0.5 0.5-1 0.25
0
1 0.5-1 1
0
0.5 0.5-1 0.5
0
4
≥4
0
0.5 0.5-1 0.5
0
≥3
27 90
57 100
10 22
22 56
6
16
0
8
7
17
0
8
7-8 16
0
8
10 16
0
8
0
14
0
4.5
0
14
0
4.5
5
16
0
9
5
15
0
8
7
15
0
10
0
14
0
8
7
14
0
10
20 61
0
24
18 53
0
23
21 50
0
23
24 61
0
24
22 53
0
23
20 50
0
23
0
1 0.5-1 1
0
1 0.5-1 0.5
0
0.5 0.5-1 0.5
0
0
0
0
0
0
0.5
≥1
≥8
0.5
≥1
0.5
0.5-1 0.5
0.5-1 ≥ 1
≥4
0.5-1 0.5
0.5-1 ≥ 1
0.5-1 0.5
Page 235
sse
sse2
sse2
0/1
alu0/1
1
1
1
int,alu
int,alu
int,alu
0/1
alu0/1
0/1
alu0/1
0
alu0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0/1
0
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
alu0
alu0/1
alu0
0
alu0
0
alu0
0
alu0
fpmul
fpdiv
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpmul
fpdiv
fpdiv
fpdiv
fpdiv
fpdiv
fpdiv
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
386
386
386
186
386
386
186
86
86
386
86
86
386
86
86
386
86
86
86
86
86
86
c
c
c
c
c
a
a
a
a
a
c
c
c
c
c
Pentium 4
NOT
SHL, SHR, SAR
SHL, SHR, SAR
ROL, ROR
ROL, ROR
RCL, RCR
RCL, RCR
RCL, RCR
SHL,SHR,SAR,ROL,
ROR
RCL, RCR
RCL, RCR
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BT
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BSF, BSR
BSF, BSR
SETcc
SETcc
CLC, STC
CMC
CLD
STD
CLI
STI
m
r,i
r,CL
r,i
r,CL
r,1
r,i
r,CL
4
1
2
1
2
1
4
4
0
0
0
0
0
0
15
15
1
0
1
0
1
0
0
≥4
1
1
1
1
1
15
14
m,i/CL
m,1
m,i/CL
r,r,i/CL
m,r,i/CL
r,i
r,r
m,i
m,r
r,i
r,r
m,i
m,r
r,r
r,m
r
m
4
4
4
4
4
3
2
4
4
3
2
4
4
2
3
3
4
3
3
4
4
4
4
7-8 10
0
7
10
0
18 18-28
14 14
0
18 14
0
0
4
0
0
4
0
0
4
0
12 12
0
0
6
0
0
6
0
7
18
0
15 14
0
0
4
0
0
4
0
0
5
0
0
5
0
0
10
0
0
10
0
7
52
0
5
48
0
5
35
12 43
1
4
3
3
4
1
4
4
3
4
4
4
4
4
4
4
0
28
0
0
31
0
4
4
0
34
4
4
38
0
0
33
Control transfer instructions
JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
4
6
4
6
4
16
16
0
118
4
4
11
0
0
0
2
0
8
9
2
2
11
Page 236
1
1
1
1
1
1
1
int
int
int
int
int
int
int
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
10
10
14
14
14
2
1
2
12
2
4
8
14
2
3
1
3
2
2
52
48
35
43
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
mmxsh
1
118
4
4
11
2-4
2-4
2-4
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
alu0
branch
alu0
alu0
branch
branch
alu0
alu0
alu0
alu0
branch
branch
branch
branch
alu0
alu0
branch
branch
alu0
alu0
branch
branch
86
186
86
186
86
86
186
86
86
86
86
386
386
386
386
386
386
386
386
386
386
386
386
386
386
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
d
d
d
d
d
d
d
d
d
d
d
d
d
d
Pentium 4
RETF
IRET
ENTER
ENTER
LEAVE
BOUND
INTO
INT
i
4
33 11
4
48 24
4
12 26
4 45+24n
4
0
3
4
14 14
4
5
18
4
84 644
i,0
i,n
m
i
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
CPUID
RDTSC
Notes:
a)
b)
c)
d)
e)
q)
4
4
4
4
4
4
4
4
4
4
0
0
26
128+16n
3
14
18
3
6
5n ≈ 4n+36
2
6
2n+3 ≈ 3n+10
4
6
≈163+1.1n
3
≈ 40+6n
≈4n
5
≈ 50+8n
≈4n
1
0
0
1
0
0
4
2
4 39-81
4
7
86
86
186
186
186
186
86
86
6
86
86
6
86
86
4
86
86
6
86
86
8
86
86
0.25 0/1
0.25 0/1
alu0/1
alu0/1
200-500
80
86
ppro
sse2
p5
p5
Add 1 μop if source is a memory operand.
Uses an extra μop (port 3) if SIB byte used. A SIB byte is needed if the memory operand has more than one pointer register, or a scaled index, or ESP is
used as base pointer.
Add 1 μop if source or destination, but not both, is a high 8-bit register (AH,
BH, CH, DH).
Has (false) dependence on the flags in most cases.
Not available on PMMX
Latency is 12 in 16-bit real or virtual mode, 24 in 32-bit protected mode.
Floating point x87 instructions
0
2
mov
load
Notes
Page 237
1
1
Instruction set
0
0
Subunit
6
≈7
Execution unit
0
0
Port
Reciprocal throughput
1
1
Additional latency
r
m32/64
Latency
Microcode
Move instructions
FLD
FLD
Operands
μops
Instruction
87
87
Pentium 4
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FIST
FIST
FISTP
FLDZ
FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD,FSUB(R)
FIADD,FISUB(R)
FIADD,FISUB(R)
FMUL(P)
FMUL
FIMUL
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FIDIV(R)
FABS
FCHS
FCOM(P), FUCOM(P)
FCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
m80
m80
r
m32/64
m80
m80
r
m16
m32/64
m16
m32/64
m
st0,r
r
AX
AX
m16
m16
m16
r
m
m16
m32
r
m
m16
m32
r
m
m16
m32
r
m
r
m16
m32
3
3
1
2
3
3
1
3
2
3
2
3
1
2
4
3
1
4
6
4
4
4
4
75
0
0
8
311
0
3
0
0
0
0
0
0
0
0
0
0
0
4
4
7
1
2
3
3
1
2
3
3
1
2
3
3
1
1
1
2
2
3
4
3
1
1
3
6
6
0
0
4
0
0
0
4
0
0
0
4
0
0
0
0
0
0
0
4
0
0
0
15
84
84
6
≈7
0
0
≈ 10
≈ 10
≈ 10
≈ 10
≈ 10
0
2-4
1
0
11
11
0
0
0
(3)
5
5
6
5
7
7
7
7
43
43
43
43
2
2
2
2
2
10
1
1
0
1
1
1
1
1
0
0
0
0
1
1
0
0
0
0
2
2
2
23
212
212
0
0
0
0
Page 238
6
2
90
2
1
0
2-3 0
8
0
400 0
1
0
6
2
1
2
2-4 0
2-3 0
2-4 0
2
0
2
0
4
1
4
0
1
0
3
1
3
1
6
0
6
0
(8) 0,2
1
1
6
2
2
2
6
2
43
43
43
43
1
1
1
1
1
3
6
2
1
1
15
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,1
1
1,2
1
1
0,1
1
1
load
load
mov
store
store
store
mov
load
load
store
store
store
mov
mov
fp
mov
mov
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
287
287
87
87
87
add
add
add
add
mul
mul
mul
mul
div
div
div
div
misc
misc
misc
misc
misc
misc
misc
misc
misc
misc
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
87
87
87
87
387
e
f
g, h
g, h
g, h
g, h
Pentium 4
Math
FSQRT
FLDPI, etc.
FSIN
FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
Notes:
e)
f)
g)
h)
i)
1
2
6
6
7
6
3
3
3
3
3
11
0
43
0
≈150 ≈180
≈175 ≈207
≈178 ≈216
≈160 ≈230
92 ≈187
24 57
15 20
45 ≈165
60 ≈200
134 ≈242
0
1
2
4
6
4
4
4
4
0
0
4
29
174
96
69
94
0
0
1
0
43
3
≈170
≈207
≈211
≈200
≈153
66
20
63
90
≈220
456
528
132
208
1
1
1
1
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
1
0
1
0
96
1
172
420 0,1
532
96
208
div
87
87
387
387
387
87
87
87
87
87
87
87
mov
mov
87
87
87
87
87
87
sse
sse
g, h
i
i
Not available on PMMX
The latency for FLDCW is 3 when the new value loaded is the same as the
value of the control word before the preceding FLDCW, i.e. when alternating
between the same two values. In all other cases, the latency and reciprocal
throughput is 143.
Latency and reciprocal throughput depend on the precision setting in the F.P.
control word. Single precision: 23, double precision: 38, long double precision
(default): 43.
Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
Takes 6 μops more and 40-80 clocks more when XMM registers are disabled.
Integer MMX and XMM instructions
0
1
2
0
fp
mmx
load
fp
alu
mmx
mmx
mmx
sse2
Notes
Page 239
1
2
1
2
Instruction set
1
0
0
1
Subunit
5
2
≈8
10
Execution unit
0
0
0
0
Port
Reciprocal throughput
2
2
1
2
Additional latency
r32, mm
mm, r32
mm,m32
r32, xmm
Latency
Microcode
Move instructions
MOVD
MOVD
MOVD
MOVD
Operands
μops
Instruction
Pentium 4
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKHBW/WD/DQ/
QDQ
PUNPCKLBW/WD/DQ/Q
DQ
PSHUFD
PSHUFL/HW
PSHUFW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
PINSRW
Arithmetic instructions
PADDB/W/D
PADD(U)SB/W
PSUBB/W/D
PSUB(U)SB/W
PADDQ, PSUBQ
PADDQ, PSUBQ
PCMPEQB/W/D
PCMPGTB/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PMULUDQ
PAVGB/W
xmm, r32
xmm,m32
m32, r
mm,mm
xmm,xmm
r,m64
m64,r
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
mm,xmm
xmm,mm
m,mm
m,xmm
2
1
2
1
1
1
2
1
1
2
4
4
3
2
3
2
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
6
≈8
≈8
6
2
≈8
≈8
6
≈8
≈8
2
1
2
1
2
1
2
1
1
2
2
2
2
2
75
18
1
2
0,1
0
1
2
0
0
2
0
2
0
0,1
0,1
0
0
mm,r/m
1
0
2
1
1
1
mmx
shift
mmx
a
xmm,r/m
1
0
4
1
2
1
mmx
shift
mmx
a
mm,r/m
1
0
2
1
1
1
mmx
shift
mmx
a
xmm,r/m
1
0
4
1
2
1
mmx
shift
sse2
a
xmm,r/m
xmm,xmm,i
xmm,xmm,i
mm,mm,i
mm,mm
xmm,xmm
r32,r
r32,mm,i
r32,xmm,i
mm,r32,i
xmm,r32,i
1
1
1
1
4
4
2
3
3
2
2
0
0
0
0
4
6
0
0
0
0
0
2
4
2
2
1
1
1
1
shift
shift
shift
shift
sse2
sse2
sse2
mmx
sse
sse2
a
1
1
1
1
1
1
1
1
1
0
0
0,1
1
1
1
1
mmx
mmx
mmx
mmx
mov
mov
7
8
9
3
4
2
2
2
1
7
10
3
2
2
2
2
mmx-alu0
mmx-int
mmx-int
int-mmx
int-mmx
sse
sse
sse2
sse
sse2
r,r/m
1
0
2
1
1,2
1
mmx
alu
mmx
a,j
r,r/m
mm,r/m
xmm,r/m
1
1
1
0
0
0
2
2
4
1
1
1
1,2
1
2
1
1
1
mmx
mmx
fp
alu
alu
add
mmx
sse2
sse2
a,j
a
a
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
0
0
0
0
0
0
2
6
6
6
6
2
1
1
1
1
1
1
1,2
1,2
1,2
1,2
1,2
1,2
1
1
1
1
1
1
mmx
fp
fp
fp
fp
mmx
alu
mul
mul
mul
mul
alu
mmx
mmx
sse
mmx
sse2
sse
a,j
a,j
a,j
a,j
a,j
a,j
8
8
1
0
0
1
0
1
1
Page 240
mmx
load
mov
mmx
load
mov
mov
load
mov
load
mov
mov-mmx
mov-mmx
shift
shift
sse2
sse2
mmx
mmx
sse2
mmx
mmx
sse2
sse2
sse2
sse2
sse2
k
k
sse2
sse2
mov
mov
sse
sse2
Pentium 4
PMIN/MAXUB
PMIN/MAXSW
PSADBW
Logic
PAND, PANDN
POR, PXOR
PSLL/RLW/D/Q,
PSRAW/D
PSLLDQ, PSRLDQ
Other
EMMS
Notes:
a)
j)
k)
r,r/m
r,r/m
r,r/m
1
1
1
0
0
0
2
2
4
1
1
1
1,2
1,2
1,2
1
1
1
mmx
mmx
mmx
alu
alu
alu
sse
sse
sse
a,j
a,j
a,j
r,r/m
r,r/m
1
1
0
0
2
2
1
1
1,2
1,2
1
1
mmx
mmx
alu
alu
mmx
mmx
a,j
a,j
r,i/r/m
xmm,i
1
1
0
0
2
4
1
1
1,2
2
1
1
mmx
mmx
shift
shift
mmx
sse2
a,j
a
4
11
12
12
0
mmx
Add 1 μop if source is a memory operand.
Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands.
It may be advantageous to replace this instruction by two 64-bit moves
Floating point XMM instructions
2
2
2
1
1
1
0
2
0
0
2
0
1
1
2
0
1
1
0
4
2
sse
0
0
0
0
0
0
2
4
3
2
2
2
0
0
1
1
1
1
sse
sse/2
sse
sse
sse
sse
2
2
≈7
0
1
0
4
2
0
0
mov
mmx
mmx
shift
shift
mmx
mmx
shift
shift
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
MOVHPS/D, MOVLPS/D
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCKHPS/D
UNPCKLPS/D
6
4
4
2
1
1
1
1
Page 241
fp
mmx
mmx
mmx
shift
shift
shift
Notes
m,r
m,r
r32,r
r,r/m,i
r,r/m
r,r/m
1
1
2
1
2
8
2
2
1
2
2
2
0
mov
Instruction set
3
0
0
Subunit
r,m
6
≈7
≈7
6
Execution unit
0
0
0
0
0
6
0
0
0
0
0
0
Port
Reciprocal throughput
1
1
2
1
4
4
1
1
1
2
1
1
Additional latency
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
Operands
Latency
Microcode
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS
MOVSD
MOVSS, MOVSD
MOVSS, MOVSD
MOVHLPS
MOVLHPS
MOVHPS/D, MOVLPS/D
μops
Instruction
k
k
Pentium 4
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
4
2
4
4
1
3
1
2
4
4
3
3
3
4
2
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
7
10
14
10
4
9
4
9
10
11
7
11
10
15
8
8
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
4
2
6
6
2
4
2
2
4
5
2
3
3
6
2.5
2.5
1
1
1
1
1
1
1
1
1
1
0,1
0,1
1
1
1
1
mmx
fp-mmx
mmx
mmx
fp
mmx-fp
fp
fp-mmx
mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp
fp
shift
sse2
shift
shift
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
1
2
0
0
0
0
0
0
0
0
4
4
6
23
39
38
69
4
1
1
1
0
0
0
0
1
2
2
2
23
39
38
69
4
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
mmx
r,r/m
1
0
4
1
2
1
r,r/m
r,r/m
1
2
0
0
4
6
1
1
2
3
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
1
0
2
1
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
2
2
0
0
0
0
0
0
23
39
38
69
4
4
0
0
0
0
1
1
m
m
4
4
8
4
98
Arithmetic
ADDPS/D ADDSS/D
SUBPS/D SUBSS/D
MULPS/D MULSS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPPS RCPSS
MAXPS/D
MAXSS/DMINPS/D
MINSS/D
CMPccPS/D
CMPccSS/D
COMISS/D UCOMISS/D
Other
LDMXCSR
STMXCSR
Notes:
Page 242
sse2
a
sse2
sse2
sse2
a
sse2
a
sse
a
a
a
a
a
sse2
sse
a
add
add
mul
div
div
div
div
sse
sse
sse
sse
sse
sse2
sse2
sse
a
a
a
a,h
a,h
a,h
a,h
a
fp
add
sse
a
1
1
fp
fp
add
add
sse
sse
a
a
2
1
mmx
alu
sse
a
23
39
38
69
3
4
1
1
1
1
1
1
fp
fp
fp
fp
mmx
mmx
div
div
div
div
sse
sse
sse2
sse2
sse
sse
a,h
a,h
a,h
a,h
a
a
100
6
1
1
sse2
sse2
sse2
sse
sse2
sse
sse2
sse
sse
a
a
a
a
a
a
a
Pentium 4
a)
h)
k)
Add 1 μop if source is a memory operand.
Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
It may be advantageous to replace this instruction by two 64-bit moves.
Page 243
Prescott
Intel Pentium 4 w. EM64T (Prescott)
List of instruction timings and μop breakdown
Explanation of column headings:
Instruction:
Operands:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate constant, r = any register, r32 = 32-bit register, etc., mm = 64 bit
mmx register, xmm = 128 bit xmm register, sr = segment register, m = any
memory operand including indirect operands, m64 means 64-bit memory operand, etc., mabs = memory operand with 64-bit absolute address.
μops:
Microcode:
Latency:
Number of μops issued from instruction decoder and stored in trace cache.
Number of additional μops issued from microcode ROM.
This is the delay that the instruction generates in a dependency chain if the next
dependent instruction starts in the same execution unit. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the
clock counts considerably. Floating point operands are presumed to be normal
numbers. Denormal numbers, NAN's, infinity and exceptions increase the delays. The latency of moves to and from memory cannot be measured accurately
because of the problem with memory intermediates explained above under
“How the values were measured”.
Additional latency:
This number is added to the latency if the next dependent instruction is in a different execution unit. There is no additional latency between ALU0 and ALU1.
This is also called issue latency. This value indicates the number of clock cycles
from the execution of an instruction begins to a subsequent independent instruction can begin to execute in the same execution subunit. A value of 0.25
indicates 4 instructions per clock cycle in one thread.
Reciprocal
throughput:
Port:
Execution unit:
Execution subunit:
Instruction set
The port through which each μop goes to an execution unit. Two independent
μops can start to execute simultaneously only if they are going through different
ports.
Use this information to determine additional latency. When an instruction with
more than one μop uses more than one execution unit, only the first and the
last execution unit is listed.
Throughput measures apply only to instructions executing in the same subunit.
Indicates the compatibility of an instruction with other 80x86 family microprocessors. The instruction can execute on microprocessors that support the instruction set indicated.
Integer instructions
86
86
x64
Notes
alu0/1
alu0/1
alu0/1
Instruction set
Page 244
0.25 0/1
0.25 0/1
0.5 0/1
Subunit
0
0
0
Execution unit
1
1
Port
Reciprocal throughput
0
0
0
Additional latency
1
1
1
Latency
r,r
r8/16/32,i
r64,i32
Microcode
Move instructions
MOV
MOV
MOV
Operands
μops
Instruction
c
Prescott
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVZX
MOVZX
MOVZX
MOVSX
MOVSX
MOVSX
MOVSXD
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LEA
LEA
LEA
LEA
LEA
LEA
LAHF
SAHF
SALC
LDS, LES, ...
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVSB
REP MOVSW
r64,i64
r8/16,m
r32/64,m
m,r
m,i
m64,i32
r,sr
sr,r/m
r,mabs
mabs,r
m,r32
r,r
r16,r8
r,m
r16,r8
r32/64,r8/16
r,m
r64,r32
r,r/m
r,r
r,m
r
i
m
sr
r
m
sr
r,[m]
r,[r+r/i]
r,[r+r+i]
r,[r*i]
r,[r+r*i]
r,[r+r*i+i]
r,m
2
0
0
2
0
3
0
1
0
2
0
1
0
2
0
2
0
1
2
1
8
3
0
3
0
2
0
1
0
1
0
2
0
2
0
1
0
2
0
2
0
2
0
1
0
1
0
2
0
3
0
1
0
1
0
3
0
9.5
0
3
0
2
0
2
6 ≈100
4
0
6
2
0
2
2
0
2
3
0
2
1
3
1
3
1
9
2
0
1
0
2
6
1
8
1
8
2
16
1
0
1
0
2.5
0
2
0
3.5
0
3
0
3.5
0
2
0
3.5
0
3
0
3.5
0
1
0
4
0
1
0
5
0
2
0
0
2
10
1
3
8
1
5n ≈ 4n+50
1
2
8
1 2.5n ≈ 3n
1
4
8
9 ≈.3n ≈.3n
1 ≈.5-1.1n≈ .6-1.4n
Page 245
1
1
1
2
2
2
8
27
1
2
2
0.25
1
1
1
0.5
1
0.5
3
1
2
2
2
9
9
16
1
10
30
70
15
0.25
0.25
0.5
1
1
1
1
28
8
1
2
2
0
0,3
0,3
alu1
load
load
store
store
store
0/1
0/1
2
0
0
2
0
alu0/1
alu0/1
load
alu0
alu0
load
alu0
0/1
alu0/1
0/1
0/1
0/1
1
0,1
1
1
0/1
1
alu0/1
alu0/1
alu0/1
alu
alu0,1
alu
int
alu0/1
int
x64
86
86
86
86
x64
86
86
x64
x64
sse2
386
386
386
386
386
386
x64
PPro
86
86
86
86
186
86
86
86
186
86
86
86
86
186
86
86
86
386
386
386
86
86
86
86
86
86
8
86
86
8
86
86
86
b,c
a,q
l
l
c
c
a,c,o
a,c,o
a
a,e
m
m
p
n
d,n
m
m
Prescott
REP MOVSD
REP MOVSQ
BSWAP
IN, OUT
PREFETCHNTA
PREFETCHT0/1/2
SFENCE
LFENCE
MFENCE
Arithmetic instructions
ADD, SUB
ADD, SUB
ADD, SUB
ADC, SBB
ADC, SBB
ADC, SBB
ADC, SBB
CMP
CMP
INC, DEC
INC, DEC
NEG
NEG
AAA, AAS
DAA, DAS
AAD
AAM
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
MUL, IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
r
r,r/i
m
m
r,r
r,m
m,r
r,r/i
r,m
m,r
m,i
r,r
r,m
r
m
r
m
r8
r16
r32
r64
m8
m16
m32
m64
r16,r16
r16,r16,i
r32,r32
r32,(r32),i
r64,r64
r64,(r64),i
r16,m16
r32,m32
r64,m64
r,m,i
r8/m8
r16/m16
r32/m32
r64/m64
1
1
1
1
1
1
1
1
1
1
2
3
3
2
2
3
1
2
2
4
1
3
1
1
2
2
1
4
3
1
2
2
3
2
1
2
1
1
1
1
2
2
2
3
1
1
1
1
≈1.1n ≈ 1.4 n
86
x64
alu
≈1.1n ≈ 1.4 n
0
52
0
0
2
2
4
1
0
0
0
0
5
6
5
0
0
0
0
0
0
10
16
5
17
0
0
0
5
0
5
0
6
0
0
0
0
0
0
0
0
0
0
20
19
21
31
1
1
5
10
10
20
22
1
1
1
5
1
5
26
29
13
71
10
11
11
11
10
11
11
11
10
11
10
10
10
10
10
10
10
10
74
73
76
63
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Page 246
1
>1000
1
1
50
50
124
86
sse
sse
sse
sse2
sse2
0.25 0/1
1
2
10
1
10
1
10
10
0.25 0/1
1
0.5 0/1
3
0.5 0
3
2.5
2.5
2.5
2.5
2.5
2.5
2.5
2.5
2.5
1-2.5
34
34
34
52
486
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
alu0/1
int,alu
int,alu
alu0/1
alu0/1
alu0
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
mul
fpdiv
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
mul
fpdiv
fpdiv
fpdiv
fpdiv
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
x64
86
86
86
x64
386
186
386
386
x64
x64
386
386
x64
186
86
86
386
x64
c
c
c
c
c
m
m
m
m
a
a
a
a
Prescott
IDIV
IDIV
IDIV
IDIV
CBW
CWD
CDQ
CQO
CWDE
CDQE
SCAS
REP SCAS
CMPS
REP CMPS
Logic
AND, OR, XOR
AND, OR, XOR
AND, OR, XOR
TEST
TEST
NOT
NOT
SHL
SHR, SAR
SHR, SAR
SHL
SHR, SAR
SHR, SAR
ROL, ROR
ROL, ROR
ROL, ROR
ROL, ROR
RCL, RCR
RCL
RCR
RCL
RCR
SHL, SHR, SAR
ROL. ROR
SHL, SHR, SAR
ROL. ROR
RCL, RCR
RCL, RCR
RCL, RCR
SHLD, SHRD
SHLD
SHRD
SHLD, SHRD
SHLD
r8/m8
r16/m16
r32/m32
r64/m64
r,r
r,m
m,r
r,r
r,m
r
m
r,i
r8/16/32,i
r64,i
r,CL
r8/16/32,CL
r64,CL
r8/16/32,i
r64,i
r8/16/32,CL
r64,CL
r,1
r,i
r,i
r,CL
r,CL
m8/16/32,i
m8/16/32,i
m8/16/32,cl
m8/16/32,cl
m8/16/32,1
m8/16/32,i
m8/16/32,cl
r8/16/32,r,i
r64,r64,i
r64,r64,i
r8/16/32,r,cl
r64,r64,cl
1
21
76
0
1
19
79
0
1
19
79
0
1
58
96
0
2
0
2
0
2
0
2
0
1
0
1
0
1
0
7
0
2
0
2
0
1
0
1
0
1
3
0
1 ≈ 54+6n
≈ 4n
1
5
1 ≈ 81+8n
≈ 5n
34
34
34
91
1
1
1
1
1
1
8
1
2
3
1
2
1
3
1
1
1
2
2
2
1
1
2
2
1
2
2
1
1
3
3
2
2
2
3
2
3
4
3
4
4
0.5
1
2
0.5
1
0.5
2
0.5
0.5
2
2
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
11
11
11
11
6
6
6
6
5
13
13
0
5
7
0
5
1
1
5
1
1
1
5
1
1
7
2
2
8
1
7
2
8
7
31
25
31
25
10
10
10
10
27
38
37
8
10
10
9
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Page 247
1
1
1
1
0
0/1
0/1
0/1
0/1
0/1
int
int
int
int
alu0
alu0/1
alu0/1
alu0/1
alu0/1
alu0/1
fpdiv
fpdiv
fpdiv
fpdiv
86
86
386
x64
86
86
386
x64
386
x64
86
a
a
a
a
86
10
86
86
1
7
2
8
7
31
25
31
25
27
38
37
7
8
0
alu0
0
alu0
0
alu0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
86
86
86
86
86
86
86
186
186
x64
86
86
x64
186
x64
86
x64
86
186
186
86
86
86
86
86
86
86
86
86
386
x64
x64
386
x64
c
c
c
c
c
d
d
d
d
d
d
d
d
d
d
d
d
d
d
Prescott
SHRD
SHLD, SHRD
SHLD, SHRD
BT
BT
BT
BT
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BTR, BTS, BTC
BSF, BSR
SETcc
SETcc
CLC, STC
CMC
CLD, STD
r64,r64,cl
m,r,i
m,r,CL
r,i
r,r
m,i
m,r
r,i
r,r
m,i
m,r
r,r/m
r
m
3
3
2
1
2
3
2
1
2
3
2
2
2
3
2
3
1
8
8
8
0
0
0
7
0
0
6
10
0
0
0
0
0
8
12
20
20
8
9
8
10
8
9
28
14
16
9
9
Control transfer instructions
JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Jcc
short/near
J(E)CXZ
short
LOOP
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
IRET
BOUND
m
INT
i
INTO
1
2
3
3
2
1
4
4
3
3
4
4
2
4
4
1
2
1
2
2
1
0
25
0
0
28
0
0
0
0
29
0
0
32
0
0
30
30
49
11
67
4
0
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
LEAVE
CLI
STI
CPUID
RDTSC
1
0
1
0
1
2
4
0
1
5
1
11
1 49-90
1
12
15
0
0
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
10
10
8
9
8
10
8
9
10
14
4
1
2
8
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x64
386
386
386
386
386
386
386
386
386
386
386
386
386
86
86
86
53
1
154
15
10
157
2-4
4
4
7
160
7
9
160
7
7
160
160
325
12
470
26
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.25 0/1
0.25 0/1
50
5
52
64
300-500
100
Page 248
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
alu1
int
int
alu0
branch
alu0
alu0
branch
branch
alu0
alu0
alu0
alu0
branch
branch
branch
branch
alu0
alu0
branch
branch
alu0
alu0
branch
branch
alu0/1
alu0/1
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
86
186
86
86
86
ppro
sse2
186
86
86
p5
p5
d
d
d
d
d
m
m
m
m
Prescott
RDPMC (bit 31 = 1)
RDPMC (bit 31 = 0)
MONITOR
MWAIT
Notes:
a)
b)
c)
d)
e)
l)
m)
n)
o)
p)
q)
1
4
37
154
100
240
p5
p5
(sse3)
(sse3)
Add 1 μop if source is a memory operand.
Uses an extra μop (port 3) if SIB byte used.
Add 1 μop if source or destination, but not both, is a high 8-bit register (AH, BH,
CH, DH).
Has (false) dependence on the flags in most cases.
Not available on PMMX
Move accumulator to/from memory with 64 bit absolute address (opcode A0 A3).
Not available in 64 bit mode.
Not available in 64 bit mode on some processors.
MOVSX uses an extra μop if the destination register is smaller than the biggest
register size available. Use a 32 bit destination register in 16 bit and 32 bit
mode, and a 64 bit destination register in 64 bit mode for optimal performance.
LEA with a direct memory operand has 1 μop and a reciprocal throughput of
0.25. This also applies if there is a RIP-relative address in 64-bit mode. A signextended 32-bit direct memory operand in 64-bit mode without RIP-relative address takes 2 μops because of the SIB byte. The throughput is 1 in this case.
You may use a MOV instead.
These values are measured in 32-bit mode. In 16-bit real mode there is 1 microcode μop and a reciprocal throughput of 17.
Floating point x87 instructions
0
0
Page 249
mov
load
load
load
mov
store
store
store
mov
load
load
store
store
mov
87
87
87
87
87
87
87
87
87
87
87
87
sse3
87
Notes
0
0
2
2
2
0
0
0
0
0
2
2
0
0
0
Instruction set
7
7
1
1
8
90
1
2
10
400
1
8
2
2.5
2.5
2
Subunit
0
0
Execution unit
7
Port
Reciprocal throughput
0
0
3
74
0
0
6
311
0
2
0
0
0
0
Additional latency
1
1
3
3
1
2
3
3
1
3
2
3
3
1
Latency
r
m32/64
m80
m80
r
m32/64
m80
m80
r
m16
m32/64
m
m
Microcode
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FIST(P)
FISTTP
FLDZ
Operands
μops
Instruction
Prescott
FLD1
FCMOVcc
FFREE
FINCSTP, FDECSTP
FNSTSW
FSTSW
FNSTSW
FNSTCW
FLDCW
Arithmetic instructions
FADD(P),FSUB(R)(P)
FADD,FSUB(R)
FIADD,FISUB(R)
FIADD,FISUB(R)
FMUL(P)
FMUL
FIMUL
FIMUL
FDIV(R)(P)
FDIV(R)
FIDIV(R)
FIDIV(R)
FABS
FCHS
FCOM(P), FUCOM(P)
FCOM(P)
FCOMPP, FUCOMPP
FCOMI(P)
FICOM(P)
FICOM(P)
FTST
FXAM
FRNDINT
FPREM
FPREM1
Math
FSQRT
FLDPI, etc.
FSIN, FCOS
FSINCOS
FPTAN
FPATAN
FSCALE
FXTRACT
F2XM1
FYL2X
FYL2XP1
st0,r
r
AX
AX
m16
m16
m16
r
m
m16
m32
r
m
m16
m32
r
m
m16
m32
r
m
r
m16
m32
2
4
3
1
4
6
2
4
3
0
0
0
0
0
0
3
0
6
1
2
3
3
1
2
3
3
1
2
3
3
1
1
1
2
2
3
3
3
1
1
3
8
9
0
0
3
0
0
0
3
0
0
0
3
3
0
0
0
0
0
0
3
0
0
0
14
86
92
1
2
3
5
8
4
3
4
3
3
3
0
0
≈100
≈150
≈170
97
25
16
190
63
58
5
1
0
0
0
0
6
6
7
6
8
8
8
8
45
45
45
45
3
3
3
3
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
28
220
220
1
1
1
45
1
≈200
≈200
≈270
≈250
96
27
≈270
≈170
≈170
Page 250
2
4
3
1
3
3
8
3
10
0
1
0
0
1
1
0
0
0,2
mov
fp
mov
mov
1
1
6
2
2
2
8
3
45
45
45
45
1
1
1
1
1
3
8
2
1
1
16
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0,1
1
1,2
1
1
0,1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
1
1
1
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
fp
45
2
≈200
≈200
≈270
≈250
87
PPro
87
87
287
287
87
87
87
add
add
add
add
mul
mul
mul
mul
div
div
div
div
misc
misc
misc
misc
misc
misc
misc
misc
misc
misc
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
87
PPro
87
87
87
87
87
87
387
div
87
87
387
387
87
87
87
87
87
87
87
fp
fp
e
f
g,h
g,h
g,h
g,h
g,h
Prescott
Other
FNOP
(F)WAIT
FNCLEX
FNINIT
FNSAVE
FRSTOR
FXSAVE
FXRSTOR
Notes:
e)
f)
g)
h)
i)
1
2
1
1
2
2
2
2
0
0
4
30
181
96
121
118
1
0
0
0
1
1
120
200
500
570
0
0
1
mov
mov
0,1
160
244
87
87
87
87
87
87
sse
sse
i
i
Not available on PMMX
The latency for FLDCW is 3 when the new value loaded is the same as the
value of the control word before the preceding FLDCW, i.e. when alternating
between the same two values. In all other cases, the latency and reciprocal
throughput is > 100.
Latency and reciprocal throughput depend on the precision setting in the F.P.
control word. Single precision: 32, double precision: 40, long double precision
(default): 45.
Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
Takes fewer microcode μops when XMM registers are disabled, but the
throughput is the same.
Integer MMX and XMM instructions
7
2
0
1
7
0
10
1
Page 251
alu
shift
shift
sse2
mmx
mmx
mmx
sse2
sse2
sse2
mmx
mmx
sse2
mmx
mmx
sse2
sse2
sse2
sse2
sse2
sse3
Notes
1
1
0
fp
1
mmx
2
load
0
fp
1
mmx
2
load
0,1
0
mov
1
mmx
2
load
0
mov
0
mov
2
load
0
mov
2
load
0
mov
2
load
0,1 mov-mmx
Instruction set
7
4
1
1
1
1
2
1
2
1
2
1
2
1
1
2
23
8
2.5
2
Subunit
1
1
Execution unit
6
3
Port
Reciprocal throughput
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
Additional latency
2
1
1
1
2
1
2
1
1
1
2
1
1
2
4
4
4
3
Latency
r32, mm
mm, r32
mm,m32
r32, xmm
xmm, r32
xmm,m32
m32, r
mm,mm
xmm,xmm
r,m64
m64,r
xmm,xmm
xmm,m
m,xmm
xmm,m
m,xmm
xmm,m
mm,xmm
Microcode
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
Operands
μops
Instruction
k
k
Prescott
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVDDUP
MOVSHDUP
MOVSLDUP
PACKSSWB/DW
PACKUSWB
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKHBW/WD/DQ/
QDQ
PUNPCKLBW/WD/DQ/Q
DQ
PSHUFD
PSHUFL/HW
PSHUFW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRW
PINSRW
Arithmetic instructions
PADDB/W/D
PADD(U)SB/W
PSUBB/W/D
PSUB(U)SB/W
PADDQ, PSUBQ
PADDQ, PSUBQ
PCMPEQB/W/D
PCMPGTB/W/D
PMULLW PMULHW
PMULHUW
PMADDWD
PMULUDQ
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PSADBW
Logic
PAND, PANDN
POR, PXOR
PSLL/RLW/D/Q,
PSRAW/D
PSLLDQ, PSRLDQ
xmm,mm
m,mm
m,xmm
xmm,xmm
2
3
2
1
0
0
0
0
10
1
2
1
2
4
4
2
0,1 mov-mmx
0
mov
0
mov
1
mmx
xmm,xmm
1
0
4
1
2
1
mm,r/m
1
0
2
1
2
xmm,r/m
1
0
4
1
mm,r/m
1
0
2
xmm,r/m
1
0
xmm,r/m
xmm,xmm,i
xmm,xmm,i
mm,mm,i
mm,mm
xmm,xmm
r32,r
r32,mm,i
r32,xmm,i
r,r32,i
1
1
1
1
1
1
2
2
2
2
r,r/m
sse2
shift
sse
sse2
sse3
mmx
shift
sse3
1
mmx
shift
mmx
a
4
1
mmx
shift
mmx
a
1
2
1
mmx
shift
mmx
a
4
1
4
1
mmx
shift
sse2
a
0
0
0
0
4
6
0
0
0
0
2
4
2
2
1
1
1
1
2
2
2
1
10
12
3
2
3
2
shift
shift
shift
shift
sse2
sse2
sse
sse
sse
sse2
a
1
0
2
1
1,2
1
mmx
alu
mmx
a,j
r,r/m
mm,r/m
xmm,r/m
1
1
1
0
0
0
2
2
5
1
1
1
1,2
1
2
1
1
1
mmx
mmx
fp
alu
alu
add
mmx
sse2
sse2
a,j
a
a
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
2
7
7
7
7
2
2
2
4
1
1
1
1
1
1
1
1
1
1,2
1,2
1,2
1,2
1,2
1,2
1,2
1,2
1,2
1
1
1
1
1
1
1
1
1
mmx
fp
fp
fp
fp
mmx
mmx
mmx
mmx
alu
mul
mul
mul
mul
alu
alu
alu
alu
mmx
mmx
sse
mmx
sse2
sse
sse
sse
sse
a,j
a,j
a,j
a,j
a,j
a,j
a,j
a,j
a,j
r,r/m
r,r/m
1
1
0
0
2
2
1
1
1,2
1,2
1
1
mmx
mmx
alu
alu
mmx
mmx
a,j
a,j
r,i/r/m
xmm,i
1
1
0
0
2
4
1
1
1,2
2
1
1
mmx
mmx
shift
shift
mmx
sse2
a,j
7
7
7
4
Page 252
1
mmx
1
mmx
1
mmx
1
mmx
0
mov
0
mov
0,1 mmx-alu0
1 mmx-int
1 mmx-int
1 int-mmx
sse
sse
sse2
sse
Prescott
Other
EMMS
Notes:
a)
j)
k)
10
10
12
0
mmx
Add 1 μop if source is a memory operand.
Reciprocal throughput is 1 for 64 bit operands, and 2 for 128 bit operands.
It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.
Floating point XMM instructions
0
Conversion
CVTPS2PD
CVTPD2PS
CVTSD2SS
CVTSS2SD
CVTDQ2PS
CVTDQ2PD
CVT(T)PS2DQ
CVT(T)PD2DQ
CVTPI2PS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
xmm,mm
1
2
3
2
1
3
1
2
4
0
0
0
0
0
0
0
0
0
MOVHPS/D, MOVLPS/D
1
1
0
4
2
1
1
4
2
1
1
5
4
4
2
1
1
1
1
4
10
14
8
5
10
5
11
12
1
1
1
1
1
1
1
1
1
4
2
6
6
2
4
2
2
6
Page 253
0
2
0
0
2
0
1
1
2
0
1
1
2
0
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
mov
mov
mmx
mmx
shift
shift
mmx
mmx
shift
shift
fp
mmx
mmx
mmx
shift
shift
shift
mmx
fp-mmx
mmx
mmx
fp
mmx-fp
fp
fp-mmx
mmx
shift
sse2
shift
shift
sse2
sse2
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse
sse3
sse3
sse
sse
sse
sse
sse
sse2
a
sse2
sse2
sse2
a
sse2
a
sse
Notes
7
MOVHPS/D, MOVLPS/D
2
4
1
1
2
1
2
8
2
2
1
2
2
2
2
2
2
2
4
3
2
2
2
Instruction set
0
0
Subunit
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Execution unit
1
1
2
1
4
4
1
1
1
2
1
1
2
2
1
1
2
2
1
2
1
7
MOVSH/LDUP
MOVDDUP
MOVNTPS/D
MOVMSKPS/D
SHUFPS/D
UNPCKHPS/D
UNPCKLPS/D
r,r
r,m
m,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
r,m
m,r
r,r
r,r
m,r
r32,r
r,r/m,i
r,r/m
r,r/m
Port
Reciprocal throughput
Additional latency
Latency
Microcode
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVUPS/D
MOVSS
MOVSD
MOVSS, MOVSD
MOVSS, MOVSD
MOVHLPS
MOVLHPS
Operands
μops
Instruction
k
k
a
a
a
a
a
a
Prescott
CVTPI2PD
CVT(T)PS2PI
CVT(T)PD2PI
CVTSI2SS
CVTSI2SD
CVT(T)SD2SI
CVT(T)SS2SI
xmm,mm
mm,xmm
mm,xmm
xmm,r32
xmm,r32
r32,xmm
r32,xmm
4
3
4
3
4
2
2
0
0
0
0
0
0
0
12
8
12
20
20
12
17
1
0
1
1
1
1
1
5
2
3
4
5
4
4
1
0,1
0,1
1
1
1
1
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp-mmx
fp
fp
sse2
sse
sse2
sse
sse2
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
3
1
1
1
1
1
2
0
0
0
0
0
0
0
0
0
0
5
5
5
13
7
32
41
40
71
6
1
1
1
1
1
1
1
1
1
1
2
2
2
5-6
2
23
41
40
71
4
1
1
1
1
1
1
1
1
1
1
fp
fp
fp
fp
fp
fp
fp
fp
fp
mmx
r,r/m
1
0
5
1
2
1
r,r/m
r,r/m
1
2
0
0
5
6
1
1
2
3
Logic
ANDPS/D ANDNPS/D
ORPS/D XORPS/D
r,r/m
1
0
2
1
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
r,r/m
1
1
1
1
2
2
0
0
0
0
0
0
32
41
40
71
5
6
1
1
1
1
1
1
m
m
2
3
11
0
Arithmetic
ADDPS/D ADDSS/D
SUBPS/D SUBSS/D
ADDSUBPS/D
HADDPS/D HSUBPS/D
MULPS/D MULSS/D
DIVSS
DIVPS
DIVSD
DIVPD
RCPPS RCPSS
MAXPS/D
MAXSS/DMINPS/D
MINSS/D
CMPccPS/D
CMPccSS/D
COMISS/D UCOMISS/D
Other
LDMXCSR
STMXCSR
Notes:
a)
h)
k)
a
a
a
a
a
sse2
sse
a
a
add
add
add
add
mul
div
div
div
div
sse
sse
sse3
sse3
sse
sse
sse
sse2
sse2
sse
a
a
a
a
a
a,h
a,h
a,h
a,h
a
fp
add
sse
a
1
1
fp
fp
add
add
sse
sse
a
a
2
1
mmx
alu
sse
a
32
41
40
71
3
4
1
1
1
1
1
1
fp
fp
fp
fp
mmx
mmx
div
div
div
div
sse
sse
sse2
sse2
sse
sse
a,h
a,h
a,h
a,h
a
a
13
3
1
1
sse
sse
Add 1 μop if source is a memory operand.
Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit.
It may be advantageous to replace this instruction by two 64-bit moves or LDDQU.
Page 254
Atom
Intel Atom
List of instruction timings and μop breakdown
Explanation of column headings:
Instruction:
Operands:
μops:
Unit:
Latency:
Reciprocal throughput:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit
xmm register, (x)mm = mmx or xmm register, sr = segment register, m =
memory, m32 = 32-bit memory operand, etc.
The number of μops from the decoder or ROM.
Tells which execution unit is used. Instructions that use the same unit cannot
execute simultaneously.
ALU0 and ALU1 means integer unit 0 or 1, respectively.
ALU0/1 means that either unit can be used. ALU0+1 means that both units
are used.
Mem means memory in/out unit.
FP0 means floating point unit 0 (includes multiply, divide and other SIMD instructions).
FP1 means floating point unit 1 (adder).
MUL means multiplier, shared between FP and integer units.
DIV means divider, shared between FP and integer units.
np means not pairable: Cannot execute simultaneously with any other instruction.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are
presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a
similar delay.
The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread.
Integer instructions
Move instructions
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
Operands
μops
Unit
r,r
r,i
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
m,r
r,r/m
r,r
r,m
r,r
r,m
1
1
1
1
1
1
2
7
8
1
1
1
1
3
4
ALU0/1
ALU0/1
ALU0, Mem
ALU0, Mem
ALU0, Mem
Latency Reciprocal
throughput
1
1
1-3
1
1
ALU0, Mem
ALU0
ALU0+1
1
2
6
6
Page 255
1/2
1/2
1
1
1
1
5
21
26
2.5
1
2
3
6
6
Remarks
All addr. modes
All addr. modes
Implicit lock
Atom
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
r
i
m
sr
r
(E/R)SP
m
sr
3
1
1
2
3
14
9
1
1
3
7
19
16
1
1
2
r,m
r
m
m
m
1
1
10
1
1
1
1
1
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
1
1
1
1
1
1
1
1
1
1
13
13
20
21
4
10
3
4
3
8
2
1
6
2
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
np
np
np
np
ALU0+1
ALU0/1
AGU1
ALU0
6
1
1
1
2
1
7
1-4
1
30
1
1
30
1
1
1/2
1
1
1
1/2
1
1
2
2
2
1/2
1
1/2
Mem
Mem
ALU0/1
ALU0/1, Mem
ALU0/1
ALU0/1
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
Page 256
6
1
1
5
6
12
11
1
1
6
31
28
12
2
1/2
5
2
2
2
2
1
1
1
16
12
20
25
7
24
7
6
6
14
6
5
13
5
Not in x64 mode
Not in x64 mode
Not in x64 mode
4 clock latency
on input register
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
7
6
6
14
5
2
11
5
Atom
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR
RCL
RCR
RCL
SHLD
SHLD
SHLD
SHLD
SHLD
SHLD
SHRD
SHRD
SHRD
SHRD
SHRD
SHRD
BT
BT
BT
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r/m8
r/m16
r/m32
r/m 64
r/m8
r/m16
r/m32
r/m64
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r,1
r/m,i/cl
r/m,i/cl
r16,r16,i
r32,r32,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r16,r16,i
r32,r32,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r,r/i
m,r
m,i
1
7
3
5
4
8
9
12
12
38
26
29
29
60
2
1
1
2
1
1
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Mul
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0, Div
ALU0
ALU0
ALU0
ALU0
ALU0
ALU0
5
14
6
7
7
14
22
33
49
183
38
45
61
207
5
1
1
5
1
1
1
ALU0/1
1
1
ALU0/1, Mem
1
ALU0/1, Me
1
1
ALU0/1
1
1
ALU0/1, Mem
1
ALU0
1
1
ALU0
1
1
ALU0
1
1
ALU0
1
5
ALU0
7
2
ALU0
1
12-17
ALU0
12-15
14-20
ALU0
14-18
10
ALU0
10
2
ALU0
5
10
ALU0
11
9
ALU0
9
2
ALU0
5
9
ALU0
10
8
ALU0
8
2
ALU0
5
10
ALU0
9
7
ALU0
8
2
ALU0
5
9
ALU0
9
1
ALU1
1
9
10
2
5
Page 257
2
14
22
33
49
183
38
45
61
207
1/2
1
1
1/2
1
1
1
1
1
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1-2 more if mem
1
Atom
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
SETcc
CLC STC
CMC
CLD
STD
r,r/i
m,r
m,i
r,r/m
r
m
Control transfer instructions
JMP
short/near
JMP
far
JMP
r
JMP
m(near)
JMP
m(far)
Conditional jump
short/near
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL
far
CALL
r
CALL
m(near)
CALL
m(far)
RETN
RETN
i
RETF
RETF
i
BOUND
r,m
INTO
String instructions
LODS
REP LODS
STOS
REP STOS
MOVS
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
1
10
3
10
1
2
1
1
5
6
ALU1
ALU1
ALU1
ALU0+1
ALU0/1
2
1
29
1
2
30
1
3
8
8
1
37
1
2
38
1
1
36
36
11
4
ALU1
a,0
1
2
5
1/2
2
7
25
2
66
4
7
78
2
7
8
8
3
65
18
20
64
6
6
80
80
10
6
ALU1
np
np
3
5n+11
2
3n+10
4
4n+11
3
5n+16
5
6n+16
1
1
5
14
1
11
6
16
2
6
3n+50
5
2n+4
6
2n - 4n
6
3n+60
7
4n+40
ALU0/1
ALU0/1
Page 258
Not in x64 mode
Not in x64 mode
Not in x64 mode
fastest for high n
1/2
1/2
24
23
Not in x64 mode
Atom
ENTER
LEAVE
CPUID
RDTSC
RDPMC
a,b
20+6b
4
40-80
16
24
6
100-170
29
48
Floating point x87 instructions
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
Operands
μops
r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
1
1
4
52
1
3
8
189
1
1
3
3
1
2
2
3
4
4
2
3
1
1
166
83
1
3
9
92
1
7
12
221
1
7
11
11
1
1
1
1
1
1
1
5
3
3
3
3
1
5
5
71
1
1
1
1
r
AX
m16
m16
m16
m
m
r/m
r/m
r/m
r/m
r
m
m
m
m
Unit
Latency Reciprocal
throughput
9
1
321
177
Mul
Div
Mul
Div
1
Page 259
1
1
10
92
1
9
13
221
1
6
9
9
1
8
10
9
10
10
8
9
1
1
321
177
1
2
71
1
1
1
1
10
9
9
73
9
1
Remarks
SSE3
Atom
FXAM
FPREM
FPREM1
FRNDINT
1
26
37
19
1
~110
~130
48
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X FYL2XP1
FPTAN
FPATAN
30
15
1
9
112
25
63
100
91
56
24
71
~260
~260
~100
~220
~300
~300
Other
FNOP
WAIT
FNCLEX
FNINIT
1
2
4
23
Div
5
1
1
5
26
74
Integer MMX and XMM instructions
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
Operands
μops
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
1
1
1
1
1
1
1
1
1
1
3
4
4
1
1
1
1
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, (x)mm
1
1
1
Unit
Latency Reciprocal
throughput
Mem
Mem
4
5
3
4
1
4
5
1
4
5
6
6
6
1
1
~400
~450
2
1
1
1
1/2
1
1
1/2
1
1
6
6
6
1
1
1
3
FP0
FP0
FP0
1
1
1
1
1
1
Mem
Mem
FP0/1
Mem
Mem
FP0/1
Mem
Mem
Mem
Mem
Mem
Page 260
Remarks
Atom
PSHUFB
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PINSRW
PEXTRW
Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
PSADBW
PSADBW
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PABSB PABSW PABSD
PSIGNB PSIGNW PSIGND
Logic instructions
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
mm,mm
xmm,xmm
mm,mm,i
xmm,xmm,i
xmm,xmm,i
xmm, xmm,i
mm,mm
xmm,xmm
r32,(x)mm
(x)mm,r32,i
r32,(x)mm,i
1
4
1
1
1
1
1
2
1
1
2
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm,(x)mm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
mm,mm
xmm,xmm
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
1
2
7
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
(x)xmm,i
xmm,i
Other
EMMS
FP0
FP0
FP0
FP0
FP0
Mem
Mem
1
6
1
1
1
1
4
3
5
FP0/1
1
6
1
1
1
1
2
7
2
1
5
1/2
5
8
FP0/1
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Mul
FP0/1
FP0/1
FP0/1
FP0/1
1
5
8
6
1
4
5
4
5
4
5
4
5
4
5
4
5
1
1
1
1
1
FP0/1
1
1/2
1
2
1
1
FP0/1
FP0
FP0
FP0
1
5
1
1
1/2
5
1
1
9
1/2
1
2
1
2
1
2
1
2
1
2
1
2
1/2
1/2
1/2
1/2
9
Floating point XMM instructions
Operands
μops
Unit
Page 261
Latency Reciprocal
throughput
Remarks
Atom
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
xmm,m32/64
m32/64,xmm
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,xmm,i
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
1
1
1
4
3
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,mm
mm,xmm
xmm,mm
mm,xmm
xmm,r32
r32,xmm
xmm,r32
r32,xmm
4
3
4
3
3
3
3
3
1
1
3
4
3
3
3
3
Arithmetic
ADDSS SUBSS
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
1
1
1
3
1
3
5
5
1
FP0/1
Mem
Mem
Mem
Mem
FP0/1
Mem
Mem
Mem
Mem
Mem
FP0
Mem
FP0
FP0
FP0
FP0
FP0
FP0
FP1
FP1
FP1
FP1
FP1
FP1
FP0+1
FP0+1
FP0, Mul
Page 262
1
4
5
6
6
1
4
5
5
4
4
1
4
~500
1
1
1
1
1
1
1/2
1
1
6
6
1/2
1
1
1
1
1
1
2
3
1
1
1
1
1
11
10
7
6
6
6
7
6
6
4
7
7
7
10
8
10
11
10
6
6
6
6
6
6
5
1
6
7
6
8
6
8
5
5
5
6
5
6
8
8
4
1
1
1
6
1
6
7
7
1
Atom
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
MAXPS/D MINPS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
1
1
6
3
3
6
6
1
5
1
3
4
1
3
FP0, Mul
FP0, Mul
FP0, Mul
FP0, Div
FP0, Div
FP0, Div
FP0, Div
FP0
FP0
FP0
FP0
FP0
5
5
9
31
60
64
122
4
9
5
6
9
5
6
2
2
9
31
60
64
122
1
8
1
6
9
1
6
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
3
5
3
5
1
5
FP0, Div
FP0, Div
FP0, Div
FP0, Div
FP0
FP0
31
63
60
121
4
9
31
63
60
121
1
8
Logic
ANDPS/D
ANDNPS/D
ORPS/D
XORPS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
1
1
1
1
FP0/1
FP0/1
FP0/1
FP0/1
1
1
1
1
1/2
1/2
1/2
1/2
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
m32
m32
m4096
m4096
4
4
121
116
5
14
142
149
6
15
144
150
Page 263
Silvermont
Intel Silvermont
List of instruction timings and μop breakdown
Explanation of column headings:
Instruction:
Operands:
μops:
Unit:
Latency:
Reciprocal throughput:
Instruction name. cc means any condition code. For example, Jcc can be JB,
JNE, etc.
i = immediate data, r = register, mm = 64 bit mmx register, x = 128 bit xmm
register, (x)mm = mmx or xmm register, m = memory, m32 = 32-bit memory
operand, etc.
The number of μops from the decoder or ROM. A µop that goes to multiple
units is counted as one.
Tells which execution unit is used. Instructions that use the same unit cannot
execute simultaneously.
IP0 and IP1 means integer port 0 or 1 and their associated pipelines
IP0/1 means that either integer unit can be used.
IP0+1 means that both units are used at the same time.
Mem means memory execution cluster
FP0 means floating point port 0 (includes multiply, divide, convert and shuffle).
FP1 means floating point port 1 (adder).
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are
presumed to be normal numbers. Denormal numbers, NAN's and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a
similar delay.
The average number of clock cycles per instruction for a series of independent instructions of the same kind in the same thread. Delays in the decoders are included in the latency and throughput timings. Values of 4 or
more are often caused by bottlenecks in the decoders and microcode ROM
rather than the execution units.
Integer instructions
Move instructions
MOV
MOV
MOV
MOV
MOV
MOVNTI
MOVSX MOVZX MOVSXD
MOVSX MOVZX MOVSXD
MOVSX MOVZX MOVSXD
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
Operands
μops
Unit
r,r
r,i
r,m
m,r
m,i
m,r
r16,r8
r16,m8
r32/64,r/m
r,r
r,m
r,r
r,m
1
1
1
1
1
1
2
3
1
1
1
3
3
4
1
IP0/1
IP0/1
Mem
Mem
Mem
Mem
IP0
IP0
IP0
IP0/1
1
1
3
4
IP0/1
5
10
5
r
IP0+1
Page 264
Latency Reciprocal
throughput
1
2
1/2
1/2
1
1
1
2
4
10
1
1
1
5
10
5
1
Remarks
All addr. modes
All addr. modes
Implicit lock
Silvermont
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
LEA
LEA
LEA
LEA
BSWAP
MOVBE
MOVBE
MOVBE
PREFETCHNTA
PREFETCHT0/1/2
PREFETCHNTW
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC
NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
i
m
r
(E/R)SP
m
r,[r+d]
r,[r+r*s]
r,[r+r*s+d]
r,[rip+d]
r16,[m]
r
r16,m16
r32/64,m32/64
m,r
m
m
m
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
r
m
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
1
3
18
10
2
2
6
21
17
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
2
2
1
IP0+1
IP0+1
IP0+1
1
1
1
1
1
1
1
1
1
1
1
13
13
20
21
4
11
3
4
3
3
2
1
1
IP0/1
IP0/1, Mem
IP0/1, Mem
IP0/1
IP0
IP0/1
IP1
IP0+1
IP0/1
IP0
IP0/1
IP0/1
IP0/1
IP0
IP0
IP0
IP0
IP0
IP0
IP0
Page 265
1
2
6
1
1
2
4
1
1
6
2
6
1
1
1
6
12
12
16
16
5
24
5
5
5
7
4
3
5
1
5
29
10
1
3
6
47
14
1
1
4
1
1
2
0.5
4
1
2
1
1
1
1
1
8
14
7
1/2
1
1
2
2
2
1/2
1
1/2
1/2
1
16
5
5
5
7
4
1
2
Not in x64 mode
Not in x64 mode
Not in x64 mode
latency to flag=2
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
Not in x64 mode
Silvermont
IMUL
IMUL
IMUL
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW
CWDE
CDQE
CWD
CDQ
CQO
POPCNT
POPCNT
POPCNT
CRC32
CRC32
CRC32
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
SHR SHL SAR
ROR ROL
ROR ROL
RCR
RCL
RCR
RCR
RCL
RCL
SHLD
SHLD
SHLD
SHLD
SHLD
SHLD
SHRD
SHRD
SHRD
r16,r16,i
r32,r32,i
r64,r64,i
m8
m16
m32
m64
r/m8
r/m16
r/m32
r/m 64
r/m8
r/m16
r/m32
r/m64
r16,r16
r32,r32
r64,r64
r32,r8
r32,r16
r32,r32
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r,1
r,1
r,i/cl
m,i/cl
r,i/cl
m,i/cl
r16,r16,i
r32,r32,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r16,r16,i
r32,r32,i
r64,r64,i
2
1
1
3
5
4
4
9
12
12
23
26
29
29
44
2
1
1
2
1
1
2
1
1
2
1
1
IP0
IP0
IP0
IP0
IP0
IP0
IP0
IP0, FP0
IP0, FP0
IP0, FP0
IP0, FP0
IP0, FP0
IP0, FP0
IP0, FP0
IP0, FP0
IP0
IP0
IP0
IP0
IP0
IP0
1
IP0/1
1
IP0/1, Mem
1
IP0/1, Mem
1
IP0/1
1
IP0/1, Mem
1
IP0
1
IP0
1
IP0
1
IP0
7
IP0
1
IP0
11
IP0
14
IP0
13
IP0
16
IP0
10
IP0
1
IP0
10
IP0
9
IP0
2
IP0
9
IP0
8
IP0
2
IP0
8-10
IP0
Page 266
4
3
5
14
24
25-29
25-39
34-94
24-35
37-41
29-46
47-107
4
1
1
4
1
1
4
3
3
4
6
3
1
6
1
1
1
1
9
2
12
13
12
14
10
2
10
10
4
10
10
4
10
4
1
2
19
19-23
19-31
25-94
25
30-32
29-38
47-107
4
1
1
4
6
1
1/2
1
1
1/2
1
1
1
1
1
2
2 more if mem
4 more if mem
2 more if mem
2 more if mem
2 more if mem
2 more if mem
3 more if mem
4 more if mem
3 more if mem
Silvermont
SHRD
SHRD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
CLC STC
CMC
CLD
STD
7
2
2
1
7
1
1
8
1
10
1
1
1
4
5
IP0
IP0
IP0
IP0+1
Control transfer instructions
JMP
short/near
JMP
r
JMP
m(near)
Conditional jump
short/near
J(E/R)CXZ
short
LOOP
short
LOOP(N)E
short
CALL
near
CALL
r
CALL
m
RET
RET
i
BOUND
r,m
INTO
1
1
1
1
2
7
8
1
1
3
1
1
10
4
IP1
String instructions
LODS
REP LODS
STOS
3
~4n
2
5
~2n
4
REP STOS
MOVS
~0.12B
5
~0.1B
6
REP MOVS
SCAS
REP SCAS
CMPS
REP CMPS
~ 0.2B
3
~5n
6
~6n
~0.15B
5
~3n
6
~3n
6
4
1
8
6
13
6
10
10
10
11
14
Synchronization instructions
XADD
LOCK XADD
LOCK ADD
CMPXCHG
LOCK CMPXCHG
CMPXCHG8B
r16,r16,cl
r32,r32,cl
r64,r64,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r/m
r/m
m,r
m,r
m,r
m,r
m,r
m,r
IP0+1
IP0+1
IP0+1
IP0+1
IP0/1
10
4
4
1
9
1
1
10
2
1
IP0+1
IP0+1
2 more if mem
2 more if mem
2 more if mem
1
1
10
1
10
1
1
1
7
35
2
2
2
1-2
2-15
10-20
IP1
2
9
14
3
3
10
7
Page 267
Not in x64 mode
Not in x64 mode
per byte, best
case
per byte, best
case
Silvermont
LOCK CMPXCHG8B
CMPXCHG16B
LOCK CMPXCHG16B
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDTSCP
RDPMC
RDRAND
m,r
m,r
m,r
11
19
17
14
24
27
1
1
6
15
19+6b
4
31-80
13
15
19
15
IP0/1
IP0/1
Operands
μops
Unit
r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m
m
m
1
1
5
59
1
1
8
204
2
1
6
7
1
1
2
3
2
4
2
4
1
1
166
82
a,0
a,b
r
1/2
1/2
24
14
59+5b
5
54-108
29
25
19
~1472
Floating point x87 instructions
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FIST(P)
FISTTP
FLDZ
FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FABS
r
AX
m16
m16
m16
m
m
r/m
r/m
r/m
1
1
1
1
FP0+1
Latency Reciprocal
throughput
1
3
9
68
1
4
9
239
1
6
9
6
240
174
0.5
1
8
68
0.5
2
9
239
1
2
9
13
1
7
7
6
9
11
4
5
0.5
0.5
240
174
3
5
39
1
1
2
37
1
6
~9
~6
~5
1
FP1
FP0
FP0
Page 268
Remarks
Silvermont
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
r/m
r
m
m
m
m
1
1
1
1
3
3
3
3
1
1
27
27
18
1
5
5
5
6
7
32-57
32-57
26
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
27
15
1
18
110
9
34
61
101
63
Other
FNOP
WAIT
FNCLEX
FNINIT
1
2
4
19
20
13-40
40-170
40-170
39-90
80-140
154
45-200
85-190
1
1
1
1
5
6
39
5
1
1
32-57
32-57
26
66
20
13-40
40-170
0.5
4
24
65
Integer MMX and XMM instructions
Move instructions
MOVD MOVQ
MOVD MOVQ
MOVD MOVQ
MOVD MOVQ
MOVQ
MOVDQA
MOVDQA MOVDQU
MOVDQA MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
Operands
μops
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
x, x
x, m128
m128, x
x, m128
mm, x
x,mm
m64,mm
m128,x
x, m128
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Unit
Mem
Mem
FP0/1
FP0/1
Mem
Mem
Mem
Mem
Mem
Page 269
Latency Reciprocal
throughput
4
4
3
3
1
1
3
4
3
1
1
~370
~370
3
1
1
1
1
0.5
0.5
1
1
1
1
1
1
1
1
Remarks
Silvermont
PACKSSWB/DW
PACKUSWB
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PMOVSX/ZX BW BD BQ DW
DQ
PMOVSX/ZX BW BD BQ DW
DQ
PSHUFB
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PALIGNR
PBLENDVB
PBLENDVB
PBLENDW
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PINSRW
PINSRB/D/Q
PINSRB/D/Q
PEXTRW
PEXTRB/W
PEXTRQ
PEXTRB/W
PEXTRD
PEXTRQ
Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PADDQ PSUBQ
PHADD(S)W PHSUB(S)W
PHADD(S)W PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PCMPEQQ
PCMPGTQ
PMULL/HW PMULHUW
PMULL/HW PMULHUW
PMULHRSW
PMULHRSW
PMULLD
PMULDQ
PMULUDQ
PMULUDQ
PMADDWD
PMADDWD
PMADDUBSW
PMADDUBSW
(x)mm, (x)mm
x,x
(x)mm, (x)mm
(x)mm, (x)mm
1
1
1
1
FP0
FP0
FP0
FP0
x,x
1
x,m
mm,mm
x,x
mm,mm,i
x,x,i
x,x,i
x,x,i
x,x,xmm0
x,m,xmm0
x,x/m,i
mm,mm
x,x
r32,(x)mm
(x)mm,r32,i
x,r32,i
x,m8,i
r32,(x)mm,i
r32,x,i
r64,x,i
m8/16,x,i
m32,x,i
m64,x,i
1
1
4
1
1
1
1
2
3
1
1
3
1
1
1
1
2
2
2
5
4
4
(x)mm, (x)mm
(x)mm, (x)mm
(x)mm, m
mm, mm
x, x/m
(x)mm, (x)mm
(x)mm,(x)mm
x, x
x, x
mm,mm
x, x
mm,mm
x, x
x, x
x, x
mm,mm
x, x
mm,mm
x, x
mm,mm
x, x
1
2
3
5
7-8
3-4
1
2
1
1
1
1
1
7
1
1
1
1
1
1
1
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
Mem
Mem
1
1
1
1
1
1
1
1
1
1
1
1
5
1
1
1
1
4
1
1
5
1
1
1
1
4
5
1
1
5
1
1
1
1
4
4
7
6
5
8
1
~370
~370
4
3
3
5
5
7
FP0/1
FP0/1
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
Page 270
1
4
6
9
5-6
1
4
5
4
5
4
5
11
5
4
5
4
5
4
5
0.5
4
5
6
9
5-6
0.5
4
2
1
2
1
2
11
2
1
2
1
2
1
2
+1 if mem
+1 if mem
Silvermont
PSADBW
PSADBW
MPSADBW
MPSADBW
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PMIN/PMAX
SB/SW/SD
UB/UW/UD
PHMINPOSUW
PABSB PABSW PABSD
PSIGNB PSIGNW PSIGND
mm,mm
x, x
x,x,i
x,m,i
(x)mm,(x)mm
(x)mm,(x)mm
(x)mm,(x)mm
1
1
3
4
1
1
1
FP0
FP0
4
5
7
FP0/1
FP0/1
FP0/1
1
1
1
1
2
6
6
0.5
0.5
0.5
x,x
x,x
(x)mm,(x)mm
1
1
1
FP0
FP0/1
1
5
1
1
2
0.5-1
(x)mm,(x)mm
1
FP0/1
1
0.5-1
(x)mm,(x)mm
x,x
(x)mm,(x)mm
(x)mm,i
x,i
1
1
2
1
1
FP0/1
FP0
FP0
FP0
1
1
2
1
1
0.5
1
2
1
1
String instructions
PCMPESTRI
PCMPESTRM
PCMPISTRI
PCMPISTRM
x,x,i
x,x,i
x,x,i
x,x,i
9
8
6
5
FP0
FP0
FP0
FP0
21
17
17
13
21
17
17
13
+1 if mem
+1 if mem
+1 if mem
+1 if mem
Encryption instructions
PCLMULQDQ
x,x,i
8
FP0
10
10
+1 if mem
Logic instructions
PAND(N) POR PXOR
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
Other
EMMS
9
10
Floating point XMM instructions
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D MOVLPS/D
MOVHPS/D MOVLPS/D
MOVLHPS MOVHLPS
BLENDPS/PD
BLENDVPS/PD
BLENDVPS/PD
Operands
μops
Unit
x, x
x,m128
m128,x
x,m128
m128,x
x, x
x,m32/64
m32/64,x
x,m64
m64,x
x,x
x,x/m,i
x,x,xmm0
x,m,xmm0
1
1
1
1
1
1
1
1
1
1
1
1
2
3
FP0/1
Mem
Mem
Mem
Mem
FP0/1
Mem
Mem
Mem
Mem
FP0
FP0+1
FP0+1
Page 271
Latency Reciprocal
throughput
1
3
4
3
4
1
3
4
3
4
1
1
4
5
0.5
1
1
1
1
0.5
1
1
1
1
1
1
4
5
Remarks
Silvermont
INSERTPS
INSERTPS
EXTRACTPS
EXTRACTPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
x,x,i
x,m32,i
r32,x,i
m32,x,i
r32,x
m128,x
x,x,i
x,x,i
x, x
x, x
x, x
x, x
1
3
2
4
1
1
1
1
1
1
1
1
FP0
Mem
FP0
FP0
FP0
FP0
FP0
FP0
5
4
~370
1
1
1
1
1
1
1
5
4
5
1
1
1
1
1
1
1
1
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
xm,r32
r32,x
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
5
4
5
4
5
5
5
5
4
4
5
5
5
5
5
5
2
2
2
2
2
2
2
2
2
2
2
2
2
1
2
1
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
x, x
1
1
1
1
1
1
4
4
1
1
1
1
1
1
6
6
1
5
1
1
1
FP1
FP1
FP1
FP1
FP1
FP1
3
3
3
4
3
4
6
6
4
5
5
7
19
34
39
69
4
9
3
1
1
1
2
1
2
6
5
1
2
2
4
17
32
39
69
1
8
1
1
1
Arithmetic
ADDSS SUBSS
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D PS/D
COMISS/D UCOMISS/D
MAXSS/D MINSS/D
1
4
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP0
FP1
FP1
FP1
Page 272
3
+1 if mem
+1 if mem
Silvermont
MAXPS MINPS
MAXPD MINPD
DPPS
DPPD
x, x
x, x
x,x,i
x,x,i
x,x,i
x,x,i
1
1
1
1
9
5
FP1
FP1
FP0
FP0
FP0
FP0
3
4
4
5
15
12
1
2
2
2
12
8
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
x, x
x, x
x, x
x, x
x, x
x, x
1
5
1
5
1
5
FP0
FP0
FP0
FP0
FP0
FP0
20
40
35
70
4
9
18
40
33
70
1
8
Logic
ANDPS/D
ANDNPS/D
ORPS/D
XORPS/D
x, x
x, x
x, x
x, x
1
1
1
1
FP0/1
FP0/1
FP0/1
FP0/1
1
1
1
1
0.5
0.5
0.5
0.5
Other
LDMXCSR
STMXCSR
FXSAVE
FXSAVE
FXRSTOR
FXRSTOR
m32
m32
m4096
m4096
m4096
m4096
5
4
115
123
114
123
10
12
132
143
118
122
8
11
132
143
118
122
ROUNDSS/D
ROUNDPS/D
Page 273
+1 if mem
+1 if mem
32 bit mode
64 bit mode
32 bit mode
64 bit mode
VIA Nano 2000
VIA Nano 2000 series
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
μops:
Port:
Latency:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of micro-operations from the decoder or ROM. Note that the VIA
Nano 2000 processor has no reliable performance monitor counter for μops.
Therefore the number of μops cannot be determined except in simple cases.
Tells which execution port or unit is used. Instructions that use the same port
cannot execute simultaneously.
I1:
Integer add, Boolean, shift, etc.
I2:
Integer add, Boolean, move, jump.
I12:
Can use either I1 or I2, whichever is vacant first.
MA:
Multiply, divide and square root on all operand types.
MB:
Various Integer and floating point SIMD operations.
MBfadd:
Floating point addition subunit under MB.
SA:
Memory store address.
ST:
Memory store.
LD:
Memory load.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay.
Note: There is an additional latency for moving data from one unit or subunit to
another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included
in the listings below where the source and destination operands are of the
same type.
Reciprocal throughput: The average number of clock cycles per instruction for a series of independent
instructions of the same kind in the same thread.
Integer instructions
Operands μops
Port
Latency
Reciprocal
thruoghput
Remarks
Move instructions
MOV
MOV
r,r
r,i
1
1
I2
I2
1
1
1
1
MOV
MOV
MOV
MOV
MOV
MOV
MOV
r,m
m,r
m,i
r,sr
m,sr
sr,r
sr,m
1
1
1
LD
SA, ST
SA, ST
2
2
1
1.5
1.5
1
2
20
20
20
20
Page 274
Latency 4 on
pointer register
VIA Nano 2000
MOVNTI
MOVSX MOVSXD
MOVZX
MOVSX MOVSXD
MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE
MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
m,r
r,r
r,m
r,m
r,r
r,m
r,r
r,m
m
r
i
m
sr
1
2
1
2
3
SA, ST
2
1.5
I2
LD, I2
LD
I1, I2
LD, I1
I2
1
3
2
2
5
3
20
6
1
1
1
1
2
3
20
SA, ST
SA, ST
Ld, SA, ST
8
r
(E/R)SP
m
sr
LD
9
1
1
r,m
1
SA
1
1
9
1
r
1
I2
1
1
30
30
1-2
1-2
14
14
14
I12
LD I12
1
LD I12 SA ST
5
1
1/2
1
2
1
1
2
1/2
1
1/2
m
m
m
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
I1
I1
1-2
1-2
2
17
8
15
1.25
4
5
20
9
12
1
1
6
1
LD
LD
1
2
3
1
2
3
1
2
1
3
I1
LD I1
LD I1 SA ST
I12
LD I12
I12
LD I12 SA ST
5
1
1
5
37
37
22
Page 275
Implicit lock
Not in x64 mode
Not in x64 mode
Not in x64 mode
3 clock latency on
input register
Not in x64 mode
Not in x64 mode
Not in x64 mode
VIA Nano 2000
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
MUL IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
SHLD
SHRD
SHLD SHRD
SHLD SHRD
SHLD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
24
23
30
r8
r16
r32
r64
r16,r16
r32,r32
r64,r64
r16,r16,i
r32,r32,i
r64,r64,i
r8
r16
r32
r64
r8
r16
r32
r64
1
1
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
r,i/cl
r,1
r,i/cl
r16,r16,i
r32,r32,i
r64,r64,i
r64,r64,i
r16,r16,cl
r32,r32,cl
r64,r64,cl
r64,r64,cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
1
2
3
1
2
1
1
1
1
2
2
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
I1
I1
7-9
7-9
7-9
8-10
4-6
4-6
5-7
4-6
4-6
5-7
26
27-35
25-41
148-183
26
27-35
23-39
187-222
1
1
I12
LD I12
1
LD I12 SA ST
5
1
I12
LD I12
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
I1
Page 276
1
1
1
28+3n
11
7
33
43
11
7
33
43
1
2
10
8
3
1
1
2
1
1
2
26
27-35
25-41
148-183
26
27-35
23-39
187-222
1
1
1/2
1
2
1/2
1
1
1
1
28+3n
11
7
33
43
11
7
33
43
1
8
1
2
10
8
2
Not in x64 mode
Not in x64 mode
Not in x64 mode
Extra latency to
other ports
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
do.
VIA Nano 2000
SETcc
SETcc
CLC STC CMC
CLD STD
r
m
I1
2
1
1
3
3
I1
3
3
I2
3
58
3
I2
3
3
55
1-3-8
3
3
1-3-8
1 if not jumping.
3 if jumping.
8 if >2 jumps in 16
bytes block
do.
do.
Control transfer instructions
JMP
JMP
short/near
far
JMP
JMP
JMP
Conditional jump
r
m(near)
m(far)
short/near
1
J(E/R)CXZ
LOOP
LOOP(N)E
short
short
short
1-3-8
1-3-8
25
1-3-8
1-3-8
25
CALL
CALL
near
far
3
72
3
72
CALL
CALL
CALL
r
m(near)
m(far)
3
4
72
3
3
72
3
3
39
39
3
3
39
39
13
7
RETN
RETN
RETF
RETF
BOUND
INTO
i
i
r,m
String instructions
LODSB/W/D/Q
REP LODSB/W/D/Q
STOSB/W/D/Q
REP STOSB/W/D/Q
1
3n+22
1-2
Small: 2n+2,
Big: 6 bytes
per clock
MOVSB/W/D/Q
REP MOVSB/W/D/Q
2
Small: 2n+45,
Big: 6 bytes
per clock
SCASB/W/D/Q
REP SCASB
1
2.2n
Page 277
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
do.
Not in x64 mode
Not in x64 mode
VIA Nano 2000
REP SCASW/D/Q
Small: 2n+50
Big: 5 bytes
per clock
CMPSB/W/D/Q
REP CMPSB/W/D/Q
Other
NOP (90)
Long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
6
2.4n+24
1
1
All
I12
a,0
a,b
4
53-173
40
1
1/2
25
23
52+5b
4
Blocks all ports
39
40
Floating point x87 instructions
Operands μops
Port and
Unit
Latency
Reciprocal
thruoghput
1
2
2
MB
LD MB
LD MB
1
3
3
MB
MB SA ST
MB SA ST
1
I2
1
4
4
54
1
5
5
125
0
7
5
5
6
5
5
1
1
1
54
1
1-2
1-2
125
1
1
MB
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FILD
FIST(T)(P)
FIST(T)(P)
FIST(T)(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m16
m32
m64
m16
m32
m64
r
AX
m16
m16
m16
2
13
1
1
I2
MB
m
m
0
321
195
Arithmetic instructions
Page 278
1
10
2
5
3
13
2
1
1
321
195
Remarks
VIA Nano 2000
FADD(P) FSUB(R)(P)
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
r/m
r/m
r/m
r/m
r
m
m
m
m
1
1
1
1
1
1
1
1
1
MB
MA
MA
MB
MB
MB
MB
MB
MB
2
4
15-42
1
1
MB
1
2
15-42
1
1
1
1
1
2
4
42
2
1
41
Lower precision:
Lat: 4, Thr: 2
151-171
106-155
29
Math
FSCALE
FXTRACT
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
39
36-57
73
51-159
270-360
50-200
~60
~170
300-370
~170
Other
FNOP
WAIT
FNCLEX
FNINIT
1
1
MB
I12
0
1
1/2
57
85
Integer MMX and XMM instructions
Operands μops
Port and
Unit
Latency
Reciprocal
thruoghput
3
2-3
4
2-3
1
2-3
2-3
1
2-3
1
1-2
1
1
1
1
1-2
1
1
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
MOVDQA
r32/64,(x)mm
m32/64,(x)mm
(x)mm,r32/64
(x)mm,m32/64
(x)mm, (x)mm
(x)mm,m64
m64, (x)mm
xmm, xmm
xmm, m128
1
1
SA ST
1
1
1
1
1
1
LD
MB
LD
SA ST
MB
LD
Page 279
Remarks
VIA Nano 2000
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
PACKSSWB/DW
PACKUSWB
PUNPCKH/LBW/WD/
DQ
PUNPCKH/LQDQ
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PINSRW
m128, xmm
m128, xmm
xmm, m128
xmm, m128
mm, xmm
xmm,mm
m64,mm
m128,xmm
1
1
1
1
1
1
3
3
SA ST
SA ST
LD
LD
MB
MB
2-3
2-3
2-3
2-3
1
1
~300
~300
1-2
1-2
1
1
1
1
2
2
v,v
1
MB
1
1
v,v
v,v
v,v
mm,mm,i
x,x,i
x,x,i
x,x,i
mm,mm
xmm,xmm
r32,(x)mm
r32 ,(x)mm,i
(x)mm,r32,i
1
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
MB
1
1
1
1
1
1
1
3
3
9
1
1
1
1
1
1
1
1-3
1-3
1
1
9
v,v
v,v
1
1
MB
MB
1
1
1
1
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
v,v
3
3
1
1
1
1
MB
MB
MB
MA
MA
MA
1
1
1
MB
MB
MB
MB
3
3
1
3
3
3
4
10
2
1
1
1
3
3
1
1
1
1
2
8
1
1
1
1
Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PMULL/HW PMULHUW
PMULHRSW
PMULUDQ
PMADDWD
PMADDUBSW
PSADBW
PAVGB/W
PMIN/MAXUB
PMIN/MAXSW
PABSB PABSW PABSD
v,v
1
MB
1
1
PSIGNB PSIGNW
PSIGND
v,v
1
MB
1
1
Logic instructions
PAND(N) POR PXOR
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
v,v
v,v
v,i
x,i
1
1
1
1
MB
MB
MB
MB
1
1
1
1
1
1
1
1
Page 280
VIA Nano 2000
Other
EMMS
1
MB
1
Floating point XMM instructions
Operands μops
Port and
Unit
Latency
Reciprocal
thruoghput
1
1
1
1
1
1
1
1
MB
LD
SA ST
LD
SA ST
MB
LD
SA ST
1
MB
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
1
2-3
2-3
2-3
2-3
1
2-3
2-3
6
6
6
2
1
3
~300
1
1
1
1
1
1
1
1
1-2
1
1-2
1
1
1-2
1
1
1-2
1-2
1
1
2.5
1
1
1
1
1
1
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
xmm,xmm
xmm,m128
m128,xmm
xmm,m128
m128,xmm
xmm,xmm
x,m32/64
m32/64,x
xmm,m64
xmm,m64
m64,xmm
m64,xmm
xmm,xmm
r32,xmm
m128,xmm
xmm,xmm,i
xmm,xmm,i
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,mm
mm,xmm
xmm,mm
mm,xmm
xmm,r32
r32,xmm
xmm,r32
r32,xmm
Arithmetic
ADDSS SUBSS
xmm,xmm
3-4
15
3-4
15
3
2
4
3
4
3
4
3
5
4
5
4
1
MBfadd
Page 281
2-3
1
Remarks
VIA Nano 2000
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
MAXSS/D MINSS/D
MAXPS/D MINPS/D
xmm,xmm
xmm,xmm
xmm,xmm
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
Logic
ANDPS/D
ANDNPS/D
ORPS/D
XORPS/D
xmm,xmm
xmm,xmm
xmm,xmm
xmm,xmm
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
m32
m32
m4096
m4096
1
1
1
1
1
1
1
MBfadd
MBfadd
2-3
2-3
2-3
2-3
2-3
5
5
3
4
3
4
15-22
15-36
42-82
24-70
5
14
2
2
1
1
MBfadd
MBfadd
3
2
2
1
1
1
MA
MA
MA
MA
33
126
62
122
5
14
33
126
62
122
5
11
MB
MB
MB
MB
1
1
1
1
1
1
1
1
45
13
208
232
29
13
208
232
1
1
1
1
1
1
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MA
MA
MA
MA
MA
MA
MA
MA
1
1
1
1
1
3
3
1
2
1
2
15-22
15-36
42-82
24-70
5
11
1
1
VIA-specific instructions
Instruction
XSTORE
XSTORE
REP XSTORE
REP XSTORE
Conditions
Data available
No data available
Quality factor = 0
Quality factor > 0
Clock cycles, approximately
160-400 clock giving 8 bytes
50-80 clock giving 0 bytes
4800 clock per 8 bytes
19200 clock per 8 bytes
Page 282
VIA Nano 2000
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XSHA1
REP XSHA256
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
44 clock per 16 bytes
46 clock per 16 bytes
48 clock per 16 bytes
54 clock per 16 bytes
59 clock per 16 bytes
63 clock per 16 bytes
43 clock per 16 bytes
46 clock per 16 bytes
48 clock per 16 bytes
54 clock per 16 bytes
59 clock per 16 bytes
63 clock per 16 bytes
54 clock per 16 bytes
59 clock per 16 bytes
63 clock per 16 bytes
3 clock per byte
4 clock per byte
Page 283
Nano 3000
VIA Nano 3000 series
List of instruction timings and μop breakdown
Explanation of column headings:
Operands:
μops:
Port:
Latency:
i = immediate data, r = register, mm = 64 bit mmx register, xmm = 128 bit xmm
register, (x)mm = mmx or xmm register, sr = segment register, m = memory,
m32 = 32-bit memory operand, etc.
The number of micro-operations from the decoder or ROM. Note that the VIA
Nano 3000 processor has no reliable performance monitor counter for μops.
Therefore the number of μops cannot be determined except in simple cases.
Tells which execution port or unit is used. Instructions that use the same port
cannot execute simultaneously.
I1:
Integer add, Boolean, shift, etc.
I2:
Integer add, Boolean, move, jump.
I12:
Can use either I1 or I2, whichever is vacant first.
MA:
Multiply, divide and square root on all operand types.
MB:
Various Integer and floating point SIMD operations.
MBfadd:
Floating point addition subunit under MB.
SA:
Memory store address.
ST:
Memory store.
LD:
Memory load.
This is the delay that the instruction generates in a dependency chain. The
numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN's and infinity increase
the delays very much, except in XMM move, shuffle and Boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay.
Note: There is an additional latency for moving data from one unit or subunit to
another. A table of these latencies is given in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs". These additional latencies are not included
in the listings below where the source and destination operands are of the
same type.
Reciprocal throughput: The average number of clock cycles per instruction for a series of independent
instructions of the same kind in the same thread.
Integer instructions
Operands
μops
Port
Latency Reciprocal
thruoghput
MOV
MOV
r,r
r,i
1
1
I2
I12
1
1
1
1/2
MOV
MOV
MOV
MOV
MOV
MOV
r,m
m,r
m,i
r,sr
m,sr
sr,r
1
1
1
LD
SA, ST
SA, ST
I12
2
2
1
1.5
1.5
1/2
1.5
20
Remarks
Move instructions
20
Page 284
Latency 4 on pointer
register
Nano 3000
MOV
MOVNTI
MOVSX MOVZX
MOVSXD
MOVSX MOVSXD
MOVZX
CMOVcc
CMOVcc
XCHG
XCHG
XLAT
PUSH
PUSH
PUSH
PUSH
PUSHF(D/Q)
PUSHA(D)
POP
POP
POP
POP
POPF(D/Q)
POPA(D)
LAHF
SAHF
SALC
LEA
BSWAP
LDS LES LFS LGS LSS
PREFETCHNTA
PREFETCHT0/1/2
LFENCE MFENCE
SFENCE
Arithmetic instructions
ADD SUB
ADD SUB
ADD SUB
ADC SBB
ADC SBB
ADC SBB
CMP
CMP
INC DEC NEG NOT
INC DEC NEG NOT
AAA
AAS
DAA
sr,m
m,r
r,r
r64,r32
r,m
r,m
r,r
r,m
r,r
r,m
m
r
i
m
sr
r
(E/R)SP
m
sr
1
1
2
1
1
3
3
1
1
3
9
2
SA, ST
I12
LD, I12
LD
I12
LD, I12
I12
LD, I1
SA, ST
SA, ST
LD, SA, ST
20
2
1
1
3
2
1
5
3
18
6
1
1
10
20
1.5
1/2
1
1
1
1/2
1
1.5
18
2
1-2
1-2
2
6
2
15
1.25
4
2
11
1
12
1
1
6
1
1
1
1
28
28
1
1
2
LD
3
3
16
1
1
2
r,m
r
1
1
m
m
m
12
1
1
I1
I1
SA
I2
LD
LD
Implicit lock
Not in x64 mode
Not in x64 mode
Not in x64 mode
Extra latency to other
ports
15
r,r/i
r,m
m,r/i
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r
m
1
2
3
1
2
3
1
2
1
3
12
12
14
I12
LD I12
1
LD I12 SA ST
5
1
I1
LD I1
LD I1 SA ST
I12
LD I12
I12
LD I12 SA ST
5
1
1
5
1/2
1
2
1
1
2
1/2
1
1/2
37
22
22
Page 285
Not in x64 mode
Not in x64 mode
Not in x64 mode
Nano 3000
DAS
AAD
AAM
MUL IMUL
MUL IMUL
MUL IMUL
r8
r16
r32
14
7
13
1
3
3
I2
I2
I2
2
3
3
MUL IMUL
IMUL
IMUL
r64
r16,r16
r32,r32
3
1
1
MA
I2
I2
8
2
2
8
1
1
IMUL
IMUL
IMUL
r64,r64
r16,r16,i
r32,r32,i
1
1
1
MA
I2
I2
5
2
2
2
1
1
IMUL
DIV
DIV
DIV
DIV
IDIV
IDIV
IDIV
IDIV
CBW CWDE CDQE
CWD CDQ CQO
r64,r64,i
r8
r16
r32
r64
r8
r16
r32
r64
1
MA
MA
MA
MA
MA
MA
MA
MA
MA
I2
I2
5
22-24
24-28
22-30
145-162
21-24
24-28
18-26
182-200
1
1
Logic instructions
AND OR XOR
AND OR XOR
AND OR XOR
TEST
TEST
SHR SHL SAR
ROR ROL
RCR RCL
RCR RCL
SHLD SHRD
SHLD SHRD
SHLD
SHRD
BT
BT
BT
BTR BTS BTC
BTR BTS BTC
BTR BTS BTC
BSF BSR
SETcc
SETcc
1
1
r,r/i
r,m
m,r/i
r,r/i
m,r/i
r,i/cl
r,i/cl
r,1
r,i/cl
r16,r16,i/cl
r32,r32,i/cl
r64,r64,i/cl
r64,r64,i/cl
r,r/i
m,r
m,i
r,r/i
m,r
m,i
r,r
r8
m
24
24
31
1
I12
2
LD I12
3
LD I12 SA ST
1
I12
2
LD I12
1
I12
1
I1
1
I1
5+2n
I1
2
I1
2
I1
16
I1
23
I1
1
I1
6
I1
2
I1
2
I1
8
I1
5
I1
2
I1
1
I1
2
Page 286
1
5
1
1
1
1
28+3n
2
2
32
42
1
2
10
8
2
1
Not in x64 mode
Not in x64 mode
Not in x64 mode
Extra latency to other
ports
Extra latency to other
ports
Extra latency to other
ports
2
22-24
24-28
22-30
145-162
21-24
24-28
18-26
182-200
1
1
1/2
1
2
1/2
1
1/2
1
1
28+3n
2
2
32
42
1
8
1
2
10
8
2
1
2
Nano 3000
CLC STC CMC
CLD STD
3
3
I1
I1
3
3
3
3
Control transfer instructions
JMP
JMP
short/near
far
1
14
I2
3
3
50
JMP
JMP
JMP
r
m(near)
m(far)
2
2
17
I2
3
3
3
3
42
Conditional jump
J(E/R)CXZ
LOOP
LOOP(N)E
short/near
short
short
short
1
2
2
5
CALL
CALL
near
far
CALL
CALL
CALL
r
m(near)
m(far)
RETN
RETN
RETF
RETF
BOUND
INTO
i
i
r,m
I2
1-3-8
1-3-8
1-3-8
24
1-3-8
1-3-8
1-3-8
24
2
17
3
3
58
2
3
19
3
4
3
3
54
3
4
20
20
9
3
3
3
3
3
49
49
13
7
String instructions
LODSB/W/D/Q
REP LODSB/W/D/Q
STOSB/W/D/Q
2
3n
1
REP STOSB/W/D/Q
MOVSB/W/D/Q
1
3n+27
1-2
Small:
n+40, Big:
6-7
bytes/clk
3
2
Small:
2n+20,
Big: 6-7
bytes/clk
3
1
2.4n
Small:
2n+31,
Big: 5
bytes/clk
REP MOVSB/W/D/Q
SCASB/W/D/Q
REP SCASB
REP SCASW/D/Q
Page 287
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
1 if not jumping.
3 if jumping.
8 if >2 jumps in 16
bytes block
8 if >2 jumps in 16
bytes block
Not in x64 mode
8 if >2 jumps in 16
bytes block
do.
8 if >2 jumps in 16
bytes block
do.
Not in x64 mode
Not in x64 mode
Nano 3000
CMPSB/W/D/Q
REP CMPSB/W/D/Q
Other
NOP (90)
long NOP (0F 1F)
PAUSE
ENTER
ENTER
LEAVE
CPUID
RDTSC
RDPMC
5
a,0
a,b
0-1
0-1
2
10
6
2.2n+30
I12
I12
3
0
0
2
55-146
1/2
1/2
6
21
52+5b
2
Sometimes fused
37
40
Floating point x87 instructions
Operands
μops
Port
r
m32/m64
m80
m80
r
m32/m64
m80
m80
r
m16
m32
m64
m16
m32
m64
MB
LD MB
LD MB
m
m
1
2
2
36
1
3
3
80
1
3
2
2
3
3
3
1
3
1
1
3
5
3
1
1
122
115
r/m
1
Latency Reciprocal
thruoghput
Move instructions
FLD
FLD
FLD
FBLD
FST(P)
FST(P)
FSTP
FBSTP
FXCH
FILD
FILD
FILD
FIST(T)(P)
FIST(T)(P)
FIST(T)(P)
FLDZ FLD1
FLDPI FLDL2E etc.
FCMOVcc
FNSTSW
FNSTSW
FLDCW
FNSTCW
FINCSTP FDECSTP
FFREE(P)
FNSAVE
FRSTOR
Arithmetic instructions
FADD(P) FSUB(R)(P)
r
AX
m16
m16
m16
MB
MB SA ST
MB SA ST
I2
1
4
4
54
1
5
5
125
0
7
5
5
6
5
5
MB
319
196
1
10
2
1
2
8
2
1
1
319
196
2
1
MB
2
I2
MB
0
MB
Page 288
1
1
1
54
1
1-2
1-2
125
1
Remarks
Nano 3000
FMUL(P)
FDIV(R)(P)
FABS
FCHS
FCOM(P) FUCOM
FCOMPP FUCOMPP
FCOMI(P) FUCOMI(P)
FIADD FISUB(R)
FIMUL
FIDIV(R)
FICOM(P)
FTST
FXAM
FPREM
FPREM1
FRNDINT
r/m
r/m
r/m
r
m
m
m
m
Math
FSCALE
FXTRACT
1
MA
MA
MB
MB
MB
MB
MB
MB
4
14-23
1
1
MB
11
2
38
~130
~130
27
22
13
37
57
1
1
1
1
1
3
3
3
3
1
15
FSQRT
FSIN FCOS
FSINCOS
F2XM1
FYL2X
FYL2XP1
FPTAN
FPATAN
2
2
14-23
1
1
1
1
1
2
4
16
2
1
38
Less at lower
precision
73
~150
270-360
50-200
~50
~50
300-370
~180
Other
FNOP
WAIT
FNCLEX
FNINIT
1
1
MB
I12
0
1
1/2
59
84
Integer MMX and XMM instructions
Operands
μops
Port
r,(x)mm
m,(x)mm
(x)mm,r
(x)mm,m
v,v
(x)mm,m64
m64, (x)mm
x,x
1
1
1
1
1
1
1
1
MB
SA ST
I2
LD
MB
LD
SA ST
MB
Latency Reciprocal
thruoghput
Move instructions
MOVD
MOVD
MOVD
MOVD
MOVQ
MOVQ
MOVQ
MOVDQA
Page 289
3
2
4
2
1
2
2
1
1
1-2
1
1
1
1
1-2
1
Remarks
Nano 3000
MOVDQA
MOVDQA
MOVDQU
MOVDQU
LDDQU
MOVDQ2Q
MOVQ2DQ
MOVNTQ
MOVNTDQ
MOVNTDQA
PACKSSWB/DW
PACKUSWB
PACKUSDW
PUNPCKH/LBW/WD/DQ
PUNPCKH/LQDQ
PSHUFB
PSHUFW
PSHUFL/HW
PSHUFD
PBLENDVB
PBLENDW
PALIGNR
MASKMOVQ
MASKMOVDQU
PMOVMSKB
PEXTRW
PEXTRB/D/Q
PINSRW
PINSRB/D/Q
PMOVSX/ZXBW/BD/
BQ/WD/WQ/DQ
x, m128
m128, x
m128, x
x, m128
x, m128
mm, x
x,mm
m64,mm
m128,x
x,m128
1
1
1
1
1
1
1
2
2
1
LD
SA ST
SA ST
LD
LD
MB
MB
2
2
2
2
2
1
1
~360
~360
2
1
1-2
1-2
1
1
1
1
2
2
1
v,v
x,x
v,v
v,v
v,v
mm,mm,i
x,x,i
x,x,i
x,x,xmm0
x,x,i
x,x,i
mm,mm
x,x
r32,(x)mm
r32 ,(x)mm,i
r32/64,x,i
(x)mm,r32,i
x,r32/64,i
1
1
1
1
1
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
MB
MB
MB
MB
MB
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
MB
MB
MB
MB
3
3
3
5
5
1
1
1
1
1
1
1
1
2
1
1
1-2
1-2
1
1
1
1
1
x,x
1
MB
1
1
v,v
v,v
1
1
MB
MB
1
1
1
1
v,v
v,v
v,v
x,x
v,v
v,v
x,x
v,v
x,x
v,v
v,v
v,v
x,x,i
v,v
3
3
1
1
1
1
1
1
1
1
7
1
1
1
MB
MB
MB
MB
MA
MA
MA
MA
MA
MA
3
3
1
1
3
3
3
3
3
4
10
2
2
1
3
3
1
1
1
1
1
1
1
2
8
1
1
1
Arithmetic instructions
PADD/SUB(U)(S)B/W/D
PADDQ PSUBQ
PHADD(S)W
PHSUB(S)W
PHADDD PHSUBD
PCMPEQ/GTB/W/D
PCMPEQQ
PMULL/HW PMULHUW
PMULHRSW
PMULLD
PMULUDQ
PMULDQ
PMADDWD
PMADDUBSW
PSADBW
MPSADBW
PAVGB/W
MB
MB
MB
Page 290
Nano 3000
PMIN/MAXSW
PMIN/MAXUB
PMIN/MAXSB/D
PMIN/MAXUW/D
PHMINPOSUW
PABSB PABSW PABSD
PSIGNB PSIGNW
PSIGND
Logic instructions
PAND(N) POR PXOR
PTEST
PSLL/RL/RAW/D/Q
PSLL/RL/RAW/D/Q
PSLL/RLDQ
v,v
v,v
x,x
x,x
x,x
1
1
1
1
1
MB
MB
MB
MB
MB
1
1
1
1
2
1
1
1
1
1
v,v
1
MB
1
1
v,v
1
MB
1
1
v,v
v,v
v,v
(x)xmm,i
x,i
1
1
1
1
1
MB
MB
MB
MB
MB
1
3
1
1
1
1
1
1
1
1
1
MB
Operands
μops
Port
x,x
x,m128
m128,x
x,m128
m128,x
x,x
x,m32/64
m32/64,x
x,m64
x,m64
m64,x
m64,x
x,x
r32,x
m128,x
x,x,i
x,x,i
x,x
x,x
x,x
x,x
1
1
1
1
2
1
1
2
2
2
3
1
1
MB
LD
SA ST
LD
SA ST
MB
LD
SA ST
x,x
x,x
x,x
2
1
2
Other
EMMS
1
Floating point XMM instructions
Move instructions
MOVAPS/D
MOVAPS/D
MOVAPS/D
MOVUPS/D
MOVUPS/D
MOVSS/D
MOVSS/D
MOVSS/D
MOVHPS/D
MOVLPS/D
MOVHPS/D
MOVLPS/D
MOVLHPS MOVHLPS
MOVMSKPS/D
MOVNTPS/D
SHUFPS
SHUFPD
MOVDDUP
MOVSH/LDUP
UNPCKH/LPS
UNPCKH/LPD
Conversion
CVTPD2PS
CVTSD2SS
CVTPS2PD
2
1
1
1
1
1
1
MB
MB
MB
MB
MB
MB
Page 291
Latency Reciprocal
thruoghput
1
2
2
2
2
1
2-3
2-3
6
6
6
2
1
3
~360
1
1
1
1
1
1
1
1
1
1
1
1
1
1-2
1
1
1-2
1-2
1
1
1-2
1
1
1
1
1
1
5
2
5
2
1
Remarks
Nano 3000
CVTSS2SD
CVTDQ2PS
CVT(T) PS2DQ
CVTDQ2PD
CVT(T)PD2DQ
CVTPI2PS
CVT(T)PS2PI
CVTPI2PD
CVT(T) PD2PI
CVTSI2SS
CVT(T)SS2SI
CVTSI2SD
CVT(T)SD2SI
x,x
x,x
x,x
x,x
x,x
x,mm
mm,x
x,mm
mm,x
x,r32
r32,x
x,r32
r32,x
1
1
1
2
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
x,x
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
1
3
1
1
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MBfadd
MA
MA
MA
MA
MA
MA
MA
MA
MA
MA
MBfadd
MBfadd
2
2
2
2
2
2
5
5
3
4
3
4
13
13-20
24
21-38
5
14
2
2
1
1
1
1
1
1
3
3
1
2
1
2
13
13-20
24
21-38
5
11
1
1
MAXSS/D MINSS/D
MAXPS/D MINPS/D
x,x
x,x
x,x
1
1
1
MBfadd
MBfadd
MBfadd
3
2
2
1
1
1
Math
SQRTSS
SQRTPS
SQRTSD
SQRTPD
RSQRTSS
RSQRTPS
x,x
x,x
x,x
x,x
x,x
x,x
1
1
1
1
1
3
MA
MA
MA
MA
33
64
62
122
5
14
33
64
62
122
5
11
Logic
ANDPS/D
x,x
1
MB
1
1
Arithmetic
ADDSS SUBSS
ADDSD SUBSD
ADDPS SUBPS
ADDPD SUBPD
ADDSUBPS
ADDSUBPD
HADDPS HSUBPS
HADDPD HSUBPD
MULSS
MULSD
MULPS
MULPD
DIVSS
DIVSD
DIVPS
DIVPD
RCPSS
RCPPS
CMPccSS/D
CMPccPS/D
COMISS/D UCOMISS/D
MB
2
1
2
2
2
1
2
1
Page 292
2
3
2
5
4
5
4
4
4
5
4
5
4
1
1
1
2
2
1
1
2
1
1
Nano 3000
ANDNPS/D
ORPS/D
XORPS/D
x,x
x,x
x,x
Other
LDMXCSR
STMXCSR
FXSAVE
FXRSTOR
m32
m32
m4096
m4096
1
1
1
MB
MB
MB
1
1
1
1
1
1
31
13
97
201
VIA-specific instructions
Instruction
XSTORE
XSTORE
REP XSTORE
REP XSTORE
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTECB
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCBC
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCTR
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTCFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XCRYPTOFB
REP XSHA1
REP XSHA256
Conditions
Data available
No data available
Quality factor = 0
Quality factor > 0
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
128 bits key
192 bits key
256 bits key
Clock cycles, approximately
160-400 clock giving 8 bytes
50-80 clock giving 0 bytes
1300 clock per 8 bytes
5455 clock per 8 bytes
15 clock per 16 bytes
17 clock per 16 bytes
18 clock per 16 bytes
29 clock per 16 bytes
33 clock per 16 bytes
37 clock per 16 bytes
23 clock per 16 bytes
26 clock per 16 bytes
27 clock per 16 bytes
29 clock per 16 bytes
33 clock per 16 bytes
37 clock per 16 bytes
29 clock per 16 bytes
33 clock per 16 bytes
37 clock per 16 bytes
5 clock per byte
5 clock per byte
Page 293
Download