Rev. 1.0 HM 10/7/2010
®
* Other brands and names may be claimed as the property of others.
Assumptions
Speed Limitations
x86 Architecture Progression
Architecture Enhancements
Intel ® x86 Architectures
®
* Other brands and names may be claimed as the property of others.
2
Audience: Understands generic x86 architecture
Knows some assembly language
– Flavor used here: gas, as used in ddd disassembly
– Result on right-hand-side:
– mov [temp], %eax; is a load into register a
– add %eax, %ebx; new integer sum is in register b
– Different from Microsoft * masm, and tasm
Understand some architectural concepts:
– Caches, Multi-level caches, (some MESI)
– Threading, multi-threaded code
– Blocking (cache), blocking (aka tiling), blocking (thread synch.)
Causes of pipeline stalls
– Control flow change
– Data dependence, registers and data
NOT discussed: asm, VTune, CISC vs. RISC
®
3
* Other brands and names may be claimed as the property of others.
®
* Other brands and names may be claimed as the property of others.
4
Performance Limiters
Register Starvation
Processor-Memory Gap
Processor Stalls
Store Forwarding
Misc Limitations:
– Spin-Lock in Multi Thread
– Misaligned Data
– Denorm Floats
®
* Other brands and names may be claimed as the property of others.
5
Architectural limitations the programmer or compiler can overcome:
– Indirect limitations: stall via branch, call, return
– Incidental limits: resource constraint
– Historical limits: register starved x86
– Technological: ALU speed vs. memory access speed
– Logical limits: data- and resource dependence
®
* Other brands and names may be claimed as the property of others.
6
How many regs needed (compiler or programmer)?
– Infinite is perfect
– 1024 is very good
– 64 acceptable
– 16 is crummy
– 4+4 is x86
– 1 is saa (single-accumulator architecture)
Formally on x86: 16 regs. Quick test:
– ax, bc, cx, dx
– si, di
– bp, sp, ip
– cs, ds, ss, es, fs, gs, flags
Of which ax, bx, cx, dx are GPRs, almost
Rest can be used as better temps
ax & dx used for * and /, cx for loop
®
* Other brands and names may be claimed as the property of others.
7
Absence of regs causes
– Spurious memory spills and load
– False data dependences --not dependencies
Except single-accumulator arch: No other arch is more register starved than x86
Instruction Stream mov %eax, [mem1] use stuff, %eax mov [mem1], %eax
Added ops
Mem latency
False DD
Instruction Stream mov %eax, [tmp] add %ebx, %eax imul %ecx mov %eax, [prod] mov [tmp], %eax
®
* Other brands and names may be claimed as the property of others.
8
No solution in ISA, x86 had 4 GPRs since 8086
Improved via internal register renaming
– Pentium ® Pro has hundreds of internal regs
Added registers in mmx
– Visible to you, programmer and compiler
– fp(0) .. fp(7), 80-bits as FP, 64 bits as mmx, but note: context switch
Added registers in SSE
– xmm(0) .. xmm(7) 128 bits
®
* Other brands and names may be claimed as the property of others.
9
1000
100
10
1
“Moore’s Law”
CPU
µProc
60%/yr.
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
®
* Other brands and names may be claimed as the property of others.
Time
Source: David Patterson, UC Berkeley
10
CPU
Caches
Multilevel
Caches
Intel® Pentium II
Processor:
Out of Order
Execution
~30%
DRAM
Intel® Xeon™ Processor:
Hyperthreading
Technology
~30%
Time
Instruction
Level
Thread
Level
Hyperthreading Technology:
Feeds two threads to exploit shared execution units
®
* Other brands and names may be claimed as the property of others.
Memory speed has NOT kept up with advance in processor speed
– Avg. integer add ~ 0.16 ns (Xeon), but memory accesses take ~10 ns or more
CPU hardware resource utilization is only
35% on average
– Limited due to memory stalls and dependencies
Possible solutions to memory speed mismatch?
Memory speed mismatch is a major source of CPU stalls
®
* Other brands and names may be claimed as the property of others.
12
Cache provided
Methods to manipulate cache
Tools provided to pre-fetch data
– At risk of superfluous fetch, if control-flow change
®
* Other brands and names may be claimed as the property of others.
13
Stalled cycle is a cycle in which processor cannot receive or schedule new instructions
– Total Cycles = Total Stall Cycles + Productive Cycles
– Stalls waste processor cycles
– Perfmon, Linux ps, tops, other system tools show Stalled cycles as busy CPU cycles
– Intel® VTune Analyzer used to monitor stalls (HP* PFmon)
®
* Other brands and names may be claimed as the property of others.
Stalled
Unstalled
14
Stalls occur, because:
– Instruction needs resource not available
– Dependences [sic] (control- or data-) between instructions
– Processor / instruction waits for some signal or event
Sample resource limitations:
– Registers
– Execution ports
– Execution units
– Load / store ports
– Internal buffers (ROBs, WOBs , etc.)
Sample events:
– Exceptions, Cache misses, TLB misses, e.t.c.
– Common thing: they hold up compute progress
®
* Other brands and names may be claimed as the property of others.
15
Change in flow of control causes stalls
Processors handle control dependences:
– Via branch prediction hardware
– Conditional move to avoid branch & pipeline stall
Barrier
(Predict)
Instruction Stream mov [%ebp+8], %eax cmp 1, %eax jg bigger mov 1, %eax
. . .
bigger:
Instruction Stream dec %ecx push %eax call rfact mov %ecx,[%ebp+8] mul %ecx
Barrier
(Predict)
®
* Other brands and names may be claimed as the property of others.
16
Data dependence limits performance
Programmer / Compiler cannot solve
– Xeon has register renaming to avoid false data dependencies
– supports out of order execution to hide effects of dependencies
Instruction Stream
. . .
mov eax, [ebp+8] cmp eax, 1 Mem latency
False DD
Instruction Stream mov [temp], eax add eax, ebx mult ecx mov [prod], eax mov eax, [temp]
. . .
bigger:
®
* Other brands and names may be claimed as the property of others.
17
D-side
– DTLB Misses
– Memory Hierarchy L1, L2 and L3 misses
Core
– Store Buffer Stalls
– Load/Store splits
– Store forwarding hazard
– Loading partial/misaligned data
– Branch Mispredicts
I-side
– Streaming Buffer Misses
– ITLB Misses
– TC misses
– 64K Aliasing conflicts
Misc
– Machine Clears
®
18
* Other brands and names may be claimed as the property of others.
Reduce processor stall by prefetching data
Reduces control flow change by conditional move
Reduce false dependences by using register temps, from mmx (fp) and xmm pool
®
* Other brands and names may be claimed as the property of others.
19
Causes:
1) Too many WC streams
2) WB loads/stores contending for fillbuffers to access L2 cache or memory
Detection (VTune)
Event based sampling:
Ext. Bus Partial Write Trans.
Causes:
L2 Cache Request
Ext. Bus Burst Read Trans.
Ext. Bus RFO Trans.
First Level Cache
Fill/WC Buffer
Fill/WC Buffer
Second Level
Cache
FSB
8B 8B 8B -
Incomplete WC buffer
3 - 8B “Partial” bus transactions
8B
Complete WC buffer
1 bus transaction
Memory
Partial writes reduce actual front-side bus Bandwidth
– ~3x lower for PIII
– ~7x lower for ~Pentium 4 processor due to longer cache line
®
* Other brands and names may be claimed as the property of others.
20
Store Forward: Loading from an address recently stored can cause data to be fetched more quickly than via mem access.
Large penalty for non-forwarding cases (1.1-1.3x)
Load aligned with Store
Load contained in Store
Load contained in single Store
Store
Load
Store
Load
Store
Load
Will Forward
Store
Load
Forwarding Penalty
Store
Load
Store
Load
A B
128-bit forwards must be
16-byte aligned
Store
Load
Store
Load
16-byte boundaries
MSVC < 7.0 and you generate these. Intel Compiler doesn’t.
®
21
* Other brands and names may be claimed as the property of others.
Pick right compiler, for HLL programs
Use VTune to check, for asm code
In asm programs, ensure loads after stores are:
– Contained in stored data, subset or proper subset
– In single previous store, not in sum of multiple stores
– Thus do store-combining: assemble together, then store
– Both data start on same address
®
* Other brands and names may be claimed as the property of others.
22
Spin-Lock in Multi Thread
– Don’t use busy wait, juts because you have (almost) a second processor for second thread
Misaligned data
– Don't align data on arbitrary boundary, just because architecture can fetch from any address
Dumb errors
– Fail to use proper tool (library, compiler, performance analyzer)
– Failure to use tiling (aka blocking) or SW pipelining
Denormalized Floats
®
* Other brands and names may be claimed as the property of others.
23
Use pause, when applicable!
– New NetBurst instruction
Use compiler switches to align data on address divisible by greatest individual data object
– Who cares about wasting 7 bytes to force 8-byte alignment?
Be smart, pick right tools
– Instruct compiler to SW pipeline
– In asm, manually SW pipeline; note easier on EPIC than
VLIW, lacking prologue, epilogue sometimes
– Enable compiler to partition larger data structures into smaller suitable blocks, for improved locality
– cache parameter dependent
®
* Other brands and names may be claimed as the property of others.
24
Executes for first of 2 labs, this one being a
"two-minute" exercise:
Turn on your computer, verify Linux is alive
Verify you have available:
– Editor to modify program
– Intel C++ compiler, text command icc, with -g
– Debugger ddd, with disassembly ability
Source program vscal.cpp
Linux commands: ls, vi, icc, mkdir, etc.
®
* Other brands and names may be claimed as the property of others.
25
Covered: key causes that render execution slower than possible:
More registers at your disposal than seems
Van Neumann bottleneck can be softened via cache use and data pre-fetch
Stalls can be reduced by conditional move, avoiding false dependences
Use (time limited) capabilities, such as proper store forwarding
Note new Pause instruction
®
* Other brands and names may be claimed as the property of others.
26
®
* Other brands and names may be claimed as the property of others.
27
Abstract & Objectives
x86 Nomenclature & Notation
Intel® Architecture Progress
Pentium 4 Abstract
®
* Other brands and names may be claimed as the property of others.
28
Abstract: High-level introduction to history and evolution of increasingly powerful 16-bit and
32-bit x86 processors that are backwards compatible.
Objectives: understand processor generations and architectural features, by learning
– Progressive architectural capabilities
– Names of corresponding Intel processors
– Explanation, description of capabilities
– FP incompatibility, minor
®
* Other brands and names may be claimed as the property of others.
29
Objective is not introduction of:
– x86 assembly language, assumed known
– Itanium ® processor family now in 3 rd generation
– Intel tools (C++, VTune)
– Performance tools: MS Perfmon, Linux ps, emon,
HP PFMon, etc.
– Performance benchmarks, performance counters
– Differentiation Intel vs. competitor products
– CISC vs. RISC
®
* Other brands and names may be claimed as the property of others.
30
Processor name, initial launch date, final clock speed
Pentium ® II, 2H98, 450 MHz
MMX, BX chipset
Dynamic branch prediction enhanced
Architecturally visible enhancement list, can be empty
Architectural speedup technique, invisible exc. higher speed
®
* Other brands and names may be claimed as the property of others.
31
®
* Other brands and names may be claimed as the property of others.
32
Processor Family Description
Northwood
NW E-Step
Pentium ® Willamette shrink. Consumer and business desktop processor. HT not enabled, though capable.
Pentium HT errata corrected. Desktop processor
Prescott
Prestonia DP Xeon TM
Nocona DP
Foster MP
Gallatin MP
Potomac MP
Pentium
Xeon
Xeon
Xeon
Xeon
Consumer and business desktop processor. Replaces NW. Offers
6 PNI: Prescott New Instructions. First processor with Lagrande technology (trusted computing)
DP slated for workstations and entry-level servers. Based on NW core. HT enabled. 512 kB L2 cache. No L3. 3 GHz processor.
DP based on Prescott core. Targeted for 3.06 GHz. 533 MHz (quadpumped) bus, I.e. bus speed is 133 MHz. 1 MB L2 cache. HT enabled. About to be launched.
MP based on Willamette core. 1 MB L3 cache, 256 kB L2, HT enabled. For higher-end servers.
MP based on NW core. 1 or 2 MB L3 cache, 512 kB L2 cache. For high-end servers. See 8-way HP DL 760, and IBM x440. HT enabled.
MP based on Prescott core. 533 MHz (quad-pumped) bus. 1 MB L2 cache, 8 MB L3 cache. HT enabled, yet to be launched.
®
* Other brands and names may be claimed as the property of others.
Note: lower clock rates for MP versions.
Due to higher circuit complexity, bus load.
33
Feature
Pentium® III
Processor
Pentium® III
Processor
Pentium® 4
Processor
MHz
L2 Cache
Execution Type
System Bus
450-600 MHz
100MHz
600 MHz – 1.13GHz
512k off-die 256k on-die
Dynamic Dynamic
133MHz
MMX™ Technology
Streaming SIMD
Extensions
Streaming SIMD
Extensions 2
Yes
Yes
No
Yes
Yes
No
1.5 GHz
256k on-die
Intel®
NetBurst™ m
Arch
400MHz
(4x100 MHz)
Yes
Yes
Yes
Manufacturing
Process
Chipset
.25 micron
ICH-1
.18 micron
ICH-2
.18 micron
ICH-2
Northwood
2+ GHz
512k on-die
Intel®
NetBurst™ m
Arch
400/533MHz
(4x100/133 MHz)
Yes
Yes
Yes
.13 micron
ICH-2
®
* Other brands and names may be claimed as the property of others.
34
8087 co-processor of 8086 : off-chip FP computation, extended 80-bit FP format for DP
MMX : multi-media extensions
– Mmx regs aliased w. FP register stack
– needs context switch
– FP regs also called ST(I) regs
SSE : Streaming SIMD extension already since
Pentium III
WNI : 144 new instructions, using additional data types for existing opcodes, using previously reserved opcodes
®
* Other brands and names may be claimed as the property of others.
35
XMM : 8 new 128-bit registers, in addition to
MMX
SSE2 : multiple integer ops and multiple DP FP ops: part of 144 WNI
– Regs unchanged in Pentium ® 4 from P III
– Ops added
NetBurst : generic term for: HyperThreading & quad-pumped bus & new Trace Cache & etc.
®
* Other brands and names may be claimed as the property of others.
Note: architectural feature ages with next generation, but survives, due to compatibility requirement. Hence is interesting not only for historical reasons:
You need to know it!
36
TM
Xeon™ MP Processor
“Gallatin”
64 GB
(PAE-36)
Physical Addressing (36-bit P Pro)
3.2 GB/s
(400)
L3 – 1or 2 MB
L2 - 512KB
L1 - 12K TC, 8K D
Hyperthreading
Technology
20
Hyperthreading
Technology
System Bus Bandwidth
External Cache
On-die Cache
Logical CPU 2 X
2xALU 1 2 3 4 5
24 Registers (126)
6
Pipeline Stages
Issue Ports
Registers
8 Integer,
1 Multimedia
2
Floating
Point
2.0+ GHz
3 Instructions / Cycle
Execution Units
Core Frequency
Instructions/clock-cycle
®
* Other brands and names may be claimed as the property of others.
37
TM
Note: Physical Address Extension,
36-bit PAE addresses, since Pentium ® Pro
External
Memory
64GB
3.2 GB/s
L3
2MB
8-way
128B lines
21+ CLKS
12.8 GB/s
Xeon
™
Processor MP
L2 (unif'd)
512KB
8-way
128B lines
7+ CLKS
TC
12KB
64B lines
2 CLKS
L1(DL0)
8KB
64B lines
2 CLKS
®
* Other brands and names may be claimed as the property of others.
38
®
* Other brands and names may be claimed as the property of others.
39
Abstract & Objectives
Faster Clock
Caches: Advantage, Cost, Limitation
Multi-Level Cache-Coherence in MP
Register Renaming
Speculative, Out of Order Execution
Branch Prediction, Code Straightening
®
* Other brands and names may be claimed as the property of others.
40
Abstract: Outline generic techniques that overcome performance limitations
Objectives: under stand cost of architectural techniques (tricks) in terms of resources (mil space) and of lost performance if incorrectly guessed
– Caches: cost silicon, can slow down
– Branch prediction: costs silicon, can be wrong
– Prefetch: costs instruction, may be superfluous
– Superscalar: may not find a second op
®
* Other brands and names may be claimed as the property of others.
41
Objective is not to explain detail of Intel processor architecture
Not to claim Intel invented techniques; academia invented many
Not to show all techniques; some apply mainly to EPIC or VLIW architectures
No hype, no judgment, just the facts please!
®
* Other brands and names may be claimed as the property of others.
42
CISC:
– Decompose circuitry into multiple simple, sequential modules
Resulting modules are smaller and thus can be fast:
– high clock rate
– Shorter speed-paths
That's what we call: pipelined architecture
More modules -> simpler modules -> faster clock -> super-pipelined
Super-pipelining NOT goodness per-se:
– Saves no silicon
– Execution time per instruction does not improve
– May get worse, due to delay cycles
But:
– Instructions retired per unit time improves
– Especially in absence of (large number of) control-flow stalls
®
43
* Other brands and names may be claimed as the property of others.
Xeon TM processor pipeline has 20 stages
Decode
Decode
I-Fetch
I-Fetch
O1-Fetch
O1-Fetch
ALU op
ALU op
O2-Fetch
O2-Fetch
R Store
R Store
I-Fetch
I-Fetch
Decode
Decode
O1-Fetch
O1-Fetch
O2-Fetch
O2-Fetch
.
.
ALU op
ALU op
R Store
R Store
1
TC Nxt IP
2
Intel
®
NetBurst TM µarchitecture: 20 stage pipeline
3
TC Fetch
4 5 6
Drive Alloc
7
Rename
8 9 10 11 12
Que Sch Sch Sch
13 14
Disp Disp
15
RF
16
RF
17 18 19 20
Ex Flgs Br Ck Drive
Beautiful model breaks upon control transfer
®
* Other brands and names may be claimed as the property of others.
44
®
* Other brands and names may be claimed as the property of others.
45
Abstract & Objectives
High Speed, Long Pipe
Multiprocessing
MMX Operations
SSE Operations
SSE2 Operations
Willamette New Instructions WNI
Cacheability Instructions
Pause Instruction
NetBurst, Hyperthreading
SW Tools
®
* Other brands and names may be claimed as the property of others.
46
Abstract: Emphasizing Pentium ® 4 processors, show progressively more powerful architectural features introduced in Intel processors. Refer to speed problems solved from module 2 and general solutions explained in module 3.
Objective: you not only understand the various processor product names and supported features (Intel marketing names), but understand how they work, and what their limitations and costs are.
®
* Other brands and names may be claimed as the property of others.
47
Objective is not to show Intel's techniques are the only ones, or best possible. They are just good trade-off in light of conflicting constraints:
– Clock speed vs. small # of pipes
– Small transistor count vs. high performance
– Large caches vs. small mil. Space
– Grandiose architecture vs. backward compatibility
– Need for large register file vs. register-starved x86
– Wish to have two full on-die processors vs. preserving silicon space
®
* Other brands and names may be claimed as the property of others.
48
TM
Basic Pentium ® Pro Pipeline
Intro at
733MHz
.18µ
1 2 3 4 5 6 7 8 9 10
Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec
Basic NetBurst ™ Micro-architecture Pipeline
1 2 3 4 5 6 7 8 9 10 11 12
TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch
13 14
Disp Disp
15 16 17 18 19 20
RF RF Ex Flgs Br Ck Drive
Hyper pipelined Technology enables industry leading performance and clock rate
®
* Other brands and names may be claimed as the property of others.
1.4 GHz
.18 µ
2.2GHz
.13µ
1
Match pipe functions to clocks/stages
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Execute: Execute the m ops on the correct port; 1 clk
Drive: Drive m
Drive: Drive the branch result to BTB at front; 1 clk ops to the Allocator; 1 clk
Flags: Compute flags (0, negative, etc.);
1 clk
Trace Cache/Next
IP: Read from
Branch Target
Buffer; 2 clks
Allocate: Allocate resources for execution; 1 clk
Trace Cache Fetch:
Read decoded m op from TC; 2 clks
Register File: Read the register file;
2 clks
Dispatch: Send m ops to appropriate execution unit; 2 clks
Branch Check:
Compare act. branch to predicted;
1 clk
Rename: Rename logical regs to physical regs; 2 clks
Queue: Write m op into m op queue to wait for scheduling; 1 clk
Schedule: Write to schedulers; compute dependencies; 3 clks
®
* Other brands and names may be claimed as the property of others.
50
Def: Execution of 1 task by >= 2 processors
Floyd Model (1960s):
– Single-Instruction, Single-Data Stream (SISD)
Architecture (PDP-11)
– Single-Instruction, Multiple-Data Stream (SIMD)
Architecture (Array Processors, Solomon, Illiac IV,
BSP, TMC)
– Multiple-Instruction, Single-Data Stream (MISD)
Architecture (possibly: pipelined, VLIW, EPIC)
– Multiple-Instruction, Multiple-Data Stream
Architecture (possibly: EPIC when SW-pipelined, true multiprocessor)
®
* Other brands and names may be claimed as the property of others.
51
Performance gain from doubling processors
Gain
0.90
2
0.81
0.73
0.59
0.43
0.25
0.11
4 8 16 32
Number of processors
64
Gain Follows Law of Diminishing Returns
128
®
* Other brands and names may be claimed as the property of others.
52
SPECint_rate_base2000
Frequency
Scale more visible with large cache
OLTP
1.68
1.00
1.25
1.40
1.00
1.31
(2P) 2.2GHz,
400MHz Bus,
512KB cache
(2P) 3.06GHz,
533MHz Bus,
512KB cache
(2P) 3.06GHz,
533MHz Bus,
1MB cache
(2P) 2.2GHz,
400MHz Bus,
512KB cache
(2P) 3.06GHz,
533MHz Bus,
512KB cache
(2P) 3.06GHz,
533MHz Bus,
1MB cache
Source: Intel Corporation
Based on Intel internal projections. System configuration assumptions: 1) two Intel® Xeon™ processor 2.8GHz with 512KB L2 cache in an E7500 chipset-based server platform, 16GB memory, Hyperthreading enabled; 2) Four Intel® Xeon™ processor MP 1.6GHz with 1MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 3)
Four Intel® Xeon™ processor MP 2.0GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 4) Four Intel® Xeon™ processor MP
2.8GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.
®
53
* Other brands and names may be claimed as the property of others.
Which processor is better?
2.00
1.00
(2P) Intel® Xeon™ processor @ 2.8GHz,
533MHz Bus, 0 L3
(4P) Intel® Xeon™ processor MP 2.0GHz,
400MHz Bus 2MB L3
Source: TPC.org
Xeon processor MP Targeted for OLTP
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.
®
54
* Other brands and names may be claimed as the property of others.
Add (wrap around) paddw mm0, mm3 p acked add on w ords mm0 mm3
F000h a2
+ +
3000h b2 a1
+ b1 a0
+ b0 mm0 2000h a2+b2 a1+b1 a0+b0
Add (saturation) paddusw mm0, mm3 p acked add with u nsigned s aturation on w ords mm0 mm3
F000h a2
+ +
3000h b2 a1
+ b1 a0
+ b0 mm0 FFFFh a2+b2 a1+b1 a0+b0
®
* Other brands and names may be claimed as the property of others.
55
Multiply-low pmullw mm0, mm3 mul tiply l ow, w ords mm0 a3 a2 b3
* * b2 mm3 a3*b3 a2*b2 mm0 a1 a0
* * b1 b0 a1*b1 a0*b0
Multiply-high pmulhw mm1, mm4 mul tiply h igh, w ords
®
* Other brands and names may be claimed as the property of others. mm1 mm4 a3*b3 a3 a2 b3
* * b2 a2*b2 mm1 c3 c2 a1 a0
* * b1 b0 a1*b1 c1 c0 a0*b0
56
Multiply Add pmaddwd mm1, mm4 p acked m ultiply and add 4 w ords to 2 d oublewords mm1 mm4 a3*b3 mm1 a3
* b3 a2*b2 a3*b3+a2*b2 a2
* b2 a1
* b1 a1*b1 a0
* b0 a1*b1+a0*b0 a0*b0
Note : This instruction does not have a saturation option.
®
* Other brands and names may be claimed as the property of others.
57
Unpack, interleaved merge punpcklwd mm0, mm1 unp a ck l ow w ords into d oublewords mm1 b3 b2 b1 b0 mm0 a3 a2 a1 mm0 b1 a1 b0 a0 a0
punpckhwd mm0, mm1 unp a ck h igh w ords into d oublewords mm1 b3 b2 b1 b0 mm0 a3 a2 a1 a0 mm0 b3 a3 b2 a2
Zero extend from small data elements to bigger data elements by using the unpack instruction, with zeros in one of the operands.
®
58
* Other brands and names may be claimed as the property of others.
Pack packusdw mm0, mm1 pack with u nsigned s aturation (signed) d oublewords into w ords mm1 A B mm0 C mm0
A’ B’ C’ D’
D
®
* Other brands and names may be claimed as the property of others.
59
psllq MM0, 8 p acked s hift l eft l ogical q uadword
psllw MM0, 8 p acked s hift l eft l ogical w ords
MM0 703F 0000 FFD9 4364h
MM0 3F00 00FF D943 6400h
8
MM0 703Fh DF00h 81DBh 007Fh 8
MM0 3F00h 0000h DB00h 7F00h
®
* Other brands and names may be claimed as the property of others.
60
pcmpgtw ; c o mp are g rea t er w ords (generate a mask)
51
>
73
3
>
2
5
>
5
23
>
6
000...00
111...11
000...00
111...11
®
* Other brands and names may be claimed as the property of others.
61
IA-INT
Registers
32
Streaming SIMD Extension
Registers
(128-bit integer)
EAX 128
.
.
.
XMM0
.
.
.
XMM3
XMM4
.
.
.
EDI
XMM7
Eight 128 bit registers
Fourteen 32-bit registers
Single-precision / Double-precision
/ 128-bit integer
Direct register access
Direct access to registers
Scalar Data only
Referred to as XMM0-XMM7
Use simultaneously with FP /
MMX ™ Technology
Data array only
®
* Other brands and names may be claimed as the property of others.
62
MMX™ Technology /
IA-FP Registers
80
64
FP0 or MM0
.
.
.
.
.
.
FP7 or MM7
Eight 64 bit registers
Xor eight 80 bit FP regs
Direct access to regs
FP data / data array
x87 remains aliased with
SIMD integer registers
Context-switch
Full Precision
ADD, SUB, MUL, DIV, SQRT
– Floating Point
(Packed/Scalar)
– Full 23 bit precision
Approximate Precision
RCP - Reciprocal
RSQRT - Reciprocal
Square Root
– Perspective correction / projection
– Vector normalization
– Very fast
– Return at least 11 bits of precision
®
* Other brands and names may be claimed as the property of others.
63
MULPS: Mul tiply P acked S ingle-FP mulps xmm1, xmm2
*
X4
Y4
X3
Y3
X4*Y4 X3*Y3
X2
Y2
X1
Y1
X2*Y2 X1*Y1 xmm1 xmm2 xmm1
®
* Other brands and names may be claimed as the property of others.
64
CMPPS: C o mp are P acked S ingle-FP cmpps xmm0, xmm1, 1
<
1.1
8.6
7.3
2.3
2.3
3.5
5.6
1.2
111…11 000…00 111…11 000...00
xmm0 xmm1 xmm0
®
* Other brands and names may be claimed as the property of others.
look like SSE cos they r
IA-INT
Registers
EAX
32
Streaming SIMD Extension
Registers
(scalar / packed SIMD-SP, SIMD-DP,
128-bit integer)
128
.
.
.
XMM0
.
.
.
XMM3
XMM4
.
.
.
EDI
XMM7
Eight 128 bit registers
Fourteen 32-bit registers
Single-precision array / Doubleprecision array / 128-bit integer
Direct register access
Direct access to registers
Scalar Data only
Referred to as XMM0-XMM7
Use simultaneously with FP /
MMX ™ Technology
Data array only
MMX™ Technology /
IA-FP Registers
80
64
FP0 or MM0
.
.
.
.
.
.
FP7 or MM7
Eight 64 bit registers
Xor eight 80 bit FP regs
Direct access to regs
FP data / data array
x87 remains aliased with
SIMD integer registers
Context-switch
®
* Other brands and names may be claimed as the property of others.
66
New 64-bit doubleprecision floating point instructions
New / enhanced 128-bit wide SIMD integer
– Superset of MMX™ technology instruction
-ANDset
-OR-
-AND-
No forced context switching on SSE registers (unlike
MMX™/x87 registers)
Instruction Type
Standard x87 (SP, DP, EP)
64-bit SIMD int.
(4x16, 8x8)
128-bit SIMD int.
(8x16, 16x8)
Single-precision SIMD FP
(4x32)
Double-precision SIMD FP
(2x64)
Cache Management
(Memory Streaming/Prefetch)
Backward compatible with all existing MMX™ & SSE code
®
* Other brands and names may be claimed as the property of others.
67
New Instructions
Extended SIMD Integer Instructions
New SIMD Double-precision FP Instructions
New Cacheability Instructions
Fully Integrated into Intel Architecture
– Use previously reserved opcodes
– Same addressing modes as MMX™ / SSE ops
– Several MMX™ / SSE mnemonics are repeated
– New Extended SIMD functionality is obtained by specifying 128-bit registers (xmm0-xmm7) as src/dst.
®
* Other brands and names may be claimed as the property of others.
68
Same instruction categories as SIMD singleprecision FP instructions
Operate on both elements of packed data, in parallel -> SIMD
Some instructions have scalar or packed versions
X2 X1 / Scalar
S Exponent
63 62 52 51
Significand
0
IEEE 754 Compliant FP Arithmetic
– Not bit exact with x87 : 80 bit internal vs 64 bit mem
Usable in all modes: real, virtual x86, SMM, and protected (16-bit & 32-bit)
®
* Other brands and names may be claimed as the property of others.
69
Arithmetic FP Instructions can be:
– Packed or Scalar
– Single-Precision or Double-Precision
ASM Intrinsics P acked or S calar add ps _mm_add_ps() Add Packed Single add pd _mm_add_pd() Add Packed Double
S ingle or D ouble add ss _mm_add_ss() Add Scalar Single add sd _mm_add_sd() Add Scalar Double
®
* Other brands and names may be claimed as the property of others.
70
Packed & Scalar FP Instructions operate on packed single- or double-precision floating point elements
– Packed instructions operate on 4 (sp) or 2 (dp) floats addps addpd op
X4
Y4
X3
Y3
X2
Y2
X1
Y1 op
X2
Y2
X1
Y1
X4opY4
X2opY2 X1opY1
X3opY3 X2opY2 X1opY1
– Scalar instructions operate only on the right-most field addss addsd
X1
X4 X3 X2 X1
X2 op op
Y1
Y4 Y3 Y2 Y1
Y2
Y2
Y4 Y3 Y2 X1opY1
X1opY1
®
* Other brands and names may be claimed as the property of others.
71
All MMX™/SSE integer instructions operate on
128-bit wide data in XMM registers
Additionally, some new functionality
– MOVDQA, MOVDQU: 128-bit aligned/unaligned moves
– PADDQ, PSUBQ: 64-bit Add/Subtract for mm & xmm regs
– PMULUDQ: Packed 32 * 32 bit Multiply
– PSLLDQ, PSRLDQ: 128-bit byte-wise Shifts
– PSHUFD: Shuffle four double-words in xmm register
– PSHUFL/HW: Shuffle four words in upper/lower half of xmm reg
– PUNPCKL/HQDQ: Interleave upper/lower quadwords
– Full 128-bit Conversions: 4 Ints vs. 4 SP Floats
®
* Other brands and names may be claimed as the property of others.
New 128-bit data-types for fixed-point integer data
– 16 Packed bytes
127 63 8 7 0
– 8 Packed words
127 63 16 15 0
– 4 Packed doublewords
127
– 2 Quadwords
127
63
63
32 31 0
0
®
* Other brands and names may be claimed as the property of others.
73
Computation
ADD, SUB, MUL, DIV, SQRT
MAX, MIN
– Full 52-bit precision mantissa
(Packed & Scalar)
Logic
AND, ANDN, OR, XOR
– Operate uniformly on entire
128-bit register
– Must use DP instructions for double-precision data
Data Formatting
MOVAPD, MOVUPD
– 128-bit DP moves
(aligned/unaligned)
MOVH/LPD, MOVSD
– 64-bit DP moves
SHUFPD
– Shuffle packed doubles
– Select data using 2-bit immediate operand
®
* Other brands and names may be claimed as the property of others.
74
The new Packed & Scalar FP Instructions operate on packed double precision floating point elements
– Packed instructions operate on 2 numbers addpd op
X2
Y2
X1
Y1
X2opY2 X1opY1
– Scalar instructions operate on least-significant number
X2 X1 op addsd
Y2 Y1
®
* Other brands and names may be claimed as the property of others.
75
Y2 X1opY1
SHUFPD: Shuf fle P acked D ouble-FP
1 0 1
XMM2 y2 y1 XMM1 x2
XMM1 y2-y1
SHUFPD XMM1, XMM2, 3
XMM1 y2
SHUFPD XMM1, XMM2, 2
XMM1 y2
®
* Other brands and names may be claimed as the property of others.
76 x2-x1
// binary 11 x2
// binary 10 x1
0 x1
Branching
CMPPD, CMPSD
– Compare & mask
(Packed/Scalar)
COMISD
– Scalar compare and set status flags
MOVMSKPD
– Store 2-bit mask of DP sign bits in a reg32
Type Conversion
CVT
– Convert DP to SP & 32bit integer w/ rounding
(Packed/Scalar)
CVTT
– Convert DP to 32-bit integer w/ truncation
(Packed/Scalar)
®
* Other brands and names may be claimed as the property of others.
77
CMPPD: C o mp are P acked D ouble-FP
CMPPD XMM0, XMM1, 1 // 1 = less than
XMM0
XMM1
1.1
<
8.6
XMM0
1111111….111
12.3
<
3.5
0000000….000
®
* Other brands and names may be claimed as the property of others.
78
On-die trace cache for decoded uops (TC)
– Holds 12K uops
8K on-die, 1 st level data cache (L1)
– 64-byte line size
– Pentium Pro was 32 bytes
– Ultrafast, multiple accesses per instruction
256K on-die, 2 nd level write-back, unified data and instruction cache (L2)
– 128-byte line size
– operates at full processor clock frequency
PREFETCH instructions return 128 bytes to L2
®
* Other brands and names may be claimed as the property of others.
79
MMX™/SSE cacheability instructions preserved
New Functionality:
– CLFLUSH: Cache line flush
– LFENCE / MFENCE: Load Fence / Memory Fence
– PAUSE: Pause execution
– MASKMOVDQU: Mask move 128-bit integer data
– MOVNTPD: Streaming store with 2 64-bit DP FP data
– MOVNTDQ: Streaming store with 128-bit integer data
– MOVNTI: Streaming store with 32-bit integer data
®
* Other brands and names may be claimed as the property of others.
80
Willamette implementation supports:
– Writing to uncacheable buffer (e.g. AGP) with full line-writes
– Re-reading same buffer with full line-reads
– New in WNI, compared to Katmai/CuMine
Integer streaming store
– Operates on integer registers (ie, EAX, EBX)
– Useful for OS, by avoiding need to save FP state, just move raw bits
®
* Other brands and names may be claimed as the property of others.
81
CLFLUSH: Cache line containing m8 flushed and invalidated from all caches in the coherency domain
Linear address based; allowed by user code
Potential usage:
– Allows incoherent (AGP) I/O data to be mapped as
WB for high read performance and flushed when updated
– Example: video encode stream
– Precise control of dirty data eviction may increase performance by scheduling @ idle memory cycles
®
* Other brands and names may be claimed as the property of others.
82
Capabilities introduced over time to enable software managed coherence:
– Write combining with the Pentium Pro processor
– SFence and memory streaming with Streaming SIMD Extensions
New Willamette Fences completes the tool set to enable full software coherence management
– LFence, strong load order
– Blocks younger loads from passing a prior load instruction
– All loads preceding an LFence will be completed before loads coming after the LFence
– MFence
–
Achieves effect of LFence and SFence instructions executed at same time
–
Necessary, as issuing an SFence instruction followed by an LFence instruction does not prevent a load from passing a prior store
®
* Other brands and names may be claimed as the property of others.
83
PAUSE architecturally a NOP on IA-32 processor generations
Usable since Willamette!
Not necessary to check processor type.
PAUSE is hint to processor that code is a spin- wait or non- performance- critical code. A processor which uses the hint can:
– Significantly improves performance of spin-wait loops without negative performance impact, by inserting a implementationdependent delay that helps processors with dynamic execution
(a. k. a. out- of- order execution) exit from the spin- loop faster
Significantly reduces power consumption during spinwait loops
®
* Other brands and names may be claimed as the property of others.
84
TM
System Bus
Frequently used paths
Less frequently used paths
Bus Unit
1 st Level Cache
(Data) 4-way
2 nd Level Cache
8-way
Fetch/
Decode
Front End
Trace Cache
Microcode ROM
Execution
Out-of-Order Core
Retirement
BTBs/Branch Prediction
®
* Other brands and names may be claimed as the property of others.
85
TM
BTB uCode
ROM
3 3
Store
AGU
Load
AGU
ALU
ALU
ALU
ALU
FP move
FP store
FMul
FAdd
MMX
SSE
TM
Quad Pumps bus to keep the Caches loaded
Stores most recent instructions as µops in TC to enhance instruction issue
Improves Program Execution
– Issues up to 3 µops per Clock
– Dispatches up to 6 µops to Execution Units per clock
– Retires up to 3 µops per clock
Feeds back branch and data information to have required instructions and data available
®
* Other brands and names may be claimed as the property of others.
87
Ability of processor to run multiple threads
– Duplicate architecture state creates illusion to SW of Dual Processor (DP)
– Execution unit shared between two threads, but dedicated if one stalls
Effect of Hyperthreading on Xeon Processor:
– CPU utilization increases to 50% (from ~35%)
– About 30% performance gain for some applications with the same processor frequency
Hyperthreading Technology Results:
1. More performance with enabled applications
2. Better responsiveness with existing applications
®
* Other brands and names may be claimed as the property of others.
88
Almost two Logical Processors
Architecture state (registers) and APIC duplicated
Share execution units, caches, branch prediction, control logic and buses
On-Die
Caches
Architecture State Architecture State
* APIC: Advanced
Programmable Interrupt
Controller. Handles interrupts sent to a specified logical processor
Adv. Programmable
Interrupt Control
Adv. Programmable
Interrupt Control
Processor
Execution
Resource
System Bus
®
* Other brands and names may be claimed as the property of others.
89
Hyperthreading Technology Performance for Dual Processor
Servers
Hyper Threading Technology Performance Gains
Enhancements in bandwidth, throughput and thread-level parallelism with
Hyperthreading Technology deliver an acceleration of performance
1.00
1.21
1.19
Source: Veritest (Sep, 2002). Comparisons based on Intel internal measurements w/preproduction hardware
1) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft
Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.
2) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft
Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.
HT OFF WebBench / Web
Server Performance
Trade2 / Java Apps
Server Performance
Intel® Xeon™ processor 2.8GHz with 512KB cache, Microsoft Windows* 2000
Hyperthreading Technology increases performance by
~20% on Some Server Applications
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.
®
90
* Other brands and names may be claimed as the property of others.
Intel® Xeon™ processor 2.8GHz with 512KB cache
Hyperthreading Technology performance gains
Performance gains whether running:
Multiple tasks within one application
Multiple applications running at once
1.00
1.15
1.18
Multi-Threaded Multi-Tasking
1.15
1.19
1.26
1.26
1.27
1.29
1.27
1.37
HT Off
Multi-
Threaded
CHARMm* 3DSM*5 D2cluster* BLAST* Lightwave
3D*75
Applicatio
Source: Intel Corporation. With and without Hyperthreading Technology on the following system configuration: Intel Xeon Processor 2.80 GHz/533 MHz system bus with 512KB L2 cache, Intel® E7505 chipset-based Pre-Release platform, 1GB PC2100 DDR CL2 CAS2-2-2, (2) 18GB Seagate* Cheetah
ST318452LW 15K Ultra160 SCSI hard drive using Adaptec 39160 SCSI adapter BIOS 3.10.0, nVidia* Quadro4 Pro 980XGL 128MB AGP 8x graphics card with driver version 40.52, Windows XP* Professional build 2600.
n
Multi-
Tasking
Applicatio
Patran* +
Nastran *
Multiple
Compiles
3ds max* +
Photoshop
Compile +
Regressio n
Maya* multiple renderings n +
Hyperthreading Technology increases performance by 15-
37% on Workstation Applications
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing.
For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.
®
91
* Other brands and names may be claimed as the property of others.
Type Description Example
Shared Each logical processor can use, evict or allocate any part of resource
Cache, WC Buffers,
VTune reg. MS-ROM
Duplicated Each logical processor has it’s own set of resources
APIC, registers, TSC,
IP
Split
Tagged
Resources are hard partitioned in half Load/Store buffers,
ITLB, ROB, IAQ
Resource entries are tagged with processor ID
Trace Cache, DTLB
®
* Other brands and names may be claimed as the property of others.
92
Buffering Queues separate major pipeline logic blocks
Buffering queues are either partitioned or duplicated to ensure independent forward progress through each logic block
Queue
Queue
Fetch
Decode
TC/MSROM
Queue
Buffering Queues duplicated
Rename/Allocate
Queue
OOO Execute
Queue
Buffering Queues partitioned
Retirement
Queue
Queue
®
* Other brands and names may be claimed as the property of others.
93
System Bus Front End
Bus unit
3 rd level cache
Optional server product
–
Execution Trace Cache
–
Microcode Store ROM (MSROM)
–
ITLB and Branch Prediction
– IA-32 Instruction Decode
– Micro-op Queue
2 nd
Fetch/
Decode level cache
Trace Cache
MS ROM
1 st level cache
4 way
OOO
Execution
Retirement
BTBs/
Branch Prediction
®
* Other brands and names may be claimed as the property of others.
94
Responsible for delivering instruction to the later pipe stages
Trace Cache Hit
– When the requested instruction trace is present in trace cache
Trace cache miss
– Requested instruction is brought in the trace cache from L2 cache
®
* Other brands and names may be claimed as the property of others.
95
Front End
IP
IP
Trace
Cache
Micro-Op
Queue
Two separate instruction pointers
Two logical processors arbitrate for access to TC each cycle
If one logical processor stalls,other uses full bandwidth of
TC
®
* Other brands and names may be claimed as the property of others.
96
Two major types of parallel programming models
– Domain decomposition
– Functional decomposition
Domain Decomposition
– Multiple threads working on subsets of the data
Functional Decomposition
– Different computation on the same data
– E.g. Motion estimation vs. color conversion, e.t.c.
Both models can be implemented on HT processors
®
* Other brands and names may be claimed as the property of others.
97
O/S thread implementations may differ
Microsoft Win32
– NT threads (supports 1-1 O/S level threading)
– Fibers (supports M-N user level threading)
Linux
– Native Linux Thread (severely broken & inefficient)
– IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix Linux native thread
– Redhat Native Posix Thread Model for Linux (NPTL) supports 1-1 O/S level threading that is to be Posix compliant
Others
– Pthreads (generic Posix compliant thread)
– Sun Solaris Light Weight Processes (lwp), Sun Solaris user level threads
Thread Model Issues Somewhat Orthogonal to HT
®
* Other brands and names may be claimed as the property of others.
98
ALL UP OS
Legacy MP OS
Backward
Compatible, will not take the advantage of
Enabled MP OS
OS with Basic
Hyperthreading
Technology
Functionality
Optimized MP
OS
OS with optimized
Hyperthreading
Technology support
Fully Compatible with ALL existing O/S… but only optimized O/S enables the most benefits
®
* Other brands and names may be claimed as the property of others.
99
Windows XP
– Windows XP
– Windows XP Professional
Windows 2003
– Enterprise
– Data Center
Enabled
– RedHat Enterprise Server (version 7.3, 8.0)
– RedHat Advanced Server 2.1
– Suse (8.0, 9.0)
®
* Other brands and names may be claimed as the property of others.
100
HT enabled O/S sees two processors for each HT physical processor
– Enumerates first logical processor from all physical processors first
Schedules processors almost same as regular SMP
– Thread priority determines schedule, but CPU dispatch matters
– O/S independently submits code stream for thread to logical processors and can independently interrupt or halt each logical processor (no change)
Physical Processor 1
Logical
Processor
1
Logical
Processor
0
Physical Processor 0
Logical
Processor
1
Logical
Processor
0
00000011
®
CPUID
* Other brands and names may be claimed as the property of others.
00000001
CPUID 101
00000010
CPUID
00000000
CPUID
Avoid coding practices that disable hyperthreaded processors, e.g.
– Avoid 64KB Aliasing
– Avoid processor serializing events (e.g. FP denormals, self modifying codes, e.t.c.)
Avoid Spin Locks
– Minimize lock contention to less than two threads per lock
– Use “ Pause ” and “ O/S synchronization ” when Spin-Wait loops must be implemented
In addition, follow multi-threading best practices:
– Use O/S services to block waiting threads
– Spin as briefly as possible before yielding to O/S
– Avoid false sharing
– Avoid unintended synchronizations (C Runtime, C++
Template Library implementations)
®
102
* Other brands and names may be claimed as the property of others.
Intel ThreadChecker Tool
– Itemization of parallelization bugs and source
– ThreadChecker class
OpenMP
– Thread model in which programmer introduces parallelism or threading via directives or pragmas
Intel Vtune Analyzer
– Provides analysis and drills down to source code
– ThreadChecker Integration
GuideView
– Parallel performance tuning
®
* Other brands and names may be claimed as the property of others.
103
Intel C/C++ Compiler
–
Support for SSE and SSE2 using C++ classes, intrinsics, and assembly
– Improved Vectorization and prefetch insertion
– Profile-guided optimizations
– G7 compiler switch for Pentium® 4 optimizations
Register Viewing Tool (RVT)
– Shows contents of XMM registers as they are updated
–
Plugs into Microsoft* Visual Studio*
Microsoft* Visual Studio* 6.0 Processor Pack*
Support for SSE and SSE2 instructions, including intrinsics
Available for free download from Microsoft*
Microsoft* Visual Studio* .NET
– Provides improved support for Intel® NetBurst™ micro-architecture
– Recognizes XMM registers
®
* Other brands and names may be claimed as the property of others.
104
Hyperthreading is not a full, dual-core processor
Hyper-threading does not deliver multiprocessor scaling
Dual Processor
Cache Cache
Arch. State Arch. State
APIC APIC
Processor core
Processor core
®
* Other brands and names may be claimed as the property of others.
Hyper-Threading
On-Die
Cache
Arch. State Arch. State
APIC APIC
Dual Core
On-Die
Cache
Arch. State Arch. State
APIC APIC
Processor core
105
Processor core
Processor core
®
* Other brands and names may be claimed as the property of others.
106
Branch: transfer of control to address different from next instruction . Unconditional or conditional.
Branch Prediction: Ability to guess target of conditional branch. Can be wrong, in which case we have mis-predict.
CISC: complex instruction set computer
Compiler: Tool translating high-level instructions into low-level machine instructions. Can be asm source (ASCII) or binary machine code.
EPIC (Explicitly Parallel Instruction Computing):
New architecture jointly defined by Intel
® and HP.Is foundation of new 64-bit Instruction Set
Architecture
®
* Other brands and names may be claimed as the property of others.
107
Explicit parallelism: Intended ability of two tasks to be executed by design (explicitly) at the same time.
Task can be as simple as an instruction, or as complex as a complete program.
Implicit parallelism: Incidental ability of two or more tasks to be executed at the same time. Example: sequence of integer add and FP convert instructions without common registers or memory addresses, executed on a target machine that happens to have respective HW modules available.
®
* Other brands and names may be claimed as the property of others.
108
Instruction Set Architecture (ISA): Architecturally visible instructions that perform software functions and direct operations within the processor. HP and
Intel
® jointly developed a new 64-bit ISA.This ISA integrates technical concepts from the EPIC technology.
Memory latency: Time to move data from memory to the processor, at request of processor.
Mispredict: A wrong guess, where new flow of control will continue as a result of a branch (or similar control flow instruction).
®
* Other brands and names may be claimed as the property of others.
109