Intel (r) x86-Architecture

advertisement

Architecture of:

Intel® Pentium® 4,

Intel® Xeon™,

Intel® Xeon™ MP Architecture

Rev. 1.0 HM 10/7/2010

Derived from Herbert G. Mayer’s

2003 Presentation for:

Intel Software College

®

* Other brands and names may be claimed as the property of others.

Agenda

 Assumptions

 Speed Limitations

 x86 Architecture Progression

 Architecture Enhancements

 Intel ® x86 Architectures

®

* Other brands and names may be claimed as the property of others.

2

Assumptions

 Audience: Understands generic x86 architecture

 Knows some assembly language

– Flavor used here: gas, as used in ddd disassembly

– Result on right-hand-side:

– mov [temp], %eax; is a load into register a

– add %eax, %ebx; new integer sum is in register b

– Different from Microsoft * masm, and tasm

 Understand some architectural concepts:

– Caches, Multi-level caches, (some MESI)

– Threading, multi-threaded code

– Blocking (cache), blocking (aka tiling), blocking (thread synch.)

 Causes of pipeline stalls

– Control flow change

– Data dependence, registers and data

 NOT discussed: asm, VTune, CISC vs. RISC

®

3

* Other brands and names may be claimed as the property of others.

Speed Limitations

®

* Other brands and names may be claimed as the property of others.

4

Agenda

 Performance Limiters

 Register Starvation

 Processor-Memory Gap

 Processor Stalls

 Store Forwarding

 Misc Limitations:

– Spin-Lock in Multi Thread

– Misaligned Data

– Denorm Floats

®

* Other brands and names may be claimed as the property of others.

5

Performance Limiters

 Architectural limitations the programmer or compiler can overcome:

– Indirect limitations: stall via branch, call, return

– Incidental limits: resource constraint

– Historical limits: register starved x86

– Technological: ALU speed vs. memory access speed

– Logical limits: data- and resource dependence

®

* Other brands and names may be claimed as the property of others.

6

Register Starvation

 How many regs needed (compiler or programmer)?

– Infinite is perfect 

– 1024 is very good

– 64 acceptable

– 16 is crummy

– 4+4 is x86

– 1 is saa (single-accumulator architecture)

 Formally on x86: 16 regs. Quick test:

– ax, bc, cx, dx

– si, di

– bp, sp, ip

– cs, ds, ss, es, fs, gs, flags

 Of which ax, bx, cx, dx are GPRs, almost

 Rest can be used as better temps

 ax & dx used for * and /, cx for loop

®

* Other brands and names may be claimed as the property of others.

7

Register Starvation

 Absence of regs causes

– Spurious memory spills and load

– False data dependences --not dependencies 

 Except single-accumulator arch: No other arch is more register starved than x86 

Instruction Stream mov %eax, [mem1] use stuff, %eax mov [mem1], %eax

Added ops

Mem latency

False DD

Instruction Stream mov %eax, [tmp] add %ebx, %eax imul %ecx mov %eax, [prod] mov [tmp], %eax

®

* Other brands and names may be claimed as the property of others.

8

And the Programmer?

 No solution in ISA, x86 had 4 GPRs since 8086

 Improved via internal register renaming

– Pentium ® Pro has hundreds of internal regs

 Added registers in mmx

– Visible to you, programmer and compiler

– fp(0) .. fp(7), 80-bits as FP, 64 bits as mmx, but note: context switch

 Added registers in SSE

– xmm(0) .. xmm(7) 128 bits

®

* Other brands and names may be claimed as the property of others.

9

1000

100

10

1

Processor-Memory Gap

“Moore’s Law”

CPU

µProc

60%/yr.

Processor-Memory

Performance Gap:

(grows 50% / year)

DRAM

DRAM

7%/yr.

®

* Other brands and names may be claimed as the property of others.

Time

Source: David Patterson, UC Berkeley

10

Bridging the Gap: Trend

CPU

Caches

Multilevel

Caches

Intel® Pentium II

Processor:

Out of Order

Execution

~30%

DRAM

Intel® Xeon™ Processor:

Hyperthreading

Technology

~30%

Time

Instruction

Level

Thread

Level

Hyperthreading Technology:

Feeds two threads to exploit shared execution units

®

* Other brands and names may be claimed as the property of others.

Impact of Memory Latency

 Memory speed has NOT kept up with advance in processor speed

– Avg. integer add ~ 0.16 ns (Xeon), but memory accesses take ~10 ns or more

 CPU hardware resource utilization is only

35% on average

– Limited due to memory stalls and dependencies

 Possible solutions to memory speed mismatch?

Memory speed mismatch is a major source of CPU stalls

®

* Other brands and names may be claimed as the property of others.

12

And the Programmer?

 Cache provided

 Methods to manipulate cache

 Tools provided to pre-fetch data

– At risk of superfluous fetch, if control-flow change

®

* Other brands and names may be claimed as the property of others.

13

Processor Stalls

 Stalled cycle is a cycle in which processor cannot receive or schedule new instructions

– Total Cycles = Total Stall Cycles + Productive Cycles

– Stalls waste processor cycles

– Perfmon, Linux ps, tops, other system tools show Stalled cycles as busy CPU cycles

– Intel® VTune Analyzer used to monitor stalls (HP* PFmon)

®

* Other brands and names may be claimed as the property of others.

Stalled

Unstalled

14

Why Stalls Occur!

 Stalls occur, because:

– Instruction needs resource not available

– Dependences [sic] (control- or data-) between instructions

– Processor / instruction waits for some signal or event

 Sample resource limitations:

– Registers

– Execution ports

– Execution units

– Load / store ports

– Internal buffers (ROBs, WOBs , etc.)

 Sample events:

– Exceptions, Cache misses, TLB misses, e.t.c.

– Common thing: they hold up compute progress

®

* Other brands and names may be claimed as the property of others.

15

Control Dependences (CD)

 Change in flow of control causes stalls

 Processors handle control dependences:

– Via branch prediction hardware

– Conditional move to avoid branch & pipeline stall

Barrier

(Predict)

Instruction Stream mov [%ebp+8], %eax cmp 1, %eax jg bigger mov 1, %eax

. . .

bigger:

Instruction Stream dec %ecx push %eax call rfact mov %ecx,[%ebp+8] mul %ecx

Barrier

(Predict)

®

* Other brands and names may be claimed as the property of others.

16

Data Dependences (DD)

 Data dependence limits performance

 Programmer / Compiler cannot solve

– Xeon has register renaming to avoid false data dependencies

– supports out of order execution to hide effects of dependencies

Instruction Stream

. . .

mov eax, [ebp+8] cmp eax, 1 Mem latency

False DD

Instruction Stream mov [temp], eax add eax, ebx mult ecx mov [prod], eax mov eax, [temp]

. . .

bigger:

®

* Other brands and names may be claimed as the property of others.

17

Xeon Processor Stalls

 D-side

– DTLB Misses

– Memory Hierarchy  L1, L2 and L3 misses

 Core

– Store Buffer Stalls

– Load/Store splits

– Store forwarding hazard

– Loading partial/misaligned data

– Branch Mispredicts

 I-side

– Streaming Buffer Misses

– ITLB Misses

– TC misses

– 64K Aliasing conflicts

 Misc

– Machine Clears

®

18

* Other brands and names may be claimed as the property of others.

And the Programmer?

 Reduce processor stall by prefetching data

 Reduces control flow change by conditional move

 Reduce false dependences by using register temps, from mmx (fp) and xmm pool

®

* Other brands and names may be claimed as the property of others.

19

Partial Writes: WC buffers

Causes:

1) Too many WC streams

2) WB loads/stores contending for fillbuffers to access L2 cache or memory

Detection (VTune)

Event based sampling:

Ext. Bus Partial Write Trans.

Causes:

L2 Cache Request

Ext. Bus Burst Read Trans.

Ext. Bus RFO Trans.

First Level Cache

Fill/WC Buffer

Fill/WC Buffer

Second Level

Cache

FSB

8B 8B 8B -

Incomplete WC buffer

3 - 8B “Partial” bus transactions

8B

Complete WC buffer

1 bus transaction

Memory

Partial writes reduce actual front-side bus Bandwidth

– ~3x lower for PIII

– ~7x lower for ~Pentium 4 processor due to longer cache line

®

* Other brands and names may be claimed as the property of others.

20

Store Forwarding Guidelines

Store Forward: Loading from an address recently stored can cause data to be fetched more quickly than via mem access.

Large penalty for non-forwarding cases (1.1-1.3x)

Load aligned with Store

Load contained in Store

Load contained in single Store

Store

Load

Store

Load

Store

Load

Will Forward

Store

Load

Forwarding Penalty

Store

Load

Store

Load

A B

128-bit forwards must be

16-byte aligned

Store

Load

Store

Load

16-byte boundaries

MSVC < 7.0 and you generate these. Intel Compiler doesn’t.

®

21

* Other brands and names may be claimed as the property of others.

And the Programmer?

 Pick right compiler, for HLL programs

 Use VTune to check, for asm code

 In asm programs, ensure loads after stores are:

– Contained in stored data, subset or proper subset

– In single previous store, not in sum of multiple stores

– Thus do store-combining: assemble together, then store

– Both data start on same address

®

* Other brands and names may be claimed as the property of others.

22

Misc Limitations

 Spin-Lock in Multi Thread

– Don’t use busy wait, juts because you have (almost) a second processor for second thread

 Misaligned data

– Don't align data on arbitrary boundary, just because architecture can fetch from any address

 Dumb errors 

– Fail to use proper tool (library, compiler, performance analyzer)

– Failure to use tiling (aka blocking) or SW pipelining

 Denormalized Floats

®

* Other brands and names may be claimed as the property of others.

23

And the Programmer?

 Use pause, when applicable!

– New NetBurst instruction

 Use compiler switches to align data on address divisible by greatest individual data object

– Who cares about wasting 7 bytes to force 8-byte alignment?

 Be smart, pick right tools 

– Instruct compiler to SW pipeline

– In asm, manually SW pipeline; note easier on EPIC than

VLIW, lacking prologue, epilogue sometimes

– Enable compiler to partition larger data structures into smaller suitable blocks, for improved locality

– cache parameter dependent

®

* Other brands and names may be claimed as the property of others.

24

And the Programmer?

 Executes for first of 2 labs, this one being a

"two-minute" exercise:

 Turn on your computer, verify Linux is alive

 Verify you have available:

– Editor to modify program

– Intel C++ compiler, text command icc, with -g

– Debugger ddd, with disassembly ability

 Source program vscal.cpp

 Linux commands: ls, vi, icc, mkdir, etc.

®

* Other brands and names may be claimed as the property of others.

25

Module Summary

Covered: key causes that render execution slower than possible:

 More registers at your disposal than seems

 Van Neumann bottleneck can be softened via cache use and data pre-fetch

 Stalls can be reduced by conditional move, avoiding false dependences

 Use (time limited) capabilities, such as proper store forwarding

 Note new Pause instruction

®

* Other brands and names may be claimed as the property of others.

26

x86 Architecture

Progression

®

* Other brands and names may be claimed as the property of others.

27

Agenda: x86 Arch. Progression

 Abstract & Objectives

 x86 Nomenclature & Notation

 Intel® Architecture Progress

 Pentium 4 Abstract

®

* Other brands and names may be claimed as the property of others.

28

Abstract & Objectives: x86 Architecture Progression

 Abstract: High-level introduction to history and evolution of increasingly powerful 16-bit and

32-bit x86 processors that are backwards compatible.

 Objectives: understand processor generations and architectural features, by learning

– Progressive architectural capabilities

– Names of corresponding Intel processors

– Explanation, description of capabilities

– FP incompatibility, minor

®

* Other brands and names may be claimed as the property of others.

29

Non-Objectives

 Objective is not introduction of:

– x86 assembly language, assumed known

– Itanium ® processor family now in 3 rd generation

– Intel tools (C++, VTune)

– Performance tools: MS Perfmon, Linux ps, emon,

HP PFMon, etc.

– Performance benchmarks, performance counters

– Differentiation Intel vs. competitor products

– CISC vs. RISC

®

* Other brands and names may be claimed as the property of others.

30

x86 Nomenclature & Notation

Processor name, initial launch date, final clock speed

Pentium ® II, 2H98, 450 MHz

MMX, BX chipset

Dynamic branch prediction enhanced

Architecturally visible enhancement list, can be empty

Architectural speedup technique, invisible exc. higher speed

®

* Other brands and names may be claimed as the property of others.

31

Intel® Architecture Progress

®

* Other brands and names may be claimed as the property of others.

32

Intel ® Pentium ® 4 Processors

Processor Family Description

Northwood

NW E-Step

Pentium ® Willamette shrink. Consumer and business desktop processor. HT not enabled, though capable.

Pentium HT errata corrected. Desktop processor

Prescott

Prestonia DP Xeon TM

Nocona DP

Foster MP

Gallatin MP

Potomac MP

Pentium

Xeon

Xeon

Xeon

Xeon

Consumer and business desktop processor. Replaces NW. Offers

6 PNI: Prescott New Instructions. First processor with Lagrande technology (trusted computing)

DP slated for workstations and entry-level servers. Based on NW core. HT enabled. 512 kB L2 cache. No L3. 3 GHz processor.

DP based on Prescott core. Targeted for 3.06 GHz. 533 MHz (quadpumped) bus, I.e. bus speed is 133 MHz. 1 MB L2 cache. HT enabled. About to be launched.

MP based on Willamette core. 1 MB L3 cache, 256 kB L2, HT enabled. For higher-end servers.

MP based on NW core. 1 or 2 MB L3 cache, 512 kB L2 cache. For high-end servers. See 8-way HP DL 760, and IBM x440. HT enabled.

MP based on Prescott core. 533 MHz (quad-pumped) bus. 1 MB L2 cache, 8 MB L3 cache. HT enabled, yet to be launched.

®

* Other brands and names may be claimed as the property of others.

Note: lower clock rates for MP versions.

Due to higher circuit complexity, bus load.

33

Processor Generation Comparison

Feature

Pentium® III

Processor

Pentium® III

Processor

Pentium® 4

Processor

MHz

L2 Cache

Execution Type

System Bus

450-600 MHz

100MHz

600 MHz – 1.13GHz

512k off-die 256k on-die

Dynamic Dynamic

133MHz

MMX™ Technology

Streaming SIMD

Extensions

Streaming SIMD

Extensions 2

Yes

Yes

No

Yes

Yes

No

1.5 GHz

256k on-die

Intel®

NetBurst™ m

Arch

400MHz

(4x100 MHz)

Yes

Yes

Yes

Manufacturing

Process

Chipset

.25 micron

ICH-1

.18 micron

ICH-2

.18 micron

ICH-2

Northwood

2+ GHz

512k on-die

Intel®

NetBurst™ m

Arch

400/533MHz

(4x100/133 MHz)

Yes

Yes

Yes

.13 micron

ICH-2

®

* Other brands and names may be claimed as the property of others.

34

Intel® Architecture Progress

 8087 co-processor of 8086 : off-chip FP computation, extended 80-bit FP format for DP

 MMX : multi-media extensions

– Mmx regs aliased w. FP register stack

– needs context switch

– FP regs also called ST(I) regs

 SSE : Streaming SIMD extension already since

Pentium III

 WNI : 144 new instructions, using additional data types for existing opcodes, using previously reserved opcodes

®

* Other brands and names may be claimed as the property of others.

35

Intel® Architecture Progress

 XMM : 8 new 128-bit registers, in addition to

MMX

 SSE2 : multiple integer ops and multiple DP FP ops: part of 144 WNI

– Regs unchanged in Pentium ® 4 from P III

– Ops added

 NetBurst : generic term for: HyperThreading & quad-pumped bus & new Trace Cache & etc.

®

* Other brands and names may be claimed as the property of others.

Note: architectural feature ages with next generation, but survives, due to compatibility requirement. Hence is interesting not only for historical reasons:

You need to know it!

36

Xeon

TM

MP Abstract

Xeon™ MP Processor

“Gallatin”

64 GB

(PAE-36)

Physical Addressing (36-bit P Pro)

3.2 GB/s

(400)

L3 – 1or 2 MB

L2 - 512KB

L1 - 12K TC, 8K D

Hyperthreading

Technology

20

Hyperthreading

Technology

System Bus Bandwidth

External Cache

On-die Cache

Logical CPU 2 X

2xALU 1 2 3 4 5

24 Registers (126)

6

Pipeline Stages

Issue Ports

Registers

8 Integer,

1 Multimedia

2

Floating

Point

2.0+ GHz

3 Instructions / Cycle

Execution Units

Core Frequency

Instructions/clock-cycle

®

* Other brands and names may be claimed as the property of others.

37

Xeon

TM

Memory Hierarchy

Note: Physical Address Extension,

36-bit PAE addresses, since Pentium ® Pro

External

Memory

64GB

3.2 GB/s

L3

2MB

8-way

128B lines

21+ CLKS

12.8 GB/s

Xeon

Processor MP

L2 (unif'd)

512KB

8-way

128B lines

7+ CLKS

TC

12KB

64B lines

2 CLKS

L1(DL0)

8KB

64B lines

2 CLKS

®

* Other brands and names may be claimed as the property of others.

38

Architecture

Enhancements

®

* Other brands and names may be claimed as the property of others.

39

Agenda: Architecture Enhancements

 Abstract & Objectives

 Faster Clock

 Caches: Advantage, Cost, Limitation

 Multi-Level Cache-Coherence in MP

 Register Renaming

 Speculative, Out of Order Execution

 Branch Prediction, Code Straightening

®

* Other brands and names may be claimed as the property of others.

40

Abstract & Objectives:

Architecture Enhancements

 Abstract: Outline generic techniques that overcome performance limitations

 Objectives: under stand cost of architectural techniques (tricks) in terms of resources (mil space) and of lost performance if incorrectly guessed

– Caches: cost silicon, can slow down

– Branch prediction: costs silicon, can be wrong

– Prefetch: costs instruction, may be superfluous

– Superscalar: may not find a second op

®

* Other brands and names may be claimed as the property of others.

41

Non-Objectives

 Objective is not to explain detail of Intel processor architecture

 Not to claim Intel invented techniques; academia invented many

 Not to show all techniques; some apply mainly to EPIC or VLIW architectures

 No hype, no judgment, just the facts please!

®

* Other brands and names may be claimed as the property of others.

42

Faster Clock

 CISC:

– Decompose circuitry into multiple simple, sequential modules

 Resulting modules are smaller and thus can be fast:

– high clock rate

– Shorter speed-paths

 That's what we call: pipelined architecture

 More modules -> simpler modules -> faster clock -> super-pipelined

 Super-pipelining NOT goodness per-se:

– Saves no silicon

– Execution time per instruction does not improve

– May get worse, due to delay cycles

 But:

– Instructions retired per unit time improves

– Especially in absence of (large number of) control-flow stalls

®

43

* Other brands and names may be claimed as the property of others.

Faster Clock

 Xeon TM processor pipeline has 20 stages

Decode

Decode

I-Fetch

I-Fetch

O1-Fetch

O1-Fetch

ALU op

ALU op

O2-Fetch

O2-Fetch

R Store

R Store

I-Fetch

I-Fetch

Decode

Decode

O1-Fetch

O1-Fetch

O2-Fetch

O2-Fetch

.

.

ALU op

ALU op

R Store

R Store

1

TC Nxt IP

2

Intel

®

NetBurst TM µarchitecture: 20 stage pipeline

3

TC Fetch

4 5 6

Drive Alloc

7

Rename

8 9 10 11 12

Que Sch Sch Sch

13 14

Disp Disp

15

RF

16

RF

17 18 19 20

Ex Flgs Br Ck Drive

 Beautiful model breaks upon control transfer

®

* Other brands and names may be claimed as the property of others.

44

Intel ® x86

Architectures

®

* Other brands and names may be claimed as the property of others.

45

Agenda: Intel x86 Architectures

 Abstract & Objectives

 High Speed, Long Pipe

 Multiprocessing

 MMX Operations

 SSE Operations

 SSE2 Operations

 Willamette New Instructions WNI

 Cacheability Instructions

 Pause Instruction

 NetBurst, Hyperthreading

 SW Tools

®

* Other brands and names may be claimed as the property of others.

46

Abstract & Objectives:

Intel ® x86 Architectures

 Abstract: Emphasizing Pentium ® 4 processors, show progressively more powerful architectural features introduced in Intel processors. Refer to speed problems solved from module 2 and general solutions explained in module 3.

 Objective: you not only understand the various processor product names and supported features (Intel marketing names), but understand how they work, and what their limitations and costs are.

®

* Other brands and names may be claimed as the property of others.

47

Non-Objectives

 Objective is not to show Intel's techniques are the only ones, or best possible. They are just good trade-off in light of conflicting constraints:

– Clock speed vs. small # of pipes

– Small transistor count vs. high performance

– Large caches vs. small mil. Space

– Grandiose architecture vs. backward compatibility

– Need for large register file vs. register-starved x86

– Wish to have two full on-die processors vs. preserving silicon space

®

* Other brands and names may be claimed as the property of others.

48

High Speed, Long NetBurst

TM

Pipe

Basic Pentium ® Pro Pipeline

Intro at

733MHz

.18µ

1 2 3 4 5 6 7 8 9 10

Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec

Basic NetBurst ™ Micro-architecture Pipeline

1 2 3 4 5 6 7 8 9 10 11 12

TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch

13 14

Disp Disp

15 16 17 18 19 20

RF RF Ex Flgs Br Ck Drive

Hyper pipelined Technology enables industry leading performance and clock rate

®

* Other brands and names may be claimed as the property of others.

1.4 GHz

.18 µ

2.2GHz

.13µ

1

Check Your Progress

Match pipe functions to clocks/stages

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Execute: Execute the m ops on the correct port; 1 clk

Drive: Drive m

Drive: Drive the branch result to BTB at front; 1 clk ops to the Allocator; 1 clk

Flags: Compute flags (0, negative, etc.);

1 clk

Trace Cache/Next

IP: Read from

Branch Target

Buffer; 2 clks

Allocate: Allocate resources for execution; 1 clk

Trace Cache Fetch:

Read decoded m op from TC; 2 clks

Register File: Read the register file;

2 clks

Dispatch: Send m ops to appropriate execution unit; 2 clks

Branch Check:

Compare act. branch to predicted;

1 clk

Rename: Rename logical regs to physical regs; 2 clks

Queue: Write m op into m op queue to wait for scheduling; 1 clk

Schedule: Write to schedulers; compute dependencies; 3 clks

®

* Other brands and names may be claimed as the property of others.

50

Multiprocessing, SMP

 Def: Execution of 1 task by >= 2 processors

 Floyd Model (1960s):

– Single-Instruction, Single-Data Stream (SISD)

Architecture (PDP-11)

– Single-Instruction, Multiple-Data Stream (SIMD)

Architecture (Array Processors, Solomon, Illiac IV,

BSP, TMC)

– Multiple-Instruction, Single-Data Stream (MISD)

Architecture (possibly: pipelined, VLIW, EPIC)

– Multiple-Instruction, Multiple-Data Stream

Architecture (possibly: EPIC when SW-pipelined, true multiprocessor)

®

* Other brands and names may be claimed as the property of others.

51

MP Scalability Caveat

Performance gain from doubling processors

Gain

0.90

2

0.81

0.73

0.59

0.43

0.25

0.11

4 8 16 32

Number of processors

64

Gain Follows Law of Diminishing Returns

128

®

* Other brands and names may be claimed as the property of others.

52

Intel® Xeon™ Processor Scaling 1.39x

Frequency

SPECint_rate_base2000

Frequency

Scale more visible with large cache

OLTP

1.68

1.00

1.25

1.40

1.00

1.31

(2P) 2.2GHz,

400MHz Bus,

512KB cache

(2P) 3.06GHz,

533MHz Bus,

512KB cache

(2P) 3.06GHz,

533MHz Bus,

1MB cache

(2P) 2.2GHz,

400MHz Bus,

512KB cache

(2P) 3.06GHz,

533MHz Bus,

512KB cache

(2P) 3.06GHz,

533MHz Bus,

1MB cache

Source: Intel Corporation

Based on Intel internal projections. System configuration assumptions: 1) two Intel® Xeon™ processor 2.8GHz with 512KB L2 cache in an E7500 chipset-based server platform, 16GB memory, Hyperthreading enabled; 2) Four Intel® Xeon™ processor MP 1.6GHz with 1MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 3)

Four Intel® Xeon™ processor MP 2.0GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled; 4) Four Intel® Xeon™ processor MP

2.8GHz with 2MB L3 cache in a GC-HE chipset-based server platform, 32GB memory, Hyperthreading enabled

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.

®

53

* Other brands and names may be claimed as the property of others.

Intel® Xeon™ MP vs. Xeon™ Relative OLTP

Performances

Which processor is better?

2.00

1.00

(2P) Intel® Xeon™ processor @ 2.8GHz,

533MHz Bus, 0 L3

(4P) Intel® Xeon™ processor MP 2.0GHz,

400MHz Bus 2MB L3

Source: TPC.org

Xeon processor MP Targeted for OLTP

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other source of information to evaluate the performance of systems or components they are considering purchasing.

®

54

* Other brands and names may be claimed as the property of others.

MMX Integer Operations

 Add (wrap around) paddw mm0, mm3 p acked add on w ords mm0 mm3

F000h a2

+ +

3000h b2 a1

+ b1 a0

+ b0 mm0 2000h a2+b2 a1+b1 a0+b0

 Add (saturation) paddusw mm0, mm3 p acked add with u nsigned s aturation on w ords mm0 mm3

F000h a2

+ +

3000h b2 a1

+ b1 a0

+ b0 mm0 FFFFh a2+b2 a1+b1 a0+b0

®

* Other brands and names may be claimed as the property of others.

55

MMX Arithmetic Operations

 Multiply-low pmullw mm0, mm3 mul tiply l ow, w ords mm0 a3 a2 b3

* * b2 mm3 a3*b3 a2*b2 mm0 a1 a0

* * b1 b0 a1*b1 a0*b0

 Multiply-high pmulhw mm1, mm4 mul tiply h igh, w ords

®

* Other brands and names may be claimed as the property of others. mm1 mm4 a3*b3 a3 a2 b3

* * b2 a2*b2 mm1 c3 c2 a1 a0

* * b1 b0 a1*b1 c1 c0 a0*b0

56

MMX Arithmetic Operations

 Multiply Add pmaddwd mm1, mm4 p acked m ultiply and add 4 w ords to 2 d oublewords mm1 mm4 a3*b3 mm1 a3

* b3 a2*b2 a3*b3+a2*b2 a2

* b2 a1

* b1 a1*b1 a0

* b0 a1*b1+a0*b0 a0*b0

Note : This instruction does not have a saturation option.

®

* Other brands and names may be claimed as the property of others.

57

MMX Convert Operations

 Unpack, interleaved merge punpcklwd mm0, mm1 unp a ck l ow w ords into d oublewords mm1 b3 b2 b1 b0 mm0 a3 a2 a1 mm0 b1 a1 b0 a0 a0

 punpckhwd mm0, mm1 unp a ck h igh w ords into d oublewords mm1 b3 b2 b1 b0 mm0 a3 a2 a1 a0 mm0 b3 a3 b2 a2

 Zero extend from small data elements to bigger data elements by using the unpack instruction, with zeros in one of the operands.

®

58

* Other brands and names may be claimed as the property of others.

MMX Convert Operations

 Pack packusdw mm0, mm1 pack with u nsigned s aturation (signed) d oublewords into w ords mm1 A B mm0 C mm0

A’ B’ C’ D’

D

®

* Other brands and names may be claimed as the property of others.

59

MMX Shift Operations

 psllq MM0, 8 p acked s hift l eft l ogical q uadword

 psllw MM0, 8 p acked s hift l eft l ogical w ords

MM0 703F 0000 FFD9 4364h

MM0 3F00 00FF D943 6400h

8

MM0 703Fh DF00h 81DBh 007Fh 8

MM0 3F00h 0000h DB00h 7F00h

®

* Other brands and names may be claimed as the property of others.

60

MMX Compare Operations

 pcmpgtw ; c o mp are g rea t er w ords (generate a mask)

51

>

73

3

>

2

5

>

5

23

>

6

000...00

111...11

000...00

111...11

®

* Other brands and names may be claimed as the property of others.

61

SSE Registers

IA-INT

Registers

32

Streaming SIMD Extension

Registers

(128-bit integer)

EAX 128

.

.

.

XMM0

.

.

.

XMM3

XMM4

.

.

.

EDI

XMM7

 Eight 128 bit registers

Fourteen 32-bit registers

 Single-precision / Double-precision

/ 128-bit integer

Direct register access

 Direct access to registers

Scalar Data only

 Referred to as XMM0-XMM7

 Use simultaneously with FP /

MMX ™ Technology

 Data array only

®

* Other brands and names may be claimed as the property of others.

62

MMX™ Technology /

IA-FP Registers

80

64

FP0 or MM0

.

.

.

.

.

.

FP7 or MM7

 Eight 64 bit registers

 Xor eight 80 bit FP regs

 Direct access to regs

 FP data / data array

 x87 remains aliased with

SIMD integer registers

 Context-switch

SSE Arithmetic Operations

Full Precision

ADD, SUB, MUL, DIV, SQRT

– Floating Point

(Packed/Scalar)

– Full 23 bit precision

Approximate Precision

RCP - Reciprocal

RSQRT - Reciprocal

Square Root

– Perspective correction / projection

– Vector normalization

– Very fast

– Return at least 11 bits of precision

®

* Other brands and names may be claimed as the property of others.

63

SSE Arithmetic Operations

MULPS: Mul tiply P acked S ingle-FP mulps xmm1, xmm2

*

X4

Y4

X3

Y3

X4*Y4 X3*Y3

X2

Y2

X1

Y1

X2*Y2 X1*Y1 xmm1 xmm2 xmm1

®

* Other brands and names may be claimed as the property of others.

64

SSE Compare Operation

CMPPS: C o mp are P acked S ingle-FP cmpps xmm0, xmm1, 1

<

1.1

8.6

7.3

2.3

2.3

3.5

5.6

1.2

111…11 000…00 111…11 000...00

xmm0 xmm1 xmm0

®

* Other brands and names may be claimed as the property of others.

SSE2 Registers,

look like SSE  cos they r

IA-INT

Registers

EAX

32

Streaming SIMD Extension

Registers

(scalar / packed SIMD-SP, SIMD-DP,

128-bit integer)

128

.

.

.

XMM0

.

.

.

XMM3

XMM4

.

.

.

EDI

XMM7

 Eight 128 bit registers

Fourteen 32-bit registers

 Single-precision array / Doubleprecision array / 128-bit integer

Direct register access

 Direct access to registers

Scalar Data only

 Referred to as XMM0-XMM7

 Use simultaneously with FP /

MMX ™ Technology

 Data array only

MMX™ Technology /

IA-FP Registers

80

64

FP0 or MM0

.

.

.

.

.

.

FP7 or MM7

 Eight 64 bit registers

 Xor eight 80 bit FP regs

 Direct access to regs

 FP data / data array

 x87 remains aliased with

SIMD integer registers

 Context-switch

®

* Other brands and names may be claimed as the property of others.

66

SSE2 Register Use

 New 64-bit doubleprecision floating point instructions

 New / enhanced 128-bit wide SIMD integer

– Superset of MMX™ technology instruction

-ANDset

-OR-

-AND-

 No forced context switching on SSE registers (unlike

MMX™/x87 registers)

Instruction Type

Standard x87 (SP, DP, EP)

64-bit SIMD int.

(4x16, 8x8)

128-bit SIMD int.

(8x16, 16x8)

Single-precision SIMD FP

(4x32)

Double-precision SIMD FP

(2x64)

Cache Management

(Memory Streaming/Prefetch)

 

Backward compatible with all existing MMX™ & SSE code

®

* Other brands and names may be claimed as the property of others.

67

Willamette New Instructions

 New Instructions

 Extended SIMD Integer Instructions

 New SIMD Double-precision FP Instructions

 New Cacheability Instructions

 Fully Integrated into Intel Architecture

– Use previously reserved opcodes

– Same addressing modes as MMX™ / SSE ops

– Several MMX™ / SSE mnemonics are repeated

– New Extended SIMD functionality is obtained by specifying 128-bit registers (xmm0-xmm7) as src/dst.

®

* Other brands and names may be claimed as the property of others.

68

SIMD Double-Precision FP Ops

 Same instruction categories as SIMD singleprecision FP instructions

 Operate on both elements of packed data, in parallel -> SIMD

 Some instructions have scalar or packed versions

X2 X1 / Scalar

S Exponent

63 62 52 51

Significand

0

 IEEE 754 Compliant FP Arithmetic

– Not bit exact with x87 : 80 bit internal vs 64 bit mem

 Usable in all modes: real, virtual x86, SMM, and protected (16-bit & 32-bit)

®

* Other brands and names may be claimed as the property of others.

69

FP Instruction Syntax

 Arithmetic FP Instructions can be:

– Packed or Scalar

– Single-Precision or Double-Precision

ASM Intrinsics P acked or S calar add ps _mm_add_ps() Add Packed Single add pd _mm_add_pd() Add Packed Double

S ingle or D ouble add ss _mm_add_ss() Add Scalar Single add sd _mm_add_sd() Add Scalar Double

®

* Other brands and names may be claimed as the property of others.

70

New SSE2 Data Types

 Packed & Scalar FP Instructions operate on packed single- or double-precision floating point elements

– Packed instructions operate on 4 (sp) or 2 (dp) floats addps addpd op

X4

Y4

X3

Y3

X2

Y2

X1

Y1 op

X2

Y2

X1

Y1

X4opY4

X2opY2 X1opY1

X3opY3 X2opY2 X1opY1

– Scalar instructions operate only on the right-most field addss addsd

X1

X4 X3 X2 X1

X2 op op

Y1

Y4 Y3 Y2 Y1

Y2

Y2

Y4 Y3 Y2 X1opY1

X1opY1

®

* Other brands and names may be claimed as the property of others.

71

Extended SIMD Integer Ops

 All MMX™/SSE integer instructions operate on

128-bit wide data in XMM registers

 Additionally, some new functionality

– MOVDQA, MOVDQU: 128-bit aligned/unaligned moves

– PADDQ, PSUBQ: 64-bit Add/Subtract for mm & xmm regs

– PMULUDQ: Packed 32 * 32 bit Multiply

– PSLLDQ, PSRLDQ: 128-bit byte-wise Shifts

– PSHUFD: Shuffle four double-words in xmm register

– PSHUFL/HW: Shuffle four words in upper/lower half of xmm reg

– PUNPCKL/HQDQ: Interleave upper/lower quadwords

– Full 128-bit Conversions: 4 Ints vs. 4 SP Floats

®

* Other brands and names may be claimed as the property of others.

New SIMD Integer Data Formats

 New 128-bit data-types for fixed-point integer data

– 16 Packed bytes

127 63 8 7 0

– 8 Packed words

127 63 16 15 0

– 4 Packed doublewords

127

– 2 Quadwords

127

63

63

32 31 0

0

®

* Other brands and names may be claimed as the property of others.

73

New DP Instruction Categories

Computation

ADD, SUB, MUL, DIV, SQRT

MAX, MIN

– Full 52-bit precision mantissa

(Packed & Scalar)

Logic

AND, ANDN, OR, XOR

– Operate uniformly on entire

128-bit register

– Must use DP instructions for double-precision data

Data Formatting

MOVAPD, MOVUPD

– 128-bit DP moves

(aligned/unaligned)

MOVH/LPD, MOVSD

– 64-bit DP moves

SHUFPD

– Shuffle packed doubles

– Select data using 2-bit immediate operand

®

* Other brands and names may be claimed as the property of others.

74

DP Packed & Scalar Operations

 The new Packed & Scalar FP Instructions operate on packed double precision floating point elements

– Packed instructions operate on 2 numbers addpd op

X2

Y2

X1

Y1

X2opY2 X1opY1

– Scalar instructions operate on least-significant number

X2 X1 op addsd

Y2 Y1

®

* Other brands and names may be claimed as the property of others.

75

Y2 X1opY1

SHUFPD Instruction

SHUFPD: Shuf fle P acked D ouble-FP

1 0 1

XMM2 y2 y1 XMM1 x2

XMM1 y2-y1

SHUFPD XMM1, XMM2, 3

XMM1 y2

SHUFPD XMM1, XMM2, 2

XMM1 y2

®

* Other brands and names may be claimed as the property of others.

76 x2-x1

// binary 11 x2

// binary 10 x1

0 x1

New DP instruction Categories, Cont'd

Branching

CMPPD, CMPSD

– Compare & mask

(Packed/Scalar)

COMISD

– Scalar compare and set status flags

MOVMSKPD

– Store 2-bit mask of DP sign bits in a reg32

Type Conversion

CVT

– Convert DP to SP & 32bit integer w/ rounding

(Packed/Scalar)

CVTT

– Convert DP to 32-bit integer w/ truncation

(Packed/Scalar)

®

* Other brands and names may be claimed as the property of others.

77

Compare & Mask Operation

CMPPD: C o mp are P acked D ouble-FP

CMPPD XMM0, XMM1, 1 // 1 = less than

XMM0

XMM1

1.1

<

8.6

XMM0

1111111….111

12.3

<

3.5

0000000….000

®

* Other brands and names may be claimed as the property of others.

78

Cache Enhancements

 On-die trace cache for decoded uops (TC)

– Holds 12K uops

 8K on-die, 1 st level data cache (L1)

– 64-byte line size

– Pentium Pro was 32 bytes

– Ultrafast, multiple accesses per instruction

 256K on-die, 2 nd level write-back, unified data and instruction cache (L2)

– 128-byte line size

– operates at full processor clock frequency

 PREFETCH instructions return 128 bytes to L2

®

* Other brands and names may be claimed as the property of others.

79

New Cacheability Instructions

 MMX™/SSE cacheability instructions preserved

 New Functionality:

– CLFLUSH: Cache line flush

– LFENCE / MFENCE: Load Fence / Memory Fence

– PAUSE: Pause execution

– MASKMOVDQU: Mask move 128-bit integer data

– MOVNTPD: Streaming store with 2 64-bit DP FP data

– MOVNTDQ: Streaming store with 128-bit integer data

– MOVNTI: Streaming store with 32-bit integer data

®

* Other brands and names may be claimed as the property of others.

80

Streaming Stores

 Willamette implementation supports:

– Writing to uncacheable buffer (e.g. AGP) with full line-writes

– Re-reading same buffer with full line-reads

– New in WNI, compared to Katmai/CuMine

 Integer streaming store

– Operates on integer registers (ie, EAX, EBX)

– Useful for OS, by avoiding need to save FP state, just move raw bits

®

* Other brands and names may be claimed as the property of others.

81

Detail: Cache Line Flush

 CLFLUSH: Cache line containing m8 flushed and invalidated from all caches in the coherency domain

 Linear address based; allowed by user code

 Potential usage:

– Allows incoherent (AGP) I/O data to be mapped as

WB for high read performance and flushed when updated

– Example: video encode stream

– Precise control of dirty data eviction may increase performance by scheduling @ idle memory cycles

®

* Other brands and names may be claimed as the property of others.

82

Detail: Fences

 Capabilities introduced over time to enable software managed coherence:

– Write combining with the Pentium Pro processor

– SFence and memory streaming with Streaming SIMD Extensions

 New Willamette Fences completes the tool set to enable full software coherence management

– LFence, strong load order

– Blocks younger loads from passing a prior load instruction

– All loads preceding an LFence will be completed before loads coming after the LFence

– MFence

Achieves effect of LFence and SFence instructions executed at same time

Necessary, as issuing an SFence instruction followed by an LFence instruction does not prevent a load from passing a prior store

®

* Other brands and names may be claimed as the property of others.

83

Pause Instruction

 PAUSE architecturally a NOP on IA-32 processor generations

 Usable since Willamette!

 Not necessary to check processor type.

 PAUSE is hint to processor that code is a spin- wait or non- performance- critical code. A processor which uses the hint can:

– Significantly improves performance of spin-wait loops without negative performance impact, by inserting a implementationdependent delay that helps processors with dynamic execution

(a. k. a. out- of- order execution) exit from the spin- loop faster

 Significantly reduces power consumption during spinwait loops

®

* Other brands and names may be claimed as the property of others.

84

NetBurst

TM

µArchitecture Overview

System Bus

Frequently used paths

Less frequently used paths

Bus Unit

1 st Level Cache

(Data) 4-way

2 nd Level Cache

8-way

Fetch/

Decode

Front End

Trace Cache

Microcode ROM

Execution

Out-of-Order Core

Retirement

BTBs/Branch Prediction

®

* Other brands and names may be claimed as the property of others.

85

NetBurst

TM

µArchitecture

BTB uCode

ROM

3 3

Store

AGU

Load

AGU

ALU

ALU

ALU

ALU

FP move

FP store

FMul

FAdd

MMX

SSE

NetBurst

TM

µArchitecture Summary

 Quad Pumps bus to keep the Caches loaded

 Stores most recent instructions as µops in TC to enhance instruction issue

 Improves Program Execution

– Issues up to 3 µops per Clock

– Dispatches up to 6 µops to Execution Units per clock

– Retires up to 3 µops per clock

 Feeds back branch and data information to have required instructions and data available

®

* Other brands and names may be claimed as the property of others.

87

What is Hyperthreading?

 Ability of processor to run multiple threads

– Duplicate architecture state creates illusion to SW of Dual Processor (DP)

– Execution unit shared between two threads, but dedicated if one stalls

 Effect of Hyperthreading on Xeon Processor:

– CPU utilization increases to 50% (from ~35%)

– About 30% performance gain for some applications with the same processor frequency

Hyperthreading Technology Results:

1. More performance with enabled applications

2. Better responsiveness with existing applications

®

* Other brands and names may be claimed as the property of others.

88

Hyperthreading Implementation

 Almost two Logical Processors

 Architecture state (registers) and APIC duplicated

 Share execution units, caches, branch prediction, control logic and buses

On-Die

Caches

Architecture State Architecture State

* APIC: Advanced

Programmable Interrupt

Controller. Handles interrupts sent to a specified logical processor

Adv. Programmable

Interrupt Control

Adv. Programmable

Interrupt Control

Processor

Execution

Resource

System Bus

®

* Other brands and names may be claimed as the property of others.

89

Benefits to Xeon™ Processor

Hyperthreading Technology Performance for Dual Processor

Servers

Hyper Threading Technology Performance Gains

 Enhancements in bandwidth, throughput and thread-level parallelism with

Hyperthreading Technology deliver an acceleration of performance

1.00

1.21

1.19

Source: Veritest (Sep, 2002). Comparisons based on Intel internal measurements w/preproduction hardware

1) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft

Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.

2) HTT on and off configurations with the Intel® Xeon™ processor 2.80 GHz with 512KB L2 cache, Intel® Server Board SE7501WV2 with Intel® E7501 chipset, 2GB DDR, Microsoft

Windows* 2000 Server SP2, Intel® PRO/1000 Gigabit Server adapter, AMI 438 MegaRAID* controller v1.48 16MB EDO RAM- Dell PowerVault 210S disk array.

HT OFF WebBench / Web

Server Performance

Trade2 / Java Apps

Server Performance

Intel® Xeon™ processor 2.8GHz with 512KB cache, Microsoft Windows* 2000

Hyperthreading Technology increases performance by

~20% on Some Server Applications

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

®

90

* Other brands and names may be claimed as the property of others.

Hyperthreading for Workstation

Intel® Xeon™ processor 2.8GHz with 512KB cache

Hyperthreading Technology performance gains

Performance gains whether running:

 Multiple tasks within one application

 Multiple applications running at once

1.00

1.15

1.18

Multi-Threaded Multi-Tasking

1.15

1.19

1.26

1.26

1.27

1.29

1.27

1.37

HT Off

Multi-

Threaded

CHARMm* 3DSM*5 D2cluster* BLAST* Lightwave

3D*75

Applicatio

Source: Intel Corporation. With and without Hyperthreading Technology on the following system configuration: Intel Xeon Processor 2.80 GHz/533 MHz system bus with 512KB L2 cache, Intel® E7505 chipset-based Pre-Release platform, 1GB PC2100 DDR CL2 CAS2-2-2, (2) 18GB Seagate* Cheetah

ST318452LW 15K Ultra160 SCSI hard drive using Adaptec 39160 SCSI adapter BIOS 3.10.0, nVidia* Quadro4 Pro 980XGL 128MB AGP 8x graphics card with driver version 40.52, Windows XP* Professional build 2600.

n

Multi-

Tasking

Applicatio

Patran* +

Nastran *

Multiple

Compiles

3ds max* +

Photoshop

Compile +

Regressio n

Maya* multiple renderings n +

Hyperthreading Technology increases performance by 15-

37% on Workstation Applications

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing.

For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

®

91

* Other brands and names may be claimed as the property of others.

Hyperthreading Resources

Type Description Example

Shared Each logical processor can use, evict or allocate any part of resource

Cache, WC Buffers,

VTune reg. MS-ROM

Duplicated Each logical processor has it’s own set of resources

APIC, registers, TSC,

IP

Split

Tagged

Resources are hard partitioned in half Load/Store buffers,

ITLB, ROB, IAQ

Resource entries are tagged with processor ID

Trace Cache, DTLB

®

* Other brands and names may be claimed as the property of others.

92

Xeon Processor Pipeline

Simplified

 Buffering Queues separate major pipeline logic blocks

 Buffering queues are either partitioned or duplicated to ensure independent forward progress through each logic block

Queue

Queue

Fetch

Decode

TC/MSROM

Queue

Buffering Queues duplicated

Rename/Allocate

Queue

OOO Execute

Queue

Buffering Queues partitioned

Retirement

Queue

Queue

®

* Other brands and names may be claimed as the property of others.

93

HT in NetBurst

System Bus  Front End

Bus unit

3 rd level cache

Optional server product

Execution Trace Cache

Microcode Store ROM (MSROM)

ITLB and Branch Prediction

– IA-32 Instruction Decode

– Micro-op Queue

2 nd

Fetch/

Decode level cache

Trace Cache

MS ROM

1 st level cache

4 way

OOO

Execution

Retirement

BTBs/

Branch Prediction

®

* Other brands and names may be claimed as the property of others.

94

Front End

 Responsible for delivering instruction to the later pipe stages

 Trace Cache Hit

– When the requested instruction trace is present in trace cache

 Trace cache miss

– Requested instruction is brought in the trace cache from L2 cache

®

* Other brands and names may be claimed as the property of others.

95

Trace Cache Hit

Front End

IP

IP

Trace

Cache

Micro-Op

Queue

 Two separate instruction pointers

 Two logical processors arbitrate for access to TC each cycle

 If one logical processor stalls,other uses full bandwidth of

TC

®

* Other brands and names may be claimed as the property of others.

96

Programming Models

 Two major types of parallel programming models

– Domain decomposition

– Functional decomposition

 Domain Decomposition

– Multiple threads working on subsets of the data

 Functional Decomposition

– Different computation on the same data

– E.g. Motion estimation vs. color conversion, e.t.c.

Both models can be implemented on HT processors

®

* Other brands and names may be claimed as the property of others.

97

Threading Implementation

 O/S thread implementations may differ

 Microsoft Win32

– NT threads (supports 1-1 O/S level threading)

– Fibers (supports M-N user level threading)

 Linux

– Native Linux Thread (severely broken & inefficient)

– IBM Next Generation Posix Threads (NGPT) – IBM’s attempt to fix Linux native thread

– Redhat Native Posix Thread Model for Linux (NPTL) supports 1-1 O/S level threading that is to be Posix compliant

 Others

– Pthreads (generic Posix compliant thread)

– Sun Solaris Light Weight Processes (lwp), Sun Solaris user level threads

Thread Model Issues Somewhat Orthogonal to HT

®

* Other brands and names may be claimed as the property of others.

98

OS Implications of HT

ALL UP OS

Legacy MP OS

Backward

Compatible, will not take the advantage of

Enabled MP OS

OS with Basic

Hyperthreading

Technology

Functionality

Optimized MP

OS

OS with optimized

Hyperthreading

Technology support

Fully Compatible with ALL existing O/S… but only optimized O/S enables the most benefits

®

* Other brands and names may be claimed as the property of others.

99

HT Optimized OS

 Windows XP

– Windows XP

– Windows XP Professional

 Windows 2003

– Enterprise

– Data Center

 Enabled

– RedHat Enterprise Server (version 7.3, 8.0)

– RedHat Advanced Server 2.1

– Suse (8.0, 9.0)

®

* Other brands and names may be claimed as the property of others.

100

OS Scheduling

 HT enabled O/S sees two processors for each HT physical processor

– Enumerates first logical processor from all physical processors first

 Schedules processors almost same as regular SMP

– Thread priority determines schedule, but CPU dispatch matters

– O/S independently submits code stream for thread to logical processors and can independently interrupt or halt each logical processor (no change)

Physical Processor 1

Logical

Processor

1

Logical

Processor

0

Physical Processor 0

Logical

Processor

1

Logical

Processor

0

00000011

®

CPUID

* Other brands and names may be claimed as the property of others.

00000001

CPUID 101

00000010

CPUID

00000000

CPUID

Thread Management

 Avoid coding practices that disable hyperthreaded processors, e.g.

– Avoid 64KB Aliasing

– Avoid processor serializing events (e.g. FP denormals, self modifying codes, e.t.c.)

 Avoid Spin Locks

– Minimize lock contention to less than two threads per lock

– Use “ Pause ” and “ O/S synchronization ” when Spin-Wait loops must be implemented

 In addition, follow multi-threading best practices:

– Use O/S services to block waiting threads

– Spin as briefly as possible before yielding to O/S

– Avoid false sharing

– Avoid unintended synchronizations (C Runtime, C++

Template Library implementations)

®

102

* Other brands and names may be claimed as the property of others.

Threading Tools

 Intel ThreadChecker Tool

– Itemization of parallelization bugs and source

– ThreadChecker class

 OpenMP

– Thread model in which programmer introduces parallelism or threading via directives or pragmas

 Intel Vtune Analyzer

– Provides analysis and drills down to source code

– ThreadChecker Integration

 GuideView

– Parallel performance tuning

®

* Other brands and names may be claimed as the property of others.

103

Software Tools

Intel C/C++ Compiler

Support for SSE and SSE2 using C++ classes, intrinsics, and assembly

– Improved Vectorization and prefetch insertion

– Profile-guided optimizations

– G7 compiler switch for Pentium® 4 optimizations

Register Viewing Tool (RVT)

– Shows contents of XMM registers as they are updated

Plugs into Microsoft* Visual Studio*

Microsoft* Visual Studio* 6.0 Processor Pack*

Support for SSE and SSE2 instructions, including intrinsics

Available for free download from Microsoft*

Microsoft* Visual Studio* .NET

– Provides improved support for Intel® NetBurst™ micro-architecture

– Recognizes XMM registers

®

* Other brands and names may be claimed as the property of others.

104

Hyperthreading is NOT:

 Hyperthreading is not a full, dual-core processor

 Hyper-threading does not deliver multiprocessor scaling

Dual Processor

Cache Cache

Arch. State Arch. State

APIC APIC

Processor core

Processor core

®

* Other brands and names may be claimed as the property of others.

Hyper-Threading

On-Die

Cache

Arch. State Arch. State

APIC APIC

Dual Core

On-Die

Cache

Arch. State Arch. State

APIC APIC

Processor core

105

Processor core

Processor core

Backup

®

* Other brands and names may be claimed as the property of others.

106

TERMS

 Branch: transfer of control to address different from next instruction . Unconditional or conditional.

 Branch Prediction: Ability to guess target of conditional branch. Can be wrong, in which case we have mis-predict.

 CISC: complex instruction set computer

 Compiler: Tool translating high-level instructions into low-level machine instructions. Can be asm source (ASCII) or binary machine code.

 EPIC (Explicitly Parallel Instruction Computing):

New architecture jointly defined by Intel

® and HP.Is foundation of new 64-bit Instruction Set

Architecture

®

* Other brands and names may be claimed as the property of others.

107

TERMS

 Explicit parallelism: Intended ability of two tasks to be executed by design (explicitly) at the same time.

Task can be as simple as an instruction, or as complex as a complete program.

 Implicit parallelism: Incidental ability of two or more tasks to be executed at the same time. Example: sequence of integer add and FP convert instructions without common registers or memory addresses, executed on a target machine that happens to have respective HW modules available.

®

* Other brands and names may be claimed as the property of others.

108

TERMS

 Instruction Set Architecture (ISA): Architecturally visible instructions that perform software functions and direct operations within the processor. HP and

Intel

® jointly developed a new 64-bit ISA.This ISA integrates technical concepts from the EPIC technology.

 Memory latency: Time to move data from memory to the processor, at request of processor.

 Mispredict: A wrong guess, where new flow of control will continue as a result of a branch (or similar control flow instruction).

®

* Other brands and names may be claimed as the property of others.

109

Download