A Future for Parallel Computer Architectures Mark D. Hill

A Future for

Parallel Computer Architectures

Mark D. Hill

Computer Sciences Department

University of Wisconsin —Madison

Multifacet Project ( www.cs.wisc.edu/multifacet )

August 2004

Full Disclosure: Consult for Sun & US NSF

© 2004 Mark D. Hill

Wisconsin Multifacet Project

Summary

• Issues

– Moore’s Law, etc.

– Instruction Level Parallelism for More Performance

– But Memory Latency Longer (e.g., 200 FP multiplies)

• Must Exploit Memory Level Parallelism

– At Thread : Runahead & Continual Flow Pipeline

– At Processor : Simultaneous Multithreading

– At Chip : Chip Multiprocessing

© 2004 Mark D. Hill 2 Wisconsin Multifacet Project

Outline

• Computer Architecture Drivers

– Moore’s Law, Microprocessors, & Caching

• Instruction Level Parallelism (ILP) Review

• Memory Level Parallelism (MLP)

• Improving MLP of Thread

• Improving MLP of a Core or Chip

• CMP Systems


(Technologists) Moore’s Law


What If Your Salary?

• Parameters

– $16 base

– 59% growth/year

– 40 years

• Initially $16  buy book

• 3 rd year’s $64  buy computer game

• 16 th year’s $27,000  buy car

• 22 nd year’s $430,000  buy house

• 40 th year’s > billion dollars  buy a lot

You have to find fundamental new ways to spend money!


Microprocessor

• First Microprocessor in 1971

– Processor on one chip

– Intel 4004

– 2300 transistors

– Barely a processor

– Could access 300 bytes of memory (0.0003 megabytes)

• Use more and faster transistors in parallel


Other “Moore’s Laws”

• Other technologies improving rapidly

– Magnetic disk capacity

– DRAM capacity

– Fiber-optic network bandwidth

• Other aspects improving slowly

– Delay to memory

– Delay to disk

– Delay across networks

• Computer Implementor’s Challenge

– Design with dissimilarly expanding resources

– To Double computer performance every two years

– A.k.a., (Popular) Moore’s Law


Caching & Memory Hierarchies, cont.

• VAX-11/780

– 1 Instruction = Memory

• Now

– 100s Instructions = Memory

• Caching Applied Recursively

– Registers

– Level-one cache

– Level-two cache

– Memory

– Disk

– (File Server)

– (Proxy Cache)


Outline



– Pipelining & Out-of-Order

– Intel P3, P4, & Banias




• CMP Systems


Instruction Level Parallelism (ILP) 101

• Non-Pipelined (Faster via Bit Level Parallelism (BLP))

Time



• Pipelined (ILP + BLP; 1st microprocessors RISC)

Time




Instruction Level Parallelism 102

• SuperScalar (& Pipelined)

Time



• Add Cache Misses in red

Time



What if data independent?


Instruction Level Parallelism 103

• Out-of-Order (& SuperScalar & Pipelined)

Time



• In-order fetch, decode, rename, & issuing of instructions with good branch prediction

• Out-of-order speculative execution of instructions in “window”, honoring data dependencies

• In-order retirement, preserving sequential instruction semantics


Out-of-Order Example: Intel x86 P6 Core

• “CISC” Twist to Out-of-Order

– In-order front end cracks x86 instructions into micro-ops (like RISC instructions)

– Out-of-order execution

– In-Order retirement of micro-ops in x86 instruction groups

• Used in Pentium Pro, II, & III

– 3-way superscalar of micro-ops

– 10-stage pipeline (for branch misprediction penalty)

– Sophisticated branch prediction

– Deep pipeline allowed scaling for many generations


Pentium 4 Core [Hinton 2001]

• Follow basic approach of P6 core

• Trace Cache stores dynamic micro-op sequences

• 20-stage pipeline (for branch misprediction penalty)

• 128 active micro-ops (48 loads & 24 stores)

• Deep pipeline to allow scaling for many generations


Intel Kills Pentium 4 Roadmap

• Why? I can speculate

• Too Much Power?

– More transistors

– Higher-frequency transistors

– Designed before power became first-order design constraint

• Too Little Performance? Time/Program =

– Instructions/Program * Cycles/Instruction * Time/Cycle

• For x86: Instructions/Cycle * Frequency

• Pent4 Instruction/Cycle loss vs. Frequency gains?

• Intel moving away from marketing with frequency!


Pentium M / Banias [Gochman 2003]

• For laptops, but now more general

– Key: Feature must add 1% performance for 3% power

– Why: Increasing voltage for 1% perf. costs 3% power

• Techniques

– Enhance Intel SpeedStep™

– Shorter pipeline (more like P6)

– Better branch predictor (e.g., loops)

– Special handling of memory stack

– Fused micro-ops

– Lower power transistors (off critical path)


What about Future for Intel & Others?

• Worry about power & energy (not this talk)

• Memory latency too great for out-of-order cores to tolerate (coming next)

Memory Level Parallelism for Thread, Processor, & Chip!


Outline




– Cause & Effect



• CMP Systems


Out-of-Order w/ Slower Off-Chip Misses

• Out-of-Order (& Super-Scalar & Pipelined)

Time



• But Off-Chip Misses are now hundreds of cycles

Time



Good Case!


Out-of-Order w/ Slower Off-Chip Misses

• More Realistic Case

I1

I2

Time



4-instrn window

I3

I4

• Why does yellow instruction block?

– Assumes 4-instruction window (maximum outstanding)

– Yellow instruction awaits “instruction - 4” (1 st cache miss)

– Actual widows are 32-64 instructions, but L2 miss slower

• Key Insight: Memory-Level Parallelism ( MLP )

[Chou, Fahs, & Abraham, ISCA 2004]


Out-of-Order & Memory Level Parallism (MLP)

• Good Case

Compute & Memory Phases

MLP = 2

• Bad Case

Compute & Memory Phases

MLP = 1


MLP Model

• MLP = # Off-Chip Accesses / # Memory Phases

• Execution has Compute & Memory Phases

– Compute Phase largely overlaps Memory Phase

– In Limit as Memory Latency increases, …

• Compute Phase hidden by Memory Phase

– Execution Time = # Memory Phases * Memory Latency

• Execution Time =

(MLP / # Off-Chip Accesses) * Memory Latency


MLP Action Items

• Execution Time =

(MLP / # Off-Chip Accesses) * Memory Latency

• Reduce # Off-Chip Accesses

– E.g., better caches or compression (Multifacet)

• Reduce Memory Latency

– E.g., on-chip memory controller (AMD)

• Increase MLP (next slides)

• Processor changes that don’t affect MLP don’t help!


What Limits MLP in Processor? [Chou et al.]

• Issue window and reorder buffer size

• Instruction fetch off-chip accesses

• Unresolvable mispredicted branches

• Load and branch issue restrictions

• Serializing instructions


What Limits MLP in Program?

• Depending on data from off-chip memory accesses

• For addresses

– Bad: Pointer chasing with poor locality

– Good: Array where address calculation separate from data

• For unpredictable branch decisions

– Bad: Branching on data values with poor locality

– Good: Iterative loops with highly predictable branching

• But, as programmer, which accesses go off-chip?

• Also: very poor instruction locality

& frequent system calls, context switches, etc.


Outline





– Runahead, Continual Flow Pipeline


• CMP Systems


Runahead Example

• Base Out-of-Order, MLP = 1

I1

I2

I3

I4

4-instrn window

• With Runahead, MLP = 2

1. Normal mode

2. Checkpoint

3. Runahead mode

4. Restore checkpoint

5. Normal mode (but faster)


Runahead Execution [Dundas ICS97, Mutlu HPCA03]

1. Execute normally until instruction M ’s off-chip access blocks issue of more instructions

2. Checkpoint processor

3. Discard instruction M , set M ’s destination register to poisoned , & speculatively Runahead

– Instructions propagate poisoned from source to destination

– Seek off-chip accesses to start prefetches & increase MLP

4. Restore checkpoint when off-chip access M returns

5. Resume normal execution (hopefully faster)


Continual Flow Pipeline [

Srinivasan ASPLOS04

]

Simplified Example

Have off-chip access M free many resources, but SAVE

Keep decoding instructions

SAVE instructions dependent on M

Execute instructions independent of M

When M completes, execute SAVED instructions


Implications of Runahead, & Continual Flow

• Runahead

– Discards dependent instructions

– Speculatively executes independent instructions

– When miss returns, re-executes dependent & independent instrns

• Continual Flow Pipeline

– Saves dependent instructions

– Executes independent instructions

– When miss returns, executes only saved dependent instructions

• Assessment

– Both allow MLP to break past window limits

– Both limited by branch prediction accuracy on unresolved branches

– Continual Flow Pipeline sounds even more appealing

– But may not be worthwhile (vs. Runahead) & memory order issues


Outline






– Core: Simultaneous Multithreading

– Chip: Chip Multiprocessing

• CMP Systems


Getting MLP from Thread Level Parallelism

• Runahead & Continual Flow seek MLP for Thread

• More MLP for Processor ?

– More parallel off-chip accesses for a processor?

– Yes: Simultaneous Multithreading

• More MLP for Chip ?

– More parallel off-chip accesses for a chip?

– Yes: Chip Multiprocessing

• Exploit workload Thread Level Parallelism (TLP)


Simultaneous Multithreading [U Washington]

• Turn a physical processor into S logical processors

• Need S copies of architectural state, S=2, 4, (8?)

– PC, Registers, PSW, etc. (small!)

• Completely share

– Caches, functional units, & datapaths

• Manage via threshold sharing, partition, etc.

– Physical registers, issue queue, & reorder buffer

• Intel calls Hyperthreading in Pentium 4

– 1.4x performance for S=2 with little area, but complexity

– But Pentium 4 is now dead & no Hyperthreading in Banias


Simultaneous Multithreading Assessment

• Programming

– Supports finer-grained sharing than old-style MP

– But gains less than S and S is small

• Have Multi-Threaded Workload

– Hides off-chip latencies better than Runahead

– E.g, 4 threads w/ MLP 1.5 each  MLP = 6

• Have Single-Threaded Workload

– Base SMT No Help

– Many “Helper Thread” Ideas

• Expect SMT in processors for servers

• Probably SMT even in processors for clients


Want to Spend More Transistors

• Not worthwhile to spend it all on cache

• Replicate Processor

• Private L1 Caches

– Low latency

– High bandwidth

• Shared L2 Cache

– Larger than if private


CPU

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable Architecture

Based on Single-Chip Multiprocessing


CPU


Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

I$ D$


CPU

I$ D$


CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS)

32GB/sec, 1-cycle delay

ICS

I$ D$

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

CPU



CPU CPU CPU CPU

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way



L2 cache: shared, 1MB, 8-way

ICS

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

CPU CPU CPU CPU



MEM-CTL MEM-CTL MEM-CTL MEM-CTL

CPU CPU CPU CPU

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

ICS

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way




Memory Controller (MC)

RDRAM, 12.8GB/sec

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

CPU CPU CPU CPU


© 2004 Mark D. Hill

8 banks

@1.6GB/sec

40 Wisconsin Multifacet Project



CPU CPU CPU CPU

HE

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

ICS

RE L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

CPU CPU CPU CPU


Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way





RDRAM, 12.8GB/sec

Protocol Engines (HE & RE)

 prog., 1K

 instr., even/odd interleaving


4 Links

@ 8GB/s


MEM-CTL

CPU

HE

I$ D$

L2$

MEM-CTL MEM-CTL MEM-CTL

CPU

I$ D$

L2$

ICS

CPU

I$ D$

L2$

CPU

I$ D$

L2$

RE L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

CPU CPU CPU CPU


Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way





RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):

 prog., 1K


System Interconnect:

4-port Xbar router topology independent

32GB/sec total bandwidth



MEM-CTL

CPU

HE

I$ D$

L2$

MEM-CTL MEM-CTL MEM-CTL

CPU

I$ D$

L2$

ICS

CPU

I$ D$

L2$

CPU

I$ D$

L2$

RE L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

CPU CPU CPU CPU


Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way





RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):

 prog., 1K


System Interconnect:

4-port Xbar router topology independent

32GB/sec total bandwidth


Single-Chip Piranha Performance

350

350

300

250

200

150

100

50

0

233

145

100

34

P1

500 MHz

1-issue

INO

1GHz

OOO

1GHz

1-issue 4-issue

OLTP

P8

500MHz

1-issue

191

100

L2Miss

L2Hit

CPU

44

P1

500 MHz

1-issue

INO

1GHz

OOO

1GHz

1-issue

DSS

4-issue

P8

500MHz

1-issue

• Piranha’s performance margin 3x for OLTP and 2.2x for DSS

• Piranha has more outstanding misses  better utilizes memory system


Chip Multiprocessing Assessment: Servers

• Programming

– Supports finer-grained sharing than old-style MP

– But not as fine as SMT (yet)

– Many cores can make performance gain large

• Can Yield MLP for Chip!

– Can do CMP of SMT processors

– C cores of S -way SMT with T -way MLP per thread

– Yields Chip MLP of C*S*T (e.g., 8*2*2 = 32)

• Most Servers have Multi-Threaded Workload

• CMP is a Server Inflection Point

– Expect >10x performance for less cost

Implying, >>10x cost-performance


Chip Multiprocessing Assessment: Clients

• Most Client (Today) have Single-Threaded Workload

– Base CMP No Help

– Use Thread Level Speculation?

– Use Helper Threads?

• CMPs for Clients?

– Depends on Threads

– CMP costs significant chip area (unlike SMT)


Outline






• CMP Systems

– Small, Medium, but Not Large

– Wisconsin Multifacet Token Coherence


Small CMP Systems

• Use One CMP (with C cores of S -way SMT)

– C starts 2-4 and grows to 16-ish

– S starts at 2, may stay at 2 or grow to 4

– Fits on your desk!

• Directly Connect CMP (C) to Memory Controller (M) or DRAM

C M C

• If Threads Useful

– >10X Performance vs. Uniprocesor

– >>10X Cost-Performance vs. non-CMP SMP

• Commodity Server!


Medium CMP Systems

• Use 2-16 CMPs (with C cores of S -way SMT)

– Small: 4*4*2 = 32

– Large: 16*16*4 = 1024

M

• Connect CMPs & Memory Controllers (or DRAM)

M C C

C C M M

C C C C

C C

M

Processor-Centric

© 2004 Mark D. Hill

M

M M

C

Memory-Centric

50

C

M M M

Dance Hall

Wisconsin Multifacet Project

M

Large CMP Systems?

• 1000s of CMPs?

• Will not happen in the commercial market

– Instead will network CMP systems into clusters

– Enhance availability & reduces cost

– Poor latency acceptable

• Market for large scientific machines probably ~$0 Billion

• Market for large government machines similar

– Nevertheless, government can make this happen (like bombers)

• The rest of us will use

– a small- or medium-CMP system

– A cluster of small- or medium-CMP systems


Wisconsin Multifacet ( www.cs.wisc.edu/multifacet )

• Designing Commercial Servers

• Availability: SafetyNet Checkpointing [ISCA 2002]

• Programability: Flight Data Recorder [ISCA 2003]

• Methods: Simulating a $2M Server on a $2K PC

[Computer 2003]

• Performance: Cache Compression [ISCA 2004]

• Simplicity & Performance: Token Coherence (next)


Token Coherence [IEEE MICRO Top Picks 03]

• Coherence Invariant (for any memory block at any time):

– One writer or multiple readers

• Implemented with distributed Finite State Machines

• Indirectly enforced (bus order, acks, blocking, etc.)

• Token Coherence Directly Enforces

– Each memory block has T tokens

– Token count store with data (even in messages)

– Processor needs all T tokens to write

– Processor needs at least one token to read

• Last year: Glueless Multiprocessor

– Speedup 17-54% vs directory

• This Year: Medium CMP Systems

– Flat for correctness

– Hierarchical for performance


Conclusions

Must Exploit Memory Level Parallelism!

At Thread : Runahead & Continual Flow Pipeline

At Processor : Simultaneous Multithreading

At Chip : Chip Multiprocessing

Talk to be filed : Google Mark Hill > Publications > 2004


A Future for Parallel Computer Architectures Mark D. Hill

A Future for

Parallel Computer Architectures

Mark D. Hill

Summary

Outline

(Technologists) Moore’s Law

What If Your Salary?

Microprocessor

Other “Moore’s Laws”

Caching & Memory Hierarchies, cont.

Outline

Instruction Level Parallelism (ILP) 101

Instruction Level Parallelism 102

Instruction Level Parallelism 103

Out-of-Order Example: Intel x86 P6 Core

Pentium 4 Core [Hinton 2001]

Intel Kills Pentium 4 Roadmap

Pentium M / Banias [Gochman 2003]

What about Future for Intel & Others?

Outline

Out-of-Order w/ Slower Off-Chip Misses

Out-of-Order w/ Slower Off-Chip Misses

Out-of-Order & Memory Level Parallism (MLP)

MLP Model

MLP Action Items

What Limits MLP in Processor? [Chou et al.]

What Limits MLP in Program?

Outline

Runahead Example

Continual Flow Pipeline [

]

Implications of Runahead, & Continual Flow

Outline

Getting MLP from Thread Level Parallelism

Simultaneous Multithreading [U Washington]

Simultaneous Multithreading Assessment

Want to Spend More Transistors

Piranha Processing Node

Piranha Processing Node

Piranha Processing Node

Piranha Processing Node

Piranha Processing Node

Piranha Processing Node

Piranha Processing Node

Piranha Processing Node

Single-Chip Piranha Performance

Chip Multiprocessing Assessment: Servers

Chip Multiprocessing Assessment: Clients

Outline

Small CMP Systems

Medium CMP Systems

Large CMP Systems?

Token Coherence [IEEE MICRO Top Picks 03]

Conclusions

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib