15-740/18-740 Computer Architecture Lecture 3: SIMD, MIMD, and ISA Principles

advertisement
15-740/18-740
Computer Architecture
Lecture 3: SIMD, MIMD, and ISA Principles
Prof. Onur Mutlu
Carnegie Mellon University
Fall 2011, 9/14/2011
Review of Last Lecture
n 
Vector processors
q 
q 
q 
n 
n 
Advantages and disadvantages
Amortization of control overhead over multiple data elements
Vector vs. scalar code execution time
SIMD vs. MIMD
Concept of on-chip networks
2
Today
n 
n 
Wrap up intro to on-chip networks
ISA level tradeoffs
3
Review: SIMD vs. MIMD
n 
SIMD:
q 
n 
Concurrency arises from performing the same operations on
different pieces of data
MIMD:
q 
Concurrency arises from performing different operations on
different pieces of data
n 
q 
Control/thread parallelism: execute different threads of control in
parallel à multithreading, multiprocessing
Idea: Use multiple processors to solve a problem
4
Review: Flynn’s Taxonomy of Computers
n 
n 
n 
Mike Flynn, “Very High-Speed Computing Systems,” Proc.
of IEEE, 1966
SISD: Single instruction operates on single data element
SIMD: Single instruction operates on multiple data elements
q 
q 
n 
MISD: Multiple instructions operate on single data element
q 
n 
Array processor
Vector processor
Closest form: systolic array processor, streaming processor
MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
q 
q 
Multiprocessor
Multithreaded processor
5
Review: On-Chip Network Based Multi-Core Systems
n 
A scalable multi-core is a distributed system on a chip
PE
PE
R
R
PE
PE
PE
PE
PE
R
VC Identifier
R
PE
R
PE
R
R
Router
PE Processing Element
From East
PE
R
R
Input Port with Buffers
PE
PE
PE
PE
R
R
R
R
PE
R
R
R
R
PE
(Cores, L2 Banks, Memory Controllers,
Accelerators, etc)
From West
VC 0
VC 1
VC 2
Control Logic
Routing Unit
(RC)
VC Allocator
(VA)
Switch
Allocator (SA)
To East
From North
To West
To North
To South
To PE
From South
From PE
Crossbar
6
Review: Idea of On-Chip Networks
n 
Problem: Connecting many cores with a single bus is not
scalable
q 
Single point of connection limits communication bandwidth
n 
q 
n 
Electrical loading on the single bus limits bus frequency
Idea: Use a network to connect cores
q 
q 
n 
What if multiple core pairs want to communicate with each other
at the same time?
Connect neighboring cores via short links
Communicate between cores by routing packets over the
network
Dally and Towles, “Route Packets, Not Wires: On-Chip
Interconnection Networks,” DAC 2001.
7
Review: Bus
+ Simple
+ Cost effective for a small number of nodes
+ Easy to implement coherence (snooping)
- Not scalable to large number of nodes (limited bandwidth,
electrical loading à reduced frequency)
- High contention
Memory
Memory
Memory
Memory
cache
cache
cache
cache
Proc
Proc
Proc
Proc
8
Review: Crossbar
Every node connected to every other
n  Good for small number of nodes
+ Least contention in the network: high bandwidth
- Expensive
- Not scalable due to quadratic cost 7
n 
6
Used in core-to-cache-bank
networks in
- IBM POWER5
- Sun Niagara I/II
5
4
3
2
1
0
0
1
2
3
4
5
6
7
9
Review: Mesh
n 
n 
n 
n 
n 
n 
O(N) cost
Average latency: O(sqrt(N))
Easy to layout on-chip: regular and equal-length links
Path diversity: many ways to get from one node to another
Used in Tilera 100-core
And many on-chip network
prototypes
10
Review: Torus
Mesh is not symmetric on edges: performance very
sensitive to placement of task on edge vs. middle
n  Torus avoids this problem
+ Higher path diversity than mesh
- Higher cost
- Harder to lay out on-chip
- Unequal link lengths
n 
11
Review: Torus, continued
n 
Weave nodes to make inter-node latencies ~constant
S
S
MP
MP
S
MP
S
MP
S
MP
S
MP
S
MP
S
MP
12
Review: Example NoC: 100-core Tilera Processor
q 
Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,”
IEEE Micro 2007.
13
The Need for QoS in the On-Chip Network
n 
One can create malicious applications that continuously access
the same resource à deny service to less aggressive applications
14
The Need for QoS in the On-Chip Network
n 
n 
Need to provide packet scheduling mechanisms that ensure
applications’ service requirements (bandwidth/latency) are satisfied
Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Costeffective QOS Scheme for Networks-on-Chip,” MICRO 2009.
15
On Chip Networks: Some Questions
n 
n 
n 
n 
n 
n 
n 
Is mesh/torus the best topology?
How do you design the router?
q  Energy efficient, low latency
q  What is the routing algorithm? Is it adaptive or deterministic?
How does the router prioritize between different threads’/applications’
packets?
q  How does the OS/application communicate the importance of
applications to the routers?
q  How does the router provide bandwidth/latency guarantees to
applications that need them?
Where do you place different resources? (e.g., memory controllers)
How do you maintain cache coherence?
How does the OS scheduler place tasks?
How is data placed in distributed caches?
16
What is Computer Architecture?
n 
n 
The science and art of designing, selecting, and
interconnecting hardware components and designing the
hardware/software interface to create a computing system
that meets functional, performance, energy consumption,
cost, and other specific goals.
We will soon distinguish between the terms architecture,
microarchitecture, and implementation.
17
Why Study Computer
Architecture?
18
Moore’s Law
Moore, “Cramming more components onto integrated circuits,”
Electronics Magazine, 1965.
19
Why Study Computer Architecture?
n 
Make computers faster, cheaper, smaller, more reliable
q 
n 
Enable new applications
q 
q 
q 
n 
Life-like 3D visualization 20 years ago?
Virtual reality?
Personal genomics?
Adapt the computing stack to technology trends
q 
n 
By exploiting advances and changes in underlying technology/circuits
Innovation in software is built into trends and changes in computer
architecture
n  > 50% performance improvement per year
Understand why computers work the way they do
20
An Example: Multi-Core Systems
Multi-Core
Chip
DRAM MEMORY
CONTROLLER
L2 CACHE 3
L2 CACHE 2
CORE 2
CORE 3
DRAM BANKS
CORE 1
DRAM INTERFACE
L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE
CORE 0
*Die photo credit: AMD Barcelona
21
Unexpected Slowdowns in Multi-Core
High priority
Memory Performance Hog
Low priority
(Core 0)
(Core 1)
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service
in multi-core systems,” USENIX Security 2007.
22
Why the Disparity in Slowdowns?
CORE
matlab1
gcc 2
CORE
L2
CACHE
L2
CACHE
Multi-Core
Chip
unfairness
INTERCONNECT
DRAM MEMORY CONTROLLER
Shared DRAM
Memory System
DRAM DRAM DRAM DRAM
Bank 0 Bank 1 Bank 2 Bank 3
23
DRAM Bank Operation
Rows
Row address 0
1
Columns
Row decoder
Access Address:
(Row 0, Column 0)
(Row 0, Column 1)
(Row 0, Column 85)
(Row 1, Column 0)
Row 01
Row
Empty
Column address 0
1
85
Row Buffer CONFLICT
HIT
!
Column mux
Data
24
DRAM Controllers
n 
A row-conflict memory access takes significantly longer
than a row-hit access
n 
Current controllers take advantage of the row buffer
n 
Commonly used scheduling policy (FR-FCFS)
[Rixner 2000]*
(1) Row-hit first: Service row-hit memory accesses first
(2) Oldest-first: Then service older accesses first
n 
This scheduling policy aims to maximize DRAM throughput
*Rixner et al., “Memory Access Scheduling,” ISCA 2000.
*Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.
25
The Problem
n 
Multiple threads share the DRAM controller
DRAM controllers designed to maximize DRAM throughput
n 
DRAM scheduling policies are thread-unfair
n 
q 
Row-hit first: unfairly prioritizes threads with high row buffer locality
n 
q 
n 
Threads that keep on accessing the same row
Oldest-first: unfairly prioritizes memory-intensive threads
DRAM controller vulnerable to denial of service attacks
q 
Can write programs to exploit unfairness
26
Fundamental Concepts
27
What is Computer Architecture?
n 
n 
The science and art of designing, selecting, and
interconnecting hardware components and designing the
hardware/software interface to create a computing system
that meets functional, performance, energy consumption,
cost, and other specific goals.
Traditional definition: “The term architecture is used
here to describe the attributes of a system as seen by the
programmer, i.e., the conceptual structure and functional
behavior as distinct from the organization of the dataflow
and controls, the logic design, and the physical
implementation.” Gene Amdahl, IBM Journal of R&D, April
1964
28
Levels of Transformation
Problem
Problem
Algorithm
Algorithm
Programs
User
Program
ISA
Microarchitecture
Runtime System
(VM, OS, MM)
Circuits
ISA
Electrons
Microarchitecture
Circuits/Technology
Electrons
29
Levels of Transformation
n 
ISA
q 
Agreed upon interface between software
and hardware
n 
q 
n 
What the software writer needs to know
to write system/user programs
Microarchitecture
q 
q 
n 
SW/compiler assumes, HW promises
Specific implementation of an ISA
Not visible to the software
Problem
Algorithm
Program
ISA
Microarchitecture
Circuits
Electrons
Microprocessor
q 
q 
ISA, uarch, circuits
“Architecture” = ISA + microarchitecture
30
ISA vs. Microarchitecture
n 
What is part of ISA vs. Uarch?
q 
q 
q 
n 
Implementation (uarch) can be various as long as it
satisfies the specification (ISA)
q 
q 
n 
Gas pedal: interface for “acceleration”
Internals of the engine: implements “acceleration”
Add instruction vs. Adder implementation
Bit serial, ripple carry, carry lookahead adders
x86 ISA has many implementations: 286, 386, 486, Pentium,
Pentium Pro, …
Uarch usually changes faster than ISA
q 
q 
Few ISAs (x86, SPARC, MIPS, Alpha) but many uarchs
Why?
31
ISA
n 
Instructions
q 
q 
q 
n 
Memory
q 
q 
n 
n 
n 
n 
n 
n 
Opcodes, Addressing Modes, Data Types
Instruction Types and Formats
Registers, Condition Codes
Address space, Addressability, Alignment
Virtual memory management
Call, Interrupt/Exception Handling
Access Control, Priority/Privilege
I/O
Task Management
Power and Thermal Management
Multi-threading support, Multiprocessor support
32
Microarchitecture
n 
n 
Implementation of the ISA under specific design constraints
and goals
Anything done in hardware without exposure to software
q 
q 
q 
q 
q 
q 
q 
q 
q 
q 
Pipelining
In-order versus out-of-order instruction execution
Memory access scheduling policy
Speculative execution
Superscalar processing (multiple instruction issue?)
Clock gating
Caching? Levels, size, associativity, replacement policy
Prefetching?
Voltage/frequency scaling?
Error correction?
33
Design Point
n 
A set of design considerations and their importance
q 
n 
Considerations
q 
q 
q 
q 
q 
q 
q 
n 
leads to tradeoffs in both ISA and uarch
Cost
Performance
Maximum power consumption
Energy consumption (battery life)
Availability
Reliability and Correctness (or is it?)
Time to Market
Problem
Algorithm
Program
ISA
Microarchitecture
Circuits
Electrons
Design point determined by the “Problem” space
(application space)
34
Tradeoffs: Soul of Computer Architecture
n 
ISA-level tradeoffs
n 
Uarch-level tradeoffs
n 
System and Task-level tradeoffs
q 
How to divide the labor between hardware and software
35
ISA-level Tradeoffs: Semantic Gap
n 
Where to place the ISA? Semantic gap
q 
q 
Closer to high-level language (HLL) or closer to hardware
control signals? à Complex vs. simple instructions
RISC vs. CISC vs. HLL machines
n 
n 
q 
FFT, QUICKSORT, POLY, FP instructions?
VAX INDEX instruction (array access with bounds checking)
Tradeoffs:
n 
Simple compiler, complex hardware vs.
complex compiler, simple hardware
q 
n 
n 
Caveat: Translation (indirection) can change the tradeoff!
Burden of backward compatibility
Performance?
q 
q 
Optimization opportunity: Example of VAX INDEX instruction: who
(compiler vs. hardware) puts more effort into optimization?
Instruction size, code size
36
X86: Small Semantic Gap: String Operations
REP MOVS DEST SRC
How many instructions does this take in Alpha?
37
Small Semantic Gap Examples in VAX
n 
n 
n 
n 
n 
n 
n 
n 
FIND FIRST
q  Find the first set bit in a bit field
q  Helps OS resource allocation operations
SAVE CONTEXT, LOAD CONTEXT
q  Special context switching instructions
INSQUEUE, REMQUEUE
q  Operations on doubly linked list
INDEX
q  Array access with bounds checking
STRING Operations
q  Compare strings, find substrings, …
Cyclic Redundancy Check Instruction
EDITPC
q  Implements editing functions to display fixed format output
Digital Equipment Corp., “VAX11 780 Architecture Handbook,” 1977-78.
38
Small versus Large Semantic Gap
n 
CISC vs. RISC
q 
Complex instruction set computer à complex instructions
n 
q 
Initially motivated by “not good enough” code generation
Reduced instruction set computer à simple instructions
n 
John Cocke, mid 1970s, IBM 801
q 
n 
Goal: enable better compiler control and optimization
RISC motivated by
q 
Memory stalls (no work done in a complex instruction when
there is a memory stall?)
n 
q 
q 
When is this correct?
Simplifying the hardware à lower cost, higher frequency
Enabling the compiler to optimize the code better
n 
Find fine-grained parallelism to reduce stalls
39
Small versus Large Semantic Gap
n 
n 
n 
n 
John Cocke’s RISC (large semantic gap) concept:
q  Compiler generates control signals: open microcode
Advantages of Small Semantic Gap (Complex instructions)
+ Denser encoding à smaller code size à saves off-chip bandwidth,
better cache hit rate (better packing of instructions)
+ Simpler compiler
Disadvantages
- Larger chunks of work à compiler has less opportunity to optimize
- More complex hardware à translation to control signals and
optimization needs to be done by hardware
Read Colwell et al., “Instruction Sets and Beyond: Computers,
Complexity, and Controversy,” IEEE Computer 1985.
40
ISA-level Tradeoffs: Instruction Length
n 
Fixed length: Length of all instructions the same
+ Easier to decode single instruction in hardware
+ Easier to decode multiple instructions concurrently
-- Wasted bits in instructions (Why is this bad?)
-- Harder-to-extend ISA (how to add new instructions?)
n 
Variable length: Length of instructions different
(determined by opcode and sub-opcode)
+ Compact encoding (Why is this good?)
Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How?
-- More logic to decode a single instruction
-- Harder to decode multiple instructions concurrently
n 
Tradeoffs
q 
q 
q 
Code size (memory space, bandwidth, latency) vs. hardware complexity
ISA extensibility and expressiveness
Performance? Smaller code vs. imperfect decode
41
ISA-level Tradeoffs: Uniform Decode
n 
Uniform decode: Same bits in each instruction correspond
to the same meaning
Opcode is always in the same location
q  Ditto operand specifiers, immediate values, …
q  Many “RISC” ISAs: Alpha, MIPS, SPARC
+ Easier decode, simpler hardware
+ Enables parallelism: generate target address before knowing the
instruction is a branch
-- Restricts instruction format (fewer instructions?) or wastes space
q 
n 
Non-uniform decode
E.g., opcode can be the 1st-7th byte in x86
+ More compact and powerful instruction format
-- More complex decode logic (e.g., more logic to speculatively
generate branch target)
q 
42
x86 vs. Alpha Instruction Formats
n 
x86:
n 
Alpha:
43
ISA-level Tradeoffs: Number of Registers
n 
Affects:
q 
q 
q 
n 
Number of bits used for encoding register address
Number of values kept in fast storage (register file)
(uarch) Size, access time, power consumption of register file
Large number of registers:
+ Enables better register allocation (and optimizations) by
compiler à fewer saves/restores
-- Larger instruction size
-- Larger register file size
-- (Superscalar processors) More complex dependency check
logic
44
ISA-level Tradeoffs: Addressing Modes
n 
Addressing mode specifies how to obtain an operand of an
instruction
q 
q 
q 
n 
Register
Immediate
Memory (displacement, register indirect, indexed, absolute,
memory indirect, autoincrement, autodecrement, …)
More modes:
+ help better support programming constructs (arrays, pointerbased accesses)
-- make it harder for the architect to design
-- too many choices for the compiler?
n 
n 
Many ways to do the same thing complicates compiler design
Read Wulf, “Compilers and Computer Architecture”
45
x86 vs. Alpha Instruction Formats
n 
x86:
n 
Alpha:
46
x86
register
indirect
absolute
register +
displacement
register
47
x86
indexed
(base +
index)
scaled
(base +
index*4)
48
Other ISA-level Tradeoffs
n 
n 
n 
n 
n 
n 
n 
n 
n 
n 
n 
n 
n 
Load/store vs. Memory/Memory
Condition codes vs. condition registers vs. compare&test
Hardware interlocks vs. software-guaranteed interlocking
VLIW vs. single instruction vs. SIMD
0, 1, 2, 3 address machines (stack, accumulator, 2 or 3-operands)
Precise vs. imprecise exceptions
Virtual memory vs. not
Aligned vs. unaligned access
Supported data types
Software vs. hardware managed page fault handling
Granularity of atomicity
Cache coherence (hardware vs. software)
…
49
Programmer vs. (Micro)architect
n 
Many ISA features designed to aid programmers
But, complicate the hardware designer’s job
n 
Virtual memory
n 
q 
q 
n 
Unaligned memory access
q 
n 
n 
vs. overlay programming
Should the programmer be concerned about the size of code
blocks?
Compile/programmer needs to align data
Transactional memory?
VLIW vs. SIMD? Superscalar execution vs. SIMD?
50
Transactional Memory
THREAD 1
THREAD 2
enqueue (Q, v) {
enqueue (Q, v) {
Node_t node = malloc(…);
node->val = v;
node->next = NULL;
acquire(lock);
if (Q->tail)
Q->tail->next = node;
else
Q->head = node;
release(lock);
Q->tail = node;
Q->tail
release(lock);
= node;
Node_t node = malloc(…);
node->val = v;
node->next = NULL;
acquire(lock);
if (Q->tail)
Q->tail->next = node;
else
Q->head = node;
Q->tail
release(lock);
= node;
release(lock);
Q->tail = node;
}
}
begin-transaction
begin-transaction
…
…
enqueue (Q, v); //no locks
enqueue (Q, v); //no locks
…
…
end-transaction
end-transaction
51
Transactional Memory
n 
n 
A transaction is executed atomically: ALL or NONE
If there is a data conflict between two transactions, only
one of them completes; the other is rolled back
q 
q 
Both write to the same location
One reads from the location another writes
52
ISA-level Tradeoff: Supporting TM
n 
n 
Still under research
Pros:
q 
q 
n 
Cons:
q 
q 
q 
n 
Could make programming with threads easier
Could improve parallel program performance vs. locks. Why?
What if it does not pan out?
All future microarchitectures might have to support the new
instructions (for backward compatibility reasons)
Complexity?
How does the architect decide whether or not to support
TM in the ISA? (How to evaluate the whole stack)
53
ISA-level Tradeoffs: Instruction Pointer
n 
Do we need an instruction pointer in the ISA?
q 
Yes: Control-driven, sequential execution
n 
n 
q 
No: Data-driven, parallel execution
n 
n 
An instruction is executed when the IP points to it
IP automatically changes sequentially (except control flow
instructions)
An instruction is executed when all its operand values are
available (data flow)
Tradeoffs: MANY high-level ones
q 
q 
q 
q 
Ease of programming (for average programmers)?
Ease of compilation?
Performance: Extraction of parallelism?
Hardware complexity?
54
The Von-Neumann Model
MEMORY
Mem Addr Reg
Mem Data Reg
PROCESSING UNIT
INPUT
OUTPUT
ALU
TEMP
CONTROL UNIT
IP
Inst Register
55
The Von-Neumann Model
n 
n 
n 
n 
Stored program computer (instructions in memory)
One instruction at a time
Sequential execution
Unified memory
q 
n 
n 
The interpretation of a stored value depends on the control
signals
All major ISAs today use this model
Underneath (at uarch level), the execution model is very
different
q 
q 
q 
Multiple instructions at a time
Out-of-order execution
Separate instruction and data caches
56
Fundamentals of Uarch Performance Tradeoffs
Instruction
Supply
- Zero-cycle latency
(no cache miss)
- No branch mispredicts
Data Path
(Functional
Units)
-  Perfect data flow
(reg/memory dependencies)
-  Zero-cycle interconnect
(operand communication)
-  No fetch breaks
Data
Supply
-  Zero-cycle latency
-  Infinite capacity
-  Zero cost
-  Enough functional units
-  Zero latency compute?
We will examine all these throughout the course (especially data supply)
57
How to Evaluate Performance Tradeoffs
Execution time
=
# instructions
program
Algorithm
Program
ISA
Compiler
X
=
time
program
# cycles
instruction
ISA
Microarchitecture
X
time
cycle
Microarchitecture
Logic design
Circuit implementation
Technology
58
Improving Performance
n 
Reducing instructions/program
n 
Reducing cycles/instruction (CPI)
n 
Reducing time/cycle (clock period)
59
Improving Performance (Reducing Exec Time)
n 
Reducing instructions/program
q 
q 
n 
More efficient algorithms and programs
Better ISA?
Reducing cycles/instruction (CPI)
q 
Better microarchitecture design
n 
n 
n 
Execute multiple instructions at the same time
Reduce latency of instructions (1-cycle vs. 100-cycle memory
access)
Reducing time/cycle (clock period)
q 
q 
Technology scaling
Pipelining
60
Improving Performance: Semantic Gap
n 
Reducing instructions/program
q 
q 
n 
Complex instructions: small code size (+)
Simple instructions: large code size (--)
Reducing cycles/instruction (CPI)
q 
Complex instructions: (can) take more cycles to execute (--)
n 
n 
q 
n 
REP MOVS
How about ADD with condition code setting?
Simple instructions: (can) take fewer cycles to execute (+)
Reducing time/cycle (clock period)
q 
Does instruction complexity affect this?
n 
It depends
61
Download