Review of Chapters 3 & 4
Copyright © 2012, Elsevier Inc. All rights reserved.
1
Chapter 3 Review

Baseline: simple MIPS 5-stage pipeline



IF, ID, EX, MEM, WB
How to exploit Instruction-Level Parallelism (ILP)
to improve the performance?
Pipeline CPI =




Ideal pipeline CPI +
Structural stalls +
Data hazard stalls +
Control stalls
Copyright © 2012, Elsevier Inc. All rights reserved.
2
Hazards & Stalls

Structural Hazards



Control Hazards



Cause: resource contention
Solution: add more resources & better scheduling
Cause: branch instructions, change of program flow
Solution: loop unrolling, branch prediction, hardware
speculation
Data Hazards

Cause: Dependences
True data dependence: property of program: RAW
Name dependence: reuse of registers, WAR & WAW

Solution: loop unrolling, dynamic scheduling, register
renaming, hardware speculation
Copyright © 2012, Elsevier Inc. All rights reserved.
3
Ideal CPI

Multiple Issue
Copyright © 2012, Elsevier Inc. All rights reserved.
4
Loop Unrolling (pp.161)





Finds that the loop iterations were independent
Uses different registers to avoid unnecessary
constraints (name dependence)
Eliminate extra test and branch instructions
(control dependence)
Interchanges the load and store instructions if
possible (utilize time of stalls)
Schedule the code: avoid/mitigate stalls while
maintaining true data dependence
Copyright © 2012, Elsevier Inc. All rights reserved.
5
Branch Predication

1-bit or 2-bit predictor: local predicator


Correlated predictor: global predicator


Uses the past results of the branch itself as an
indicator
Uses the pass results of correlated branches as an
indicator
(m,n) predictor: Two-level predicator

The number of bits in an (m, n) predictor is


2m * n * number of prediction entries
Tournament predictor: an adaptive one


Combining local & global predictor together
Select the right predictor for a particular branch
Copyright © 2012, Elsevier Inc. All rights reserved.
6
Dynamic Scheduling


Hardware rearranges the instruction execution to
reduce the stalls while maintaining data flow and
exception behaviors.
Simple pipeline:


In-order issue, in-order execution, in-order completion
Dynamic scheduling:



In-order issue, out-of-order execution, out-of-order
execution
Out-of-order execution results in WAR & WAW
hazards
Out-of-order completion results in unexpected
exception behaviors
Copyright © 2012, Elsevier Inc. All rights reserved.
7
Dynamic Scheduling

Addressing WAW & WAR hazards by out-oforder execution





Tomasulo’s Approach: Register Renaming
(Reservation station, Common data bus)
Issue,
Execute,
Write Results
Basic structure of Tomasulo’s algorithm: PP 173
Copyright © 2012, Elsevier Inc. All rights reserved.
8
Dynamic Scheduling

Addressing unexpected exception behaviors by
out-of-order completion







Hardware speculation: Reorder buffer(pass the results,
guaranteeing in-order completion)
Issue,
Execute,
Write Results
Commit
Basic structure of hardware speculation: PP 185
Now: pipeline with dynamic scheduling

In-order issue, out-of-order execution, in-order
completion
Copyright © 2012, Elsevier Inc. All rights reserved.
9
Decreasing the CPI

Multiple Issue




Statically scheduled superscalar processors
VLIW(very long instruction word) processors
Dynamically scheduled superscalar processors
See a summary table on pp. 194
Copyright © 2012, Elsevier Inc. All rights reserved.
10
Chapter 4 Review

SISD (single instruction, single data) architecture


SIMD (single instruction, multiple data)
architecture: exploiting data-level parallelism




Examples in Chapter 3
Vector architecture
Multimedia SIMD instruction set extensions
Graphics processing units (GPUs)
Data Independence
Copyright © 2012, Elsevier Inc. All rights reserved.
11
Vector Architecture

Primary components: VMIPS





Vector registers
Vector functional units
Vector load/store unit
A set of scalar registers
Basic structure of a vector architecture (pp. 265)
Copyright © 2012, Elsevier Inc. All rights reserved.
12
Vector Architecture

Execution time




Convoy




Length of the operand vectors;
Structural hazards among the operations;
Data dependencies.
The set of vector instructions that could potentially
execute together
NO structural hazards
Chaining: address data dependency in a convoy
Chime

The unit of time taken to execute one convoy
Copyright © 2012, Elsevier Inc. All rights reserved.
13
Vector Architecture

Execution time


a vector sequence: m convoys, a vector length of n
m * n clock cycles
Copyright © 2012, Elsevier Inc. All rights reserved.
14
Vector Architecture

Executes a single vector faster than one element
per clock cycle


Handles programs where the vector lengths are
not the same as the length of the vector register



Vector-Length Register: MTC1 VLR, R1
Strip mining: if the vector length is longer than MVL
Handles IF statements in vector loops


Multiple Lanes
Vector Mask Registers: CVM, POP
Supplying bandwidth for vector load/store units

Memory Banks: allow multiple independent data
accesses
Copyright © 2012, Elsevier Inc. All rights reserved.
15
Vector Architecture

Handles multidimensional arrays

Stride



Handles Sparse Matrices

Gather-Scatter




LVWS V1, (R1, R2)
SVWS (R1, R2),V1
LVI V1, (R1, V2)
SVI (R1,V2), V1
Programming Vector Architectures: Program
structures affecting performance
most of them are spent on improving memory
accesses, and most of them are modifications to
the vector instruction set
Copyright © 2012, Elsevier Inc. All rights reserved.
16
SIMD Instruction Set Execution

Observation: many media applications operate
on narrower data types than the 32-bit
processors were optimize for



Limitation




8 bits represent each of three primary colors
8 bits for transparency
Fix the number of data operands in the opcode
Does not offer the more sophisticated addressing
modes of vector architectures: stride & gather-scatter
Does not offer the mask registers
Roofline Visual Performance Model
Copyright © 2012, Elsevier Inc. All rights reserved.
17

Implementations:

Intel MMX (1996)


Streaming SIMD Extensions (SSE) (1999)





Eight 16-bit integer ops
Four 32-bit integer/fp ops or two 64-bit integer/fp ops
Advanced Vector Extensions (2010)


Eight 8-bit integer ops or four 16-bit integer ops
Four 64-bit integer/fp ops
Operands must be consecutive and aligned memory locations
Generally designed to accelerate carefully written libraries rather than
for compilers
Advantages over vector architecture:




Cost little to add to the standard ALU and easy to implement
Require little extra state  easy for context-switch
Require little extra memory bandwidth
No virtual memory problem of cross-page access and page-fault
Copyright © 2012, Elsevier Inc. All rights reserved.
SIMD Instruction Set Extensions for Multimedia
SIMD Implementations
18
Graphics Processing Units

Challenges:



Not simply getting good performance on the GPU
Coordinating the scheduling of computation on the
system processor and the GPU, and the transfer of
data between system memory and GPU memory
Heterogeneous architecture & computing





CPU + GPU
Individual memories for CPU & GPU
Like a distributed system on a node
CUDA or OpenCL languages
Programming model is “Single Instruction Multiple
Thread” (SIMT)
Copyright © 2012, Elsevier Inc. All rights reserved.
19




A thread is associated with each data element
Threads are organized into blocks
Blocks are organized into a grid
GPU hardware handles thread management, not
applications or OS
Copyright © 2012, Elsevier Inc. All rights reserved.
Graphical Processing Units
Threads, Blocks, Grids
20

Similarities to vector machines:





Works well with data-level parallel problems
Scatter-gather transfers
Mask registers
Large register files
Graphical Processing Units
NVIDIA GPU Architecture
Differences:



No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few
deeply pipelined units like a vector processor
Copyright © 2012, Elsevier Inc. All rights reserved.
21

Threads of SIMD instructions




Each has its own PC
Thread scheduler uses scoreboard to dispatch
No data dependencies between threads!
Keeps track of up to 48 threads of SIMD instructions



Graphical Processing Units
Terminology
Hides memory latency
Thread block scheduler schedules blocks to
SIMD processors
Within each SIMD processor:


32 SIMD lanes
Wide and shallow compared to vector processors
Copyright © 2012, Elsevier Inc. All rights reserved.
22