Short Course on Advanced Topics: from Emerging Media/Network

advertisement
Short Course on Advanced Topics:
from Emerging Media/Network
Processors to Internet Computing
1
Topic 3: Fundamentals of Media
Processor Designs

Overview of High-Performance Processors






Multimedia extension





Multiple-issue, out-of-order, dynamic-window processor
VLIW and Vector processor
Systolic and Reconfigurable processor
Hardwired stream processor
Thread-level parallelism
Media benchmark/workload
Steaming media processing, sub-word parallelism
Intel MMX/SSE media extension
IA-64 multimedia instructions
Media processors



IMAGINE: Media processing with streams
IVRAM: Extent Intelligent RAM with vector unit
Trimedia: Price-performance challenge for media processing
2
Digital Signal Processing (DSP)






In 1970s, DSP in telecomm, requires higher
performance than microprocessor available
Computationally intensive: Dominated by vector
dot product – multiply, multiply-add
Real-time requirement
Streaming data, high memory bandwidth, simple
memory access pattern
Predictable program flow, nested loops, less
branches, large basic blocks
Sensitivity to numeric error
3
Early DSPs

Single-cycle multiplier

Streamlined multiply-add operation





Separate instruction / data memory for high
memory bandwidth
Specialized addressing hardware, autoincrement
Complex instruction set, combine multiple
operations in single instruction
Special-purpose, fixed-function hardware, lack
flexibility and programmability
TI TMS32010, 1982
4
Today’s DSP (from 1995)

Adapt general-purpose processor design




Programmability and compatibility



RISC-like instruction set
Multiple-issue, VLIW approach, Vector
SIMD, superscalar, chip multiprocessing
Easy to program, better compiler target
Better compatibility for future architecture
TI-TMS320C62xx family


RISC instruction set
8-issue, VLIW design
5
General-Purpose Processors





Notice increasing applications (e.g. cellular phone)
for DSP tasks ($6 billion DSP market in 2000)
Add architecture features to boost performance of
common DSP tasks
Extended multimedia instruction set, adapted and
integrated with existing hardware in almost all
high-performance microprocessor, Intel MMX/SSE
New architecture, encompass DSP+general-proc.,
boost high parallelism, Stanford Imagine, etc.
Future directions? Graphics processors?
6
Media Processing







Digital signal processing, 2D/3D graphics rendering,
image/audio compression/decompression
Real-time constraint, high performance density
Large amount of data parallelism, latency tolerance
Steaming data, very little global data reuse
Computational intensive, performing 100-200
arithmetic operations for each data element
Require efficient hardware mapping with algorithm
flow, special-purpose media processors
Extend instruction set / hardware on generalpurpose processors
7
Multimedia Applications


Image/video/audio compression
(JPEG/MPEG/GIF/png)
Front-end of 3D graphics pipeline
(geometry, lighting)


High Quality Additive Audio Synthesis

Todd Hodes, UCB
Vectorize across oscillators

Adobe Photoshop



Pixar Cray X-MP, Stellar, Ardent,
Microsoft Talisman MSP
Image Processing
Speech recognition



Front-end: filters/FFTs
Phoneme probabilities: Neural net
Back-end: Viterbi/Beam Search
8
High-Performance Processors

Exploit instruction-level parallelism






Superscalar, VLIW, vector SIMD, systolic array, etc.
Flexible (superscalar, VLIW) vs. regular (vector, systolic)
Data communication: through register (VLIW, vector) vs.
forwarding (superscalar, systolic, vector-chaining)
Ratio of computation / memory access, data reuse ratio
Hardware (superscalar) vs. software (VLIW, vector,
systolic) to discover ILP
Exploit thread-level parallelism



Parallel computation (programming) model: streaming,
macro-dataflow, SPMD, etc.
Data communication and data sharing behavior
Multiprocessor synchronization requirement
9
Instruction-Level Parallelism

Loop:
Control
Dependence
load
F0,0(R1)
add
F4,F0,F2
store
F4,0(R1)
addui
R1,R1,#-8
bne
R1,R2,Loop
Data Dependence
For (I=1000; I>0; I--)
X[I] = X[I] + S;
• Limited Instruction-Level Parallelism (ILP)
• Data dependence: True (RAW), Anti (WAR), Output (WAW)
• Control dependence: Determine program flow
10
Dynamic Out-of-order Execution






In-order fetch/issue, out-of-order execution, in-order
completion: maintain precise interrupt
Use Reorder Buffer to
hold results of uncommit.
Reorder
Register rename to RB
Buffer
FP
entry to drive dep. Inst.
Op
Inst. commit in order,
Queue
FP Regs
remove from RB, result to
architecture register
Memory disambiguation
Res Stations
Res Stations
Discover ILP dynamically
FP Adder
FP Adder
flexible, costly, suitable for
integer programs
11
Fetch / Issue Unit
Stream of Instructions
To Execute
Instruction Fetch
with
Branch Prediction
Out-Of-Order
Execution
Unit
Correctness Feedback
On Branch Results






Must fetch beyond branches: Branch Prediction
Must feed execution unit with high-bandwidth: Trace Cache
Must utilize Inst / Trace cache bandwidth: Next line prediction
Instruction fetch decoupled from execution
Often issue logic (+ rename) included with Fetch unit
Need efficient (1-cycle) broadcast+wakeup+schedule logic for
dependent instruction scheduling
12
Superscalar Out-of-order Execution

Loop:
Control
Dependence
load
F0,0(R1)
add
F4,F0,F2
store
F4,0(R1)
addui
R1,R1,#-8
bne
R1,R2,Loop
Data Dependence
Branch prediction
load
F0,0(R1)
add
F4,F0,F2
store
F4,0(R1)
addui
R1,R1,#-8
R1,R2,Loop
bne
Register Renaming:
R1, F0, R1
Hardware discover ILP,
Most flexible
13
VLIW Approach – Static Multiple Issue




Wide-instruction, multiple independent operations
Loop unrolling, procedure inlining, trace scheduling,
etc. to enlarge basic blocks
Compiler discover ind. operations, pack to long inst.
Difficulties:






Code size: clever encoding
Lock-step execution: hardware allows unsynchronized.
Binary code compatibility: object-code translation
Compiler techniques to improve ILP
Compiler optimization with hardware support
Better suited for applications with predictable control
flow, media / signal processing
14
VLIW Approach – Example
Memory Ref 1
Memory Ref 2
Load F0,0(R1)
Load F6, -8(R1)
Load F10,-16(R1)
Load F14,-24(R1)
Load F18,-32(R1)
Load F22,-40(R1)
Load F26,-48(R1)
Store F4,0(R1)
Store F8,-8(R1)
Store F12,-16(R1)
Store F16,-24(R1)
Store F20,24(R1)
Store F24,16(R1)
Store F28,8(R1)
FP Operation 1
FP Operation 2
Add F4,F0,F2
Add F8,F6,F2
Add F12,F10,F2
Add F16,F14,F2
Add F20,F18,F2
Add F24,F22,F2
Integer/Branch
Add F28,F26,F2
Addui R1,R1, -56
Bne R1,R2,Loop
15
Vector Processor






Single-Instruction, multiple-data, exploit regular data
parallelism, less flexible than VLIW
Highly-pipeline, tolerate memory latency
Require high-memory bandwidth (cache-less)
Better suited for large scientific applications with
heavy loop structures, good for media application
Dynamic vector chaining, compound instruction
Example:
/* after vector loop blocking */
Vload
V1,0(R1)
Vadd
V2,V1,F2
Vstore V2,0(R1)
16
Systolic Array, Reconfigure Processor

Systolic array: Fixed function, fixed wire
….. 8(R1), 0(R1)
….. F2, F2
+
….. 8(R1), 0(R1)
Avoid register communication, inflexible

Reconfigurable hardware: MIT Raw, Stanford Smart
Memories




General-purpose engine for media applications is limited
Fixed-function, fixed-wire too restricted
Reconfigurable hardware provides compiler programmable
interconnections and system structure to suit applications
Exploit thread-level parallelism
17
Thread-Level Parallelism


Many applications, such as database transactions,
scientific computations, server applications, etc.
demonstrate high-level (thread-level) parallelism
Two basic approaches:


Execute each thread in a separate processor, the old
parallel processing approach
Execute multiple threads in a single processor



Duplicating each thread state, PC, registers, etc.; but share
functional units, memory hierarchy, etc.; minimize thread
switching cost comparing context switching
Switching thread: coarse-grained vs. fine-grained
Simultaneous multithreading (SMT): thread-level and
instruction-level parallelism are exploited simultaneously
with multiple threads issues at the same cycle.
18
Simultaneous Multithreading

Simultaneous multithreading is a processor design
that combines hardware multithreading with superscalar
processor technology to allow multiple threads to issue
instructions each cycle.



Unlike other hardware multithreaded architectures (such as the Tera
MTA), in which only a single hardware context (i.e., thread) is active
on any given cycle, SMT permits all thread contexts to
simultaneously compete for and share processor resources.
Unlike conventional superscalar processors, which suffer from a lack
of per-thread instruction-level parallelism, simultaneous
multithreading uses multiple threads to compensate for low singlethread ILP.
The performance consequence is significantly higher instruction
throughput and program speedups on a variety of workloads that
include commercial databases, web servers and scientific
applications in both multiprogrammed and parallel environments.
19
Comparison of Multithreading
Time
SuperScalar
Course MT
Fine MT
SMT
20
Performance of SMT

SMT shows better performance than superscalar;
however, contentions on caches
21
Summary


Application-driven architecture studies
Media applications




Computational intensive, lots parallelism, predictable
control flow, real-time constraint
Memory intensive, streaming data access
8, 16, 24 bit data structures
Suitable architectures



Dynamic schedule, out-of-order processors are inefficient
and overkill
VLIW, Vector, Reconfigurable processors, or exploit subword
parallelism on general-purpose processors
special handle memory access
22
Download