Evaluation of Static and Dynamic Scheduling for Media Processors

advertisement
Evaluation of Static and Dynamic
Scheduling for Media Processors
Jason Fritts
Assistant Professor
Department of Computer Science
Co-Author: Wayne Wolf
Overview
●
Media Processing – Present and Future
●
Evaluation Environment
●
Dynamic vs. Static Architectures
●
Effects of High Frequency
●
Conclusions
●
Future Research
2
Page 1
1
Multimedia Applications
●
Wide range of applications
— Communication
–
–
–
–
video conferencing
World Wide Web
digital/video libraries
videophones
— Entertainment
– video/computer games
– movies
– animation
— Computer Vision
Multimedia
Multimediaisis
primarily
primarily aa
communication
communicationmedia
media
– image understanding
– surveillance
– tracking
— Education
– interactive learning
– virtual classrooms
— Art and Architecture
3
Future of Multimedia
Multimedia
Multimediaindustry
industry evolves
evolveswith
with
processor
performance.
processor performance.
Processing
Performance
Object-Based
Multimedia
Multimedia
Multimediaisis
moving
movingtowards
towards
advanced
advanced
representations
representations
Video
Compression
Image
Compression
Time
4
Page 2
2
Current Media Processing
Solutions
●
Application-specific processors
— high performance at low cost
— very limited flexibility
●
Multimedia extensions to general-purpose processors
— good programmability at little added cost
— some speedup with subword parallelism
— optimized for general-purpose processing
●
Current “programmable” media processors
— good performance
– specialized hardware
– subword parallelism
– ILP
— good programmability (w/ special programming libraries)
— moderate frequency
5
Future Media Processors
●
Increasing Performance
— high frequency
— improved ILP
●
Cost is Major Barrier
— high resource costs are primary barrier to using such mechanisms
— smaller market for media processing prohibits high resource costs
— media processors currently much more expensive per MIPS
●
Diminishing Costs
— increasing market for media processing
— decreasing power per MIPS
— demonstrated by recently announced TI C64x => frequencies up to 1.1 GHz
P ro c e s s o r
F re q u e n c y
Pow er
E x e c u tio n
U n its
TI C 62x
up to 3 0 0 MHz
up to 2 W
8
Inte l P e ntium III
5 0 0 MHz - 1 G Hz
13 - 16 W
5
L2 C ache
varie s , w/ up to
7 Mb m e m
32 KB L1,
256 KB L2
VLSI
T e c h n o lo g y
6
Page 3
3
Evaluation Environment
7
MediaBench Benchmark Suite
●
Developed at UCLA
[CLee97] “MediaBench: A Tool for Evaluating and Synthesizing Multimedia
Communication Systems,” MICRO-30, 1997.
●
Excellent combination of applications
—
—
—
—
—
—
●
video:
audio:
graphics:
image:
security:
speech:
MPEG-2
ADPCM coder
Mesa
JPEG, EPIC, Ghostscript
PGP, Pegwit
GSM, G.721, Rasta
Augmented for greater representation of future multimedia
— MPEG-4 object-oriented video
— H.263 very-low bitrate video
8
Page 4
4
IMPACT Environment
●
Aggressive ILP research compiler
— Three levels of optimizations
– Classical
– Superscalar
– Hyperblock
●
- classical optimizations only
- adds loop unrolling and superblock formation
- adds hyperblock optimization
Architecture-independent evaluation
— large, generic instruction set
— retargetable back-end
●
Performance analysis tools
—
—
—
—
parameterizable simulator
statistical and cycle-accurate simulation
models VLIW and in-order superscalar architectures
expanded tools to include out-of-order superscalar architectures
9
Dynamic vs. Static Architectures
10
Page 5
5
Related Research
●
Media processors currently statically-scheduled
— TI C6x
— TriMedia TM-1000, TM-2000
— Equator/Hitachi MAP1000
●
Research-based media processors
[CLee97] “MediaBench: A Tool for Evaluating and Synthesizing Multimedia
Communications Systems,” MICRO-30, 1997.
[CLee98] “Media Architecture: General Purpose vs. Multiple ApplicationSpecific Programmable Processors,” DAC-35, 1998.
[PPirsch97] “On Implementation of Media Processors,” IEEE Signal
Processing Magazine, vol. 14, no. 4, July 1997.
[SRixner99] “Media Processors Using Streams,” SPIE Photonics West –
Media Processors ’99, 1999.
●
Static vs. dynamic scheduling
[PChang91] “Comparing Static and Dynamic Code Scheduling for MultipleInstruction Issue Processors,” MICRO-24, 1991.
11
Base Architecture Model
●
Architecture model
—
—
—
—
—
●
8-issue media processor
operation latencies targeting 500 MHz to 1 GHz processor frequency
64 integer and floating-point registers
pipeline: 1 fetch, 2 decode, 1 write back, variable execute stages
1024-entry 2-bit branch predictor
`
L1 Cache
Bus frequency =
1/6 processor frequency
— 16 KB direct-mapped L1 instruction
cache w/ 256 byte lines
— 32 KB direct-mapped L1 data
cache w/ 64 byte lines
●
50 cycles
L2
Cache
On-Chip L2 Cache
— 256 KB 4-way set associate
w/ 64 byte lines
●
8 Write
Buffers
15 cycles (D-cache)
20 cycles (I-cache)
3 cycles
L1
Instr
Cache
L1
Data
Cache
External Memory
— 6:1 Processor to bus frequency ratio
8 Write
Buffers
Datapath
12
Page 6
6
Static vs. Dynamic Scheduling
Architectures for static and dynamic scheduling
●
— VLIW and in-order superscalar perform comparably (5% difference)
— out-of-order superscalar has 64% better performance on average
– out-of-order issue with 32-entry issue-reorder buffer
– early branch evaluation
– large degree of dynamic control speculation
4
VLIW
in-order superscalar
out-of-order superscalar
3.5
3
IPC
2.5
2
1.5
1
unepic
AVERAGE
texgen
rawdaudio
rasta
rawcaudio
pegwitenc
pgpdecode
osdemo
pegwitdec
mpeg4dec
mpeg2enc
mipmap
h263enc
h263dec
gsmencode
gs
gsmdecode
g721enc
epic
g721dec
djpeg
cjpeg
0
mpeg2dec
0.5
Application
13
Scheduling Variations
across Compiler Methods
Compared compilations models across architectures
— hyperblock demonstrates best performance
– 12% increase over superblock on out-of-order superscalar
– only 2% increase over superblock otherwise
– gain likely does not warrant resources for predication
3
VLIW
2.5
in-order superscalar
2
IPC
●
out-of-order superscalar
1.5
VLIW w/ perfect caches
1
in-order superscalar w/
perfect caches
out-of-order superscalar w/
perfect caches
0.5
0
Classical
Superscalar
Hyperblock
Compilation Method
14
Page 7
7
Scheduling Variations
across Processor Widths
Compared processor widths across architectures
— performance gain minimizes after 4 issue slots
— 3-4 issue slots sufficient for these compiler methods
— 2-issue out-of-order superscalar outperforms 8-issue VLIW and
8-issue in-order superscalar
2.5
2
VLIW
1.5
IPC
●
in-order superscalar
1
out-of-order superscalar
0.5
0
0
5
Issue width
10
15
Effects of High Frequency
16
Page 8
8
Impact of Higher Frequencies
●
Increasing frequency
— Causes greater wire delays and fewer levels of logic per cycle
— Leads to:
– deeper pipelines
– longer operation latencies
– increased communication costs
●
Compared three different processor frequency models
●
Compared immediate vs. delayed bypassing
In s truc t io n
M od e l 1
M od e l 2 (B a s e )
M od e l 3
F re q ue n c y R a n g e
2 5 0 -5 0 0 M H z
500 M Hz – 1 G Hz
1 -2 G H z
P ro c e s s o r-B u s F re q . R a tio
4 :1
6 :1
8 :1
A LU
1
1
1
B ra n c h e s
1
1
1
S to re
1
2
3
Load
2
3
4
F lo a tin g -P o int
3
4
5
M u ltip ly
3
5
7
D ivid e
10
20
30
17
Comparison of Frequency Models
Results from doubling processor frequency
— average IPC degradation of 15%
– 2/3 of degradation from longer operation latencies
– 1/3 of degradation from longer memory latencies
— performance increase of 70% from doubling frequency
— out-of-order superscalar and superscalar compilation least susceptible
to IPC degradation at higher frequencies
20
15
m2 to m3
10
m1 to m2
5
r.o
r.i
iw
pe
pe
su
S.
H
H
S.
su
S.
vl
r.o
pe
pe
su
S.
H
r.i
iw
vl
S.
su
.o
S.
er
.s
C
.s
up
up
.v
er
.i
liw
0
C
IPC Difference (%)
25
C
●
Compilation/Simulation Method
18
Page 9
9
Impact of Delayed Bypassing
●
Results from delaying bypassing one cycle
IPC Difference (%)
— average IPC degradation of 32%
— out-of-order superscalar and superscalar compilation least
susceptible to IPC degradation
45
40
35
30
25
20
15
10
5
0
VLIW
in-order superscalar
out-of-order superscalar
Classical
Superscalar
Hyperblock
Compilation Method
19
Conclusions
●
VLIW and in-order superscalar perform comparably
— Only 5% average difference in performance
●
Out-of-order superscalar has significantly higher performance
— 64% better average performance than VLIW
— 2-issue out-of-order superscalar outperforms both 8-issue VLIW and 8-issue
in-order superscalar
●
Compilation and Processor Width
— Hyperblock compilation is best, but likely not worth overhead
— Processor widths of 3-4 issue slots sufficient for these compilation methods
●
Effects of High Frequency
— Doubling processor frequency decreases IPC by 16%
— Delayed bypassing decreases IPC by 32%
— Out-of-order scheduling and superscalar compilation up to 30% less
susceptible to high frequency effects
20
Page 10
10
Areas for Future Work
●
Advanced Compilation Methods
— Software pipelining
●
Impact of Subword Parallelism
— Current work only evaluates scheduling mechanisms on ILP-based code
— How does inclusion of subword parallelism affect performance?
— Anticipate greater impact from dynamic aspects:
– Subword parallelism primarily used across loop iterations with regular control flow
– Subword parallelism reduces regularity, giving dynamic aspects greater weight
●
Evaluating DSP Features
— DSP operations: multiply-accumulate, saturation arithmetic, etc.
— Low-overhead looping
●
Evaluate Performance with Specialized Functional Units
— Motion estimation, DCT, variable-bit rate coding, etc.
— Support specialized media functions with reconfigurable co-processor?
21
Page 11
11
Download