Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview ● Media Processing – Present and Future ● Evaluation Environment ● Dynamic vs. Static Architectures ● Effects of High Frequency ● Conclusions ● Future Research 2 Page 1 1 Multimedia Applications ● Wide range of applications — Communication – – – – video conferencing World Wide Web digital/video libraries videophones — Entertainment – video/computer games – movies – animation — Computer Vision Multimedia Multimediaisis primarily primarily aa communication communicationmedia media – image understanding – surveillance – tracking — Education – interactive learning – virtual classrooms — Art and Architecture 3 Future of Multimedia Multimedia Multimediaindustry industry evolves evolveswith with processor performance. processor performance. Processing Performance Object-Based Multimedia Multimedia Multimediaisis moving movingtowards towards advanced advanced representations representations Video Compression Image Compression Time 4 Page 2 2 Current Media Processing Solutions ● Application-specific processors — high performance at low cost — very limited flexibility ● Multimedia extensions to general-purpose processors — good programmability at little added cost — some speedup with subword parallelism — optimized for general-purpose processing ● Current “programmable” media processors — good performance – specialized hardware – subword parallelism – ILP — good programmability (w/ special programming libraries) — moderate frequency 5 Future Media Processors ● Increasing Performance — high frequency — improved ILP ● Cost is Major Barrier — high resource costs are primary barrier to using such mechanisms — smaller market for media processing prohibits high resource costs — media processors currently much more expensive per MIPS ● Diminishing Costs — increasing market for media processing — decreasing power per MIPS — demonstrated by recently announced TI C64x => frequencies up to 1.1 GHz P ro c e s s o r F re q u e n c y Pow er E x e c u tio n U n its TI C 62x up to 3 0 0 MHz up to 2 W 8 Inte l P e ntium III 5 0 0 MHz - 1 G Hz 13 - 16 W 5 L2 C ache varie s , w/ up to 7 Mb m e m 32 KB L1, 256 KB L2 VLSI T e c h n o lo g y 6 Page 3 3 Evaluation Environment 7 MediaBench Benchmark Suite ● Developed at UCLA [CLee97] “MediaBench: A Tool for Evaluating and Synthesizing Multimedia Communication Systems,” MICRO-30, 1997. ● Excellent combination of applications — — — — — — ● video: audio: graphics: image: security: speech: MPEG-2 ADPCM coder Mesa JPEG, EPIC, Ghostscript PGP, Pegwit GSM, G.721, Rasta Augmented for greater representation of future multimedia — MPEG-4 object-oriented video — H.263 very-low bitrate video 8 Page 4 4 IMPACT Environment ● Aggressive ILP research compiler — Three levels of optimizations – Classical – Superscalar – Hyperblock ● - classical optimizations only - adds loop unrolling and superblock formation - adds hyperblock optimization Architecture-independent evaluation — large, generic instruction set — retargetable back-end ● Performance analysis tools — — — — parameterizable simulator statistical and cycle-accurate simulation models VLIW and in-order superscalar architectures expanded tools to include out-of-order superscalar architectures 9 Dynamic vs. Static Architectures 10 Page 5 5 Related Research ● Media processors currently statically-scheduled — TI C6x — TriMedia TM-1000, TM-2000 — Equator/Hitachi MAP1000 ● Research-based media processors [CLee97] “MediaBench: A Tool for Evaluating and Synthesizing Multimedia Communications Systems,” MICRO-30, 1997. [CLee98] “Media Architecture: General Purpose vs. Multiple ApplicationSpecific Programmable Processors,” DAC-35, 1998. [PPirsch97] “On Implementation of Media Processors,” IEEE Signal Processing Magazine, vol. 14, no. 4, July 1997. [SRixner99] “Media Processors Using Streams,” SPIE Photonics West – Media Processors ’99, 1999. ● Static vs. dynamic scheduling [PChang91] “Comparing Static and Dynamic Code Scheduling for MultipleInstruction Issue Processors,” MICRO-24, 1991. 11 Base Architecture Model ● Architecture model — — — — — ● 8-issue media processor operation latencies targeting 500 MHz to 1 GHz processor frequency 64 integer and floating-point registers pipeline: 1 fetch, 2 decode, 1 write back, variable execute stages 1024-entry 2-bit branch predictor ` L1 Cache Bus frequency = 1/6 processor frequency — 16 KB direct-mapped L1 instruction cache w/ 256 byte lines — 32 KB direct-mapped L1 data cache w/ 64 byte lines ● 50 cycles L2 Cache On-Chip L2 Cache — 256 KB 4-way set associate w/ 64 byte lines ● 8 Write Buffers 15 cycles (D-cache) 20 cycles (I-cache) 3 cycles L1 Instr Cache L1 Data Cache External Memory — 6:1 Processor to bus frequency ratio 8 Write Buffers Datapath 12 Page 6 6 Static vs. Dynamic Scheduling Architectures for static and dynamic scheduling ● — VLIW and in-order superscalar perform comparably (5% difference) — out-of-order superscalar has 64% better performance on average – out-of-order issue with 32-entry issue-reorder buffer – early branch evaluation – large degree of dynamic control speculation 4 VLIW in-order superscalar out-of-order superscalar 3.5 3 IPC 2.5 2 1.5 1 unepic AVERAGE texgen rawdaudio rasta rawcaudio pegwitenc pgpdecode osdemo pegwitdec mpeg4dec mpeg2enc mipmap h263enc h263dec gsmencode gs gsmdecode g721enc epic g721dec djpeg cjpeg 0 mpeg2dec 0.5 Application 13 Scheduling Variations across Compiler Methods Compared compilations models across architectures — hyperblock demonstrates best performance – 12% increase over superblock on out-of-order superscalar – only 2% increase over superblock otherwise – gain likely does not warrant resources for predication 3 VLIW 2.5 in-order superscalar 2 IPC ● out-of-order superscalar 1.5 VLIW w/ perfect caches 1 in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches 0.5 0 Classical Superscalar Hyperblock Compilation Method 14 Page 7 7 Scheduling Variations across Processor Widths Compared processor widths across architectures — performance gain minimizes after 4 issue slots — 3-4 issue slots sufficient for these compiler methods — 2-issue out-of-order superscalar outperforms 8-issue VLIW and 8-issue in-order superscalar 2.5 2 VLIW 1.5 IPC ● in-order superscalar 1 out-of-order superscalar 0.5 0 0 5 Issue width 10 15 Effects of High Frequency 16 Page 8 8 Impact of Higher Frequencies ● Increasing frequency — Causes greater wire delays and fewer levels of logic per cycle — Leads to: – deeper pipelines – longer operation latencies – increased communication costs ● Compared three different processor frequency models ● Compared immediate vs. delayed bypassing In s truc t io n M od e l 1 M od e l 2 (B a s e ) M od e l 3 F re q ue n c y R a n g e 2 5 0 -5 0 0 M H z 500 M Hz – 1 G Hz 1 -2 G H z P ro c e s s o r-B u s F re q . R a tio 4 :1 6 :1 8 :1 A LU 1 1 1 B ra n c h e s 1 1 1 S to re 1 2 3 Load 2 3 4 F lo a tin g -P o int 3 4 5 M u ltip ly 3 5 7 D ivid e 10 20 30 17 Comparison of Frequency Models Results from doubling processor frequency — average IPC degradation of 15% – 2/3 of degradation from longer operation latencies – 1/3 of degradation from longer memory latencies — performance increase of 70% from doubling frequency — out-of-order superscalar and superscalar compilation least susceptible to IPC degradation at higher frequencies 20 15 m2 to m3 10 m1 to m2 5 r.o r.i iw pe pe su S. H H S. su S. vl r.o pe pe su S. H r.i iw vl S. su .o S. er .s C .s up up .v er .i liw 0 C IPC Difference (%) 25 C ● Compilation/Simulation Method 18 Page 9 9 Impact of Delayed Bypassing ● Results from delaying bypassing one cycle IPC Difference (%) — average IPC degradation of 32% — out-of-order superscalar and superscalar compilation least susceptible to IPC degradation 45 40 35 30 25 20 15 10 5 0 VLIW in-order superscalar out-of-order superscalar Classical Superscalar Hyperblock Compilation Method 19 Conclusions ● VLIW and in-order superscalar perform comparably — Only 5% average difference in performance ● Out-of-order superscalar has significantly higher performance — 64% better average performance than VLIW — 2-issue out-of-order superscalar outperforms both 8-issue VLIW and 8-issue in-order superscalar ● Compilation and Processor Width — Hyperblock compilation is best, but likely not worth overhead — Processor widths of 3-4 issue slots sufficient for these compilation methods ● Effects of High Frequency — Doubling processor frequency decreases IPC by 16% — Delayed bypassing decreases IPC by 32% — Out-of-order scheduling and superscalar compilation up to 30% less susceptible to high frequency effects 20 Page 10 10 Areas for Future Work ● Advanced Compilation Methods — Software pipelining ● Impact of Subword Parallelism — Current work only evaluates scheduling mechanisms on ILP-based code — How does inclusion of subword parallelism affect performance? — Anticipate greater impact from dynamic aspects: – Subword parallelism primarily used across loop iterations with regular control flow – Subword parallelism reduces regularity, giving dynamic aspects greater weight ● Evaluating DSP Features — DSP operations: multiply-accumulate, saturation arithmetic, etc. — Low-overhead looping ● Evaluate Performance with Specialized Functional Units — Motion estimation, DCT, variable-bit rate coding, etc. — Support specialized media functions with reconfigurable co-processor? 21 Page 11 11