Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker Sourav Chatterji, Jason Duell, Manikandan Narayanan Motivation Commodity cache-based SMP clusters perform at small % of peak for memory intensive problems (esp irregular prob) But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr) Power and packaging are becoming significant bottlenecks Better software is improving some problems: ATLAS, FFTW, Sparsity, PHiPAC Alternative arch allow tighter integration of proc & memory Can we build HPC systems w/ high-end media proc tech? VIRAM: PIM technology combines embedded DRAM with vector coprocessor to exploit large bandwidth potential IMAGINE: Stream-aware memory supports large processing potential of SIMD controlled VLIW clusters Motivation General purpose procs badly suited for data intensive ops Application-specific ASICs Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption Good, but expensive/slow to design. Solution: general purpose “memory aware” processors Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation VIRAM Overview MIPS core (200 MHz) Main memory system 8 banks w/13 MB of on-chip DRAM Large 6.4 GBytes/s on-chip peak bandwidth Cach-less Vector unit Energy efficient way to express fine-grained parallelism and exploit bandwidth Single issue, in order Low power consumption: 2.0 W Peak vector performance 1.6/3.2/6.4 Gops 1.6 Gflops (single-precision) Fabricated by IBM: Taped-out 02/2003 To hide DRAM access load/store, arithmetic instructions deeply pipelined (15 stages) We use simulator with Cray’s vcc compiler VIRAM Vector Lanes Parallel lane design has adv in performance, design complex, scalability Each lanes has 2 ALUs ( 1 for FP) and receives identical control signal Vector instr specify 64 way-parallelism, hardware exec 8-way 8 KB vector register file partitioned into 32 vector registers Variable data widths: 4 lanes 64-bit, 8 lanes for 32 bit, 16 for 8 bit Data width cut in half, # of elems per register (and peak) doubles Limitations: no 64-bit FP & compiler doesn’t generate fused MADD VIRAM Power Efficiency 1000 VIRAM 100 MOPS/Watt R10K P-III 10 P4 Sparc 1 EV6 Transitive GUPS SPMV Hist Mesh 0.1 Comparable performance with lower clock rate Large power/performance advantage for VIRAM from PIM technology, data parallel execution model Stream Processing Example: stereo depth extraction Data and Functional Parallelism High Comp rate Little Data Reuse Producer-Consumer and Spatial locality Ex: Multimedia, sign proc, graphics Stream: ordered set of records (homogenous, arbitrary data type) Stream programming: data is streams, compu is kernel Kernel loop through all stream elements (sequential order) Perform compound (multiword) operation on each stream elem Vectors perform single arith op on each vector elem (then store in reg) Imagine Overview Host sends inst to stream controller, SC issues commands to on-chip modules “Vector VLIW” processor Coprocessor to off-chip host processor 8 arithmet clusters control in SIMD w/ VLIW instr Central 128KB Stream Register File @ 32GB/s SRF can overlap comp with mem (double buff) SRF cab reuse intermed results (prod-cons local) Stream-aware mem sys with 2.7 GB/s off-chip 544 GB/s interclustr comm Imagine Arithmetic Clusters 400 MHz clock, 8 clusters w/ 6 FU each (48 FU total) Reads/writes streams to SRF Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch, & 1 comm unit 32 bit arch: subword operations support 16 and 8 bit data (no 64 bit support) Local registers on functional units hold 16 words each (total 1.5 KB) Clusters receive VLIW-style instructions broadcast from microcontroller. VIRAM and Imagine Bwdth GB/s 6.4 IMAGINE Memory 2.7 Peak Fl 32bit 1.6 GF/s 20 GF/s 20 Peak Fl/Wd Speed MHz Chip Area Data widths Transistors Pwr Consmp 1 200 15x18mm 64/32/16 130 x 106 2 Watts 30 400 12x12mm 32/16/8 21 x 106 10 Watts 2.5 VIRAM IMAGINE SRF 32 Imagine order of magnitude higher performance VIRAM twice mem bandwidth, less power consumption Notice peak Flop/Word ratios % of Peak SQMAT Architectural Probe 3x3 Matrix Multiply 50% 40% 30% 20% 10% 0% VIRAM IMAGINE 8 16 32 64 128 256 512 1024 Vector/Stream Length (L) Sqmat: scalable synthetic probe, control comput intensity, vector len Imagine stream model req large # of ops per word to amortize mem ref Poor use of SRF, no producer-consumer locality Long stream helps hide mem latency but only 7% of algorithmic peak VIRAM: performs well for low op/word (40% when L=256) Vector pipeline overlap comp/mem, on-chip DRAM (hi bdwth, low laten) 80000 CYCLES VIRAM 70000 CYCLES IMAGINE 60000 MFLOPS VIRAM 50000 MFLOPS IMAGINE 40000 30000 20000 10000 0 8 16 32 4500 4000 3500 3000 2500 2000 1500 1000 500 0 MFLOPS CYCLES SQMAT: Performance Crossover 64 128 256 512 1024 Vector/Stream Length(L) Large number of ops/word N10 where N=3x3 Crossover point L=64 (cycles) , L = 256 (MFlop) Imagine power becomes apparent almost 4x VIRAM at L=1024 Codes at this end of spectrum greatly benefit from Imagine arch VIRAM/Imagine Optimization Optimization strat: speed up slower of comp or mem Restructure computation for better kernel perform Add more computation for better memory perform Mem is waiting for ALUS ALU memory starved Subtle overlap effects: vect chaining, stream doub buff Example optimization RGB→YIQ conversion from EEMBC Input format: R1G1B1R2G2R2R3G3B3… Required format: R1R2R3… G1G2G3… B1B2B3…. VIRAM RGB→YIQ Optimization VIRAM: poor memory performance • Strided accesses (~1/2 performance) - RGBRGBRGB… -- strided loads → RRR…GGG…BBB… - Only 4 address generators for 8 addresses (sufficient for 64 bit) • Word operations on byte data (1/4th performance) Optimization: replace strided w/ unit access, using in-register shuffle • Increased computational overhead (packing and unpacking) VIRAM RGB→YIQ Results VIRAM RGB->YIQ VIRAM Integer ops (M/sec) 2,500.00 Kernel Memory (cycles) (cycles) 2,400.00 2,300.00 2,200.00 2,100.00 2,000.00 1,900.00 small medium Original optimized large Unoptimized 114 95 Optimized 108 17 Chunk Size 64 Used functional units instead of memory to extract components, increasing the computational overhead Imagine RGB→YIQ Optimization Imagine bottleneck is comp due poor ALU schedule (left) Unoptimized 15 cycles per pixel Software pipelining makes VLIW schedule denser (right) Optimized 8 cycles per pixel Imagine RGB→YIQ Results Imagine RGB->YIQ Imagine Integer ops (M/sec) 6,000.00 Kernel Memory (cycles) (cycles) 5,000.00 4,000.00 3,000.00 2,000.00 1,000.00 0.00 small Original medium large Unoptimized 2153 1167 Optimized 1147 1165 Chunk Size 1024 software pipelined Optimized kernel takes only ½ the cycles per element Memory is now the new bottleneck EEMBC Benchmark Benchmark Vec addition RGB →YIQ RGB →CMYK Gray Filter Autocorrelation Width VIR/IMA 32/32 bits 32/32 bits 16/8 bits 16/32 bits 16/32 bits Application Area Microbenchmark EEMBC Consumer EEMBC Consumer EEMBC Consumer EEMBC Telecom VIRAM GOPS VIRAM GB/sec Imagine GOPS Imagine GB/sec 6.00 5.00 4.00 3.00 2.00 1.00 0.00 Integer ops (G/sec) Autocorr: speech Autocorr: pulse RGB>CMYK: RGB>YIQ: 6.00 5.00 4.00 3.00 2.00 1.00 0.00 64K Vector Bandwidth (GB/sec) Remarks c[i]=a[i]+b[i] Color-conver Color-conver 3x3 convolu Dot product Vec-add: one add/elem, perf limited by memory system RGB →(YIQ,CMYK): VIRAM limited by processing (cannot use avail bdwidth) Grayfiler: Difficult to efficiently impl on Imagine (sliding 3x3 window) Autocorr: Uses short streams, Imagine host latency is high Scientific Kernels SPMV Performance Matrix Perform Rows/NNZ Metric LSHAPE 1008 6958 LARGE DIS 10000 117820 % Peak Cycles MFlop/s % Peak Cycles CRS 2.8% 67K 44 3.2% 802K MFlop/s 91 VIRAM Imagine SegSum Ellpck CRS Streams Ellpck 7.4% 31% 1.1% 0.8% 1.2% 24K 5.6K 40K 48K 38K 118 496 136 114 149 8.4% 32% 1.5% 0.6% 6.3% 567K 641K 742K 1840K 754K 135 511 192 77 870 Algorithmic peak: VIRAM 8 ops/cycle, Imag 32 ops/cycle LSHAPE: finite element matrix, LARGEDIS pseudo-random nnz Imagine lacks irreg access, reorder matrix before kernelC VIRAM better suited for this class of apps (low comp/mem) Scientific Kernels Complex QR Decomposition Complex QR Decomposition Matrix Performance MITRE % of Peak RT_STRAP Total Cycles 192x96 MFlops/s complex VIRAM Imagine 34.1% 5189K 65.5% 712K 546 10480 A=QR Q orthrog & A upper triag, Blocked Househoulder variant – rich in level 3 BLAS ops Complex elems increases ops/word & locality (1 MUL = 6 ops) VIRAM uses CLAPACK port (insertion of vector directives) Imagine: complex indexing of matrix stream (each iter smaller matrix) Imagine over 10GFlops (19x VIRAM) – well suited for this arch Low VIRAM perf due strided access and compiler limitations Overview Significantly different balance of memory organization Relative performance depends on computational intensity Programming complexity is high for both approaches, although VIRAM is based on established vector technology For well-suited applications IMAGINE processor can sustain over 10GFlop/s (simulated results) Large # homogeneous computation required to sufficiently saturate IMAGINE while VIRAM can operate on small vector sizes IMAGINE can take advantage of producer-consumer locality Both present significant reduction in power and space May be used as coprocessors in future generation architectures Next Generation •CODE: next generation of VIRAM –More functional units/ faster clock speed –Local registers per unit instead of single register file. –Looking more like Imagine… •Multi VIRAM architecture – network interface issues? •Brook: new language for Imagine –Eliminate exposure of hardware details (# of clusters) • Streaming Supercomputer – multi Imagine configuration – Streams can be used for functional/data parallelism •Currently evaluating DIVA architecture