Short Course on Advanced Topics: from Emerging Media/Network Processors to Internet Computing 1 Topic 3: Fundamentals of Media Processor Designs Overview of High-Performance Processors Multimedia extension Multiple-issue, out-of-order, dynamic-window processor VLIW and Vector processor Systolic and Reconfigurable processor Hardwired stream processor Thread-level parallelism Media benchmark/workload Steaming media processing, sub-word parallelism Intel MMX/SSE media extension IA-64 multimedia instructions Media processors IMAGINE: Media processing with streams IVRAM: Extent Intelligent RAM with vector unit Trimedia: Price-performance challenge for media processing 2 Digital Signal Processing (DSP) In 1970s, DSP in telecomm, requires higher performance than microprocessor available Computationally intensive: Dominated by vector dot product – multiply, multiply-add Real-time requirement Streaming data, high memory bandwidth, simple memory access pattern Predictable program flow, nested loops, less branches, large basic blocks Sensitivity to numeric error 3 Early DSPs Single-cycle multiplier Streamlined multiply-add operation Separate instruction / data memory for high memory bandwidth Specialized addressing hardware, autoincrement Complex instruction set, combine multiple operations in single instruction Special-purpose, fixed-function hardware, lack flexibility and programmability TI TMS32010, 1982 4 Today’s DSP (from 1995) Adapt general-purpose processor design Programmability and compatibility RISC-like instruction set Multiple-issue, VLIW approach, Vector SIMD, superscalar, chip multiprocessing Easy to program, better compiler target Better compatibility for future architecture TI-TMS320C62xx family RISC instruction set 8-issue, VLIW design 5 General-Purpose Processors Notice increasing applications (e.g. cellular phone) for DSP tasks ($6 billion DSP market in 2000) Add architecture features to boost performance of common DSP tasks Extended multimedia instruction set, adapted and integrated with existing hardware in almost all high-performance microprocessor, Intel MMX/SSE New architecture, encompass DSP+general-proc., boost high parallelism, Stanford Imagine, etc. Future directions? Graphics processors? 6 Media Processing Digital signal processing, 2D/3D graphics rendering, image/audio compression/decompression Real-time constraint, high performance density Large amount of data parallelism, latency tolerance Steaming data, very little global data reuse Computational intensive, performing 100-200 arithmetic operations for each data element Require efficient hardware mapping with algorithm flow, special-purpose media processors Extend instruction set / hardware on generalpurpose processors 7 Multimedia Applications Image/video/audio compression (JPEG/MPEG/GIF/png) Front-end of 3D graphics pipeline (geometry, lighting) High Quality Additive Audio Synthesis Todd Hodes, UCB Vectorize across oscillators Adobe Photoshop Pixar Cray X-MP, Stellar, Ardent, Microsoft Talisman MSP Image Processing Speech recognition Front-end: filters/FFTs Phoneme probabilities: Neural net Back-end: Viterbi/Beam Search 8 High-Performance Processors Exploit instruction-level parallelism Superscalar, VLIW, vector SIMD, systolic array, etc. Flexible (superscalar, VLIW) vs. regular (vector, systolic) Data communication: through register (VLIW, vector) vs. forwarding (superscalar, systolic, vector-chaining) Ratio of computation / memory access, data reuse ratio Hardware (superscalar) vs. software (VLIW, vector, systolic) to discover ILP Exploit thread-level parallelism Parallel computation (programming) model: streaming, macro-dataflow, SPMD, etc. Data communication and data sharing behavior Multiprocessor synchronization requirement 9 Instruction-Level Parallelism Loop: Control Dependence load F0,0(R1) add F4,F0,F2 store F4,0(R1) addui R1,R1,#-8 bne R1,R2,Loop Data Dependence For (I=1000; I>0; I--) X[I] = X[I] + S; • Limited Instruction-Level Parallelism (ILP) • Data dependence: True (RAW), Anti (WAR), Output (WAW) • Control dependence: Determine program flow 10 Dynamic Out-of-order Execution In-order fetch/issue, out-of-order execution, in-order completion: maintain precise interrupt Use Reorder Buffer to hold results of uncommit. Reorder Register rename to RB Buffer FP entry to drive dep. Inst. Op Inst. commit in order, Queue FP Regs remove from RB, result to architecture register Memory disambiguation Res Stations Res Stations Discover ILP dynamically FP Adder FP Adder flexible, costly, suitable for integer programs 11 Fetch / Issue Unit Stream of Instructions To Execute Instruction Fetch with Branch Prediction Out-Of-Order Execution Unit Correctness Feedback On Branch Results Must fetch beyond branches: Branch Prediction Must feed execution unit with high-bandwidth: Trace Cache Must utilize Inst / Trace cache bandwidth: Next line prediction Instruction fetch decoupled from execution Often issue logic (+ rename) included with Fetch unit Need efficient (1-cycle) broadcast+wakeup+schedule logic for dependent instruction scheduling 12 Superscalar Out-of-order Execution Loop: Control Dependence load F0,0(R1) add F4,F0,F2 store F4,0(R1) addui R1,R1,#-8 bne R1,R2,Loop Data Dependence Branch prediction load F0,0(R1) add F4,F0,F2 store F4,0(R1) addui R1,R1,#-8 R1,R2,Loop bne Register Renaming: R1, F0, R1 Hardware discover ILP, Most flexible 13 VLIW Approach – Static Multiple Issue Wide-instruction, multiple independent operations Loop unrolling, procedure inlining, trace scheduling, etc. to enlarge basic blocks Compiler discover ind. operations, pack to long inst. Difficulties: Code size: clever encoding Lock-step execution: hardware allows unsynchronized. Binary code compatibility: object-code translation Compiler techniques to improve ILP Compiler optimization with hardware support Better suited for applications with predictable control flow, media / signal processing 14 VLIW Approach – Example Memory Ref 1 Memory Ref 2 Load F0,0(R1) Load F6, -8(R1) Load F10,-16(R1) Load F14,-24(R1) Load F18,-32(R1) Load F22,-40(R1) Load F26,-48(R1) Store F4,0(R1) Store F8,-8(R1) Store F12,-16(R1) Store F16,-24(R1) Store F20,24(R1) Store F24,16(R1) Store F28,8(R1) FP Operation 1 FP Operation 2 Add F4,F0,F2 Add F8,F6,F2 Add F12,F10,F2 Add F16,F14,F2 Add F20,F18,F2 Add F24,F22,F2 Integer/Branch Add F28,F26,F2 Addui R1,R1, -56 Bne R1,R2,Loop 15 Vector Processor Single-Instruction, multiple-data, exploit regular data parallelism, less flexible than VLIW Highly-pipeline, tolerate memory latency Require high-memory bandwidth (cache-less) Better suited for large scientific applications with heavy loop structures, good for media application Dynamic vector chaining, compound instruction Example: /* after vector loop blocking */ Vload V1,0(R1) Vadd V2,V1,F2 Vstore V2,0(R1) 16 Systolic Array, Reconfigure Processor Systolic array: Fixed function, fixed wire ….. 8(R1), 0(R1) ….. F2, F2 + ….. 8(R1), 0(R1) Avoid register communication, inflexible Reconfigurable hardware: MIT Raw, Stanford Smart Memories General-purpose engine for media applications is limited Fixed-function, fixed-wire too restricted Reconfigurable hardware provides compiler programmable interconnections and system structure to suit applications Exploit thread-level parallelism 17 Thread-Level Parallelism Many applications, such as database transactions, scientific computations, server applications, etc. demonstrate high-level (thread-level) parallelism Two basic approaches: Execute each thread in a separate processor, the old parallel processing approach Execute multiple threads in a single processor Duplicating each thread state, PC, registers, etc.; but share functional units, memory hierarchy, etc.; minimize thread switching cost comparing context switching Switching thread: coarse-grained vs. fine-grained Simultaneous multithreading (SMT): thread-level and instruction-level parallelism are exploited simultaneously with multiple threads issues at the same cycle. 18 Simultaneous Multithreading Simultaneous multithreading is a processor design that combines hardware multithreading with superscalar processor technology to allow multiple threads to issue instructions each cycle. Unlike other hardware multithreaded architectures (such as the Tera MTA), in which only a single hardware context (i.e., thread) is active on any given cycle, SMT permits all thread contexts to simultaneously compete for and share processor resources. Unlike conventional superscalar processors, which suffer from a lack of per-thread instruction-level parallelism, simultaneous multithreading uses multiple threads to compensate for low singlethread ILP. The performance consequence is significantly higher instruction throughput and program speedups on a variety of workloads that include commercial databases, web servers and scientific applications in both multiprogrammed and parallel environments. 19 Comparison of Multithreading Time SuperScalar Course MT Fine MT SMT 20 Performance of SMT SMT shows better performance than superscalar; however, contentions on caches 21 Summary Application-driven architecture studies Media applications Computational intensive, lots parallelism, predictable control flow, real-time constraint Memory intensive, streaming data access 8, 16, 24 bit data structures Suitable architectures Dynamic schedule, out-of-order processors are inefficient and overkill VLIW, Vector, Reconfigurable processors, or exploit subword parallelism on general-purpose processors special handle memory access 22