Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: Greg Stitt Ann Gordon-Ross David Sheldon Ryan Mannion Scott Sirowy Ph.D. expected June 2007 Ph.D. expected June 2007 Ph.D. expected 2009 Ph.D. expected 2009 Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Serge Rutman, Dave Clark, Darshan Patra, Intel Jeff Welser, Scott Lekuch, IBM Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA 10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM) 2 Microprocessors plus FPGAs Speedups of 10x-1000x Embedded, desktop, and supercomputing More platforms w/ uP and FPGA Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera Xilinx, Altera, … Cray, SGI Mitrionics IBM Cell (research) Frank Vahid, UCR Cray XD1. Source: FPGA journal, Apr’05 3 “Traditional” Compilation for uP/FPGAs Specialized High-level Code Updated Language Binary Specialized Synthesis Compiler Decompilation Libraries/ Libraries/ Object Object Code Code Software Hardware NonStandard Software Tool Flow Linker Specialized language or compiler Commercial success still limited Bitstream Bitstream uP Frank Vahid, UCR FPGA SystemC, NapaC, HandelC, Spark, ROCCC, CatapultC, Streams-C, DEFACTO, … Sw developers reluctant to change languages/tools But still very promising 4 Warp Processing – “Invisible” Synthesis Libraries/ Libraries/ Object Object Code Code High-level Code Updated Binary High-Level Code Updated Binary Compiler Decompilation Synthesis Decompilation Software Binary Updated Libraries/ Libraries/ Object Object Code Code Synthesis Decompilation Software Hardware Software Linker Standard Software Move compilation Tool Flow before synthesis 2002 – Sought to make synthesis more “invisible” Began “Synthesis from Binaries” project Hardware Bitstream Bitstream uP Frank Vahid, UCR FPGA 5 Warp Processing – Dynamic Synthesis Libraries/ Libraries/ Object Object Code Code High-level Code Updated Binary High-Level Code Updated Binary Compiler Decompilation Obtained circuits were competitive 2003: Runtime? Benefits Synthesis Decompilation Software Binary Updated Libraries/ Libraries/ Object Object Code Code Synthesis Decompilation Software Hardware Like binary translation (x86 to VLIW), more aggressive Language/tool independent Library code OK Portable binaries Dynamic optimizations FPGA becomes transparent performance hardware, like Warp processor memory Software Linker Hardware Bitstream Bitstream uP Frank Vahid, UCR FPGA looks like standard uP but invisibly synthesizes hardware 6 Warp Processing Background: Basic Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA Frank Vahid, UCR Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 On-chip CAD 7 Warp Processing Background: Basic Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA Frank Vahid, UCR Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Time Energy On-chip CAD 8 Warp Processing Background: Basic Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add µP I Mem D$ FPGA Frank Vahid, UCR On-chip CAD Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Time Energy Critical Loop Detected 9 Warp Processing Background: Basic Idea 4 On-chip CAD reads in critical region Profiler I Mem µP D$ FPGA Frank Vahid, UCR Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Time Energy On-chip CAD 10 Warp Processing Background: Basic Idea 5 On-chip CAD converts critical region into control data flow graph (CDFG) Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Time Energy reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Frank Vahid, UCR 11 Warp Processing Background: Basic Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + Time Energy reg3 := 0 + := 0+ reg4 + ... loop: +reg4 := reg4++ mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 . . loop +< 10). goto ret reg4 Frank Vahid, UCR + ... 12 Warp Processing Background: Basic Idea 7 On-chip CAD maps circuit onto FPGA Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + Time Energy + reg3 := 0 + := 0+ + reg4 SM SM SM . . . loop: reg4+ + mem[ + CLB + +reg4 :=CLB + reg2 + (reg3 << 1)] reg3 := reg3 + 1 SM + ifSM (reg3 goto . . loop +< 10).SM ret reg4 Frank Vahid, UCR + ... 13 Warp Processing Background: Basic Idea 8 On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Profiler I Mem µP D$ FPGA Dynamic Part. On-chip CAD Module (DPM) Feasible for repeating or longrunning applications Frank Vahid, UCR Software Binary Mov reg3, 0 Mov reg4, 0 loop: // instructions Shl reg1, reg3, that 1 interact FPGA Add reg5,with reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + Time Energy Time Energy Software-only “Warped” + reg3 := 0 + := 0+ + reg4 SM SM SM . . . loop: reg4+ + mem[ + CLB + +reg4 :=CLB + reg2 + (reg3 << 1)] reg3 := reg3 + 1 SM + ifSM (reg3 goto . . loop +< 10).SM ret reg4 + ... 14 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA 10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM) 15 Synthesis from Binaries can be Surprisingly Competitive With aggressive decompilation Previous techniques, plus newly-created ones 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Frank Vahid, UCR From C source From binary v Only small difference in speedup U r BI TM l N P0 1 ID CT RN 01 PN TR CH 01 Av er ag e FI R Fi lt Be am er fo rm er Vi te rb i Speedup Br e 16 Decompilation is Effective Even with High Compiler-Optimization Levels 25 20 15 10 5 -O 3 M icr oB la ze icr oB la ze -O 1 3 M AR M -O 1 -O AR M M IP S -O 3 0 1 (Surprisingly) found opposite – optimized code even better 30 M Average Speedup of 10 Examples -O Do compiler optimizations generate binaries harder to effectively decompile? IP S Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005. Frank Vahid, UCR 17 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA 10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM) 18 Several Month Study with Freescale Optimized H.264 Proprietary code Different from reference code 10x faster 16,000 lines ~90% time in 45 distinct functions rather than 2-3 Frank Vahid, UCR Function Nam e Ins tr %Tim e Cum ulative Spe e dup MotionComp_00 33 6.76% 1.1 InvTransf orm4x4 63 12.53% 1.1 FindHorizontalBS 47 16.68% 1.2 GetBits 51 20.78% 1.3 FindVerticalBS 44 24.70% 1.3 MotionCompChromaFullXFullY 24 28.61% 1.4 FilterHorizontalLuma 557 32.52% 1.5 FilterVerticalLuma 481 35.84% 1.6 FilterHorizontalChroma 133 38.96% 1.6 CombineCoef sZerosInvQuantScan 69 42.02% 1.7 memset 20 44.87% 1.8 MotionCompensate 167 47.66% 1.9 FilterVerticalChroma 121 50.32% 2.0 MotionCompChromaFracXFracY 48 52.98% 2.1 ReadLeadingZerosAndOne 56 55.58% 2.3 DecodeCoef f TokenNormal 93 57.54% 2.4 DeblockingFilterLumaRow 272 59.42% 2.5 DecodeZeros 79 61.29% 2.6 MotionComp_23 279 62.96% 2.7 DecodeBlockCoef Levels 56 64.57% 2.8 MotionComp_21 281 66.17% 3.0 FindBoundaryStrengthPMB 44 67.66% 3.1 19 Several Month Study with Freescale 10 9 Speedup from High-level Synthesis Speedup 8 Speedup from Binary Synthesis 7 6 Binary synthesis competitive with high level 5 4 3 2 1 51 49 47 45 43 41 39 37 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 0 Number of Functions in Hardware Pub: Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth, CODES/ISSS Sep. 2005. Frank Vahid, UCR 20 However – Ideal Speedup Much Larger Large difference between ideal speedup and actual speedup 10 Ideal Speedup (Zero-time Hw Execution) Speedup from High-level Synthesis Speedup from High-Level Synthesis Speedup from Binary Synthesis Speedup from Binary Synthesis 9 Speedup 8 7 6 5 4 3 2 1 51 49 47 45 43 41 39 37 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 0 Number of Functions in Hardware How bring both approaches closer to ideal? Unanticipated sub-task Frank Vahid, UCR 21 C-Level Coding Guidelines Hw/sw with guidelines g3fax mpeg2 jpeg high-level or binary issue Studied dozens of embedded applications and identified bottlenecks 8 6 c rc f ir br e v jpeg mp eg 2 g3 f ax 4 2 51 49 47 45 43 41 39 37 35 33 31 29 27 25 15 0 13 (e.g., avoid function pointers, use constants, …) Closer to ideal Ideal Speedup (Zero-time Hw Execution) Speedup After Rewrite (High-level) Speedup After Rewrite (Binary) Speedup from High-Level Synthesis Speedup from Binary Synthesis 10 9 -30% 11 Defined ~10 basic guidelines Size Overhead -20% Memory bandwidth Use of pointers Software algorithms 7 -10% 5 Performance Overhead 0% 3 crc 10% 1 fir 20% Speedup 30% brev 23 Orthogonal to synthesis from Hw/sw with o riginal co de 21 16 Sw 19 Are there simple coding guidelines that improve synthesized hardware? 573 842 17 Speedup 16 10 9 8 7 6 5 4 3 2 1 0 Number of Functions in Hardware Pub: A Code Refinement Methodology for Performance-Improved Synthesis from C . G. Stitt, F. Vahid, W. Najjar. IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD), Nov. 2006. Frank Vahid, UCR 22 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA 10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM) 23 Warp-Tailored FPGA Prototype One-year effort developed FPGA fabric tailored to fast/small-memory on-chip CAD Bi-weekly phone meetings for 5 months plus several day visit to Intel Created synthesizable VHDL models, in Intel shuttle tool flow, in 0.13 micron technology, simulated and verified at post-layout (Unfortunately, Intel cancelled entire shuttle program, just before out tapeout) Adj. CLB a b c d e f LUT LUT o1 o2 o3 o4 Adj. CLB 0 1 2 3 0L 1L 2L 3L DADG LCH SM SM SM 32-bit MAC CLB Configurable Logic Fabric SM CLB SM SM 3L 2L 1L 0L 3 2 1 0 3L 2L 1L 0L 3 2 1 0 0 1 2 3 0L 1L 2L 3L Frank Vahid, UCR 24 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA 10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM) 25 Smart Buffers State-of-the-art FPGA compilers use several advanced methods Riverside Optimizing Compiler for Configurable Computing [Guo, Buyukkurt, Najjar, LCTES 2004] SmartBuffer Compiler analyzes memory access patterns Determines size of window and stride Block RAM Smart Buffer e.g., ROCCC Input Address Generator Task Trigger Datapath Write Buffer Output Address Generator Block RAM Creates custom self-updating buffer, "pushes" data into datapath Helps alleviate memory bottleneck problem Frank Vahid, UCR 26 Smart Buffers Void fir() { for (int i=0; i < 50; i ++) { B[i] = C0 * A[i] + C1 *A[i+1] + C2 * A[i+2] + C3 * A[i+3]; } } 1st iteration window 2nd iteration window A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] …. 3rd iteration window Smart Buffer Killed A[0] A[1] A[2] A[3] Killed A[1] A[2] A[3] A[4] *Elements in bold are read from memory A[2] A[3] A[4] A[5] Etc. Frank Vahid, UCR 27 Recovering Arrays from Binaries Arrays and memory access patterns needed Array recovery from binaries Search loops for memory accesses with linear patterns Frank Vahid, UCR Other access patterns are possible but rare (e.g., array[i*i]) Array bounds determined from loop bounds and induction variables 28 Recovery of Arrays Determine induction variable: reg3 for ( reg3=0; reg3 < 10; reg3++) { reg3 1 + reg3 reg2 2 << + + Frank Vahid, UCR Element size specified by shift or multiplication amount Reg2 corresponds to array base address Find base address from reg2 definition Memory Read } reg4 Find array address calculations reg4 Determine array bounds from loop bounds long array[10]; for (reg3=0; reg3 < 10; reg++) reg4 += array[reg3]; 29 Recovery of Arrays Multidimensional recovery is more difficult Example: array[i][j] can be implemented many ways for (i=0; i < 10; i++) { for (i=0; i < 10; i++) { for (j=0; j < 10; j++) { i*element_size*width j*element_size i*element_size*width+base base for (j=0; j < 10; j++) { j*element_size + + + addr addr } } Frank Vahid, UCR } } 30 Recovery of Arrays Multidimensional array recovery Use heuristics to find row major ordering calculations Compilers can implement RMO in many ways Check for common possibilities So far able to recover multidimensional arrays for all but one example Frank Vahid, UCR Dependent on the optimization potential of the application Hard to check every possible way Success with dozens of benchmarks Bounds of each array dimension determined from bounds of inner and outer loop 31 Experimental Setup Two experiments Compare binary synthesis with and without smart buffers Compare synthesis from binary and from C-level source, both with smart buffers Used our UCR decompilation tool C Code Software Binary (ARM) Decompilation Recovered C Code 30,000 lines of C code Outputs decompiled C code Xilinx XC2V2000 FPGA Frank Vahid, UCR ROCCC Netlist Controller Synthesized from C using ROCCC and Xilinx tools GCC –O1 Smart Buffer Datapath Smart Buffer 32 Binary Synthesis with and without SmartBuffer W/O Smart Buffers With Smart Buffers Example Cycles Clock Time Cycles Clock Time bit_correlator 258 118 2.2 258 118 2.2 fir 577 125 4.6 129 125 1.0 udiv8 281 190 1.5 281 190 1.5 prewitt 172086 123 1399.1 64516 123 524.5 mf9 8194 57 143.0 258 57 4.5 moravec 969264 66 14663.6 195072 66 2951.2 Avg: Speedup 1.0 4.5 1.0 2.7 31.8 5.0 7.6 Used examples from past ROCCC work SmartBuffer: Significant speedups Shows criticality of memory bottleneck problem Frank Vahid, UCR 33 Synthesis from Binary versus from Original C ROCCC gcc –O1, decompile, ROCCC Synthesis from C Code Synthesis from Binary Example Cycles Clock Time Area Cycles Clock Time Area %TimeImprovement %Area Overhead bit_correlator 258 118 2.19 15 258 118 2.19 15 0% 0% fir 129 125 1.03 359 129 125 1.03 371 0% 3% udiv8 281 190 1.48 398 281 190 1.48 398 0% 0% prewitt 64516 123 525 2690 64516 123 525 4250 0% 58% mf9 258 57 4.5 1048 258 57 4.5 1048 0% 0% moravec 195072 66 2951 680 195072 70 2791 676 -6% -1% Avg: -1% 10% From C vs. from binary – nearly same results One example even better (due to gcc optimization) Area overhead due to strength-reduced operators and extra registers Pub: Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z. Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp. 118-124. Frank Vahid, UCR 34 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA 10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (w/ Freescale) Consider desktop/server domains (with IBM) 35 Domain-Specific FPGA Question: To what extent can customizing FPGA fabric impact delay and area? SM CLB Relevant for FPGA fabrics forming part of ASIC or SoC, for sub-circuits subject to change SM SM Pareto points show interesting delay/area tradeoffs SM dsip Varied LUT sizes, LUTs per CLB, and switch matrix parameters Pseudo-exhaustive exploration on 9 MCNC circuit benchmarks SM CLB Used VPR (Versatile Place & Route) for Xilinx Spartan-like fabrics SM 800.0000 700.0000 600.0000 delay 500.0000 400.0000 300.0000 200.0000 100.0000 0.0000 0.0000 2.0000 4.0000 6.0000 8.0000 area Frank Vahid, UCR 36 Domain-Specific FPGA Compared customized fabric to best average fabric Three experiments: Delay only, Area only, Delay*Area Benefits understated – avg is for 9 benchmarks, not larger set for which off-the-shelf FPGA fabrics are designed Delay – up to 50% gain, at cost of area Area – up to 60% gain, plus delay benefits Customized Delay versus Best Average Delay Fabric 2.5 2 1.5 Delay Area 1 0.5 0 C7552 bigkey clmb dsip mm30a mm4a s15850 s38417 s38584 Benchm arks Customized Area versus Best Average Area 1.2 1 0.8 Delay 0.6 Area 0.4 0.2 0 C7552 bigkey clmb dsip mm30a mm4a s15850 s38417 s38584 Benchm arks Frank Vahid, UCR 37 Task Description Warp processing background Idea: Invisibly move binary regions from microprocessor to FPGA 10x speedups or more, energy gains too Task– Mature warp technology Years 1/2 Automatic high-level construct recovery from binaries In-depth case studies (with Freescale) Warp-tailored FPGA prototype (with Intel) Year 2/3 Frank Vahid, UCR Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM) 38 Consider Desktop/Server Domains Investigated warp processing for SPEC benchmarks Server benchmark But little speedup from hw/sw partitioning Due to data structures, file I/O, library functions, ... Studied Apache server Too disk intensive, could not attain significant speedups Multiprocessing benchmarks Frank Vahid, UCR Promising direction for warp processing 39 Multiprocessing Platforms Running Multiple Threads – Use Warp Processing to Synthesize Thread Accelerators on FPGA Profiler b( ) Function a( ) Warp FPGA for (i=0; i < 10; i++) createThread( b ); Frank Vahid, UCR b( µP) a( µP) OS can only schedule 2 threads Remaining 8 threads placed in thread queue OS schedules 4 threads to custom accelerators b( b( )) b( ) Warp tools create custom accelerators for b( ) Warp Tools OS b( µP) b( ) µP b( ) Thread Queue OS b( ) b( ) b( ) b( ) b( ) b( ) b( ) b( ) 40 Multiprocessing Platforms Running Multiple Threads – Use Warp Processing to Synthesize Thread Accelerators on FPGA Profiler detects performance critical loop in b( ) Profiler b( ) Function a( ) for (i=0; i < 10; i++) createThread( b ); Frank Vahid, UCR b( ) FPGA b(Warp ) b( ) a( µP) b( µP) b( µP) OS µP b( ) b( ) Warp tools create larger/faster accelerators Warp Tools b( ) 41 Warp Processing to Synthesize Thread Accelerators on FPGA Apps must be long-running (e.g., scientific Created simulation framework apps running for days) or repeating for synthesis times to be acceptable >10,000 lines of code Plus SimpleScalar Multi-threaded warp 120x faster than 4-uP (ARM) system Speedup (4-uP) 307.7 501.9 140 120 100 80 60 40 20 0 4-uP 8-uP 16-uP 32-uP 64-uP Warp fir Frank Vahid, UCR ew r p itt ar e lin i qu l ul h ck m o ve ra c q le se ct t or s q w el v a et m a ilt xf er ve A ge a r . eo G ea n M 42 Multiprocessor Warp Processing – Additional Benefits due to Custom Communication NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App1 µP µP Bus Mesh App2 µP µP Bus Frank Vahid, UCR Mesh 43 Warp Processing – Custom Communication NoC – Network on a Chip provides communication between multiple cores Problem: Best topology is application dependent App1 µP µP Bus FPGA Mesh App2 µP µP Warp processing can dynamically choose topology Frank Vahid, UCR Bus Mesh Collaboration with Rakesh Kumar University of Illinois, Urbana-Champaign “Amoebic Computing” 44 Warp Processing Enables Expandable Logic Concept RAM RAM Expandable ExpandableRAM Logic– –System Warp tools detect detects duringinvisibly start, adapt amountRAM of FPGA, improves performance invisiblyhardware. application to use less/more DMA CacheCache FPGA FPGA FPGA FPGA Profiler µP µP Warp Tools Expandable Logic Expandable RAM uP Frank Vahid, UCR Planning MICRO submission Performance 45 Expandable Logic Speedup 500 Softw are 1 FPGA 2 FPGAs 3 FPGAs 4 FPGAs 400 300 200 100 0 N-Body 3DTrans Prew itt Wavelet Used our simulation framework Large speedups – 14x to 400x (on scientific apps) Different apps require different amounts of FPGA Expandable logic allows customization of single platform Frank Vahid, UCR User selects required amount of FPGA No need to recompile/synthesize 46 Current/Future: IBM’s Cell and FPGAs Investigating use of FPGAs to supplement Cell Q: Can Cell-aware code be migrated to FPGA for further speedups? Q: Can multithreaded Cell-unaware code be compiled to Cell/FPGA hybrid for better speedups than Cell alone? Frank Vahid, UCR 47 Current/Future: Distribution Format for Clever Circuits for FPGAs? Code written for microprocessor doesn’t always synthesize into best circuit Designers create clever circuits to implement algorithms (dozens of publications yearly, e.g., FCCM) Can those algorithms be captured in high-level format suitable for compilation to variety of platforms? With big FPGA, small FPGA, or none at all? NSF project, overlaps with SRC warp processing project Frank Vahid, UCR 48 Industrial Interactions Year 2 / 3 Freescale Intel Chip prototype: Participated in Intel’s Research Shuttle to build prototype warp FPGA fabric – continued bi-weekly phone meetings with Intel engineers, visit to Intel by PI Vahid and R. Lysecky (now prof. at UofA), several day visit to Intel by Lysecky to simulate design, ready for tapout. June’06–Intel cancelled entire shuttle program as part of larger cutbacks. Research discussions via email with liaison Darshan Patra (Oregon). IBM Research visit: F. Vahid to Freescale, Chicago, Spring’06. Talk and full-day research discussion with several engineers. Internships –Scott Sirowy, summer 2006 in Austin (also 2005) Internship: Ryan Mannion, summer and fall 2006 in Yorktown Heights. Caleb Leak, summer 2007 being considered. Platform: IBM’s Scott Lekuch and Kai Schleupen 2-day visit to UCR to set up Cell development platform having FPGAs. Technical discussion: Numerous ongoing email and phone interactions with S. Lekuch regarding our research on Cell/FPGA platform. Several interactions with Xilinx also Frank Vahid, UCR 49 Patents “Warp Processing” patent Filed with USPTO summer 2004 Several actions since Still pending SRC has non-exclusive royalty-free license Frank Vahid, UCR 50 Year 1 / 2 publications New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale) Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005. A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. Frank Vahid, UCR 51 Year 2 / 3 publications Binary Synthesis. G. Stitt and F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2007 (to appear). Integrated Coupling and Clock Frequency Assignment. S. Sirowy and F. Vahid. International Embedded Systems Symposium (IESS), 2007. Soft-Core Processor Customization Using the Design of Experiments Paradigm. D. Sheldon, F. Vahid and S. Lonardi. Design Automation and Test in Europe, 2007. A One-Shot Configurable-Cache Tuner for Improved Energy and Performance. A Gordon-Ross, P. Viana, F. Vahid and W. Najjar. Design Automation and Test in Europe, 2007. Two Level Microprocessor-Accelerator Partitioning. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid. Design Automation and Test in Europe, 2007. Clock-Frequency Partitioning for Multiple Clock Domains Systems-on-a-Chip. S. Sirowy, Y. Wu, S. Lonardi and F. Vahid Conjoining Soft-Core FPGA Processors. D. Sheldon, R. Kumar, F. Vahid, D.M. Tullsen, R. Lysecky. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. Application-Specific Customization of Parameterized FPGA Soft-Core Processors. D. Sheldon, R. Kumar, R. Lysecky, F. Vahid, D.M. Tullsen. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2006. Warp Processors. R. Lysecky, G. Stitt, F. Vahid. ACM Transactions on Design Automation of Electronic Systems (TODAES), July 2006, pp. 659-681. Configurable Cache Subsetting for Fast Cache Tuning. P. Viana, A. Gordon-Ross, E. Keogh, E. Barros, F. Vahid. IEEE/ACM Design Automation Conference (DAC), July 2006. Techniques for Synthesizing Binaries to an Advanced Register/Memory Structure. G. Stitt, Z. Guo, F. Vahid, and W. Najjar. ACM/SIGDA Symp. on Field Programmable Gate Arrays (FPGA), Feb. 2005, pp. 118-124. Frank Vahid, UCR 52