Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign Amalgam: a Reconfigurable Processor for Future Fabrication Processes Performance = f(architecture, implementation) ST ST MUL ADD MUL ADD MUL ADD LD LD ST ST MUL LD MUL ADD LD ST MUL LD ST MUL LD MUL ADD LD LD 1-D IDCT 1-D IDCT 1-D IDCT 1-D IDCT Time Amalgam: a Reconfigurable Processor for Future Fabrication Processes Efficient Implementation • Everything you give up in clock rate you have to make back in architectural efficiency • Wire delay is the big limiting factor in system architectures today – Wires get slower relative to transistors as fab. process improves • Programmable processors moving to deeper pipelines – Not good enough to just prevent wires from making reconf. logic slower Amalgam: a Reconfigurable Processor for Future Fabrication Processes Amalgam DRAM Cache (Multi-Banked) Network PCluster RCluster PCluster RCluster PCluster RCluster PCluster RCluster Amalgam: a Reconfigurable Processor for Future Fabrication Processes Reconfigurable Cluster Design • 4 Register banks Network Interface – 8 registers/bank • 4 Reconfigurable logic segments – 8 Rows x 32 LBs per segment • Array control unit • Network interface • Counter-clockwise flow of computation through cluster Bank Segment Segment Bank ACU Segment Bank Amalgam: a Reconfigurable Processor for Future Fabrication Processes Bank Segment Reconfigurable Clock Rates Amalgam: a Reconfigurable Processor for Future Fabrication Processes Unpipelined Critical Path FF HWIRE HWIRE • Effect on clock rate varies significantly with fabrication process Bank VWIRE – Wires have heavy loads, making them slower than their length would indicate LB VWIRE • Latches in logic blocks only resource for pipelining • Vertical and horizontal wires carry data between logic blocks LB FF Amalgam: a Reconfigurable Processor for Future Fabrication Processes Supporting Pipelining • Goal: make logic block delay the limiting factor on clock rate • Add configurable latches at each wire intersection – Problem: different paths may have different latencies • Add retiming buffers at logic block inputs/outputs • Add network queues to reduce synchronization overhead Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Critical Path LB FF HWIRE VWIRE Bank FF VWIRE • Delay of individual wires < logic block delay in all processes studied • Add configurable pipeline latches at junctions between wires • Pipeline latches also added on carry chains within rows FF HWIRE FF FF LB FF Amalgam: a Reconfigurable Processor for Future Fabrication Processes Retiming Buffers • 5-deep chain of latches added to each logic block input – Similar structure added to LB output • Can “borrow” up to two cycles of additional delay from adjacent input • Total pipeline register overhead = 17% FF FF FF FF FF FF FF FF FF FF Amalgam: a Reconfigurable Processor for Future Fabrication Processes Register Queues Original Architecture Original Architecture Network Network WRITE R8, Val1 WRITE R8, Val2 WRITE R8, Val1 Sync. Message EMPTY R8 WRITE R8, Val2 Register File Register Queue Register File Amalgam: a Reconfigurable Processor for Future Fabrication Processes Implementing Pipelined Apps. • Logical vs. Physical pipelining – Logical: Program-visible, uses array and registers – Physical: Only visible to ACU, uses pipeline registers on wires, retiming buffers • Take advantage of decoupling provided by queues • Applications use same reconfigurable logic configurations in different fab. processes – Only FSM in ACU changes – Applications to portability, managing intra-die variation Amalgam: a Reconfigurable Processor for Future Fabrication Processes Experimental Methodology • Programs simulated using Amalsim – Set each cluster’s clock rate independently • Benchmarks: IDCT, Rijndael, DNA comparison – Fine-grained version of each benchmark does one computation – Medium-grained version performs four independent computatons • Programmable cluster clock rates based on ITRS – Limit stages to 7 FO4 delay, slightly more aggressive than ITRS • Logic block latencies, wire lengths taken from circuit-level design of reconf. Cluster in 180nm CMOS – Convert logic block delay to FO4, scale by FO4 delay of each fabrication process – Scale wire length based on fabrication process, simulate wire delay in SPICE – Pipeline such that reconf. cluster cycle time is determined by logic block delay Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Clock Rates Amalgam: a Reconfigurable Processor for Future Fabrication Processes Fine-Grained Benchmark Perf. • Reconfigurable version maintains about 20% perf. Improvement over programmable in all fab. processes • Pipelining only small benefit • Majority of speedup comes from reduction in memory references Amalgam: a Reconfigurable Processor for Future Fabrication Processes Medium-Grain Benchmark Perf. • Pipelined architecture sees 2.6x perf improvement over programmable • Unpipelined architecture only minor improvement over programmable – Greater parallelism means more ability to tolerate memory delays Amalgam: a Reconfigurable Processor for Future Fabrication Processes Limit Studies • Believe that memory operations are much of the benefit for small tasks – Study limit where memory latency = 1 – Also test theory that streaming benchmarks have enough parallelism to cover latency • Understand how much clock rate of reconfigurable unit affects performance – Model reconfigurable unit at same clock rate as programmable clusters – Completely unreasonable for unpipelined – Might be indicator of what industry could do with pipelined Amalgam: a Reconfigurable Processor for Future Fabrication Processes Unpipelined Fine-Grained • Removing memory latencies makes programmable performance similar to reconfigurable • Latency of reconfig. clusters has large impact on performance -- no parallelism to cover latency Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Fine-Grained • Results similar to unpipelined – Benefit still mostly from memory reduction Amalgam: a Reconfigurable Processor for Future Fabrication Processes Unpipelined Medium-Grain • Eliminating memory latencies really helps programmable • Latency of reconf. logic an even bigger problem – Programmable clusters can exploit parallelism through pipelines Amalgam: a Reconfigurable Processor for Future Fabrication Processes Pipelined Medium-Grain • Impact of memory system on reconfigurable performance very small • Less benefit from increasing reconfigurable cluster clock rate – With even small amounts of parallelism, throughput becomes more important than latency. Amalgam: a Reconfigurable Processor for Future Fabrication Processes Future Directions • ASIC-like performance with programmable systems – ASICs typically get 100x better performance per unit area than microprocessors • Application-specific memory systems in a programmable chip – Transform memory references into communication – Create natural division of programs into regular and irregular blocks Amalgam: a Reconfigurable Processor for Future Fabrication Processes Conclusion • Reconfigurable computing must provide both speedup from custom logic and high clock rates to succeed • Amalgam does this by limiting and tolerating wire delay at multiple levels – Clustered architecture – Segmented reconfigurable unit – Pipeline wire delays • Result: 2.6x speedup over 8-way CMP in current and future fabrication processes Amalgam: a Reconfigurable Processor for Future Fabrication Processes