Using System-on-a-Chip as a Vehicle for VLSI Design Education Andrew Laffely and Wayne Burleson Electrical and Computer Engineering University of Massachusetts Amherst {alaffely,burleson}@ecs.umass.edu This material is based upon work supported by the National Science Foundation under Grant No. 9988238 and SRC Tasks 766 and 1075 Burleson, UMASS 1 Challenges in VLSI Education • • • • • Advancing Processing Technology Higher level design tools Realistic yet tractable design projects Preparation for jobs in semiconductor and other sectors. Making best use of faculty/student time and university resources Burleson/UMASS 2 ECE 559/659: VLSI Design Project (10 grads, 20 seniors) Course Objectives: • Learn design process for a complex VLSI in deep sub-micron CMOS • Learn VLSI design skills and tools, including working in teams • Learn about a particular application component and its VLSI implementation • Learn to present formal design reviews using oral, written, graphical and web-based techniques Burleson/UMASS 3 Key Aspects of the Course • aSoC (home-grown SoC platform) • • • • • • Graduate and undergraduate teamwork • • • Provides a unifying framework to class Allows for subdivision but inter-relation of projects Interesting cutting edge architecture based on NSF- and SRC-funded research at UMASS and elsewhere Covers many aspects of VLSI Design Realistic constraints on area, timing, power and I/O Graduate students provide leadership, motivation and experience Commercial tools and design flow Review-based evaluation • Oral and web-based reports for 4 different reviews: proposal, feasibility, implementation, integration Burleson/UMASS 4 Adaptive System-on-a-Chip (aSoC) Tile mProc • • Multiplier • Communication Interface North FPGA Tiled architecture with mesh interconnect Multiplier East West Allows for heterogeneous cores • • ctrl South Core Burleson/UMASS Differing sizes, clock rates, voltages Low-overhead core interface for • • Point to point communication pipeline On-chip bus substitute for streaming applications Based on static scheduling • Fast and predictable 5 Communication Interface Core • Core-ports North North South East South • East • West West Inputs Local Config. Crossbar Decoder North to South & East • Outputs Controller Custom design to maximize speed and reduce power Local Frequency & Voltage • • Core-ports Crossbar Controller Instruction memory Local frequency and voltage supply PC Instruction Memory Burleson/UMASS 6 Class Projects SoC Infrastructure1,3 • Communication Interface • Interconnect3 • Power Distribution • Clock System • Power Management • 1 2 3 Used in PhD Dissertation Used in Masters Thesis Used in Publications • Cores Motion estimation for video encoding2,3 • AES Cryptography3 • Cache2,3 • Huffman Coding • 3D Graphics1,2,3 • Discrete Cosine Transform2,3 • Smart Card2,3 • Burleson/UMASS 7 Design Flow http://vsp2.ecs.umass.edu/vspg/658/TA_Tools/design_flow.html • Architecture to Layout Architecture: Block diagram of system and behavioral description Logic: Gate level or schematic description • Circuit: Transistor sizing • Layout: Floorplanning, clock and power distribution • • • Tools • • • • • • • • VerilogXL: behavioral representation VTVT: standard cell library Synopsys: standard cell gate level netlist generation Silicon Ensemble: standard cell netlist to layout Cadence LayoutPlus: schematic and layout design NCSU CDK: design and extraction rules Cadence Layout vs. Schematic: layout verification HSPICE: circuit simulator Burleson/UMASS 8 aSoC Implementation and Integration 2500 l .18m TSMC technology Full custom 3000 l Burleson/UMASS 9 Advanced Signaling Techniques (building on SRC-funded work) Differential current sensing Booster Insertion Multi-level current signaling Phase coding Burleson/UMASS 10 Circuit Level Simulation (HSPICE) Evaluating Subsystems with realistic models • • • Capacitance, resistance and inductance Process variations Process generations Burleson/UMASS 11 Interconnect Characterization: Comparing delay and power of signaling techniques for different tile sizes at 250nm, 180nm, 130nm, 100n Burleson/UMASS 12 Voltage Scaling Approach • Core-ports • • Single buffer for each stream to cross clock/voltage barrier between core and interface Reading/Writing success rates indicate core utilization Input blocked: Core too slow • Output blocked: Core too fast • • Controller • Interprets core-port success rates to adjust local clock and voltage Core Buffer Processing Pipeline Local Local Vdd Clock Input Core-port Output Core-port Clock Blocked Blocked and Supply Controller Interconnect Burleson/UMASS 13 Vdd Selection Criteria Normalized Core Critical Path Delay vs. Vdd 12 Normalized Delay 10 1/8 Speed 8 6 1/4 Speed 4 1/2 Speed • • • As Vdd decreases delay increases exponentially Use curve to match available clock frequencies to voltages The voltage and frequency change reduces power by 79%, 96%, and 98.7% • P = aC(Vdd)2f 2 Max Speed 0 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0.73 1.16 2 Voltage Burleson/UMASS 14 Clock Distribution Tile • Tiled architecture extends life of globally synchronous systems • Precise H-tree implementation • Load is small and equal at each branch • Skew can be reduced by 70% with advanced deskew circuits1 64 tile aSoC 70nm 100nm 130nm 180nm Chip Area (9.24mm)2 (13.3mm)2 (17.2mm)2 (23.8mm)2 Frequency 5 GHz 2 GHz 1 GHz 0.5 GHz Power 126 mW 240 mW 445 mW 784 mW Mean Skew 41 ps 50 ps 92 ps 70.6 ps Percent Skew 21 % 10 % 9% 4% S. Tan et al. “Clock Generation and Distribution for the First IA-64 Microprocessor” IEEE JSSC, Nov. 2000 Burleson/UMASS 1 15 Power Distribution • Heterogeneous cores may require multiple power supply voltages • Tile structure enables uniform interwoven grid • Larger grid for higher current demands Gnd Vml Vl Vmh • Reduced resistance • Higher capacitance Vh 64 tile aSoC Vh Vmh Vml Vl Voltage 1.8V 1.16V 0.73V 0.6V Current per Core 110mA 25mA 13mA 7mA Total Power 12.1 W 1.86 W 607 mW 269 mW Burleson/UMASS 16 Architecture Evaluation (Motion Estimation) • Array-based architecture • • Memory Pipelined ME FIFOs Parameterized search window size • • • Address Generation Unit Full search Choose 16x16 or 8x8 windows Reduce power Burleson/UMASS Processing Element Array 17 Modify Existing Designs • • Take existing Verilog code or hardware and improve or change functionality (e.g. add motion estimation algorithms, provide AES key-length flexibility) Evaluate changes in performance and overhead - Old PE Layout - New PE Layout Burleson/UMASS 18 Conclusions • Advancing Process Technology • • Higher level design tools • • Re-use existing projects and provide unifying themes Preparation for jobs in semiconductor and other sectors. • • • Combine synthesis and custom techniques Realistic yet tractable design projects • • Target .18u for affordable fab but also do scaling studies Focus on system design and appropriate levels of abstraction Teach how to learn new tools Making best use of faculty/student time and university resources Leverage research Combine grad and undergrad • Re-use materials, tools • • Burleson/UMASS 19