3D Interconnect: Architectural Challenges and Opportunities Tim Sherwood UC SANTA BARBARA The Role of Architecture (Battery Life, Performance, Programmability ) Runtime System Architecture Circuit HW Constraints (Noise, Thermal, Yield) Device Package Demands 3D Integration SW Applications Lab Overview b0 1 0 b2 b1 0 1 1 0 b4 { 2 } 0 0 1 1 b3 0 Adaptive Hardware Profiling Engines integrated On-Chip Prototype Processor Acceleration Core Primitives Intrusion Detection System Caches, etc. 0 1 1 Intrusion Detection Software Wireless andDefined Prevention Access Point b9 1 1 b5 b6 { 2,5 } 1 b8 { 2,7 } 0 0 0 b7 High Speed Programmable Routers High Throughput MEMS controllers Server Farm Reconfigurable Security on FPGAs Memory Hierarchy y Lab Overview b0 1 0 b2 b1 0 1 1 0 b4 { 2 } 0 0 1 1 b9 1 1 b5 b3 0 0 1 1 b6 { 2,5 } 1 b8 { 2,7 } 0 0 0 b7 High Speed Programmable Routers Software Defined Wireless Access Point High Throughput MEMS controllers Reconfigurable Security on FPGAs Potential for Impact from 3D 3D Bandwidth 3D Specialization 3D Bandwidth b0 1 0 b2 b1 0 1 1 0 b4 { 2 } 0 0 1 1 b3 0 Adaptive Hardware Profiling Engines integrated On-Chip 0 1 1 Intrusion Detection and Prevention b9 1 1 b5 b6 { 2,5 } 1 b8 { 2,7 } 0 0 0 b7 High Speed Programmable Routers Prototype Processor Acceleration Core Primitives Intrusion Detection System Caches, etc. Server Farm 3D Integration for Latency Memory Hierarchy 3D Integration Potential for Impact from for Signal 3D Bandwidth 3DMixed Specialization 3D Integration 3DTechnology Bandwidth for Mixed 3D b0 1 0 b2 b1 0 1 1 0 b4 { 2 } 0 0 1 1 b3 0 Adaptive Hardware Profiling Engines integrated On-Chip 0 1 1 Intrusion Detection and Prevention b9 1 1 b5 b6 { 2,5 } 1 b8 { 2,7 } 0 0 0 b7 High Speed Programmable Routers Prototype Processor Acceleration Core Primitives Intrusion Detection System Caches, etc. Server Farm 3D Specialization 3D Integration for Latency Memory Hierarchy Presented Works • Shashidhar Mysore, Banit Agrawal, Sheng-Chih Lin, Navin Srivastava, Kaustav Banerjee and Timothy Sherwood. Introspective 3D Chips , Proceedings of the Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2006. San Jose, CA • Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Timothy Sherwood, Kaustav Banerjee. A ThermallyAware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy, Proceedings of the 43nd Design Automation Conference (DAC), June 2006. San Francisco, CA Two Specific Opportunities 1) 3D Integration for Performance Bring Memory Closer to those that use it More Bandwidth and Lower Latency Tricky System Level Tradeoffs 2 ) 3D Integration for Specialization Integration offers unique specialization opportunity Decouple commodity from niche The ramifications of any radical change requires a careful evaluation that considers all the parameters A Simple Performance “Ecosystem” package temp total power dynamic power V leakage communication utilized area freq parallelism feedback performance OS or runtime No multicore, no spatial variance, no temporal apperror or yield variance, no metrics of cost or Two Specific Opportunities 1) 3D Integration for Performance Bring Memory Closer to those that use it More Bandwidth and Lower Latency Tricky System Level Tradeoffs 2 ) 3D Integration for Specialization Integration offers unique specialization opportunity Decouple commodity from niche The ramifications of any radical change requires a careful evaluation that considers all the parameters Basic Savings in 3D Area: 4 Dist: √8 ≈ 2.8 BW: √8 ≈ 2.8 Area: 2 Dist: √4 ≈ 2 + 1L BW: 2√4 ≈ 4 Area: 1 Dist: √2 ≈ 1.4 + 3L BW: 4√2 ≈ 5.6 On-chip Latency improved, Bandwidth could improve more What about real wires? What about apps? What about temp? Example Technology Node Dioxide Layer 1 Metal layers Silicon substrate Dioxide Banerjee et al. IEEE 2001 50um 30-40um CMOS Vertical Interconnect Layer 2 2-5um 3D Wire Delay -11 x 10 Vertical via model 1.2 Delay ( Sec ) 1 0.8 Vertical wire length 1.4 Distributed RC delay Horizontal line model 0.6 0.4 0.2 0 Horizontal wire length L 160 240 320 400 480 560 Wire length L ( um ) 640 720 800 A “Typical” 2D System Design Memory Bottleneck DRAM DRAM CPU core L1 I-Cache L1 D-Cache Memory Controller L2 to Main Memory External Bus DRAM DRAM DRAM DRAM L2 Unified Cache DRAM Board A 3D Memory System 8 bytes to 128 bytes 200 Mhz to 2 Ghz Layer 2 L2 Unified Cache L1 to L2 vertical interlayer Bus L2 to Main Memory vertical interlayer Bus Layer 3 to 18 L1 D-Cache L1 I-Cache CPU core Layer 1 Stacked three dimensional main memory System-Level Simulation Simulator : Sim-Alpha simulator Processor : Alpha-21264 processor Benchmarks: mcf, parser, twolf with Minnespec reduced inputs % main memory access per instruction mcf 1.7% parser 0.258472% twolf 0.00062% Effect of Bus Width and Frequency mcf 7 8 bytes bus width (2-D) 8 bytes bus width (3-D) 6 16 bytes bus width (3-D) Execution time (sec) 32 bytes bus width (3-D) 5 64 bytes bus width (3-D) 128 bytes bus width (3-D) 4 Only a few vias required 3 2 1 0 10 100 1000 L2 cache size in KBytes 10000 Effect of Clock Frequency : mcf Execution time per instruction (ns) 3 mcf (2-D) 2.5 2 1.5 1 mcf (3-D) 0.5 0 600 1000 1400 1800 2200 Clock Frequency (MHz) 2600 3000 Effect of Clock Frequency : parser Execution time per instruction (ns) 1.4 1.2 1 parser (2-D) 0.8 0.6 parser (3-D) 0.4 0.2 0 600 1000 1400 1800 2200 Clock Frequency (MHz) 2600 3000 Effect of Clock Frequency : twolf Execution time per instruction (ns) 1.4 1.2 1 0.8 twolf (2-D) 0.6 twolf (3-D) 0.4 0.2 0 600 1000 1400 1800 2200 Clock Frequency (MHz) 2600 3000 An Example Memory System DRAM DRAM DRAM DRAM L2 Cache CPU & L1Cache Heat Sink Thermal Gradient DRAM Self-consistent Thermal Modeling Insert the initials values of leakage and dynamic power for each layer Calculate the first thermal profile No Yes Finish Is it convergent? Based on the previous thermal profile calculate the new power dissipation considering Ion decrease with temperature ILeakage increase with temperature Calculate the new temperature profile 3D Thermally-aware Performance Analysis mcf 400 390 2.5 Min execution time in 2-D 380 Temperature constraint 370 2 3-D max chip temperature 360 1.5 350 2-D max chip temperature 340 1 330 Min execution time in3-D 600 1000 1400 Temperature (K) Execution time per instruction 3 1800 2200 Frequency in MHz 2600 3000 3D Thermally-aware Performance Analysis Maximum frequency allowed due to temperature constraint 1.1 1 390 380 Temperature constraint 0.9 370 0.8 360 3-D max chip temperature 0.7 350 2-D max chip temperature 0.6 340 0.5 Min execution time in 3-D 0.4 330 Min execution time in 2-D 0.3 600 1000 1400 1800 2200 Frequency in MHz 2600 3000 Temperature (K) Execution time per instruction twolf 3D Memory Integration • Many Unaccounted For Effects Effect of Multiple Cores and Memory Banks Spatial Variation Temporal Variation (thermal load balancing) All of these are intimately tied to the integration method and packaging • How to Manage Architecture and Software will be increasingly involved Exposing Variation to higher levels Huge demand for “models”, “sensors”, and “knobs” Thermal, Packaging, Application, Architecture all tangled Need to build models that capture all of these aspects Models need to be “self consistent” Two Specific Opportunities 1) 3D Integration for Performance Bring Memory Closer to those that use it More Bandwidth and Lower Latency Tricky System Level Tradeoffs 2 ) 3D Integration for Specialization Integration offers unique specialization opportunity Decouple commodity from niche The ramifications of any radical change requires a careful evaluation that considers all the parameters 3D Integration for Introspection • Complex interactions across levels of abstraction make debugging, optimizing, securing, and analysis in general difficult • The first requirement – visibility Not just data capture, we need the ability to put together a cohesive picture of system interactions and correlate between them in a sound and non-intrusive manner • The hardware/software boundary is uniquely situated Piece together from low level events • What would the programmer wish list look like? What programmers want Everything. Decode To Integrated Monitoring Hardware 4x 3x 4x 3x 4x 3x 4x 3x 4x 3x 4x 3x 2 320 3 2 790 L2_BPU 32 bit Memory Address 32 bit Memory Value 10 bit Opcodes 2, 5 bit Register Names 2, 32 bit Register Values 10 bits of “status” L1_BPU Trace Cache Top MOB Trace Cache Bottom DTLB Bus Control ITLB L1 Cache Top FP Exec L1 Cache Bottom UROM FP Reg Int Exec Mem Ctl Alloc 1892 bits per cycle = 1 terrabyte/sec @ 4Ghz Retire Rename Instr Q1 Int Reg Sched Instr Q2 L2 Cache Why programmers cant have it • Interconnect is not free • Analysis is not free Significant processing required • Extra cost of added heat $15 budget for cooling • Used by developers Trace Cache Top To Integrated Monitoring Hardware Huge cross chip busses OptBuf 285um 20,000 buffers Decode L2_BPU L1_BPU Trace Cache Bottom MOB ITLB DTLB L1 Cache Top Bus Control 2 320 3 2 FP Exec L1 Cache Bottom UROM FP Reg Mem Ctl Int Exec 790 Alloc Rename Retire Instr Q1 Int Reg Sched Instr Q2 L2 Cache Cake + Eating It Too • Need a way to provide cheap (or high margin) HW to the masses No paying for developer functionality • Get developers the powerful analysis they crave See everything at executable rate • Provide “snap-on” functionality for developers Separate chip for analysis engine Only hook it onto “developer” systems • Idea is not limited to development systems Security, Error Correction, Confidentiality, Accelerators, … • 3d Integration offers the potential Thermal Impact Conclusion: Opportunities+Challenges 3D Integration for Performance Bring Memory Closer to those that use it More Bandwidth and Lower Latency Requires few vias for big impact Tricky System Level Tradeoffs 3D Integration for Specialization Integration offers unique specialization opportunity Requires rethinking of integration process Decouple commodity from niche Challenges Cross layer models: from app to package Cross layer optimization: both static and dynamic Thermal Management is everybody's problem http://www.cs.ucsb.edu/~arch/ NSF CNS 0524771, NSF CCF 0702798, NSF CCF 0448654 Related Work • Bryan Black, Murali M. Annavaram, Edward Brekelbaum, John DeVale, Gabriel H. Loh, Lei Jiang, Don McCauley, Pat Morrow, Don Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Paul Shen, Clair Webb, "Die Stacking (3D) Microarchitecture," in IEEE International Symposium on Microarchitecture, 469-479, 2006. • PUBLICATIONS on 3D STACKED IC • • • • • • • • • • • • • • 1. Karthik Balakrishnan, Vidit Nanda, Siddharth Easwar, and Sung Kyu Lim, "Wire Congestion And Thermal Aware 3D Global Placement," IEEE/ACM Asia South Pacific Design Automation Conference, p1131-1134, 2005. (pdf) 2. Jacob Minz, Sung Kyu Lim, and Cheng-Kok Koh, "3D Module Placement for Congestion and Power Noise Reduction," ACM Great Lake Symposium on VLSI, p458-461, 2005. (pdf) 3. Jacob Minz, Eric Wong, and Sung Kyu Lim, "Reliability-aware Floorplanning for 3D Circuits," to appear in IEEE International SOC Conference, 2005. (pdf) 4. Kiran Puttaswamy and Gabriel H. Loh, "Implementing Caches in a 3D Technology for High Performance Processors", IEEE International Conference on Computer Design, pp. 525-532, 2005. (pdf) 5. Eric Wong and Sung Kyu Lim, "3D Floorplanning with Thermal Vias," to appear in Design, Automation and Test in Europe, 2006. 6. Kiran Puttaswamy and Gabriel H. Loh, "Implementing Register Files for High-Performance Microprocessors in a Die-Stacked (3D) Technology," IEEE International Symposium on VLSI, pp. 384-389, 2006. (pdf) 7. Kiran Puttaswamy and Gabriel H. Loh, "The Impact of 3-Dimenstional Integration on the Design of Arithmetic Units," IEEE International Symposium on Circuits and Systems, pp. 4951-4954, 2006. (pdf) 8. Kiran Puttaswamy and Gabriel H. Loh, "Thermal Analysis of a 3D Die-Stacked High-Performance Microprocessor," ACM/IEEE Great Lakes Symposium on VLSI, 19-24, 2006. (pdf) 9. Kiran Puttaswamy and Gabriel H. Loh, "Dynamic Instruction Schedulers in a 3-Dimensional Integration Technology," ACM/IEEE Great Lakes Symposium on VLSI, 153-158, 2006. (pdf) 10. Yuan Xie, Gabriel H. Loh, Bryan Black and Kerry Bernstein, "Design Space Exploration for 3D Architectures," ACM Journal on Emerging Technologies in Computing Systems, vol.2(2), pp. 65-103, 2006. (pdf) 11. Eric Wong, Jacob Minz, and Sung Kyu Lim, "Decoupling Capacitor Planning and Sizing for Noise and Leakage Reduction," to appear in IEEE International Conference on Computer Aided Design, 2006. 12. Bryan Black, Murali M. Annavaram, Edward Brekelbaum, John DeVale, Gabriel H. Loh, Lei Jiang, Don McCauley, Pat Morrow, Don Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Paul Shen, Clair Webb, "Die Stacking (3D) Microarchitecture," in IEEE International Symposium on Microarchitecture, 469-479, 2006. 13. Kiran Puttaswamy, Gabriel H. Loh, "Thermal Herding: Microarchitecture Techniques for Controlling HotSpots in High-Performance 3D-Integrated Processors," in IEEE International Symposium on High-Performance Computer Architecture, 2007. 14. Kiran Puttaswamy, Gabriel H. Loh, "Scalability of 3D-Integrated Arithmetic Units in High-Performance Microprocessors," to appear in ACM Design Automation Conference, 2007. • PUBLICATIONS on MICRO-ARCHITECTURAL FLOORPLANNING • • • 1. Mongkol Ekpanyapong, Jacob Minz, Thaisiri Watewai, Hsien-Hsin S. Lee, and Sung Kyu Lim, "Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25, No. 7, pp. 1289-1300, 2006. (pdf) 2. Mongkol Ekpanyapong, Jacob Minz, Thaisiri Watewai, Hsien-Hsin S. Lee, and Sung Kyu Lim, "Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design," ACM Design Automation Conference, p634-639, 2004. (pdf) 3. Mongkol Ekpanyapong, Sung Kyu Lim, Chinnakrishnan Ballapuram, and Hsien-Hsin S. Lee, "Wire-driven Microarchitectural Design Space Exploration," IEEE International Symposium on Circuits and Systems, p1867-1870, 2005. (pdf) 4. Michael Healy, Mario Vittes, Mongkol Ekpanyapong, Chinnakrishnan Ballapuram, Sung Kyu Lim, Hsien-Hsin S. Lee, and Gabriel H. Loh, "Microarchitectural Floorplanning Under Performance and Temperature Tradeoff," to appear in Design, Automation and Test in Europe, 2006. 5. Michael Healy, Mario Vittes, Mongkol Ekpanyapong, Chinnakrishnan Ballapuram, Sung Kyu Lim, Hsien-Hsin S. Lee, and Gabriel H. Loh, "Multi-Objective Microarchitectural Floorplanning For 2D And 3D ICs," to appear in IEEE Transactions on Computer-Aided Design of Integrated Ciruits and Systems. 6. Fayez Mohamood, Michael Healy, Sung Kyu Lim, and Hsien-Hsin S. Lee, "A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design," to appear in IEEE/ACM International Symposium on Microarchitecture, 2006. 7. Fayez Mohamood, Michael Healy, Hsien-Hsin Lee, and Sung Kyu Lim, "Noise-Direct: A Technique for Power Supply Noise Aware Floorplanning Using Microarchitecture Profiling," to appear in IEEE/ACM Asia South Pacific Design Automation Conference, 2007. • PUBLICATIONS on 3D PACKAGING • • • • • • • • • • • • • • • • • • 1. Jacob Minz and Sung Kyu Lim, "Layer Assignment for System-on-Packages," ACM/IEEE Asia and South Pacific Design Automation Conference, p31-37, 2004. (pdf) 2. Jacob Minz, Mohit Pathak, and Sung Kyu Lim, "Net and Pin Distribution for 3D Package Global Routing," Design, Automation and Test in Europe, p1410-1411, 2004. (pdf) 3. Ramprasad Ravichandran, Jacob Minz, Mohit Pathak, Siddharth Easwar, and Sung Kyu Lim, "Physical Layout Automation for System-On-Packages," IEEE Electronic Components and Technology Conference, p41-48, 2004. (pdf) 4. Pun Hang Shiu, Ramprasad Ravichandran, Siddharth Easwar, and Sung Kyu Lim, "Multi-layer Floorplanning for Reliable System-on-Package," IEEE International Symposium on Circuits and Systems, p69-72, 2004. (pdf) 5. Jacob Minz, Sung Kyu Lim, Jinwoo Choi, and Madhavan Swaminathan, "Module Placement for Power Supply Noise and Wire Congestion Avoidance in 3D Packaging," IEEE Electrical Performance of Electronic Packaging, p123-126, 2004. (pdf) 6. Jacob Minz and Sung Kyu Lim, "A Global Router for System-on-Package Targeting Layer and Crosstalk Minimization," IEEE Electrical Performance of Electronic Packaging, p99-102, 2004. (pdf) 7. Jacob Minz, Eric Wong, and Sung Kyu Lim, "Thermal and Crosstalk-Aware Physical Design For 3D System-On-Package," IEEE Electronic Components and Technology Conference, P824-831, 2005. (pdf) 8. Eric Wong, Jacob Minz, and Sung Kyu Lim, "Power Noise-aware 3D Floorplanning for System-On-Package," to appear in IEEE Electrical Performance of Electronic Packaging, 2005. (pdf) 9. Sung Kyu Lim, "Physical Design for 3D System-On-Package: Challenges and Opportunities," IEEE Design & Test of Computers, Vol. 22, No. 6, p532-539, 2005. (pdf) 10. Jacob Minz, Eric Wong, Mohit Pathak, and Sung Kyu Lim, "Placement and Routing for 3D System-On-Package Designs," to appear in IEEE Transactions on Components and Packaging Technologies. 11. Jacob Minz and Sung Kyu Lim, "Block-level 3D Global Routing With an Application to 3D Packaging," to appear in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 12. Jacob Minz, Somaskanda Thyagaraja, and Sung Kyu Lim, "Optical Routing for 3D System-On-Package," to appear in Design, Automation and Test in Europe, 2006. 13. Eric Wong, Jacob Minz, and Sung Kyu Lim, "White Space Management for Thermal Via and Decoupling Capacitor Insertion Targeting 3D System-On-Package," to appear in IEEE Electronic Components and Technology Conference, 2006. 14. Eric Wong, Jacob Minz, and Sung Kyu Lim, "Multi-objective Module Placement For 3D System-On-Package," IEEE Transactions on Very Large Scale Integration Systems, Vol. 14, No. 5, pp. 553-557, 2006