TFUG-07 - UCSB Computer Science

advertisement
3D Interconnect: Architectural
Challenges and Opportunities
Tim Sherwood
UC SANTA BARBARA
The Role of Architecture
(Battery Life, Performance,
Programmability )
Runtime System
Architecture
Circuit
HW
Constraints
(Noise, Thermal, Yield)
Device
Package
Demands
3D Integration
SW
Applications
Lab Overview
b0
1
0
b2
b1
0
1
1
0
b4 { 2 } 0
0
1
1
b3
0
Adaptive Hardware Profiling
Engines integrated On-Chip
Prototype
Processor
Acceleration
Core
Primitives
Intrusion
Detection System
Caches,
etc.
0
1
1
Intrusion Detection
Software
Wireless
andDefined
Prevention
Access Point
b9
1
1
b5
b6 { 2,5 }
1
b8 { 2,7 }
0
0
0
b7
High Speed
Programmable Routers
High Throughput
MEMS controllers
Server Farm
Reconfigurable
Security on FPGAs
Memory Hierarchy
y
Lab Overview
b0
1
0
b2
b1
0
1
1
0
b4 { 2 } 0
0
1
1
b9
1
1
b5
b3
0
0
1
1
b6 { 2,5 }
1
b8 { 2,7 }
0
0
0
b7
High Speed
Programmable Routers
Software Defined Wireless
Access Point
High Throughput
MEMS controllers
Reconfigurable
Security on FPGAs
Potential for Impact from 3D
3D Bandwidth
3D Specialization
3D Bandwidth
b0
1
0
b2
b1
0
1
1
0
b4 { 2 } 0
0
1
1
b3
0
Adaptive Hardware Profiling
Engines integrated On-Chip
0
1
1
Intrusion Detection
and Prevention
b9
1
1
b5
b6 { 2,5 }
1
b8 { 2,7 }
0
0
0
b7
High Speed
Programmable Routers
Prototype
Processor
Acceleration
Core
Primitives
Intrusion
Detection System
Caches,
etc.
Server Farm
3D Integration
for Latency
Memory Hierarchy
3D Integration
Potential for Impact from
for
Signal
3D Bandwidth
3DMixed
Specialization
3D Integration
3DTechnology
Bandwidth
for Mixed
3D
b0
1
0
b2
b1
0
1
1
0
b4 { 2 } 0
0
1
1
b3
0
Adaptive Hardware Profiling
Engines integrated On-Chip
0
1
1
Intrusion Detection
and Prevention
b9
1
1
b5
b6 { 2,5 }
1
b8 { 2,7 }
0
0
0
b7
High Speed
Programmable Routers
Prototype
Processor
Acceleration
Core
Primitives
Intrusion
Detection System
Caches,
etc.
Server Farm
3D Specialization
3D Integration
for Latency
Memory Hierarchy
Presented Works
• Shashidhar Mysore, Banit Agrawal, Sheng-Chih Lin, Navin
Srivastava, Kaustav Banerjee and Timothy Sherwood.
Introspective 3D Chips ,
Proceedings of the Twelfth
International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS),
October 2006. San Jose, CA
• Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih
Lin, Timothy Sherwood, Kaustav Banerjee. A ThermallyAware Performance Analysis of Vertically Integrated (3-D)
Processor-Memory Hierarchy, Proceedings of the 43nd
Design Automation Conference (DAC), June 2006. San
Francisco, CA
Two Specific Opportunities
1) 3D Integration for Performance
 Bring Memory Closer to those that use it
 More Bandwidth and Lower Latency
 Tricky System Level Tradeoffs
2 ) 3D Integration for Specialization
 Integration offers unique specialization opportunity
 Decouple commodity from niche
The ramifications of any radical change requires a careful
evaluation that considers all the parameters
A Simple Performance “Ecosystem”
package
temp
total power
dynamic power
V
leakage
communication
utilized area
freq
parallelism
feedback
performance
OS or runtime
No multicore, no spatial variance, no temporal
apperror or yield
variance, no metrics of cost or
Two Specific Opportunities
1) 3D Integration for Performance
 Bring Memory Closer to those that use it
 More Bandwidth and Lower Latency
 Tricky System Level Tradeoffs
2 ) 3D Integration for Specialization
 Integration offers unique specialization opportunity
 Decouple commodity from niche
The ramifications of any radical change requires a careful
evaluation that considers all the parameters
Basic Savings in 3D
Area: 4
Dist: √8 ≈ 2.8
BW: √8 ≈ 2.8
Area: 2
Dist: √4 ≈ 2 + 1L
BW: 2√4 ≈ 4
Area: 1
Dist: √2 ≈ 1.4 + 3L
BW: 4√2 ≈ 5.6
On-chip Latency improved, Bandwidth could improve more
What about real wires? What about apps? What about temp?
Example Technology Node
Dioxide
Layer 1
Metal layers
Silicon substrate
Dioxide
Banerjee et al. IEEE 2001
50um
30-40um
CMOS
Vertical
Interconnect
Layer 2
2-5um
3D Wire Delay
-11
x 10
Vertical via model
1.2
Delay ( Sec )
1
0.8
Vertical wire length
1.4
Distributed RC delay
Horizontal line
model
0.6
0.4
0.2
0
Horizontal wire
length L
160
240
320
400
480
560
Wire length L ( um )
640
720
800
A “Typical” 2D System Design
Memory Bottleneck
DRAM
DRAM
CPU core
L1 I-Cache
L1 D-Cache
Memory Controller
L2 to Main Memory
External Bus
DRAM
DRAM
DRAM
DRAM
L2 Unified Cache
DRAM
Board
A 3D Memory System
8 bytes to 128 bytes
200 Mhz to 2 Ghz
Layer 2
L2 Unified
Cache
L1 to L2 vertical
interlayer Bus
L2 to Main
Memory vertical
interlayer Bus
Layer 3 to 18
L1 D-Cache
L1 I-Cache
CPU core
Layer 1
Stacked three
dimensional
main memory
System-Level Simulation
Simulator : Sim-Alpha simulator
Processor : Alpha-21264 processor
Benchmarks: mcf, parser, twolf with Minnespec reduced
inputs
% main memory access per
instruction
mcf
1.7%
parser
0.258472%
twolf
0.00062%
Effect of Bus Width and Frequency
mcf
7
8 bytes bus width (2-D)
8 bytes bus width (3-D)
6
16 bytes bus width (3-D)
Execution time (sec)
32 bytes bus width (3-D)
5
64 bytes bus width (3-D)
128 bytes bus width (3-D)
4
Only a few vias required
3
2
1
0
10
100
1000
L2 cache size in KBytes
10000
Effect of Clock Frequency : mcf
Execution time per instruction (ns)
3
mcf (2-D)
2.5
2
1.5
1
mcf (3-D)
0.5
0
600
1000
1400
1800
2200
Clock Frequency (MHz)
2600
3000
Effect of Clock Frequency : parser
Execution time per instruction (ns)
1.4
1.2
1
parser (2-D)
0.8
0.6
parser (3-D)
0.4
0.2
0
600
1000
1400
1800
2200
Clock Frequency (MHz)
2600
3000
Effect of Clock Frequency : twolf
Execution time per instruction (ns)
1.4
1.2
1
0.8
twolf (2-D)
0.6
twolf (3-D)
0.4
0.2
0
600
1000
1400
1800
2200
Clock Frequency (MHz)
2600
3000
An Example Memory System
DRAM
DRAM
DRAM
DRAM
L2 Cache
CPU &
L1Cache
Heat Sink
Thermal Gradient
DRAM
Self-consistent Thermal Modeling
Insert the initials
values of
leakage and
dynamic power
for each layer
Calculate
the first
thermal
profile
No
Yes
Finish
Is it
convergent?
Based on the previous
thermal profile calculate
the new power
dissipation considering
Ion decrease with
temperature
ILeakage increase with
temperature
Calculate the new
temperature profile
3D Thermally-aware
Performance Analysis
mcf
400
390
2.5
Min execution time in 2-D
380
Temperature constraint
370
2
3-D max chip
temperature
360
1.5
350
2-D max chip
temperature
340
1
330
Min execution time in3-D
600
1000
1400
Temperature (K)
Execution time per instruction
3
1800
2200
Frequency in MHz
2600
3000
3D Thermally-aware
Performance Analysis
Maximum frequency allowed due to
temperature constraint
1.1
1
390
380
Temperature constraint
0.9
370
0.8
360
3-D max chip
temperature
0.7
350
2-D max chip
temperature
0.6
340
0.5
Min execution time in 3-D
0.4
330
Min execution time in 2-D
0.3
600
1000
1400
1800
2200
Frequency in MHz
2600
3000
Temperature (K)
Execution time per instruction
twolf
3D Memory Integration
• Many Unaccounted For Effects




Effect of Multiple Cores and Memory Banks
Spatial Variation
Temporal Variation (thermal load balancing)
All of these are intimately tied to the integration method
and packaging
• How to Manage




Architecture and Software will be increasingly involved
Exposing Variation to higher levels
Huge demand for “models”, “sensors”, and “knobs”
Thermal, Packaging, Application, Architecture all tangled


Need to build models that capture all of these aspects
Models need to be “self consistent”
Two Specific Opportunities
1) 3D Integration for Performance
 Bring Memory Closer to those that use it
 More Bandwidth and Lower Latency
 Tricky System Level Tradeoffs
2 ) 3D Integration for Specialization
 Integration offers unique specialization opportunity
 Decouple commodity from niche
The ramifications of any radical change requires a careful
evaluation that considers all the parameters
3D Integration for Introspection
• Complex interactions across levels of abstraction make
debugging, optimizing, securing, and analysis in general
difficult
• The first requirement – visibility
 Not just data capture, we need the ability to put together
a cohesive picture of system interactions and correlate
between them in a sound and non-intrusive manner
• The hardware/software boundary is uniquely situated
 Piece together from low level events
• What would the programmer wish list look like?
What programmers want
Everything.
Decode
To Integrated Monitoring Hardware
4x 3x
4x 3x
4x 3x
4x 3x
4x 3x
4x 3x
2
320
3
2
790
L2_BPU
32 bit Memory Address
32 bit Memory Value
10 bit Opcodes
2, 5 bit Register Names
2, 32 bit Register Values
10 bits of “status”
L1_BPU
Trace Cache
Top
MOB
Trace Cache Bottom
DTLB
Bus Control
ITLB
L1
Cache
Top
FP Exec
L1 Cache
Bottom
UROM
FP Reg
Int Exec
Mem
Ctl
Alloc
1892 bits per cycle
= 1 terrabyte/sec @ 4Ghz
Retire
Rename
Instr
Q1
Int Reg
Sched
Instr
Q2
L2 Cache
Why programmers cant have it
• Interconnect is not free
• Analysis is not free
 Significant processing required
• Extra cost of added heat
 $15 budget for cooling
• Used by developers
Trace Cache
Top
To Integrated Monitoring Hardware
 Huge cross chip busses
 OptBuf 285um
 20,000 buffers
Decode
L2_BPU
L1_BPU
Trace Cache Bottom
MOB
ITLB
DTLB
L1
Cache
Top
Bus Control
2
320
3
2
FP Exec
L1 Cache
Bottom
UROM
FP Reg
Mem
Ctl
Int Exec
790
Alloc
Rename
Retire
Instr
Q1
Int Reg
Sched
Instr
Q2
L2 Cache
Cake + Eating It Too
• Need a way to provide cheap (or high margin) HW to the
masses
 No paying for developer functionality
• Get developers the powerful analysis they crave
 See everything at executable rate
• Provide “snap-on” functionality for developers
 Separate chip for analysis engine
 Only hook it onto “developer” systems
• Idea is not limited to development systems
 Security, Error Correction, Confidentiality, Accelerators, …
• 3d Integration offers the potential
Thermal Impact
Conclusion: Opportunities+Challenges
3D Integration for Performance




Bring Memory Closer to those that use it
More Bandwidth and Lower Latency
Requires few vias for big impact
Tricky System Level Tradeoffs
3D Integration for Specialization
 Integration offers unique specialization opportunity
 Requires rethinking of integration process
 Decouple commodity from niche
Challenges
 Cross layer models: from app to package
 Cross layer optimization: both static and dynamic
 Thermal Management is everybody's problem
http://www.cs.ucsb.edu/~arch/
NSF CNS 0524771, NSF CCF 0702798, NSF CCF 0448654
Related Work
•
Bryan Black, Murali M. Annavaram, Edward Brekelbaum, John DeVale, Gabriel H. Loh, Lei Jiang, Don
McCauley, Pat Morrow, Don Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John
Paul Shen, Clair Webb, "Die Stacking (3D) Microarchitecture," in IEEE International Symposium on
Microarchitecture, 469-479, 2006.
•
PUBLICATIONS on 3D STACKED IC
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1. Karthik Balakrishnan, Vidit Nanda, Siddharth Easwar, and Sung Kyu Lim, "Wire Congestion And Thermal Aware 3D Global Placement," IEEE/ACM Asia South Pacific Design Automation Conference, p1131-1134, 2005. (pdf)
2. Jacob Minz, Sung Kyu Lim, and Cheng-Kok Koh, "3D Module Placement for Congestion and Power Noise Reduction," ACM Great Lake Symposium on VLSI, p458-461, 2005. (pdf)
3. Jacob Minz, Eric Wong, and Sung Kyu Lim, "Reliability-aware Floorplanning for 3D Circuits," to appear in IEEE International SOC Conference, 2005. (pdf)
4. Kiran Puttaswamy and Gabriel H. Loh, "Implementing Caches in a 3D Technology for High Performance Processors", IEEE International Conference on Computer Design, pp. 525-532, 2005. (pdf)
5. Eric Wong and Sung Kyu Lim, "3D Floorplanning with Thermal Vias," to appear in Design, Automation and Test in Europe, 2006.
6. Kiran Puttaswamy and Gabriel H. Loh, "Implementing Register Files for High-Performance Microprocessors in a Die-Stacked (3D) Technology," IEEE International Symposium on VLSI, pp. 384-389, 2006. (pdf)
7. Kiran Puttaswamy and Gabriel H. Loh, "The Impact of 3-Dimenstional Integration on the Design of Arithmetic Units," IEEE International Symposium on Circuits and Systems, pp. 4951-4954, 2006. (pdf)
8. Kiran Puttaswamy and Gabriel H. Loh, "Thermal Analysis of a 3D Die-Stacked High-Performance Microprocessor," ACM/IEEE Great Lakes Symposium on VLSI, 19-24, 2006. (pdf)
9. Kiran Puttaswamy and Gabriel H. Loh, "Dynamic Instruction Schedulers in a 3-Dimensional Integration Technology," ACM/IEEE Great Lakes Symposium on VLSI, 153-158, 2006. (pdf)
10. Yuan Xie, Gabriel H. Loh, Bryan Black and Kerry Bernstein, "Design Space Exploration for 3D Architectures," ACM Journal on Emerging Technologies in Computing Systems, vol.2(2), pp. 65-103, 2006. (pdf)
11. Eric Wong, Jacob Minz, and Sung Kyu Lim, "Decoupling Capacitor Planning and Sizing for Noise and Leakage Reduction," to appear in IEEE International Conference on Computer Aided Design, 2006.
12. Bryan Black, Murali M. Annavaram, Edward Brekelbaum, John DeVale, Gabriel H. Loh, Lei Jiang, Don McCauley, Pat Morrow, Don Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Paul Shen, Clair Webb, "Die Stacking (3D) Microarchitecture,"
in IEEE International Symposium on Microarchitecture, 469-479, 2006.
13. Kiran Puttaswamy, Gabriel H. Loh, "Thermal Herding: Microarchitecture Techniques for Controlling HotSpots in High-Performance 3D-Integrated Processors," in IEEE International Symposium on High-Performance Computer Architecture, 2007.
14. Kiran Puttaswamy, Gabriel H. Loh, "Scalability of 3D-Integrated Arithmetic Units in High-Performance Microprocessors," to appear in ACM Design Automation Conference, 2007.
•
PUBLICATIONS on MICRO-ARCHITECTURAL FLOORPLANNING
•
•
•
1. Mongkol Ekpanyapong, Jacob Minz, Thaisiri Watewai, Hsien-Hsin S. Lee, and Sung Kyu Lim, "Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
Vol. 25, No. 7, pp. 1289-1300, 2006. (pdf)
2. Mongkol Ekpanyapong, Jacob Minz, Thaisiri Watewai, Hsien-Hsin S. Lee, and Sung Kyu Lim, "Profile-Guided Microarchitectural Floorplanning for Deep Submicron Processor Design," ACM Design Automation Conference, p634-639, 2004. (pdf)
3. Mongkol Ekpanyapong, Sung Kyu Lim, Chinnakrishnan Ballapuram, and Hsien-Hsin S. Lee, "Wire-driven Microarchitectural Design Space Exploration," IEEE International Symposium on Circuits and Systems, p1867-1870, 2005. (pdf)
4. Michael Healy, Mario Vittes, Mongkol Ekpanyapong, Chinnakrishnan Ballapuram, Sung Kyu Lim, Hsien-Hsin S. Lee, and Gabriel H. Loh, "Microarchitectural Floorplanning Under Performance and Temperature Tradeoff," to appear in Design, Automation and Test in
Europe, 2006.
5. Michael Healy, Mario Vittes, Mongkol Ekpanyapong, Chinnakrishnan Ballapuram, Sung Kyu Lim, Hsien-Hsin S. Lee, and Gabriel H. Loh, "Multi-Objective Microarchitectural Floorplanning For 2D And 3D ICs," to appear in IEEE Transactions on Computer-Aided Design of
Integrated Ciruits and Systems.
6. Fayez Mohamood, Michael Healy, Sung Kyu Lim, and Hsien-Hsin S. Lee, "A Floorplan-Aware Dynamic Inductive Noise Controller for Reliable Processor Design," to appear in IEEE/ACM International Symposium on Microarchitecture, 2006.
7. Fayez Mohamood, Michael Healy, Hsien-Hsin Lee, and Sung Kyu Lim, "Noise-Direct: A Technique for Power Supply Noise Aware Floorplanning Using Microarchitecture Profiling," to appear in IEEE/ACM Asia South Pacific Design Automation Conference, 2007.
•
PUBLICATIONS on 3D PACKAGING
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1. Jacob Minz and Sung Kyu Lim, "Layer Assignment for System-on-Packages," ACM/IEEE Asia and South Pacific Design Automation Conference, p31-37, 2004. (pdf)
2. Jacob Minz, Mohit Pathak, and Sung Kyu Lim, "Net and Pin Distribution for 3D Package Global Routing," Design, Automation and Test in Europe, p1410-1411, 2004. (pdf)
3. Ramprasad Ravichandran, Jacob Minz, Mohit Pathak, Siddharth Easwar, and Sung Kyu Lim, "Physical Layout Automation for System-On-Packages," IEEE Electronic Components and Technology Conference, p41-48, 2004. (pdf)
4. Pun Hang Shiu, Ramprasad Ravichandran, Siddharth Easwar, and Sung Kyu Lim, "Multi-layer Floorplanning for Reliable System-on-Package," IEEE International Symposium on Circuits and Systems, p69-72, 2004. (pdf)
5. Jacob Minz, Sung Kyu Lim, Jinwoo Choi, and Madhavan Swaminathan, "Module Placement for Power Supply Noise and Wire Congestion Avoidance in 3D Packaging," IEEE Electrical Performance of Electronic Packaging, p123-126, 2004. (pdf)
6. Jacob Minz and Sung Kyu Lim, "A Global Router for System-on-Package Targeting Layer and Crosstalk Minimization," IEEE Electrical Performance of Electronic Packaging, p99-102, 2004. (pdf)
7. Jacob Minz, Eric Wong, and Sung Kyu Lim, "Thermal and Crosstalk-Aware Physical Design For 3D System-On-Package," IEEE Electronic Components and Technology Conference, P824-831, 2005. (pdf)
8. Eric Wong, Jacob Minz, and Sung Kyu Lim, "Power Noise-aware 3D Floorplanning for System-On-Package," to appear in IEEE Electrical Performance of Electronic Packaging, 2005. (pdf)
9. Sung Kyu Lim, "Physical Design for 3D System-On-Package: Challenges and Opportunities," IEEE Design & Test of Computers, Vol. 22, No. 6, p532-539, 2005. (pdf)
10. Jacob Minz, Eric Wong, Mohit Pathak, and Sung Kyu Lim, "Placement and Routing for 3D System-On-Package Designs," to appear in IEEE Transactions on Components and Packaging Technologies.
11. Jacob Minz and Sung Kyu Lim, "Block-level 3D Global Routing With an Application to 3D Packaging," to appear in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
12. Jacob Minz, Somaskanda Thyagaraja, and Sung Kyu Lim, "Optical Routing for 3D System-On-Package," to appear in Design, Automation and Test in Europe, 2006.
13. Eric Wong, Jacob Minz, and Sung Kyu Lim, "White Space Management for Thermal Via and Decoupling Capacitor Insertion Targeting 3D System-On-Package," to appear in IEEE Electronic Components and Technology Conference, 2006.
14. Eric Wong, Jacob Minz, and Sung Kyu Lim, "Multi-objective Module Placement For 3D System-On-Package," IEEE Transactions on Very Large Scale Integration Systems, Vol. 14, No. 5, pp. 553-557, 2006
Download