Dynamically Parameterized Architectures for Power Aware Video

advertisement
Using System-on-a-Chip as a
Vehicle for VLSI Design Education
Andrew Laffely and Wayne Burleson
Electrical and Computer Engineering
University of Massachusetts Amherst
{alaffely,burleson}@ecs.umass.edu
This material is based upon work supported by the National Science Foundation under Grant No. 9988238
and SRC Tasks 766 and 1075
Burleson, UMASS
1
Challenges in VLSI Education
•
•
•
•
•
Advancing Processing Technology
Higher level design tools
Realistic yet tractable design projects
Preparation for jobs in semiconductor
and other sectors.
Making best use of faculty/student time
and university resources
Burleson/UMASS
2
ECE 559/659: VLSI Design Project
(10 grads, 20 seniors)
Course Objectives:
• Learn design process for a complex VLSI in
deep sub-micron CMOS
• Learn VLSI design skills and tools, including
working in teams
• Learn about a particular application
component and its VLSI implementation
• Learn to present formal design reviews using
oral, written, graphical and web-based
techniques
Burleson/UMASS
3
Key Aspects of the Course
•
aSoC (home-grown SoC platform)
•
•
•
•
•
•
Graduate and undergraduate teamwork
•
•
•
Provides a unifying framework to class
Allows for subdivision but inter-relation of projects
Interesting cutting edge architecture based on NSF- and
SRC-funded research at UMASS and elsewhere
Covers many aspects of VLSI Design
Realistic constraints on area, timing, power and I/O
Graduate students provide leadership, motivation and
experience
Commercial tools and design flow
Review-based evaluation
•
Oral and web-based reports for 4 different reviews:
proposal, feasibility, implementation, integration
Burleson/UMASS
4
Adaptive System-on-a-Chip (aSoC)
Tile
mProc
•
•
Multiplier
•
Communication
Interface
North
FPGA
Tiled architecture with
mesh interconnect
Multiplier
East
West
Allows for heterogeneous
cores
•
•
ctrl
South
Core
Burleson/UMASS
Differing sizes, clock
rates, voltages
Low-overhead core
interface for
•
•
Point to point
communication pipeline
On-chip bus substitute
for streaming
applications
Based on static
scheduling
•
Fast and predictable
5
Communication Interface
Core
•
Core-ports
North
North
South
East
South
•
East
•
West
West
Inputs
Local Config.
Crossbar
Decoder
North to South & East
•
Outputs
Controller
Custom design to
maximize speed
and reduce power
Local
Frequency
& Voltage
•
•
Core-ports
Crossbar
Controller
Instruction
memory
Local frequency
and voltage supply
PC
Instruction Memory
Burleson/UMASS
6
Class Projects
SoC Infrastructure1,3
•
Communication
Interface
• Interconnect3
• Power Distribution
• Clock System
• Power Management
•
1
2
3
Used in PhD Dissertation
Used in Masters Thesis
Used in Publications
•
Cores
Motion estimation for
video encoding2,3
• AES Cryptography3
• Cache2,3
• Huffman Coding
• 3D Graphics1,2,3
• Discrete Cosine
Transform2,3
• Smart Card2,3
•
Burleson/UMASS
7
Design Flow
http://vsp2.ecs.umass.edu/vspg/658/TA_Tools/design_flow.html
•
Architecture to Layout
Architecture: Block diagram of system and behavioral description
Logic: Gate level or schematic description
• Circuit: Transistor sizing
• Layout: Floorplanning, clock and power distribution
•
•
•
Tools
•
•
•
•
•
•
•
•
VerilogXL: behavioral representation
VTVT: standard cell library
Synopsys: standard cell gate level netlist generation
Silicon Ensemble: standard cell netlist to layout
Cadence LayoutPlus: schematic and layout design
NCSU CDK: design and extraction rules
Cadence Layout vs. Schematic: layout verification
HSPICE: circuit simulator
Burleson/UMASS
8
aSoC Implementation and Integration
2500 l
.18m TSMC technology
Full custom
3000 l
Burleson/UMASS
9
Advanced Signaling Techniques
(building on SRC-funded work)
Differential current sensing
Booster Insertion
Multi-level current signaling
Phase coding
Burleson/UMASS
10
Circuit Level Simulation (HSPICE)
Evaluating Subsystems with realistic models
•
•
•
Capacitance, resistance and inductance
Process variations
Process generations
Burleson/UMASS
11
Interconnect Characterization:
Comparing delay and power of signaling techniques for different
tile sizes at 250nm, 180nm, 130nm, 100n
Burleson/UMASS
12
Voltage Scaling Approach
•
Core-ports
•
•
Single buffer for each
stream to cross
clock/voltage barrier
between core and
interface
Reading/Writing
success rates indicate
core utilization
Input blocked: Core
too slow
• Output blocked:
Core too fast
•
•
Controller
•
Interprets core-port
success rates to adjust
local clock and voltage
Core
Buffer
Processing
Pipeline
Local Local
Vdd Clock
Input
Core-port
Output
Core-port
Clock
Blocked
Blocked
and
Supply
Controller
Interconnect
Burleson/UMASS
13
Vdd Selection Criteria
Normalized Core Critical Path Delay vs. Vdd
12
Normalized
Delay 10
1/8 Speed 8
6
1/4 Speed 4
1/2 Speed
•
•
•
As Vdd decreases delay increases
exponentially
Use curve to match available clock
frequencies to voltages
The voltage and frequency change
reduces power by 79%, 96%, and
98.7%
•
P = aC(Vdd)2f
2
Max Speed
0
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0.73 1.16
2
Voltage
Burleson/UMASS
14
Clock Distribution
Tile
• Tiled architecture extends life
of globally synchronous
systems
• Precise H-tree implementation
• Load is small and equal at each
branch
• Skew can be reduced by 70%
with advanced deskew circuits1
64 tile aSoC
70nm
100nm
130nm
180nm
Chip Area
(9.24mm)2
(13.3mm)2
(17.2mm)2
(23.8mm)2
Frequency
5 GHz
2 GHz
1 GHz
0.5 GHz
Power
126 mW
240 mW
445 mW
784 mW
Mean Skew
41 ps
50 ps
92 ps
70.6 ps
Percent Skew
21 %
10 %
9%
4%
S. Tan et al. “Clock Generation and Distribution for the First IA-64 Microprocessor”
IEEE JSSC, Nov. 2000
Burleson/UMASS
1
15
Power Distribution
• Heterogeneous cores may
require multiple power supply
voltages
• Tile structure enables
uniform interwoven grid
• Larger grid for higher current
demands
Gnd
Vml
Vl
Vmh
• Reduced resistance
• Higher capacitance
Vh
64 tile aSoC
Vh
Vmh
Vml
Vl
Voltage
1.8V
1.16V
0.73V
0.6V
Current
per Core
110mA
25mA
13mA
7mA
Total Power
12.1 W
1.86 W
607 mW
269 mW
Burleson/UMASS
16
Architecture Evaluation
(Motion Estimation)
•
Array-based
architecture
•
•
Memory
Pipelined ME
FIFOs
Parameterized
search window
size
•
•
•
Address
Generation
Unit
Full search
Choose 16x16 or
8x8 windows
Reduce power
Burleson/UMASS
Processing
Element
Array
17
Modify Existing Designs
•
•
Take existing Verilog code or hardware and improve
or change functionality (e.g. add motion estimation
algorithms, provide AES key-length flexibility)
Evaluate changes in performance and overhead
- Old PE Layout
- New PE Layout
Burleson/UMASS
18
Conclusions
•
Advancing Process Technology
•
•
Higher level design tools
•
•
Re-use existing projects and provide unifying themes
Preparation for jobs in semiconductor and other sectors.
•
•
•
Combine synthesis and custom techniques
Realistic yet tractable design projects
•
•
Target .18u for affordable fab but also do scaling studies
Focus on system design and appropriate levels of abstraction
Teach how to learn new tools
Making best use of faculty/student time and university resources
Leverage research
Combine grad and undergrad
• Re-use materials, tools
•
•
Burleson/UMASS
19
Download