Design Methodology for Customizable Programmable Processors

advertisement
Design Methodology for Customizable
Programmable Processors
Berkeley – Finland Day, Oct. 18, 2002
Prof. Jarmo Takala
Institute of Digital and Computer Systems
Tampere University of Technology
Tampere, Finland
Tel: +358 – 33115 3879; Email: jarmo.takala@tut.fi
Outline
Motivation
Transport Triggered Architecture (TTA)
Design Methodology for TTAs
Research at TUT
Conclusions
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Motivation
Programmable processors often used in
products using digital signal processing (DSP)
Flexibility
Ease of verification
Traditionally DSP processor architectures have
been developed based on average performance
in several benchmark tasks (~100)
User applications often contain only subset of total
benchmarks
Efficiency can be improved by customizing
architecture according to given tasks
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Motivation
DSP applications are often hard realtime
constrained
execution should be deterministic
dynamic runtime behaviours should be avoided
Static scheduling lends itself to DSP
Current design complexities call for increase in
designer productivity
High level languages should be used
DSP algorithms contain inherent parallelism
Instruction level parallelism (ILP) should be
maximized
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
What is needed?
Application driven design process with easy
design space exploration
Replace hardware complexity by software
complexity
Compiler driven process
Use templated architecture
Flexible
 heterogeneous function units
Modular
 scalability
Orthogonal
 compiler friendly
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Choices for Architecture Template
Application
Frontend
Determine Dependencies
Determine Independencies
Bind Function Units
Compilation time
(Software)
J.Takala/TUT
ILP Architectures
sequential
(superscalar)
dependence
(dataflow)
independence
(EPIC)
independence
(VLIW)
Determine Dependencies
Determine Independencies
Bind Function Units
Bind Datapaths & Execute
Run time
(Hardware)
Berkeley – Finland Day, Oct.18, 2002
FU-3
FU-4
FU-5
Data Memory
FU-2
Register File
FU-1
Bypassing Network
Instruction Decode
Instruction Fetch
Instruction Memory
VLIW Gained Popularity in DSP
CPU
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Transport Triggered Architecture
VLIW drawbacks
Bypass complexity
Register file complexity
Register file design restricts FU flexibility
Operation encoding format restricts FU flexibility
Reverse programming paradigm [H. Corporaal, 94]
data transport  operation
Instruction set contains only a single
instruction: move
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
From VLIW to TTA
VLIW
TTA
J.Takala/TUT
FU-4
FU-5
FU-3
FU-4
Data Memory
FU-3
FU-2
Register File
FU-2
FU-1
Bypassing Network
Instruction Decode
Instruction Fetch
Instruction Memory
FU-1
FU-5
Register
File
Berkeley – Finland Day, Oct.18, 2002
TTA Datapath
Data Memory
Load/Store Load/Store
Unit
Unit
Integer
ALU
Integer
ALU
Float
ALU
Socket
Integer
RF
Float
RF
Boolean
RF
Instruction
Unit
Immediate
Unit
Instruction Memory
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Function Units
Optional shadow register
C
T
O
 Operands written to
operand registers (O)
 Operation performed
when last operand written
to trigger register (T)
 Pipeline synchronized with
control bits (C)
 Standard interface
logic
C
logic
C
logic
C
R
optional
J.Takala/TUT
 FU_ready
 Result_ready
 Global_lock
Berkeley – Finland Day, Oct.18, 2002
ILP Architectures
Application
Frontend
Determine Dependencies
Determine Independencies
Bind Function Units
Bind Datapaths
Compilation time
J.Takala/TUT
sequential
(superscalar)
dependence
(dataflow)
independence
(EPIC)
independence
(VLIW)
independence
(TTA)
Determine Dependencies
Determine Independencies
Bind Function Units
Bind Datapaths
Execute
Run time
Berkeley – Finland Day, Oct.18, 2002
TTA Characteristics: HW
Modular
Can be constructed with standard building blocks
Very flexible and scalable
FU functionality can be arbitrary
Supports user defined Special Function Units (SFU)
Lower complexity
Reduction on # register ports
Reduced bypass complexity
Reduction in bypass connectivity
Reduced register pressure
Trivial decoding (implies long instructions)
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
TTA Characteristics: SW
Traditional operation-triggered instruction:
mul r1,r2,r3;
Transport-triggered instruction:
r1mul.o;
r2mul.t;
mul.rr3;
or
r1mul.o, r2mul.t;
mul.rr3;
Reminds dataflow and time-stationary coding
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
TTA Design Tools
Design tools based on TTA architecture
template have been developed at Delft
University of Technology (DUT), Delft, the
Netherlands
MOVE project lead by Prof. Henk Corporaal
Fully parametric C/C++ Compiler
 buses, connections, function units, register files, etc.
Design space explorer
Processor generator
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Code Generation Trajectory
Application (C/C++)
Architecture Description
GCC or SUIF
Compiler
Frontend
Sequential Code
Sequential
Simulator
Compiler
Backend
Profiling Data
Parallel Code
Parallel
Simulator
I/O
I/O
(MOVE Project at DUT)
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
TTA Specific Optimizations
TTA allows extra scheduling optimizations
E.g., software bypassing
Bypassing can eliminate the need of RF access
Example:
r1 → add.o, r2 → add.t;
add.r → r3;
r3 → sub.o, r4 → sub.t
sub.r → r5;
Translates to:
r1 → add.o, r2 → add.t;
add.r → sub.o, r4 → sub.t;
sub.r → r5;
However, more difficult to schedule !
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Design Space Exploration
Application
(C/C++)
Resource
Optimization
Resources
(Mach)
Frontend
Map&Schedule
Select Resources
Simulator
Design Points
Connectivity
Optimization
FU models
Cost Functions
Map&Schedule
Reduce Connections
Simulator
Design Point
J.Takala/TUT
(MOVE Project at DUT)
Berkeley – Finland Day, Oct.18, 2002
Exploration: Resourse Optimization
(MOVE Project at DUT)
ALU
ALU
LSU LSU LSU
Pareto curve represents the
lowest bound of found
architecture configurations
Selected architecture for
further optimization
IRU
J.Takala/TUT
IRU
IU
IU
IU
Berkeley – Finland Day, Oct.18, 2002
Exploration: Connectivity Optimization
(MOVE Project at DUT)
ALU
ALU
LSU
LSU
LSU
Reduced connections
decrease bus delay
Critical connections have
been removed
IRU
IRU
J.Takala/TUT
IU
IU
IU
Berkeley – Finland Day, Oct.18, 2002
Topics to be Investigated
 Poor code density
 good target for code compression techniques
 apriori information of application, thus instruction propabilities known
 Estimations
 Power estimation
 Fast estimations with sufficient accuracy
 Flexibity, reuse
 Applications may change, thus additional resources need to assigned
although not needed by the original application
 Tool-assisted special function unit generation
 Analysis support
 Model creation support
 Characterization support
 Parameterized processor generator
 Interconnections, control, etc. maybe realized in several ways depending on
the target
 Low-power optimizations
 Clustered TTAs
 Interprocessor communication schemes
 These topics considered in FlexDSP Project at TUT
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
New Design Environment
Functionality
(C/C++)
Target of FlexDSP Project at TUT
Frontend
FU models
(C, HDL)
Cost Functions
(area, power,
speed)
Operation
Analysis
Resource
Constraints
Design Space
Exploration
SFU Generation
Parametric Compiler
Parallel
Object Code
J.Takala/TUT
Code
Compression
Parametric
Processor Generator
TTA Processor
HDL
Code
Berkeley – Finland Day, Oct.18, 2002
Conclusions
 Design methodologies allowing processor
customization will improve efficiency in certain
application areas, e.g., multimedia, telecom
 TTA is a promising candidate for architectural
template for customized processors
 In particular, support for custom function units allows
powerful tailoring
 Results of MOVE project at DUT have already proven
the concept
 Parameterized compiler allows tool-assisted design space
exploration
 Still more research needed on
 Hardware implementations
 Enhanced compiler strategies
J.Takala/TUT
Berkeley – Finland Day, Oct.18, 2002
Download