Designing Programmable Platforms: From ASIC to ASIP

advertisement
Designing Programmable Platforms:
From ASIC to ASIP
MPSoC 2005
Heinrich Meyr
CoWare Inc., San Jose
and
Integrated Signal Processing Systems
(ISS),
Aachen University of Technology,
Germany
Agenda
ƒ Facts & Conclusions
ƒ Heterogeneous MPSoC
» Energy Efficiency vs.Flexibility
» How to explore the Design Space?
ƒ ASIP Design
Agenda
ƒ Economics of SoC Development
ƒ Conclusions
Facts & Conclusion
Core Proposition
ASIP based Platforms
(heterogenousMPSoC)
Agenda
ƒ Facts & Conclusions
ƒ Heterogeneous MPSoC
» Energy Efficiency vs.Flexibility
» How to explore the Design Space?
ƒ ASIP Design
Agenda
ƒ Economics of SoC Development
ƒ Conclusions
Heterogeneous MPSoC
Trade-off between Flexibility and
Energy -Efficiency
Architectural Objectives
Need more MOPS/Watt and MOPS/mm² to minimize the
global performance measure for battery driven devices
Energy / decoded Bit = (Joule/Bit)
Computational Effiency vs. Flexibility
Source: T.Noll, RWTH Aachen
Enabling MP-SoC Design
System Level Tools I: Application & IP Creation
algorithm
domain
algorithmic exploration
• Matlab
• SPW
• System Studio
system application design
Architecture
Description
Language
block specification
• LISATek Processor Synthesis
• ConvergenSC Buscompiler
High-level IP block design
micro
architecture
domain
block implementation
• RTL Synthesis
System Level Tools I: Application & IP Creation
algorithm
domain
SystemC
Transaction
Level
Modeling
Architecture
Description
Language
micro
architecture
domain
algorithmic exploration
abstract architecture
• Matlab
System
• SPW
•System Studio application design
• MP-SoC Intermediate Representation
MP-SoC platform design
System
Level Tools II:
MP-SoC Platform Design
virtual prototype
• ConvergenSC Platform Creator
• ConvergenSC Platform Creator
block specification
• LISATek Processor Synthesis
• ConvergenSC Buscompiler
High-level IP block design
block implementation
• RTL Synthesis
Agenda
ƒ Facts & Conclusions
ƒ Heterogeneous MPSoC
» Energy Efficiency vs.Flexibility
» How to explore the Design Space?
ƒ ASIP Design
Agenda
ƒ Economics of SoC Development
ƒ Conclusions
Processor Design Space
Micro Architecture Design
Instruction Set Design
butterfly 0
butterfly 1
load/store
• Exploit- regularity/parallelism
in
Instruction-Set
Design
Instruction-Set
Design
data flow/data storage
- Compiler
Design
Compiler
Design
• VLIW,- SIMD,
?
• Which instructions for compiler support?
• Instruction Encoding?
• How much general purpose registers?
FE
FE
DC
DC
EX
EX
WB
WB
-Micro
-MicroArchitecture
ArchitectureDesign
Design
• Pipeline length ?
MESCAL
2:
• Shared resources ?
• Bypass ?
• Parallel execution
units ?
Inclusively
identify
the
Optimal
Optimaldesign
designrequires
requirespowerful
powerfultools
tools
architectural
space
and
RTL Design
which cache required ?
andautomation
automation!!Soc Integration
• Area constraints met?
• Clock frequency?
-RTL Design
-RTL Design
--RTL
RTLISS
ISSCo-verification
Co-verification
Core
MMU
Cache
-System
-SystemIntegration
Integration
bus fast enough?
--Embedded
EmbeddedSoftware
Software
Simulation
communication?
Simulation
Memory
Peripheral
Architecture Description Language based Processor Design
ƒ
The purpose of an architecture description language (e.g
LISA) is:
» To allow for an iterative design to efficiently explore
architecture alternatives
MESCAL–Compiler”
3:
» To jointly design “Architecture
and on chip
communication Efficiently describe
» To automatically generate hardware (path to
the ASIP
implementation)
» To automatically generate tools
» Assembler ,Linker, Compiler, Simulator, co-simulation
interfaces
ƒ
From a single model at various level of temporal and spatial
abstraction
LISA 2.0 - Abstraction Levels
architecture
very detailed
phase
accurate
model
+ IRQ, etc.
Functional units,
Registers,
Memories
Pseudo
Resources
(e.g. cc-variables)
instruction
accurate
model
ac
cu
ra
cy
cycle
accurate
model
+ Pipelines
no details
high
level
model
Pseudo
Processor
Cycles
Instructions Instructions
time
Phases
Empty Model
Describe/Adopt
Processor Model
Generate
Tools
Application
RISC Sample
VLIW Sample
DSP Sample
Custom
Processor
Model
(LISA 2.0
language)
LISATek
Software
MESCAL 3: Software
Tool
Tool
FFT Processor
Chain
Chain
Processor
Efficiently
describe
Designer
LISATek IP
and evaluate the
Samples
Rapid
-targetable simulation
-generation allows
Rapidmodeling
modelingand
andre
re-targetable
simulation++code
code-generation
allowsfor:
for:
ASIP
joint optimization of application and architecture
joint optimization of application and architecture
Generate...
RTL
RTL
MESCAL
5: RTL
SoC
Executable
Integration Kit Software
Sucessfully
(e.g.:SystemC)
) Platform deploy
e.g.:SystemC
the ASIP
Function and instruction level
profiling reveals hot-spots
-> special purpose instructions
•Instruction Set
Synthesis
•Memory architecture
•Verification
E
X
P
L
O
R
A
T
I
O
N
Current Work
I
M
LISA Description
P
LISA Compiler
C-Compiler
L
HDL Generator
E
Optimization
Assembler
M
Linker
E
SystemC, VHDL, Verilog Output
Simulator
N
T
Model Verification
Gate Level Synthesis
A
& Evaluation
T
Evaluation Results
Evaluation Results
I
Chip Area, Clock Speed,
Profile Information,
O
Power Consumption3:
Application Performance
MESCAL
N
…..evaluate the ASIP
Target Architecture
June 10, 2004
A Novel Approach for Flexible and
Consistent ADL-driven ASIP Design
Gunnar Braun
Achim Nohl
CoWare, Inc
DAC Booth #1844
www.CoWare.com
Weihua Sheng, Jianjiang Ceng, Manuel Hohenauer,
Hanno Scharwächter, Rainer Leupers, Heinrich Meyr
Integrated Signal Processing Systems (ISS)
Aachen University of Technology
Germany
Introduction
Architecture Description Languages (ADL)
•
Automatic generation of Software Toolkit
(Compiler, Assembler, Linker, IS-Simulator)
•
Architecture Exploration
•
SystemC models, RTL code, verification tools, ...
Challenges:
• Different tools need different information
• Unambiguous, redundancy-free architecture model
(rather than tools description)
• Multiple abstraction levels (instruction-accurate
and/or cycle-accurate)
Tool Requirements: Compiler
aa == bb ++ c;
c;
C
mul rd = rs, rt
rs
rt
ld rd = @
@
*
LD
rd
rd
add rd = rs, rt
st @ = rs
C
C Compiler
Compiler
rs
add
add cc == a,
a, bb
Assembly
rt
rs
+
ST
rd
@
Tool Requirements: Simulator
Machine Code add
add r5
r5 == r2,
r2, r1
r1
mul rd = rs, rt
MUL_read (rs, rt);
MUL_add ();
Update_flags ();
writeback (rd);
ld rd = @
LSU_addrgen();
data_bus.req();
data_bus.read();
writeback (rd);
Simulator
Simulator
add rd = rs, rt
ALU_read (rs, rt);
ALU_add ();
Update_flags ();
writeback (rd);
st @ = rs
LSU_addrgen();
LSU_read(rs);
data_bus.req();
data_bus.write(rs);
Simulation Code (C)
ALU_read
ALU_read(r2,
(r2,r1);
r1);
ALU_add
ALU_add();
();
Update_flags
Update_flags();
();
writeback
writeback(r5);
(r5);
ADL Model
aa == bb ++ c;
c;
ADL Model
add
add r5
r5 == r2,
r2, r1
r1
SYNTAX
{{
SYNTAX
“ADD“
“ADD“ dst,
dst, src1,
src1, src2
src2
}}
add rd = rs, rt
CODING
{{
CODING
ALU_read (rs, rt);
0b0010
dst
0b0010
dst src1
src1 src2
src2
ALU_add ();
}} Update_flags ();
writeback (rd);
C
C Compiler
Compiler
BEHAVIOR
BEHAVIOR {{
ALU_read
(src1,
src2);
ALU_read
(src1,
src2);
rs
rt
ALU_add
ALU_add ();
();
Update_flags
Update_flags ();
+ (dst);();
writeback
writeback (dst);
}}
Simulator
Simulator
rd
SEMANTICS
SEMANTICS {{
src1
src1 ++ src2
src2 Æ
Æ dst;
dst;
}}
add
add cc == a,
a, bb
ALU_read
ALU_read(r2,
(r2,r1);
r1);
ALU_add
ALU_add();
();
Update_flags
Update_flags();
();
writeback
writeback(r5);
(r5);
Problem Statement
• Compiler and Simulator need different information:
• Compiler: C operation to instruction(s)
ƒ
WHAT is the instruction good for? Purpose?
• Simulator: instructions to sequence of operations
ƒ
HOW is the instruction executed? What actions to perform?
• Architecture Designer‘s Perspective:
?
?
?
src1
src1 ++ src2
src2 Æ
Æ dst;
dst;
ALU_read
ALU_read (src1,
(src1, src2);
src2);
ALU_add
();
ALU_add ();
Update_flags
Update_flags ();
();
write
back
(dst);
writeback (dst);
Examples
ASDSP FPGA Implementation
Myjung Sunwoo, Ajiou University,
ASDSP Core Design
9SEC 0.18um Synthesis
• Gate : 77,000
• Program Memory : 4 Kbyte, Data Memory : 8 Kbyte
• Frequency : 290MHz
• Power consumption : 0.87W (3mW/MHz)
FPGA Implementation
9 iProve Xilinx xc2v6000
Support the Special Instruction Set for FFT Operation and the BMU Instruction
Improve the Performance for OFDM Communication
The ICORE
A low-power ASIP for Infineon DVB-T 2nd
generation Single-Chip Receiver:
• ASIP for DVB-T acquisition and tracking algorithms
(sampling-clock-synchronization, interpolation / decimation,
carrier frequency offset estimation)
• Harvard Architecture
• 60 mostly RISC-like Instructions &
Special Instructions for CORDIC-Algorithm
• 8x32-Bit General Purpose Registers, 4x9-Bit Address Registers
• 2048x20-Bit Instruction ROM, 512x32-Bit Data Memory
• I2C Registers and dedicated interfaces for external communication
The Motorola M68HC11 Architecture
The Motorola M68HC11
Architecture
Increasing
SW Content- but How?
Architecture Overview
ƒ M68HC11 CPU Architecture : Hot spots
» 8-bit micro-controller.
» Harvard Architecture
»
»
»
»
»
»
»
»
7 CPU Registers.
6 different Addressing Modes.
Shared data and program bus.
: stalled data access
Instruction width : 8,16, 24, 32, 40 : multi-cycle fetch
8-bit opcode : 181 instructions
Clock speed : ~200 MHz
Performance :
:
non-pipelined
Area : 15K to 30K (DesignWare® Library)
Architecture Development with LISA
FE
16
32
32
DC
32
32
EX
++ pipelined
pipelined architecture
architecture
++ separate
program
and
data
bus
separate
program
and
data
bus
16
0x0000
512Bytes int. RAM
ACCU
Accu A
64Bytes Conf. Reg.
Accu B
Index X
Index Y
3.5K ext. RAM
Stack Pointer
61K ext. RAM
0x10000
Condition
Results
•Area
< 23k gates
•Clock speed
~ 200 MHz
•Execution time speed up
62 % for spanning tree application
•Mapped onto Xilinx FPGA
Architecture Development with LISA
•Studying the architecture
4 days
•Basic architecture modifications
2 days
•Grouping and coding of the instructions 1 day
•Writing the LISA model
-basic syntax and coding
4 days
-behavior section
6 days
•Validation
4 days
•HDL Generation
Total
2 days
23 days
Design of Application Specific
Processor Architectures
Rainer Leupers
RWTH Aachen University
Software for Systems on Silicon
leupers@iss.rwth-aachen.de
Institute for Integrated Signal Processing Systems
Overview
1. Introduction
2. ASIP design methodologies
3. Software tools
4. ASIP architecture design
5. Case study
6. Advanced research topics
2005 © R. Leupers
4
1. Introduction
2005 © R. Leupers
5
Embedded system design automation
¾ Embedded systems
ƒ Special-purpose electronic devices
ƒ Very different from desktop computers
¾ Strength of European IT market
ƒ Telecom, consumer, automotive, medical, ...
ƒ Siemens, Nokia, Bosch, Infineon, ...
¾ New design requirements
ƒ Low NRE cost, high efficiency requirements
ƒ Real-time operation, dependability
ƒ Keep pace with Moore´s Law
2005 © R. Leupers
6
What to do with chip area ?
2005 © R. Leupers
7
Example: wireless multimedia terminals
¾ Multistandard radio
ƒ UMTS
ƒ GSM/GPRS/EDGE
ƒ WLAN
ƒ Bluetooth
ƒ UWB
ƒ …
¾ Multimedia standards
ƒ MPEG-4
ƒ MP3
ƒ AAC
ƒ GPS
ƒ DVB-H
ƒ …
2005 © R. Leupers
Keyissues:
issues:
Key
Timeto
tomarket
market(≤
(≤12
12months)
months)
••Time
Flexibility(ongoing
(ongoingstandard
standard
••Flexibility
updates)
updates)
Efficiency(battery
(batteryoperation)
operation)
••Efficiency
8
Application specific processors (ASIPs)
„As the performance of conventional microprocessors improves, they
first meet and then exceed the requirements of most computing
applications. Initially, performance is key. But eventually, other factors,
like customization, become more important to the customer...“
[M.J. Bass, C.M. Christensen: The Future of the Microprocessor Business, IEEE Spectrum 2002]
design budget = (semiconductor revenue) × (% for R&D)
growth ≈ 15%
≈ 10%
# IC designs = (design budget) / (design cost per IC)
growth ≈ 15%
growth ≈ 50-100%
[Keutzer05]
→ Customizable application specific processors as
reusable, programmable platforms
2005 © R. Leupers
9
Digital
Signal
Processors
Application
Specific Instruction
Set Processors
Field
Programmable
Devices
SW
Why
use ASIPs?
Design
• Higher efficiency for given range
of applications
• IP protection
Design
• Cost reduction (no HW
royalties)
• Product differentiation
Application
Specific
ICs
Physically
Optimized
ICs
105 . . . 106
Log F L E X I B I L I T Y
General
Purpose
Processors
Log P O W E R D I S S I P A T I O N
Efficiency and flexibility
Log P E R F O R M A N C E
103 . . . 104
Source: T.Noll, RWTH Aachen
2005 © R. Leupers
10
2. ASIP design
methodologies
2005 © R. Leupers
12
ASIP architecture exploration
Application
initial processor
architecture
optimized
processor
architecture
2005 © R. Leupers
Profiler
Compiler
Simulator
Assembler
Linker
13
Expression (UC Irvine)
2005 © R. Leupers
14
Tensilica Xtensa/XPRES
Source: Tensilica Inc.
2005 © R. Leupers
15
MIPS CorXtend/CoWare CorXpert
H
o
t
s
p
o
t
Profile
and
identify
custom
instructions
Replace
critical code
with special
instruction
2
User Defined Instruction
+
Synthesize
HW and
profile
with
MIPSsim
and
extensions
3
CorExtend Module
1
User Defined Instruction
2005 © R. Leupers
16
CoWare LISATek ASIP architecture exploration
¾ Integrated embedded
processor development
environment
¾ Unified processor
model in LISA 2.0
architecture description
language (ADL)
¾ Automatic generation
of:
ƒ SW tools
ƒ HW models
2005 © R. Leupers
17
LISA operation hierarchy
main
Reflects hierarchical
organization of ISAs
decode
addr
imm
linear
2005 © R. Leupers
cond
opcode
opnds
cycl
control
arithm
move
short
add
sub
mul
and
or
long
18
LISA operations structure
LISA operation
References to other operations
DECLARE
Binary coding
CODING
Assembly syntax
SYNTAX
Resource access, e.g. registers
EXPRESSION
Initiate “downstream” operations in pipe
ACTIVATION
BEHAVIOR
Computation and processor state
update
C compiler generation
SEMANTICS
2005 © R. Leupers
19
LISA operation example
OPERATION ADD
{
DECLARE
{
GROUP src1, src2, dest = { Register }
}
CODING { 0b1011 src1 src2 dest }
SYNTAX { “ADD” dest “,” src1 “,” src2 }
BEHAVIOR { dest = src1 + src2; }
}
OPERATION Register
{
DECLARE
{
LABEL
}
C/C++ Code
ADD
index;
CODING { index }
src1
src2
dest
SYNTAX { “R” index }
EXPRESSION{ R[index] }
Register
}
2005 © R. Leupers
Register
Register
20
Exploration/debugger GUI
••Application
Applicationsimulation
simulation
••Debugging
Debugging
••Profiling
Profiling
••Resource
Resourceutilization
utilizationanalysis
analysis
••Pipeline
Pipelineanalysis
analysis
••Processor
Processormodel
modeldebugging
debugging
••Memory
Memoryhierarchy
hierarchyexploration
exploration
••Code
Codecoverage
coverageanalysis
analysis
••...
...
2005 © R. Leupers
21
Some available LISA 2.0 models
¾ DSP:
• VLIW:
ƒ Texas Instruments
TMS320C54x
ƒ Analog Devices
ADSP21xx
ƒ Motorola 56000
¾ RISC:
ƒ MIPS32 4K
ƒ ESA LEON SPARC 8
ƒ ARM7100
ƒ ARM926
2005 © R. Leupers
– Texas Instruments
TMS320C6x
– STMicroelectronics
ST220
• µC:
– MHS80C51
• ASIP:
– Infineon PP32 NPU
– Infineon ICore
– MorphICs DSP
22
3. Software tools
2005 © R. Leupers
23
Tools generated from processor ADL model
Application
Profiler
Compiler
Simulator
Assembler
Linker
2005 © R. Leupers
24
Instruction set simulation
Interpretive:
• flexible
• slow (~ 100 KIPS)
Compiled:
• fast (> 10 MIPS)
• inflexible
• high memory
consumption
JIT-CCS™:
• „just-in-time“
compiled
• SW simulation cache
• fast and flexible
2005 © R. Leupers
Decode
Instruction
Application
Execute
Memory
Run-Time
Run-Time
Instruction Behavior
Application
Simulation
Compiler
Compile-Time
Compile-Time
Program
Memory
Program
Memory
Run-Time
Run-Time
Instruction
Application
Execute
Compiled
Simulation
Instruction Behavior
Decode
Compiled
Simulation
Cache
Execute
Run-Time
Run-Time
25
JIT-CC simulation performance
9
100
8
90
7
80
70
6
60
5
50
4
40
3
30
C
2005 © R. Leupers
6
25
12
51
2
10
24
20
48
40
96
81
9
16 2
38
32 4
76
8
0
8
0
64
10
32
1
8
16
20
o
In m p
te ile
rp d
re
ti v
e
2
Cache Miss Ratio [%]
Performance [MIPS]
• Dependent on simulation cache size
• 95% of compiled simulation performance @ 4096 cache
blocks (10% memory consumption of compiled sim.)
• Example: ST200 VLIW DSP
Cache size [records]
26
Why care about C compilers?
¾ Embedded SW design becoming predominant manpower
factor in system design
¾ Cannot develop/maintain millions of code lines in assembly
language
¾ Move to high-level programming languages
2005 © R. Leupers
27
Why care about compilers?
¾ Trend towards heterogeneous multiprocessor systems-onchip (MPSoC)
¾ Customized application specific instruction set processors
(ASIPs) are key MPSoC components
¾ How to achieve efficient compiler support for ASIPs?
ASIC
ASIC
ASIC
ASIC
CPU
CPU
ASIP
ASIP
ASIP
ASIP
CPU
CPU
ASIP
ASIP
Memory
Memory
Memory
Memory
Memory
Memory
CPU
CPU
Mem
Mem
2005 © R. Leupers
28
C compiler in the exploration loop
„Compiler/Architecture Co-Design“
¾ Efficient C-compilers cannot be
designed for ARBITRARY architectures!
Application
Software
Compiler
Processor
Results
¾ Compiler and processor form a UNIT that needs to be
optimized!
¾ “Compiler-friendliness“ needs to be taken into account
during the architecture exploration!
2005 © R. Leupers
29
Retargetable compilers
Classical compiler
source
code
Compiler
Retargetable compiler
source
code
processor
model
processor
model
Compiler
Compiler
asm
code
asm
code
2005 © R. Leupers
30
GNU C compiler (gcc)
• Probably the most widespread retargetable compiler
• Mostly used as a native Unix/Linux compiler, but may
operate as a cross-compiler, too
• Support for C/C++, Java, and other languages
• Comes with comprehensive support software, e.g. runtime
and standard libraries, debug support
• Portable to new architectures by means of machine
description file and C support routines
“The
“Themain
maingoal
goalof
ofGCC
GCCwas
wasto
tomake
makeaagood,
good,fast
fastcompiler
compilerfor
for
machines
machinesin
inthe
theclass
classthat
thatthe
theGNU
GNUsystem
systemaims
aimsto
torun
runon:
on:32-bit
32-bit
machines
machinesthat
thataddress
address8-bit
8-bitbytes
bytesand
andhave
haveseveral
severalgeneral
generalregisters.
registers.
Elegance,
Elegance,theoretical
theoreticalpower
powerand
andsimplicity
simplicityare
areonly
onlysecondary.”
secondary.”
2005 © R. Leupers
31
CoSy compiler system (ACE)
• Universal retargetable C/C++
compiler
• Extensible intermediate
representation (IR)
• Modular compiler
organization
• Generator (BEG) for code
selector, register allocator,
scheduler
2005 © R. Leupers
© ACE - Associated
Compiler Experts
34
LISATek C compiler generation
LISA
processor model
SYNTAX
{{
SYNTAX
“ADD“
“ADD“ dst,
dst, src1,
src1, src2
src2
}}
CODING
{{
CODING
0b0010
dst
0b0010 dst src1
src1 src2
src2
}}
BEHAVIOR
BEHAVIOR {{
ALU_read
ALU_read (src1,
(src1, src2);
src2);
ALU_add
ALU_add ();
();
Update_flags
Update_flags ();
();
writeback
writeback (dst);
(dst);
}}
SEMANTICS
SEMANTICS {{
src1
src1 ++ src2
src2 Æ
Æ dst;
dst;
}}
…
…
GUI
Autom. analyses
Manual refinement
CoSysystem
system
CoSy
Compiler
CCCompiler
2005 © R. Leupers
36
LISATek compiler generation
C-Code
ASM-Code
int a,b,c;
a = b+1;
c = a<<3;
…
LD R1, [R2]
ADD R1, #1
SHL R1, #3
…
CodeFrontend RegisterOpt
Selector
Allocator
FE
DE
InstructionFetch
ADD
SUB
JMP
Backend
Scheduler
ADD …
ALU
…#1
SUB
MUL
JMP
2
1
ADD
2
3Mem
Jump
Decoder
Pipeline Control
EX
WB
WriteBack
Decoder
Registers
R[0..31]
…R[i]
…
2005 © R. Leupers
Prog Data
RAM RAM
37
Compiled code quality: MIPS example
¾LISATek
¾LISATekgenerated
generatedC-Compiler
C-Compiler
¾Out-of-the-box
¾Out-of-the-boxC-Compiler
C-Compiler
¾No
¾Nomanual
manualoptimizations
optimizations
¾Development
¾Developmenttime
timeof
ofmodel
model
approx.
approx.22weeks
weeks
gcc
gccC-Compiler
C-Compiler
¾gcc
¾gccwith
withMIPS32
MIPS324kc
4kcbackend
backend
¾Used
¾Usedby
bymost
mostMIPS
MIPSusers
users
¾Large
¾Largegroup
groupof
ofdevelopers,
developers,
several
severalman-years
man-yearsof
ofoptimization
optimization
Size
Cycles
140.000.000
80.000
120.000.000
70.000
60.000
100.000.000
Overhead
Overheadof
of10%
10%in
incycle
cyclecount
countand
and17%
17%in
incode
codedensity
density
50.000
80.000.000
Cycles
Size
40.000
60.000.000
30.000
40.000.000
20.000
20.000.000
10.000
0
0
gcc,-O4
gcc,-O2
2005 © R. Leupers
cosy,-O4
cosy,-O2
gcc,-O4
gcc,-O2
cosy,-O4
cosy,-O2
38
Demands on code quality
¾ Compilers for embedded processors have to generate
extremely efficient code
ƒ
Code size:
» system-on-chip
» on-chip RAM/ROM
ƒ
Performance:
» real-time constraints
ƒ
Power/energy consumption:
» heat dissipation
» battery lifetime
2005 © R. Leupers
39
Compiler flexibility/code quality trade-off
variety of
embedded
processors
unification
retargetable
compilation
dedicated
optimization
techniques
specialization
DSP
2005 © R. Leupers
NPU
VLIW
40
Adding processor-specific code optimizations
¾ High-level (compiler IR)
ƒ Enabled by CoSy´s engine concept
¾ Low-level (ASM):
LISA
LISACC
Compiler
Compiler
.C
.C
Unscheduled
Unscheduled
.asm
.asm
Assembly API
Scheduled
Scheduled&&
Optimized
Optimized
.asm
.asm
Optimization
11
Optimization
Optimization
22
Optimization
Optimization
Optimization33
Binary Code Generation
Assembler
Assembler
2005 © R. Leupers
Linker
Linker
.out
41
4. ASIP architecture
design
2005 © R. Leupers
47
ASIP implementation after exploration
2005 © R. Leupers
48
Unified Description Layer
LISA
HDL Generation
Register-Transfer-Level
Gate–Level Synthesis
(e.g. SYNOPSYS design compiler)
Gate–Level
2005 © R. Leupers
49
Challenges in Automated ASIP Implementation
ADL:
HDL:
Instructions
Arithmetic
Mul
Multiplier
(MUL)
Control
JMP
Mac
1:1
Mapping
BRC
Multiplier
(MAC)
Independent description of
instruction behavior:
Independent mapping to
hardware blocks:
+ Efficient Design Space Exploration
- Insufficient architectural efficiency
by 1:1 mapping
2005 © R. Leupers
50
Unified Description Layer
LISA
Structure & Mapping
(incl. JTAG/DEBUG)
Optimizations
Unified Description Layer
Backend (VHDL,
Verilog, SystemC)
Register-Transfer-Level
Gate–Level Synthesis
(e.g. SYNOPSYS design compiler)
Gate–Level
2005 © R. Leupers
51
Optimization strategies
LISA: separate descriptions
for separate instructions
Goal: share hardware for
separate instructions
Instruction A
Instruction B
LISA
Operation A
LISA
Operation B
Mutual
Exclusiveness
a
Possible Optimizations
• ALU Sharing
b
c
d
+
+
x
y
a c
b d
+
x,y
2005 © R. Leupers
52
Optimization strategies
LISA: separate descriptions
for separate instructions
Goal: same hardware for
separate instructions
Instruction A
Instruction B
LISA
Operation A
LISA
Operation B
Mutual
Exclusiveness
…
• Path Sharing
Resource
Sharing
Register Array
…
• ALU Sharing
DataA
…
Possible Optimizations
Path PA
AddressA
Path PB
DataB
AddressB
• ...
…
Register Array
2005 © R. Leupers
DataA, DataB
AddressA
AddressB
53
5. Case study
2005 © R. Leupers
54
Motorola 6811
Project Goals:
• Performance (MIPS) must be increased
• Compatibility on the assembly level
for reuse of legacy code
(Integration into existing tool flow)
• Royalty free design
Î compatible architecture developed with LISA
using RTL processor synthesis
2005 © R. Leupers
55
Motorola 6811
legacy code
compiler
assembly
assembler
0100101010011010
1110010110101111
0000110110110100
assee
IInnccrreea nccee!!!!
rm
maan
PPeerrffoor IPPSS))
((M
MI
?
6812 6811
2005 © R. Leupers
56
Motorola 6811
Bluetooth app.
6811 compiler
assembly
assembly level
compatible
assembler
0100101010011010
1110010110101111
0000110110110100
LISA
Synthesized
Architecture
2005 © R. Leupers
57
Architecture Development
original 6811 Processor
LISA 6811 Processor
8 bit instructions
16 bit instructions
16 bit instructions
32 bit instructions
24 bit instructions
32 bit instructions
40 bit instructions
Instruction
Instruction is
is fetched
fetched by
by 88 bit
bit blocks:
blocks:
16
16 bit
bit are
are fetched
fetched simultaneously:
simultaneously:
Î
Î up
up to
to 55 cycles
cycles for
for fetching!
fetching!
Î
Î max
max 22 cycles
cycles for
for fetching!
fetching!
++ pipelined
pipelined architecture
architecture
++ possibility
possibility for
for special
special instructions
instructions
2005 © R. Leupers
58
Tools Flow and RTL Processor Synthesis
C-Application
6811 compiler
LISA tools
Assembly
LISA
model
LISA assembler
Executable
2005 © R. Leupers
6811 compatible architecture
generated completely in VHDL
1) VLSI Implementation:
Area: <17kGates
Clock Speed: ~154 MHz
2) Mapped onto XILINX FPGA
59
References
¾ R. Leupers: Code Optimization Techniques for Embedded
Processors - Methods, Algorithms, and Tools, Kluwer, 2000
¾ R. Leupers, P. Marwedel: Retargetable Compiler Technology
for Embedded Systems - Tools and Applications, Kluwer,
2001
¾ A. Hoffmann, H. Meyr, R. Leupers:
Architecture Exploration for Embedded Processors with LISA,
Kluwer, 2002
¾ C. Rowen, S. Leibson: Engineering the Complex SoC: Fast,
Flexible Design with Configurable Processors, Prentice Hall,
2004
¾ M. Gries, K. Keutzer, et al.: Building ASIPs: The Mescal
Methodology, Springer, 2005
¾ P. Ienne, R. Leupers (eds.): Customizable and Configurable
Embedded Processor Cores, Morgan Kaufmann, to appear
2006
2005 © R. Leupers
75
Download