Design Productivity Crisis

advertisement
xPilot: A Platform-Based System-Level Synthesis for
Reconfigurable SOCs
Prof. Jason Cong
cong@cs.ucla.edu
UCLA Computer Science Department
Motivation
 Design
complexity is outgrowing the traditional RTL
method even in current CMOS technologies
 Nanotechnology will
enable 10-100x increase in device
density and degree of integration
 Need
to enable higher level of design abstraction
 Start from behavior descriptions (e.g. C or SystemC)
 Use and/or re-use more complex functional unit (e.g. processor
cores instead of standard cells)
ESL Tools – A Lot of Interests …
xPilot: Platform-Based Synthesis System
Platform Description
& Constraints
SystemC/C
xPilot
xPilot Front End
Profiling
SSDM
(System-Level
Synthesis
Data Model)
Processor &
Architecture
Synthesis
Processor Cores
+ Executables
Interface
Synthesis
Drivers + Glue Logic
Analysis
Mapping
Behavioral Synthesis
Custom Logic
FPSoC

Uniqueness of xPilot
 Platform-based synthesis and optimization
 Communication-centric synthesis with interconnect optimization
xPilot: Behavioral-to-RTL Synthesis Flow
Behavioral spec.
in C/SystemC
Platform
description





Frontend
compiler


FPGAs/ASICs
Loop unrolling/shifting
Strength reduction / Tree height reduction
Bitwidth analysis
Memory analysis …
Core synthesis optimizations
 Scheduling
 Resource binding, e.g., functional unit
binding register/port binding
SSDM
RTL + constraints
Presynthesis optimizations
Arch-generation & RTL/constraints
generation
 Verilog/VHDL/SystemC
 FPGAs: Altera, Xilinx
 ASICs: Magma, Synopsys, …
System-Level Exploration Using xPilot for
Heterogeneous MPSoC Platforms
 Heterogeneous
MPSoCs exploration
 Processors
• Heterogeneous vs. homogeneous
• General-purpose vs. application-specific
 On-chip communication architecture (OCA)
• Bus (e.g. AMBA, CoreConnect), packet switching network
(e.g. Alpha 21364)
 Memory hierarchy
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
IP
μP
Network
Interface
Network
Interface
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
FPGA
μP
Network
Interface
Network
Interface
Communication Network
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
μP
DSP
Network
Interface
Network
Interface
Outline
 xPilot
Overview
 Behavior-level synthesis in xPilot
 System-level synthesis in xPilot
 Recent
Progress in xPilot
 Interface synthesis
 Resource binding based on distributed register architecture
 Conclusions
Advantage of Behavior Synthesis
 Shorter
verification/simulation cycle
 Better
complexity management, faster time to market
 Rapid
system exploration
 Quick evaluation of different hardware/software boundaries
 Fast exploration of multiple micro-architecture alternatives
 Higher
quality of results
 Platform-based synthesis & optimization
 Full consideration of physical reality
Example: Better Complexity Management

Shorter verification/simulation cycle
 Simulation speed 100X faster than RTL-based method [NEC, ASPDAC04]

Significant code size reduction
 RTL design ~300KL  Behavioral design 40KL [NEC, ASPDAC04]
 VHDL code generated by UCLA xPilot targeting Altera Stratix platform
 Over 10x code size reduction can be achieved
Unique Features of xPilot (1):
Platform-based Synthesis & Optimization
Platform-based
synthesis & optimization
 The quality of a RTL design is platform-dependent
 Designers often lack the complete and detail knowledge of the target
platform
Resource
Area
Delay (ns)
ADDSUB-24b
25 LUTs
2.27
ADDSUB-32b
33 LUTs
2.61
MUX8to1-24b
120 LUTs
2.92
MUX16to1-24b
264 LUTs
4.658
DSPMUL-18bx18b
2 DSP Blocks
3.833
DSPMUL-24bx24b
8 DSP Blocks
7.688
 Platform: Altera Stratix
(0,0)
0.58
1.8
2.8
2.0
2.9
3.7
2.8
3.8
4.7
(95,61)
3X3 Delay Matrix
 RTL synthesis & place-and-route: Altera QuartusII v5.0
Unique Features of xPilot (2):
Communication-Centric Synthesis & Optimization
 System performance & power is dominated by interconnect
 It is difficult for designers to consider physical layout at the RT
level
T
add1
Data
transfer
>
F
5*
2*, 3*
add2
6*
4*
mul1
(2,4,5)
mul2
(3,6)
Binding solution 1:
Both multipliers keep
active
mul1
mul2
Layout-aware performance
optimization
Overlap computation with communication
<
C2’
Layout-aware power
optimization
mul1
(2,5,6)
mul2
(3,4)
Binding solution 2:
mul2 can be powered
off when false branch
is taken
Unique Features of xPilot (3):
Highly Scalable and Optimized Synthesis Algorithms
 Use
of highly scalable and optimized synthesis algorithms
for best quality of results
 Interface synthesis: Simultaneous data and communication
scheduling for latency minimization
 Scheduling: A unified framework for multi-constraints and multiobjective scheduling based on the system of difference
constraints (SDC)
 Resource binding: Use of distributed register architectures for
interconnect/communication optimization
 Power optimization: Optimal functional module and voltage
binding
…
Behavior and Communication Co-Optimization
for Systems with SCM
 SCM
: Sequential Communication Media
 FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.)
 Data must be read and written in the same order
 Order may have dramatic impact on performance
• Best order should guarantee that no data transmission on critical path are
delayed by non-critical transmission
for (int i=0; i <8; i++) {
S1: data[i] = …;
}
C
data[8]
int s07 = data[0] + data[7];
Int s16 = data[1] + data[6];
…..
P2
P1
FIFO
Custo
m
Logic 1
PE1
Custom logic 2
DCT example
PE2
SCM Co-Optimization  Problem Formulation
 Given:
 A set of processes P connected by a set of channels in C
 A set of data D = {d1, d2, …, dm} to be transmitted on each
channel cj,
 Goal:
 Find the optimal transmission order of each process, so that
the overall latency of the process network is minimized
subject to the given design constraints and platform
specifications
 In the meantime, generate the drivers and glue logics for
each process automatically
Proposed SCM Co-Optimization Design Flow
Platform Description &
Constraints
Process Network
Front End
System-Level Synthesis
Data Model
SCOOP (SCM CO-Optimization)
Communication
order detection
Code transformation and
interface generation
Indices compression
for loop reordering
Drivers + Glue
Logics
Process
Behavior
Communication Order Detection

Step 1. Construct a global CDFG by merging the individual CDFGs of each process

Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the
total latency of the global CDFG
Process 1




T1
T2

+

T3
*
*
T3
T2
Process 2 +

T1
*
T1

+
T3

T2

+

Ti : FIFO
Latency = 5 cycles
Latency = 7 cycles
Loop Indices Compression
Given the optimal order, we try to generate restructured loops for
code compression

 i.e., given the original iteration and reordered iteration, find the minimum
number of linear intervals to represent the new iteration space
Original order: (0,0), (0,1), (1,0), (1,1)
After reordering: (0,0), (1,0), (0,1), (1,1)
Need to solve the linear system
i
 i '   a1 b1 c1  
   
 j 
 j '   a 2 b2 c 2  1 
 
Solution: i’=j, j’ = i;
Preliminary Experimental Results

Experimental setting
 Target communication model: two-process producer-consumer model
 Behavioral synthesizer: UCLA xPilot
 RTL simulator : Mentor ModelSim
Total latency (Cycle#)
RAs Compress
Designs
Trad.
SCOOP
Reduction
Before
After
DCT1
325
290
10.77%
0
0
Haar
142
134
5.63%
0
0
DWT
689
617
10.45%
0
0
Mat_mul
408
339
16.91%
96
20
DCT2
483
419
13.25%
80
64
Masking
620
420
32.26%
192
0
Dot
1903
1084
43.04%
300
0
An average of 26% improvement in total latency can be achieved.
Advantage of Register-File Microarchitectures
1
1
2
3
2
4
3
4
2
1
(a)
 (a) A scheduled
DFG with register
binding indicated
on each variable
(b)
 (b)
Binding using
discrete registers
(c)
 (c)
Binding
using a register
file
Distributed Register-File Microarchitecture
Island B
Island A
On-chip memory
blocks
Data-Routing
Logic
Local
Register
File
Input Buffers
FUP MUX
Island C
Functional Unit Pool
MUL
Island A
ALU
Island C
ALU’
Xilinx XC-2V
2000
3000
4000
6000
8000
#18Kb BRAM
56
96
120
144
168
Dist. RAM(Kb)
336
448
720
1,056
1,456
Altera EP1
S25
S30
S40
S60
S80
#M512(512b)
224
295
384
574
767
#M4K(4Kb)
138
171
183
292
364
#M-(512Kb)
2
4
4
6
9
Island B
FP-SoC
On-chip RAM resource
(Virtex II and Stratix)
Resource Binding for DRFM
 Facts
1
v7
v2
3
v3
4
v4
A

v6
v1
2
under simplified
assumptions
v9
v5
B
v8
C
v10
D
Inter-island connections
 (A,B)=(A,D)=1
 (A,C)=1, two data transfers
share one connection
 (C,D)=2
 Operations bound onto an island
form a chain in the given
scheduled DFG
 Inter-chain data transfers may
share a physical inter-island
connection
 The
number of inter-island
connections is crucial to the
QoR of a DRFM instance
Resource Binding Problem for DRFM
 General
DRFM binding problem
 Given scheduled DFG G and DRFM M, to find a feasible resource
binding B(G,M), so that the quality of B is optimized.
• Hard to characterize the quality of binding solution B
• The problem is too ad-hoc
 Relaxed
problem – DRFM Binding for Minimizing InterIsland Connections:
 Given a scheduled DFG G and DRFM M, to find a feasible
resource binding B(G,M), so that the total number of inter-island
connections of B is minimized.
 Solution: control-step by step binding with min-cost bipartite
matching
Three Experimental Flows for Comparison
xPilot Frontend
xPilot behavioral
synthesis system
1) Binding on
Discrete-Register
Microarchitecture
SSDM/CDFG
Scheduling algorithms
Scheduled CDFG (STG)
2) Baseline (Random)
DRFM Binding
RTL generation
Xilinx Virtex II
3) DRFM Binding for
Minimizing
Inter-Island Connections
Experimental Results

Xilinx ISE 7.1; Virtex II; Target clock period: 8ns

The baseline DRFM binding results achieve 46.70% slice reduction over the discreteregister approach

Optimized DRFM binding reduces 12.21% further

Overall, more than 2X logic slice reduction with better clock period (7.8%).
1200
Discrete-Reg
DRF-Random
1000
14
12
DRF-Opt
10
Clock Period (ns)
Slices
800
600
400
8
6
4
200
2
0
0
PR
LEE
CHEN
DIR
Area (Slices, DRF solutions use
on-chip RAM blocks)
PR
LEE
CHEN
Clock period (ns)
DIR
Conclusions
 xPilot
can automatically synthesize behavior level C or SystemC
presentation to RTL code with necessary design constraints
 Platform-based




synthesis with physical planning provides
Shorter verification/simulation cycle
Better complexity management, faster time to market
Rapid system exploration
Higher quality of results
 xPilot
can help to explore the efficient use of (multiple) on-chip
processors
 xPilot
can efficiently optimize the software for reconfigurable
processors
 We
are interested to engage with selected industrial partners to
further validate and enhance the technology
Acknowledgements
 We would like to thank the supports from
 National Science Foundation (NSF)
 Gigascale Systems Research Center (GSRC)
 Semiconductor Research Corporation (SRC)
 Industrial sponsors under the California MICRO programs (Altera, Xilinx)
 Team members:
Yiping Fan
Guoling Han
Wei Jiang
Zhiru Zhang
Download