Networking for Embedded Systems

advertisement
Platform-based Design
성균관대 조준동 교수
발표순서
 Why Platform-based Design?
 S/W configurable platform의 필요성
 Design Space of Reconfigurable
Architectures
 Reconfigurable Radio and Multimedia
Systems
 Network-centric Design: Clock and
Power
 Reliable Design
SoC and Customizable
Platform Based-Design
DSP
Reconfigurable
Hardware
(Fine Grain)
ASIC 1
ASIC 2
Reconfigurable
Hardware
(Coarse Grain)
Semiconductor
Revolutions
“Mainstream Silicon Application
is switching every 10 Years”
software
standard
µproc.,
memory
TTL
1967
1957
custom
LSI,
MSI
hardware
1977
reconfigurable
FPGAs
2007
1987
ASICs,
accel’s
1997
coarse
grain
Definition of Platforms?
•An architecture that
is designed for an
application domain
Platform 분류
 Application Platform:
멀티미디어 platform: Nexperia, TI의 OMAP
3G 무선 platform: Infineon의 M-gold
Bluetooth platform: Parthus
무선 platform: ARM의 PrimeXsys
 Process-centric platform
Improv System, ARC, Tensilica, Triscend
 Communication-centric platform:
Sonics, Palmchip
SoC Platform Adaptation
The Platform-Based
Design Concept
Cadence
Pre-Qualified/Verified
Foundation-IP*
HW-SW Kernel
+ Reference Design
Scaleable
bus, test, power, IO,
clock, timing architectures
MEM
Hardware IP
SW IP
Application
Space CPU
FPGA
Reconfigurable Hardware Region
(FPGA, LPGA, …)
Programmable
*IP can be hardware (digital
or analogue) or software.
IP can be hard, soft or
‘firm’ (HW), source or
object (SW)
Processor(s), RTOS(es)
and SW architecture
Foundry-Specific
HW Qualification
SW architecture
characterisation
Platform Architecture
Do I need a
dedicated DSP ?
Which microcontroller? ARM?
HC11? ARC?
Which RTOS do I use? Which scheduling
policy do I have to choose ?
How fast will my
user interface
software run? How
much can I fit onto
my microcontroller?
Which Bus? PI? AMBA?
Dedicated Bus for DSP?
Can I buy a QCELP
decoding core?
Do I need a dedicated
HW or can I run this
on the Microcontroller ?
Example of a commercial
SoC
More CPUs?
More
SRAM/Flash?
Add FPGA?
A Legacy SoC Approach
CoreConnect (PPC), AMBA (ARM)…
Networks-on-Silicon,
Phillips
MP-SOC Cluster
Definition of MP-SOC?
Usually
heterogeneous
multiprocessor:
CPUs, DSPs, etc.
Hardwired
accelerators.
Mixed-signal
front end.
기존 MP-SoC의 문제점
▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력
소모가 크게 된다.
▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은
자원 유용도가 낮아서 리니어로 전력량이 늘어나게
된다.
▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서
독립적으로 설계되었다.
▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도
하다.
▷ 프로세서가 와이어와 메모리 지연시간에 의해서
제약된다.
▷ 특정 응용분야에 대해서만 최고 성능을 낸다.
4G: Multiple standards
Communications.
Networking.
Multimedia.
Security.
Mutiband/multimode를
지원하는 Digital RF
The triangle, Chicken and
Egg?
•Hardware and
software architectures
determine capabilities.
•Applications guide
design decisions.
•Methodologies allow
repeatable, predictable
design.
architectures
applications
methodologies
Why Multi-Threaded Cores?
Increasing
gap: memory
& processor
speeds
(2x / 2 years)
More parallel
processing
(lower-power,
higher-perf./mm2)
GPP
D$
I$
$
H/W
Proc. Element
DSP
DSP
H/W-MT
DSP
RISC I$
NoC
In
……
Out
SRAM
Increasing gap:
interconnect &
gate delays
(multi-clock)
MPSoC “Bus” Alternatives
• Fixed Bus [Bergamaschi, DAC, 2000]
– Point to point communication
– Signals between cores transferred
dedicated wires
• FPGA-like Bus [Cherepacha, FPGA Sym,
– Programmable interconnects
– Employ static network
• Arbitrated Bus [IDT Inc., 2000]
– Time-shared multiple core connectivity
– Use arbitrator
• Hierarchical Bus [AMBA, ARM Inc]
– Combine multiple buses using bus
– Separate buses for cores and I/O
 NoCBus [Dally, DAC, 2000]
– Resources communicate with data packets
– Use switch fabric
Future mobile platform?
Mudge et al:
 Mobile supercomputing
Speech recognition.
Cryptography.
Augmented reality.
Typical applications (email, etc.).
 Requires 16x 2 GHz Pentium 4.
 Peak power must not exceed 75 mW
미래 모빌 어플리케이션
플랫폼?
Culture and Education?
Personal Entertainment
Platform?
Road Map to MP-SoC
Trends
 mask NRE: Over 1M$; design NRE: 10M$ to 75M$
ASICs replaced by programmable ASSP, FPGA’s
 number of embedded processors
DVD/STB/HDTV, mobile phones: 5 to 8
 Image proc, networking, basestation: 8 to 100+
 eS/W complexity
Set-top box, audio: >1 million lines of code
eS/W becoming essential part of SoC’s
?’s Law?
Should the SoC designer work
hard?
Compose the system
Verify
Requirements
Simulate
Verify
SoC Composer
Verify
Synthesis + P&R
Verify (timing, area)
Mobile SoC에서
검증이 왜
중요한지?
왜 우리는 검증이
취약하게 되었는지
Simulate (performance)
Verify
Tape Out
More SoC topics …
Platform optimization
Power management
BW allocation
Resource sharing
Task distribution
Efficient communications
Low Power
Verification
•인재 (System Architect) 양성
Available Mobile and VLIW Processors
 The ARM Family
The
The
The
The
The
ARM7 Generation
StrongARM
ARM Thumb Option
ARM Piccolo Option
ARM9 and ARM10
 The Motorola M-Core
 The LSI TinyRisc
 The Hitachi SuperH Family
 VLIW Processors
The Motorola-Lucent Star*Core
The Philips TriMedia
The HP/Intel IA-64
NexperiaTM DVP Hardware
architecture (source: Th. Claasen, Philips, DAC
2000)
Exploitable Parallelism
Min parallel
grain size
(instrns.)
MultiFlex ThreadLevel
GP O/S
Parallelism
Thread-Level
Parallelism
Exploitable
task
parallelism
InstructionLevel
Parallelism
10 000’s
Instructions
1~100
1~8
100’s
2~6
1
NEC MP211: Homogeneous
MP core
 Asymmetric mp with very coarse grain multitasking
 3 ARM9’s utilized as predefined function units
 NO complex overhead : e.g. no cache coherency, dynamic
scheduling/load balancing
MP-SoC의 장점
 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리
사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어
개발기간과 가격을 낮출 수 있다.
 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를
낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply
voltage를 가능하게 하고 파워 소모를 줄일 수 있다.
 Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로
조절이 가능하다.
 Boosting real-time 성능: 각 어플리케이션은 각기 다른
프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간
인터페이스를 줄일 수 있다.
 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지
안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다.
AMP task allocation image
Bus and Memory
Architecture
MP211 block diagram
Power consumption of
H.264+AAC
Holistic design of multi-core
architectures
 Naïve Methodology is inefficient
 Demonstrated inefficiency for cores and proposed
alternatives
 Single-ISA Heterogeneous Multi-core Architectures
for Power[MICRO03]
 Single-ISA Heterogeneous Multi-core Architectures
for Performance[ISCA04]
 Conjoined-core Chip Multiprocessing [MICRO04]
 What about interconnects?
 How much can interconnects impact processor
architecture?
 Need to be co-designed with caches and cores?
Heterogenous MP Core

▷ Single-ISA heterogeneous multicore 구조는 볼테지 스케일링, 클럭 게이팅, speculation
control등을 사용하는 경우에 비해 우수한 성능을 보인다.

▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous CMP(또는 asymmetric
CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은 큰 사이즈의 코어를 비롯하여 작은
사이즈의 코어를 이용하기를 원한다. 또한 바테리를 사용하는 경우와 전원을 사용하는 경우등
시스템의 콘텍스트에 의존적이다. 따라서 복잡도가 다른 코어들을 사용하는 것이 효율적이다.

▷Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level
parallellism, instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 그러나 singleISA heterogeneous CMP는 모든 코어가 같은 ISA를 수행하기 때문에 각 응용이 어느 코어에
매핑이 되어도 상관없게 된다. 코어 숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core
프로세서의 경우, 인터콘넥트의 전력 소모량은 하나의 코어와 같다. 다이나믹 볼테지 스케일링
및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지-딜레이 프로덕트가 75%
개선되는 효과를 얻을 수 있다.

▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을 이용하는
heterogeneous processors는 homogeneous에 비해서 63% 성능이 개선된다.

5-8 threads level을 사용하는 경우에는 평균 29%의 개선이 있다. Amdahl's의 법칙에 의하면
병렬 응용들의 속도개선은 직렬 응용 부분때문에 제한적이 된다.

▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에 대해서는 전력
소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를 최대화 한다. [Annavaram, et al]
Heterogeneous MP-SoC
문제점들
 Processors are bound by wire and memory
latencies
 Peak performance on only a small class of
applications.
 How well they map to a given design
 Diversification of workloads
 Increased hardware complexity
 Poor resource utilization
Alpha cores scaled to 0.10 um.
EV8 is 80 times bigger but provides only two to three times
more single-threaded performance
Heterogenous MP Core
If two or more cores share L2, the way a lot of present CMPs do, a
crossbar provides a high bandwidth connection.
Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로
구성되며 vector/data-level parallellism,
instruction level parallelism을 동시에 처리 가능하도록 설계되었다.
헤티로지니어스 플랫폼의 특징
 8-core 프로세서의 경우, 인터콘넥트의 전력 소모량은
하나의 코어와 같다. 다이나믹 볼테지 스케일링 및
사용하지 않는 코어에 대해서 게이팅 기술을 이용하면
에너지-딜레이 프로덕트가 75% 개선되는 효과를 얻을
수 있다.
 듀얼 프로세서의 경우를 예를 들면 low Thread level과
high thread level을 이용하는 heterogeneous
processors는 homogeneous에 비해서 63% 성능이
개선된다. threads level을 사용하는 경우에는 평균
29%의 개선이 있다. Amdahl’s의 법칙에 의하면 병렬
응용들의 속도 개선은 직렬 응용 부분때문에 제한적이
된다.
10 Performance of heuristics for equal-area
heterogeneous architectures with multithreaded cores.
Exploring the potential from
heterogeneity
CT 3400 Multi-core DSP
H.264 encoder , decoder and audio codecs and the system control
 8개 32비트 DSP
코어
 6개 32비트 범용
프로세서 코어
 128핀 프로그램
가능 I/O
서브시스템으로
구성
 C 프로그램 가능
 H.264 및 MPEG4
코드를 지원
http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf
H.264 codec onto the cradle
CT3400 MDSP
CT 3400 Multi-core DSP

DSP Engine

Each DSP engine contains

A Single Instruction Multiple Data

Arithmetic Logic Unit (SIMD ALU)

A Packed Integer Multiplier
Accumulator (PIMAC)

A Floating Point Unit (FPU)

Bi-directional FIFO data buffers

DMA channels

A 128 x 32 register and

A 512 x 20 program memory
CT3400 DPS Engine
http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdf
http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf
CT3600계열 제품군
CT3600 Multiprocessor DSP Family Members
 CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장
뛰어난 가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운
경쟁 제품보다 2배 이상 우수
 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인
프로파일 비디오 인코더를 업계 최초로 구현한다
 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을
네 배로 증가
 40달러에서 90달러
http://www.cradle.com/downloads/CT3600-PB.pdf
CT 3616 Multi-core DSP
http://www.cradle.com/downloads/CT3600-PB.pdf
Homogeneous MP-SoC
문제점들
 The hardware must be configurable for efficient
execution across broad class of application.
 Each core consists of an array of homogenous
processing execution nodes, a banked
Instruction Cache, Data Cache, register file and
block control logic.
 Some of the resources (called polymorphous
resources)
 in the TRIPS architecture can be configured to
operate differently depending on the mode
(instruction, thread or data parallelism).
HiBRID-SoC Architecture

HIBRID-SoC multi-core system-on-chip Architecture

Integrate a powerful on-chip communication structure

A well-balanced memory system to account for the growing amount of data memory
system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding (AVC))

Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing
demend


Three programmable cores

Each adapted towards a specific class of algorithms

Combination of the cores and their software development environment

An extention of a programmable core with dedicated modules (e.g.,Trimedia)
HIBRID-SoC multi core

Developed at the University of Hannover
Multi-Core SoC Architecture


Multi-Core SoC Architecture

Instruction Level VLIW (Very long instruction word)

Data Level SIMD (Single instruction multiple data)

Task Level (Simultaneous multithreading)
Hi-par DSP

16-datatath SIMD processor core controlled by VLIW,
Particularly optimized towards high-throughput two dimensional DSP-style processing
(FFT-intensive applications or filtering)

Stream Processor (SP)

32-Bit RISC architecture that is more optimized to-wards control-dominated task
Bitstream processing or global system control

Macroblock processor(MP)

Efficient processing of data blocks (Heterogeneous data path structure consisting of scalar
and a vecture unit)

Controlled by dual-issue VLIW, offers flexible subword parallelism, and contains
instruction set extensions for typical processing computation steps
HiBRID-SoC multi-core architecture

64-bit AMBA AHB system bus

Connects all cores SDRAM
memory via a 64 Bit SDRAM
interface

Two versatile 32-Bit host
interfaces for access (e.g., host
PC via PCI and to serial flash
memory)
Figure 1. HiBRID-SoC multi-core architecture
HiPAR-DSP

HiPAR-DSP

Highly paralled DSP core with a VLIWcontrolled SIMD architecture

Memory concept provides an easy data
exchange between the data paths, which
is required for many filter and image
processing algorithms

DMA unit serves all cache misses and
performs data prefetch transfers to the
matrix memory

At the targeted clock frequency of 145
MHz, the HiPAR-DSP achieves a
performance of 2.3 GMACs
Figure 2. HiPAR-DSP architecture
Stream Processor
 Stream Processor

Sp has been optimized for high-level programmability and
efficient processing of control-driven applications

Harvard architecture with a 32-bit data path consisting of 5
pipeline stages and controlled by 32-Bit RISC instructions.

Supports Conditional execution, forwarding interlocks, and
provides full interrupt capability

Convert the 64-Bit AMBA bus width to the 32-Bit internal
Macroblock processor

Macroblock processor

Heterogeneous data path structure
consisting of a scalar and a vector data
path

The scalar data path operates on 32-Bit
data words in a 32-entry register file
and provides control instructions
(jump,branch, and loop)

The vector data path is equipped with a
64 entry register file of 64 bit width

Special fuction unit(SFU) provide
instruction set extensions for common
video and multimedia core algorithms.

MUL/MAC or ALU, incorporate
SIMD-style subword parallelism by
processing either two 32-Bit, four 16Bit, or eight 8-Bit data entities in
parallel within a 64-bit register operand
Figure 3. Macroblock processor data paths.
HiBRID-SoC Implementations
 HiBRID-SoC is fabricated in a 0.18 um,
6LM standard-cell technology,
14 million tr’s 3.5W
occupies 82 mm2, and operates at 145 MHz
Table 1. MPEG-4 ASP decoder (full TV
resolution) performance on MP and SP,
720*576@25Hz,1.5-3 Mbits:
Figure 3. Chip layout of the HiBRID-SoC.
Analyzing On-chip Communication in
MPSoC Enviroment
Proceedings of Design,Automation and Test’04 Mirko Loghi et al
•Analysis and trade-off exploration of on-chip
communication
architectures.
•Compare and analysis with two practical
configurations :
AHB-AMBA (ARM) and STBus (ST
Microelectronics).
•Models hardware and software of MPSoC at highlevel of
accuracy and sufficient simulation speed.
•Provide realistic performance by stimulating
communication system with functional traffic.
Multiprocessor simulalation platform
 Hardware architecture:
• Homogeneous MPSoC platform.
• Configurable number of 32-bit ARM
processors.
• Processor cores : GPL-licensed ARM
Instruction Set Simulator (ISS)
SWARM in C++
• Private memories for each
processor.
• A shared memory
• A hardware interrupt module.
• 32-bit interconnection
• All components are wrapped in
SystemC
Multiprocessor simulalation platform
 Benchmarks running with RTEMS-OS :
• Running on top of RTEMS
• Synchronization : Use OS queues to exchange matrices between
processors.
• Benchmark 1: Independent matrix multiplication.
• Benchmark 2: Pipeline of matrix multiplication

•
•
•
Benchmark 1: Independent matrix multiplication:
Perform independent matrix multiplication at each processors
Not require interprocessor communication.
Operands are stored in private memories of each processor.
Multiprocessor simulation platform
 Benchmark 2: Pipeline matrix multiplication:
• Platform receives a continous flow of input and out put
• Operation of every cores follows this partern :
 Copies input matrix from share memory to private
space
 Multiplicate input matrix with a already matrix in
private space
 Copies the resulting matrix back to shared space.
• Interrupt and semaphores slaves are queried to keep
synchronization in all process.
Multiprocessor simulation platform
 Code development and
analysis tool :
• Development tool : GNUcross compiler
• Allow flexible profiling by
functions of simulator.
• Output of simulator :
• Statistics about processor
and interconnect
performance.
• VCD waveform of all bus
signal
• Traces of memory accesses
performed by every cores.
Features of communication architecture
 AMBA-AHB Architecture:
• Traditional shared bus with pipelining.
• Distinct data and address/control bus
• Transfer with data phase and control phase.
• Support burst as streams of single transaction.
• “split/retry transfer” and “early burst termination” are used to solve
high-latency slaves.
 STBus Architecture:
• Protocol type 3: simple load/store operation , pipelining and spliting
transaction,out-of-order support.
• Flexible topology :from shared bus to full crossbar
• Overlapping transfer:Requesting new burst while previous ones are
still completing without idle cycle.
• Fast arbitration with two cycles and minimum latency is three
cycles.
Experimental Result
 Comparison of performance interconnection
 Five interconnections :
• AMBA-AHB
• Shared-bus STBus
• Full crossbar STBus
• Partial crossbar STBus : ST-32
• Partial crossbar STBus : ST-54
Experimental Result
 Performance comparison
Experimental Result
 Comparison of performance interconnection
 Four benchmarks :
• Matrix multiplication independent : ASM-IND
• Matrix multiplication pipeline without OS : ASM-PIP
• Matrix multiplication with OS : OS-IND
• Matrix multiplication pipeline with OS : OS-PIP
Experimental Result
 Comparison of performance
interconnection
MPSoC Clock and Power
Olivier Franza, Intel
 Increased uncertainty with process scaling
 Process, voltage, temperature variations, noise,
coupling
 Affects design margin over design, power & performance
loss
 Increased power constraints
 Increasing leakage, power (density, delivery)
limitations
 More transistors mean:
 Larger clock distribution networks
 Higher capacitance (more load and parasitics)
 With each new technology:
 Gate delay decreases ~25%
 Wire delay increases ~100%
 Cross-chip communication increases
 Clock needs multiple cycles to cover die
Interconnect Delays &
Density
Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology
Multiple Clocks due to Interconnect
limitation
At reduced performance,
larger resource size
Noise in Mixed Signal
Systems
Multiple clock domains
 Low skew and jitter ALWAYS a must
 Clock modeling requires more accuracy
 Within-die variations, inductance, crosstalk,
electromigration, self-heat, …
 Floor plan modularity
 Think adding/removing cores seamlessly!
 Hierarchical clock partitioning
 Reduce global clock and possibly relax its requirements
 Generate “locally”-used clock “locally”
 Implement clock domain deskewing techniques
 Bound clock problem into simple, reliable, efficient
domains
DEC/Compaq Alpha
more complex core to improve performance, more
complex clocks (?), Source: DEC/Compaq – Gronoski & al., JSSC 1998 – Xanthopoulos &
al., ISSCC 2001 – Barroso & al., ISCA 2000

Clock and Power Convergence
Intel®
Itanium® Montecito
Each core split into 3 clock domains
on variable power supply
 Each domain controlled by Digital
Frequency Divider (DFD)
generating low-skew variablefrequency clocks; fed by central
PLL and aligned through phase
detectors
 Regional Voltage Detector (RVD):
supply voltage monitor
 Second level clock buffer (SLCB):
digitally controlled delay buffer for
active deskewing
 Regional Active Deskew (RAD):
phase comparators monitoring
and adjusting delay difference
between SLCBs
 Clock Vernier Device (CVD):
digitally controlled delay buffer
Clock generation and distribution are essential Clock generation and distribution are
essential enablers of microprocessor performance
On-Chip Interconnects:
Circuits and Signaling,
Wayne Burleson
• Using Vdd programmability
• High Vdd to devices on critical path
• Low Vdd to devices on non-critical
paths
• VddOff for inactive paths
A – Baseline Fabric
B – Fabric with Vdd Configurable
Interconnect
This work builds on a similar idea for FPGAs described in:
Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International
Conference on Computer-Aided Design, Nov. 2004
Why Reconfigurable
System?
 GPP와 재구성 h/w 를
포함
 목적: 전력 감축 및
유연성
1. 동적인 환경에 따른
Quality of Service를 제공
2. 알고리즘 진화에 따른
유연한 구조
3. 개발 및 유지 보수해야
하는 플랫폼 감소
Task 1
Task N
A B
W
C
X Y
D E
Z
X
D
H
A
W
Y
B
I
J
C ZE
Reconfigurable Hardware
Energy Efficiency
of Reconfigurability
system architecture
communication protocol
O/S and applications
Partitioning of functions between wireless
device and services on the network
The mobiles must be flexible enough to
accommodate a variety of multimedia
services and communication capabilities and
adapt to various operating conditions in an
(energy) efficient way
S/W configurable platform의
필요성
– Doing More by Doing Less :다양한
표준을 다룰 수 있는 능력이 필요 (AM,
FM, GSM, UMTS, digital broadcasting
standards, analog and digital television
and other data links.
– A fully software reconfigurable multichannel broadband sampling receiver for
standards in the 100 MHz band
Granularité dela reconfiguration
Sébastien PILLEMENT - ENSSAT/LASTI
 Reconfiguration au niveau système
 Lx, C62 (décomposition en cluster)
 Reconfiguration au niveau fonctionnel
 Pleiades, RaPiD, DART(2001)
 Reconfiguration au niveau opérateur
 Chameleon, Piperench, Morphosys(2000)
 Reconfiguration au niveau porte
 Napa, GARP, FPGA
The gain size of operations
in Reconfigurable System Architectures
Fine gained operations : Multiply and
addition
Medium gained operations : reconfigurable
modules
Course gained operations : CPU, host
Design Space of
Reconfigurable Architectures
RECONFIGURABLE ARCHITECTURES
(R-SOC)
Lilian Bossuet
LESTER Lab
Université de Bretagne Sud
Lorient, France
MULTI GRANULARITY
(Heterogeneous)
FINE GRAIN
(FPGA)
Processor +
Coprocessor
Island
Topology
Hierarchical
Topology
Coarse Grain
Coprocessor
Fine Grain
Coprocessor
• Xilinx Virtex
• Xilinx Spartran
• Atmel AT40K
• Lattice ispXPGA
• Altera Stratix
• Altera Apex
• Altera Cyclone
• Chameleon
• REMARC
• Morphosys
• Pleiades
• Garp
• FIPSOC
• Triscend E5
• Triscend A7
• Xilinx Virtex-II Pro
• Altera Excalibur
• Atmel FPSIC
COARSE GRAIN
(Systolic)
Tile-Based
Architecture
Mesh
Topology
• aSoC
• E-FPFA
Linear
Topology
• RAW
• Systolic Ring
• CHESS
• RaPiD
• MATRIX
• PipeRench
• KressArray
• Systolix Pulsedsp
Hierarchical
Topology
• DART
• FPFA
Digital Signal Processing
With FPGAs
Paul Ekas
Jean-Charles Bouzigues
Multiplier Options In
FPGAs
Option
Resource
Area Usage
1
Logic Multipliers
Logic Elements
(Traditional)
500 LEs per
18x18 Multiplier
2
Hard Multipliers
DSP Blocks
4 18x18
Multipliers per
DSP Block
3
Soft Multipliers
RAM
1 to 2 Embedded
Memory Blocks
Logic Elements
Control
Signals
4
LE1
Smallest Unit of Logic
Grouped into Logic Array
Blocks (LABs) of Ten LEs
Features
Four-Input Look-Up Table
(LUT)
Configurable Register
Dynamic Add/Subtract
Control
Carry-Select Chain LogicLocal
Interconnect
4
4
4
4
4
4
4
4
Logic
Element
LE2
LE3
LE4
LE5
LE6
LE7
LE8
LE9
4
LE10
Logic Array
Block
DSP Block: Optimized Hard
MAC
36
38
+
36
+-S
37
Output Register Unit
+-S
37
Output MUX
Optional Pipelining
144
Input Register Unit
36
144
36
9 Bit x 9 Bit
18 Bit x 18 Bit
36 Bit x 36 Bit
8 Multiplies
4 Multiplies
1 Multiply
2 Multiplies with Accumulate
2 Multiplies with Accumulate
2 Sum of 2 Multipliers
(Complex Multipliers)
1 Sum of 2 Multipliers
(Complex Multiply)
2 Sum of 4 Multiplies
1 Sum of 4 Multiplies
Soft Multipliers: Lookup Based
Multiplication
 Use Embedded RAM Blocks as Look-Up Tables
(LUTs) for Generating Partial Products
 Coefficient or Sum of Coefficients Values
Stored in RAM Blocks
Address
 MSB Partial Product Shifted5& AddedMultiplier
to LSB
Table
ADDRESS
MULT_RESULT
Partial Product
Example
 Multiplication of 5-Bit
Input with 13-Bit
Coefficient
 All 18 Bit Possible
Results Stored at
32*18 Look Up Table
32*18
M512
18
Data Output
00000
0
00001
C
00010
2*C
00011
3*C
…
11111
….
31*C
C = Coefficient[12:0]
Altera FPGA Memory
Architectures
 Today’s applications need more high performance memory
 One size does not fit all
 Wide choice of modes and widths
M512 Blocks
M4K Blocks
M-RAM
External Memory Devices




Rate Changing
Embedded Shift
Register Mode
Operates Up to
312Mhz
Mixed Clock Mode




True Dual Port RAM
Embedded Shift
Register Mode
Operates Up to
312Mhz
Mixed Clock Mode





True Dual Port RAM
Embedded Shift Register
Mode
512K bits 300 Mhz
Operates Up to 300Mhz
Mixed Clock Mode

DDR SDRAM & SRAM

SDR SDRAM

QDR & QDRII SRAM

ZBT SRAM

DDR FCRAM
More Bits For Larger Memory Buffering
More Data Ports for Greater Memory Bandwidth
Soft Multiplier: Sum of
Multiplications
16-Bit Serial Shift Registers
16-Bit Serial Shift Registers
Input
1
1
(Sample 16-Bit, Coefficient 16 Bit)
1
Sum of Multiplications Table
4
4
M512
32*18
18
18
+
19
35
+
Example: FIR Filter
Memory: 2 M512
M512
32*18
Output
ADDRESS
MULT_RESULT
0000
0
0001
C0
0010
C1
0011
C0+C1
…
….
1111
C0+C1+C2+C3
Example Direct
Sequence Spread
Spectrum (DSSS)
Modem
DSSS Modem
Five Independent Data Channels Spread to 3.84 Mcps
Three-Stage FIR Interpolation-by-32
Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth
112 dB SFDR 15.36 MHz Quadrature Carriers
122.88 MSPS Transmitter Output with 5 MHz Bandwidth & Over 78-dB Out–
of-Band Rejection
 Automatic Gain Control (AGC) Compensating for Channel Attenuation of up
to 30 dB
 Costas Loop Carrier Recovery
DCH0
DCH0
 4x Oversampling
Code Synchronization





DCH1
DCH2
DCH3
DCH4
DSSS
Modulator
Channel
Model
DSSS
Demodulator
DCH1
DCH2
DCH3
DCH4
DSSS Modulator
DCH0
Cch,16,0
DCH1
S
FIR3 RRC
25-Tap FIR
Filter
Interpolation x4
Ex BW:22%
Re[]
Cch,16,1
gi
DCH2
K
Cch,16,2
SCH
Length 256
Gold Code
Spreader
K
DCH3
Cch,16,8
DCH4
Cch,16,9
PCH
Cch,16,10
Im[]
S
gq
FIR1
LPF
2-Channel
87-Tap
FIR Filter
Interpolation
x2
FIR2
LPF
2-Channel
47-Tap
FIR Filter
Interpolation
x4
Sin(wn)
NCO Frequency
Resolution:
0.03Hz
SFDR: 112dB Cos(wn)
Carrier Phase
Increment
FIR3 RRC
25-Tap FIR
Filter
Interpolation x4
Ex BW:22%
DSSS Demodulator
FIR
Altera RRC
31-Tap FIR Filter
Excess BW: 22%
Fixed Rate
AGC
NCO
Frequency
Resolution:
0.03Hz
SFDR: 112dB
pn_lock
8
Gold Code
Correlator
4x
Oversampling
Peak
Detector max_index
Data
Channels
Output
1…5
Carrier
Recovery
Loop
Free-Running
Phase Increment
Buffer
FIR
Altera RRC
31-Tap FIR Filter
Excess BW: 22%
Fixed Rate
I-Q
Derotate
Hadamard
Despreader
8
Pilot
Output
Pilot Monitor
DSSS Modem Resources
Resource Usage Summary
Design
Entity
Logic
Elements
M512
RAM
M4K
RAM
Mega
RAM
DSP Block
Elements
Modulator
9943
1
8
0
12
Demodulator
12196
60
8
1
60
Power Usage Estimates
Power
mW
Total Standby Internal Power
75
Total Logic Element Internal
Power
283
Total Clocktree Internal Power
175
Total DSP Internal Power
23
Other Internal Power
92
Total Power
505
FIR Filter Example* – 16X
Cost/Performance Improvement
Device
Solution
FIR
Performance
(MHz)
Device
Cost***
*
Cost per
FIR MHz
TI C6713-200
64 cycles** @
200MHz
3.125
$24.59
$7.87
TI C6416-600
32 cycles** @
600MHz
18.75
$160
$8.53
Altera 1C3-8
8 cycles*** @
230MHz
28.75
$14
$0.49
Altera 1C12-8
1 Cycles*** @
170
$84
* FIR 128 Tap, 16 170MHz
bit data, 14 bit coefficients
** DSPLib Optimized Assembly Libraries from Texas Instruments
*** MegaCore Optimized FIR Compiler from Altera
**** Pricing in quantity of 100 at Arrow 6/25/03
$0.49
Reconfigurable video processor
for SDRAM access optimization
(Henriss, Ernst et al.)
Reconfigurable video
platform
· SDRAM memory centered design
· FPGA based scheduler merges different streams
and random accesses exploitation of SDRAM
bank structure
· supports 2 HDTV streams at 1.48 Gbit/s each plus
DSP and filter unit access
· reaches 700MByte/s in practical application for 4
Byte SDRAM memory word
· extremly cost efficient design
· used in professional video product line
Fine-Grained RSOCs:
Triscend A7 CSOC






A7 Family
32-bit ARM 7
with 8kB
Cache
3200 logic
cells max.
(40K gates)
Up to 3800
FF’s
Up to 300
Prog. I/O pins
www.triscend.
com
Coarse-Grained RSOCs
Chameleon Structure (2000)
Design a battery powered personal mobile computing device that has
multimedia functionality and can operate in a dynamic environment.
- Do just enough and not too much for a given task (QoS)









32-bit ARC control processor
Up to 84 32-bit Datapath Units
DPU=a 32-bit ALU+a 32-bit
barrel shifter
Up to 24 of 16x24-bit
multipliers
Up to 48 of 128x32-bit local
memory modules
Up to 160 Prog. I/O pins
Targeted at 3rd gen. wireless
basestation, wireless local
loop,
SW radio, etc.
Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M.
Heysters, www.chameleonsystems.com
Field Programmable
Function Array
The FPFA concept has a number of
advantage
The FPFA has a highly regular organisation
We use general purpose process core
Its scalability stands in contrast to the
dedicated chips designed nowadays
The FPFA can do media processing tasks
such as compression/decompression
efficiently
Field Programmable
Function Array
 Processor tiles
Consists of five identical blocks, which share a control unit and a
communication unit
An individual block contains an ALU, two memories and four register
banks of four 20-bit wide register
A crossbar-switch makes flexible routing between the ALUs,
registers and memories
This structure is convenient for the Fast Fourier Transform(6-input,4output) and the Finite impulse response
M
M
M
M
M
M
M
M
M
M
Memory
CrossBar
Registers
ALU
ALU
ALU
ALU
ALU
ALUs
Performance (MMACs/sec)
DSP System Architecture Options
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
DSP
Stand-Alone
Processor
Processor Array
Processor +
Co-Processor
Dedicated Hardware
Architecture
Optional Coprocessor
Mappings
Processor On FPGA
Processor External to FPGA
FPGA
FPGA
Processor
Processor
•Nios
•ARM (AHB)
Memory
•TI c6x (EMIF)
•Mot PPC (MPX)
•Mot Starcore (MPX, AHB)
•Intel 2850 (PCI Express)
•ARM (AHB)
•…..
Mapping of DSP Algorithms
on the FPFA
Fast Fourier Transform
FFT recursively divides a DFT into smaller
DFTs
DFT
FFT
DFT
N=2
DFT
FFT
N=8
FFT
N=8
DFT
N=2
N=8
N=8
DFT
N=8
FFT
DFT
N=2
N=8
DFT
N=2
N=8
a
b
+
-
-
W
Recursion of a radix 2 FFT with 8
inputs
The radix 2 FFT butterfly
Mapping of DSP Algorithms
on the FPFA
Five-tap finite-impulse response filter
Cross Bar
h4
h3
h2
h1
h0
Level 2
1
2
3
4
5
O
MorphoSys (1999)
Reconfigurable cell
RC Array
•Array of reconfigurable
cells
•64 cells in a 2-D matrix
•SIMD model
•Same row(column)
share configuration
• Each RC operates on
different data
TinyRISC (Cont’d)
Implementation &
Performance
•0.35 micron technology
•4 metal layers
•Operation at 100MHz
•170 mm2
Motion Estimation
Block size : 16x16 pixel,
Image size : 352x288 pixel
Lx de STMicroelectronics
DART,
Raphael David, IRISA/ENSSAT
With STMicroelectronics, UBO univ.
 Reconfigurable
multigrain= DPR+FPGA
 Reconfiguration
Dynamique
 Faible Consommation
 Distribution hierarchique
des ressources
 SCMD (Single
Configuration Multiple
Data)
11 GOPS/cluster
1.6 GMACS/cluster
0.64 W @ 11GOPS
16 MIPS/mW @ 11GOPS
0.18u CMOS
DART
Cluster
Cluster architecture
DPR1
Control
DPR3
DPR4
DMA
ctrl
DPR5
Config
mem.
FPGA
DPR6
Segmented network
DPR2
Data
mem
DPR architecture
Loop management
Global bus
AG1
AG2
AG3
AG4
Data
mem1
Data
mem2
Data
mem3
Data
mem4
Multibus network
reg1
reg2
MUL1
ALU1
MUL2
ALU2
Download