Shapes_hw_DATE07

advertisement
Introduction to the Tiled HW Architecture of SHAPES
Pier Stanislao Paolucci 1,2,**, Francesca Lo Cicero 1, Alessandro Lonardo 1, Mersia Perra 1,
Davide Rossetti 1, Carlo Sidore 1, Piero Vicini 1, Marcello Coppola 3, Luigi Raffo 4, Gianni
Mereu4, Francesca Palumbo4, Luca Fanucci 5, Sergio Saponara5, Francesco Vitullo5
1
INFN Roma, 2Atmel Roma,
3
ST Microelectronics - Grenoble, 4DIEE, Università di Cagliari, 5Università di Pisa
Abstract
1. Introduction
Nanoscale systems on chip dedicated to embedded
systems and numerical computations will integrate a
few hundreds of million gates. The challenge is to find
a scalable HW/SW design style for future CMOS
technologies.
The main HW problem is wiring, which threatens
Moore’s law. Tiled architectures suggest a possible
HW path: “small” processing tiles connected by
“short wires”.
A second HW problem is the
management of the design
complexity. A tiled design style
reuses
stable
Intellectual
Properties requiring a few million
gates: a manageable complexity.
A typical SHAPES tile always
contains one Distributed Network
Processor (DNP) for inter-tile
communications, plus one VLIW
DSP processor for computation
and/or one RISC processor for
control. The DNP is supported by
a NoC (for inter-tile, inter-chip
communications) and by Ndimensional toroidal network for
off-chip communications. The SW
challenge is to provide a simple
and
efficient
programming
environment for a (massive) tiled
parallel architecture. This paper
introduces the HW architecture.
There is no processing power ceiling for low
consumption, low cost, dense Numerical Embedded
Scalable Systems dedicated to future human-centric
applications which will manage multi-channel audio,
video and multi-sensorial input/outputs.
Nanoscale systems on chip dedicated to embedded
systems and numerical computations will integrate a
few hundreds of million gates. A serious challenge is to
Figure 1. The Tiled HW Architecture of SHAPES
*
corresponding author e-mail: pier.paolucci@roma1.infn.it. SHAPES (scalable Software Hardware Computing Architecture for Embedded
Systems) is a European Project (FET-FP6-2004-IST-4.2.3.4(viii) - Advanced Comp. Arch. started Jan 2006. See www.shapes-p.org for a
complete documentation.
identify a scalable HW/SW design style for future
CMOS technologies enabling high gate counts [1-3].
The main HW problem is wiring [4,5], which
threatens Moore’s law.
A second HW problem is the management of the
design complexity of high gate count designs.
Tiled architectures [6-9] suggest a possible HW
path: “small” processing tiles connected by “short
wires”.
A tiled design style extensively reuses processing
tiles, each tile composed of stable Intellectual
Properties requiring only a few million gates: a
manageable complexity.
The SHAPES project targets three main objectives:
investigate the tiled HW paradigm, experiment a realtime, communication aware system SW, and validate
the HW and System SW Platform through a set of
benchmarking applications.
This paper introduces the HW architecture. For an
introduction to the SHAPES System SW, see [10].
processor for control intensive codes.
Intra-tile communications are sustained by a Multilayer Bus Matrix, while inter-tile communications are
supported by the NoC (on-chip) and by the 3DT (offchip 3 Dim. Toroidal next neighbours interconnection
network).
The DNP acts as a generalized DMA controller, offloading the RISC and DSP processors from the task of
managing the packets flowing through the inter-tile
network.
SHAPES is a Distributed Memory Architecture.
Each Tile is equipped with distributed on-chip
memories and can be associated with an external
distributed memory (DXM).
Each tile may also contain a POT (a set of
Peripherals On Tile).
In its first implementation, the SHAPES tile will be
developed as the combination of a new generation of
the DIOPSIS (RISC + DSP) MPSOC [8,15], designed
by ATMEL Roma, with the DNP, designed by INFN.
2. Tiled HW paradigm
3. Distributed Network Processor (DNP)
The SHAPES project (scalable Software hardware
DNP stands for Distributed Network Processor. The
Computing Architecture for Embedded Systems)
main task of the DNP (designed by INFN) is to provide
investigates a specific Tiled HW paradigm (see Figure
inter-tile communication services. A secondary service
1).
is to act as a DMA controller for intra-tile
Each Tile includes a few million gates, for optimal
communications.
balance among parallelism, local memory, and IP reuse
on
future
technologies.
The
SHAPES inter-tile routing fabric
connects on-chip and off-chip tiles,
weaving a distributed packet
switching network.
3D next neighbours engineering
methodologies will be studied for
off-chip networking and maximum
system density, leveraging on the
know-how accumulated by INFN
during the design and development
of several generations of massive
parallel
processors
[11-14]
dedicated
to
numerical
computations.
Figure 2 describes a typical Tile
of SHAPES.
Each tile of SHAPES always
contains one Distributed Network
Processor (DNP) for inter-tile
communications, plus one VLIW
floating-point DSP (Digital Signal
Processor)
for
numerical
computations, and/or a RISC Figure 2. A typical Tile of SHAPES: RISC + VLIW DSP + DNP
allowing for easy upgrade and bug fixing.
4. mAgicV VLIW Floating-Point DSP Core
Figure 3. The interfaces of the Distributed Network
Processor
mAgicV VLIW DSP (designed by Atmel Roma) is a
fully C programmable, high performance Digital Signal
processor delivering 10 floating-point operations per
cycle and 16 ops per cycle.
It is new member of the mAgic [16] processor
family used in the Atmel Diopsis product line
(multiprocessor systems on chip combining a RISC and
a DSP).
In a conventional numerical processor, which
detects parallelism at execution time and pushes the
clock speed to the limit, the die area required for
control logic overhead frequently dwarfs that which is
required by the functional units.
For a discussion of the efficiency of DSPs relying
on parallelism detected before execution and exploited
by VLIW and appropriate software scheduling see
[18]. In our opinion and experience, moderate clock
speeds are an ideal complement to VLIW architectures
because they reduce pipeline depth, bypass logic, and
speculation correction logic. This choice simplifies the
task of high-level language compilers. A moderate
clock speed also allows simpler clock tree management
and lower supply voltages.
The classical drawback of VLIW architectures is
that the longer instruction words require more memory.
The DNP offers its services to other masters in the
tile (typically a RISC or DSP processor) (Figure 3
describes the interfaces offered by the DNP). The DNP
receives command issued by the initiating processors
on a slave port.
The DNP is also equipped with two master ports to
sustain the data traffic. In a typical situation the master
ports act simultaneously to sustain the flow between the
source and destination targets (memory buffers
exposed to the DNP by the tile).
For intra-tile communications the set of master and
slave interface ports would suffice.
For inter-tile communication, which require the
cooperation of at least two
DNPs located on different
tiles, the DNP is equipped
with a set of inter-tile
interfaces.
A first set of those
interfaces are used for offchip communications on
the 3DT (3D toroidal nextneighbors topology) and on
a collective communication
tree (CTV).
A
second
set
of
interfaces will be used for
on-chip
communication
through a dedicated NOC
architecture.
The DNP may host a
small processor, called in
the following DNP-P, to
implement some advanced
features in an easy way,
without
resorting
to
complex VHDL coding while Figure 4. mAgicV Floating-point VLIW DSP
While conventional code compression/decompression
schemes mitigate this problem [19], they also increase
control logic overhead.
We provided efficient usage of the internal program
memory designing a small footprint (18 Kgate) VLIW
Dynamic Program memory Decompression system
(DyProDe) [20].
The architecture of mAgicV is co-designed with its
C-Compiler, using the Retargetable Compiler
generation
techniques
of
Target
Compiler
Technologies [17]. The C-oriented Architecture
completely frees the user from the burden of dealing
with the parallelism of the DSP processor resources
(when the program is written for a single processor). A
rich library of C-written DSP routines is available. The
DSP library reaches impressive level of performance,
comparable or higher to the performances obtained by
optimized assembler in previous generations of the
mAgic Processor family. For what regards multiprocessor architectures, the problem is addressed by
the System SW developed by the SHAPES project).
The mAgicV floating-point VLIW DSP is equipped
with one AHB master port and one AHB slave port for
system-on-chip integration. It includes 256 data registers, 64 address registers, 10 independent arithmetic
operating units, 2 independent address generation units
and a DMA engine driving the AHB Master Port.
To sustain the internal parallelism, the data
bandwidth among the Register File, the Operators and
the Data Memory System amounts to 80 bytes/cycle.
The Data Memory System is designed to transfer 28
bytes/cycle. For instance, activating all the computing
units, mAgicV can produce one complete FFT butterfly
per cycle.
mAgicV operates on IEEE 754 40-bit extended
precision floating-point and 32-bit integer numeric
format for numerical computations, while internal
memory accesses are supported by a powerful 16-bit
MAGU (Multiple Address Generation Unit).
It includes on-chip 16K x 40-bit 6-access/cycle data
memory system and 8K x 128-bit dual port program
memory locations.
5. The Spidergon NoC
Spidergon-STNoC (S-STNoC) is the Network on
Chip (NoC) technology currently developed in
STMicroelectronics, with the support of University of
Cagliari and Pisa, and it is made of a network of microrouters interconnected in a Spidergon [21] topology.
The main task of S-STNoC in SHAPES is to
provide the inter-tile on-chip communication services.
Different tiles can be connected on the same silicon
chip by the DNP to S-STNoC interconnection.
Figure 5. Spidergon NoC Topology
In fact the DNP has a port connected to S-STNoC
Network Interface (S-STNoC NI), a block who is
responsible to map ingoing DNP to S-STNoC packets
in a S-STNoC compatible packet format and vice-versa
outgoing S-STNoC to DNP packets in DNP packet
format.
Traditional SoC interconnection architectures have
generally been based upon buses and point-to-point
links. Bus architectures are not ultimately scalable,
since as more units are added to it, the power
dissipation for bus access grows as a consequence of
the increased capacitive load. Dedicated point-to-point
links are optimal in terms of performance, but the
number of links required increases exponentially with
the number of cores, leading to a potential area and
routing problem.
For maximum flexibility and scalability it is
generally accepted that a move towards a shared
segmented global communication structure is needed
[22]. This definition leads in turn to a data-routing
network consisting of communication links and routing
nodes that are implemented on the chip.
Networks-on-Chip are mainly based around three
components: 1) Routers, 2) Network Interfaces and 3)
Physical Link. Each Router has 4 connections to other
three Routers and one Network Interface. NoCs hide
the interconnect specific implementation details to the
IP resources interfaced. All that is needed for an
external IP to transmit data through the network is a
specific designed Network Interface.
The architecture of S-STNoC Network Interface has
been designed with modularity in mind. It has two main
components, shell and kernel.
The shell handles the processing activity related to
the signals coming from the IP resource side, while the
kernel takes care of the activity related to the NoC side
of the Network Interface. The shell architecture
depends on the specific IP protocol while the kernel
architecture does not. S-STNoC Network Interfaces
support data size conversion and, thanks to FIFO
synchronizers, kernel and shell subsystems running at
totally unrelated frequencies. S-STNoC Physical Link
handles
Router-to-Router
and
Router-to-NI
connections [23].
It is made up of two interfaces: 1) Upstream (US)
interface, associated to the transmitter module and 2)
Downstream (DS) interface, associated to the receiver
module.
Links are half-duplex and support both synchronous
and mesochronous clocking strategies.
The latter enables full bandwidth communication (1
GHz in our experiments in 65 nm CMOS) between IP
macrocells clocked by signals with the same frequency
but an arbitrary amount of skew. The verification of SSTNOC building blocks is based on Specman [24].
6. Tile aware System SW and benchmarks
Even if this paper is an introduction of the SHAPES
HW architecture, we find useful to resume a few key
ideas about System SW, introduced by [10].
The SHAPES project will adopt innovative layered
system software, which does not destroy the
information about algorithmic parallelism, data and
workload distribution and real-time requirements
provided by the programmer.
The System SW will be fully aware of the Tiled HW
paradigm. For efficiency and predictability, the system
SW manages intra-tile and inter-tile latencies,
bandwidths, computing resources, using static and
dynamic profiling.
The application is described using a model based
approach in terms of a network of actors with explicit
real-time constraint annotation, and it is mapped (i.e.
bound to the processing and communication resources
and instrumented with scheduling strategies) by an
iterative automated (or semi-automated) multiobjective optimization.
The mapping procedure uses performance
estimation obtained at different level of abstractions:
from analytic prediction down to queries answered by
the simulation environment.
The layered structure of the software separates the
application code from the Hardware dependent
Software. Therefore it is possible to debug the
application and System SW layers independently and
obtain higher simulation speeds.
The application can be debugged using a Virtual
Architecture, which is agnostic of the HdS detail, and
of the particular Hardware abstraction Layer offered by
the real HW.
The HDS generator refines the System software,
first through a Transaction Accurate step, and then
down to the Virtual Prototype level, where the
generated System Software contains all the details
needed to drive the actual HW platform.
The SW accesses the on-chip and off-chip networks
through a homogeneous interface. The same HW and
SW interface can be adopted for integration with signal
acquisition and reconfigurable logic tiles.
Generation after generation, the number of tiles on a
single-chip will grow, but the application will be
portable.
The SHAPES HW and SW platform will be
benchmarked through a set of applications
characterized by a large inherent parallelism and,
typically, by real-time constraints: wave field synthesis
for array of sound sources reproduced by large arrays
of loud-speakers, treatment of signals acquired by
arrays of microphones, Ultrasound Scanners, and
Theoretical Physics (Lattice Quantum-Chromo
Dynamics).
7. Acknowledgements
SHAPES (scalable Software Hardware Computing
Architecture for Embedded Systems) is a project
partially funded by European Commission (FET-FP62004-IST-4.2.3.4(viii) - Advanced Comp. Arch.).
The SHAPES HW Team acknowledges the fruitful
contribution of the project partners working on System
SW and Benchmarking Applications. In particular we
would like to thank Lothar Thiele – ETH Zurich,
Rainer Leupers – RWTH Aachen, Ahmed A. Jerraya –
TIMA Grenoble, Gert Goossens – TARGET Compiler
Technologies, Thomas Sporer – Fraunhofer IDMT for
discussions and feedbacks.
8. References
[1] D. Sylvester and K. Keutzer, “Impact of Small Process
Geometries on Microarchitectures in Systems on a Chip”,
Proc. IEEE, 89-4(2001)467-489.
[2] W.J. Dally and S. Lacy, “VLSI Architectures: Past,
Present and Future”, Proc. Advanced Research in VLSI
Conf., IEEE Press (1999)232-241.
[3] A. Allan et al., “2001 Technology Roadmap for
Semiconductors”, IEEE Computer 35-1(2002)42-53.
[4] R. Ho, K. Mai and M. Horowitz, “The Future of Wires”,
Proc. IEEE, 89-4 (2001)490-504.
[5] J. Rabaey, A. Chandrakasan, B. Nikolic, Digital
Integrated Circuits, 2-nd Edition, Prentice-Hall (2003)
Chapter 4 and 9.
[6] M.B. Taylor et al., “The Raw Microprocessor: A
Computational Fabric for Software Circuits and GeneralPurpose Programs”, IEEE Micro 22-2(2002)25-35.
[7] L.P. Carloni, A.L. Sangiovanni-Vincentelli, "Coping with
latency in SOC Design", IEEE Micro 22-5 (2002) 24-35.
[8] P.S. Paolucci et al. “Janus: A gigaflop VLIW+RISC Soc
Tile”, Hot Chips 15 IEEE Stanford Conference (2003).
http://www.hotchips.org (note: Janus was the development
name of the first generation of DIOPSIS).
[9] Paolucci, P. S., “The Diopsis Multiprocessor Tile of
SHAPES” 6th International Forum on Application-Specific
Multi-Processor SoC MPSOC’06 (Colorado, August 2006)
[10] Paolucci, P. S., Jerraya, A. A., Leupers, R., Thiele, L.,
and Vicini, P. 2006. “SHAPES: a tiled scalable software
hardware architecture platform for embedded systems.” In
Proceedings of the 4th international Conference on
Hardware/Software Codesign and System Synthesis (Seoul,
Korea, 2006). CODES+ISSS '06. ACM Press, 167-172.
DOI= http://doi.acm.org/10.1145/1176254.1176297
[11] A. Bartoloni, P.S. Paolucci et al., “A Hardware
Implementation of the APE100 Architecture”, Int. Journ.
Mod. Phys. C 4(1993)969.
[12] N. Cabibbo and P.S. Paolucci, “SIMD algorithm for
Matrix Transposition”, Int. Journ. Mod. Phys. C 6(1995)183.
[13] F. Aglietti, P. S. Paolucci, et al. , “The teraflop
supercomputer APEmille: architecture, software and project
status report” Computer Physics Communications, 110,1-3
(May 1998) 216-219
[14]Belletti, F et al. “Computing for LQCD: apeNEXT”,
Computing in Science and Engineering, 8-1, pp. 18-29,
Jan/Feb, 2006
[15] ATMEL Roma, “DIOPSIS: Dual Inter Operable
Processor in A Single Silicon”, www.atmelroma.it
[16] P.S. Paolucci, P. Kajfasz et al., “mAgic-FPU and
MADE: A customizable VLIW core and the modular VLIW
processor architecture description environment”, Computer
Physics Communication 139(2001)132-143.
[17]“Chess/Checkers, a retargetable tool-suite for embedded
processors”,
Target
Compiler
Technologies,
http://www.retarget.com/doc/target-whitepaper.pdf.
[18] P. Faraboschi, G. Desoli, J.A. Fisher, “The Latest Word
in Digital and Media Processing”, IEEE Signal Processing
Mag. 15-2(1998)59-85.
[19] R.P. Clowell, J. O’Donnell, D.P. Papworth, P.K.
Rodman, “Instruction Storage Method with a Compressed
Format Using a Mask Word”, U.S. Patent 5057837, (Oct
1991).
[20] P. S. Paolucci, “Apparatus and Method for Dynamic
Program Decompression”, U.S. Patent 6,766,439, (Jul 2004).
[21] R. Locatelli, G. Maruccia, L. Pieralisi, A. Scandurra,
M. Coppola, “Spidergon: a novel on-chip communication
network”, International Symposium on System-on-Chip,
2004, pp. 15-26
[22] L. Benini, G. De Micheli, “Powering network-onchips”, The 14th International Symposium on System
Synthesis (ISSS), pp. 33-38, 2001.
[23] D. Mangano, R. Locatelli, A. Scandurra, C. Pistritto, M.
Coppola, L. Fanucci, F. Vitullo, D. Zandri, “Skew
Insensitive Physical Links for Network on Chip”, Nano-Net
2006, September 14-16, 2006.
[24]
The
e
functional
verification
language,
http://www.ieee1647.org/
Download