On-Chip Communication (Architecture and Design) Sungjoo Yoo ISRC, SNU Contents Part 1 Introduction to on-chip communication On-chip communication architecture Software architecture Hardware architecture On-chip communication networks Part 2 Analysis and optimization of on-chip communication network On-chip communication design on unreliable interconnect Open issues and summary Part 1 Introduction On-chip communication design High-level functional specification SoC Implementation of on-chip communication architecture M1 M3 M2 mP IP MM1 1 M3 SW wr. HW wr. HW wr. Physical Communication Network Designer’s Objectives and Problems High-performance What is the maximum bandwidth of wire? What is the best suited OCA? Low power consumption What is the minimum energy required to send the given amount of data? How to achieve the minimum energy? Small HW/SW overhead Interconnection and transceiver Conflicting objectives Trade-offs Incremental Refinement of On-Chip Communication Specification of On-Chip Communication Abstraction levels of on-chip communication Client/server level Message level Transaction level Implementation level Client/Server Level Concept Service request/provide relation A client component demands a service from server(s). Service provider component may not be fixed and can be determined dynamically Object request broker (ORB) is needed. Real example Modem service PDA device: baseband modem vocoder Modem service can be Bluetooth, IEEE802.11, CDMA2000, GPS, etc. depending on the location of PDA device. Indoor: Bluetooth or IEEE802.11 Outdoor: IEEE802.11 (short range) or CDMA2000 Message Level Concept Components communicate with each other via messages. Message sender/receiver are fixed. A message can have any type of data. Real example PDA: In the CDMA2000 mode, the vocoder sends messages to the CDMA2000 modem. A message has a frame of voice data and control info. Transaction Level Concept Components are mapped on real processors. Communication is mapped on abstract communication networks. Communication protocols are fixed. Transaction can be read, write, burst_read, burst_write, etc. For each candidate of real communication networks, the transaction performance can be analyzed. Real example PDA: vocoder on a DSP, modem on an IP, candidate communication networks (AMBA, Sonics, IBM, ...) Determine bus priorities, packet priorities, TDMA slot assignment, etc. Implementation Level On-chip communication architecture is implemented. Software and hardware architecture Local memory w/ I/D caches Application SW mP, DSP Middleware OS DMA SW architecture Device drivers Processor local bus HW architecture Adapter HW IP Memory Adapter Adapter Communication network (OCBs w/ bridges, Sonics, packet/circuit switch, etc.) On-Chip Communication Architecture Software Middleware, OS, device driver and ISR, memory instructions Hardware DMA, (bus) adapter, communication network (OCBs and bridges, packet network, etc.), memory Software On-Chip CommunicationArchitecture Middleware: CORBA, COM+, JAVA, BREW Service resolution ORB implementation Dynamic reconfiguration of services needs to be supported. 802.11 baseband modem in HW --> Bluetooth in SW Operating system Communication services pipe, shared memory, semaphore, mutex, etc. Supported as OS system calls Software On-Chip Communication Architecture Device driver and ISR The device driver depends on OS and the processor OS • Preemptive or not, interrupt or not, synchronization services (semaphore, lock var, …) Processor • Bus width, register set, exception behavior, etc. Memory instructions Load/store, load multiple/store multiple instructions Cache/virtual memory instructions in ARM v6 architecture Hardware On-Chip Communication Architecture DMA (Direct Memory Access) Block size Adapter Basic functionality: protocol conversion E.g. VCI -- AMBA Local communication architecture Distributed bus arbitration/network routing: e.g. Sonics, packet switch network mP mP IP MM1 1 M3 OS Adapter Adapter AMBA M4 IP(mP) adapter OS Adapter CoreConnect Ch. adp Ch. adp Hardware On-Chip Communication Architecture Communication network On-chip bus AMBA, CoreConnect, PI, etc. Sonics mNetwork On-chip communication network Circuit switch • Philips Packet switch • W. Dally (DAC01), Guerrir (DATE00) Hardware On-Chip Communication Architecture On-chip memory Shared memory E.g. external SDRAM in multimedia chips Distributed memory w/ caches: e.g. Daytona architecture Four 64-bit processing elements (PE’s) Each PE - 32-bit RISC with DSP enhancements - 64-bit vector co-processor (four MAC’s) Split-transaction bus - Shared memory based on L1 cache snooping - Caches reduce bus traffic. Embedded RTOS dynamically schedules tasks. 120mm2, 0.35m, 100MHz Hardware On-Chip Communication Architecture On-chip memory (cont’d) On-chip implementation of linked list Philips, DATE01 Data transfer and storage exploration (DTSE) IMEC • Focus on low power consumption and area of memory On-Chip Communication Networks Routing Sonics mNetwork SiliconBackplane Philips, Circuit Switch Network Packet Switch Networks, Guerrir, DATE00 Network topologies Mesh, W. Dally, DAC2001 Octagon, ST Microelectronics, DAC2001 Sonics mNetwork SiliconBackplane On-chip bus Time-division multiple access (TDMA) Pre-characterized on-chip bus agent Two-step Arbitration Originally assigned module TDMA If no bus access priority-based Pipelined TDMA Bus Arbitration Pipeline depth Based on memory target latency at the desired clock frequency Design Example: CarrierClass VOIPProcessing Card DSP + CPU banks + IO + DRAM DSP: ~16 processors voice and modem protocols LEC CPU: ~4 processors Packet protocols Control (call setup) Hi BW SDRAM Communication Bandwidth Requirements: Basic I/O IO traffic is low BW Data IO rates = 1000 ch x 64kb/s x 3 full duplex = 48MB/s (worst case) Data are buffered to SDRAM Communication Bandwidth Requirements: Cache Updates CPU cache swap -assuming 1.6MIPS/channel -Total BW requirements: 48 + 600 + 320 = 968 (MB/s) mNetwork Implementation Derivative Design Example -Full G.168 LEC uses a specialized core -LEC has local 4MB memory -# of channels: 1000 2000 -Increased traffic -Bus width: 64 128 (bits) Circuit Switch Network: Philips PROPHID Architecture Focus on high-throughput signal processing for multimedia applications Requirements High computation capacity and high communication bandwidth Performance and programmability PROPHID Heterogeneous multi-processor architecture consisting of general and application specific processors General purpose processor Control and low-medium signal processing Application specific processors High performance signal processing Philips Multi-window TV application PNX8500 PROPHID architecture PROPHID: An Architecture Template For high throughput: ~ 10 Gbits/s and reconfigurable connection (switch matrix, 20 proc’s, 64MHz) Programmability and control app’s ~10 GOPS Control-oriented bus Autonomous tasks based on data-driven execution PROPHID: Autonomous Execution of ADS Processors - Autonomous task execution on Application Domain Specific (ADS) processors - Steam-based execution - Data-availability determines the execution of tasks. - Master(CPU)-slave synchronization can be avoided. Khan Process Network Model of Multi-window Application Communication Infrastructure Processor Model and Surrounding Shell Circuit Switch Network Guaranteeing the throughput of streams with hard-real-time constraints in the PROPHID architecture. Requirements of task execution on ADS processors Time-interleaved task execution Each task requires input/output FIFO’s. Circuit Switch Network Network Topology Time-Space-Time Routing High-Performance Communication Network in PROPHID Architecture time space time Chip Photo and Metrics A Generic Architecture for On-Chip Packet-Switched Interconnections, DATE 2000. A scalable system-level interconnection template is presented. A Generic Architecture for On-Chip Packet-Switched Interconnections Bus-based architecture will not meet the bandwidth requirements, since it is inherently non-scalable in terms of bandwidth Bandwidth is shared by connected comp’s. Multiple on-chip bus approaches like VSIA case-specific grouping of IP’s Not a truly scalable and reusable interconnection. In this paper, a generic interconnection template is presented. A Generic Architecture for On-Chip Packet-Switched Interconnections Switching networks Circuit switching like PROPHID communication network High performance Drawbacks • lack of reactivity against rapidly changing comm. – E.g. data bursts in MPEG (worst case should be assumed.), random traffic between CPU master and slaves. Packet switching Packets are transferred by routers like Internet. Routing decisions are distributed over the routers, the network can remain very reactive. Packet Routing Wormhole routing Network Topology: Fat-tree Network -Ex. 16 terminals: 8 --> 8 communication -The terminals can be processors, DSPs, memory, etc. - Routers are free to use any of the available paths - Packet: a sequence of 32 bit words - Packet payload may be of any size Scalability of Fat-Tree Network Scaling and Protocol Stack Real Implementation Network Costs and Latency - One drawback of packet-switched network --> inherently arbitrary delay Pros and Cons: Bus versus Network Structured On-Chip Communication Network, DAC2001 Why structured network? Global routing on SoC is hard to characterized and design. It would be better to have electrically well characterized wiring. -Top 2 metal layers are used -2D folded torus topology -Each tile can have processor, DSP, memory, I/O, etc. -256bit data line -Virtual channel support Router Architecture Real Implementation 0.1m CMOS Router overhead Eight virtual channels at each edge of tile 4 flits x 300b/flit = 1200 b Each tile has ~5kB (=4 x 1200 b) buffer storage Metal routing: 50mm x 3mm Total router overhead: 6.6% (0.59mm2) Network Processor Design: ST Microelectronics Octagon OC-768 40Gbps 114x106 packets/s, 44B/packet Processing requirement 1/114x106 = 9ns/packet 1 packet needs 500 instructions execution 57GIPS • No single processor! • Multiprocessors w/ high communication BW Communication network for multiprocessor SoC of OC-768 Octagon ST Microelectronics Octagon Octagon Cross Bar Node Model Scaling and Comparison with Cross Bar Summary Introduction to on-chip communication On-chip communication architecture Software architecture Hardware architecture On-chip communication networks Routing Topology Part 2 will treat Analysis and optimization of on-chip communication network On-chip communication design on unreliable interconnect Open issues and summary Part 2 Analysis of On-Chip Communication Analysis Quality of service, runtime, power consumption, etc. Modeling of architecture components OS modeling Communication network modeling On-chip bus Packet switch network Analysis of On-Chip Communication Given communication network and mapping Trace-based S. Dey Worst-case R. Ernst : SW + HW Statistical analysis Queueing theory in packet switch network Other modeling methods Performance Analysis of On-Chip Communication Analysis with synthetic statistical testbenchs Hierarchical bus, TDMA, Ring ICVD'00 Trace-based analysis Hierarchical bus ICCAD99, SiPS Queueing theory Circuit, packet switch DAC01 Optimization of On-Chip Communication HW architecture Communication resource management Mapping, (reconfigurable) interconnection (topology), scheduling and routing Performance and power Modulation/demodulation Power Average performance SW architecture Optimization of On-Chip Communication Network On-chip bus design Gajski Daveau Glesner Mapping and interconnection topology S. Dey, ICCAD00 Potkojnak, ICCAD00 Pedram, DATE00 Others for low power, DAC2001 Optimization of On-Chip Communication Network Scheduling and routing S. Dey: ICCAD, DAC (CAT, reconfigurable) W/ mapping and interconnection topology Circuit switch Comm. arch. Book Packet switch Comm. arch. Book Octagon For better optimization, Not physical module basis, but virtual channel or message basis! Optimization in SW On-Chip Comm. Architecture Design Middleware, OS, device driver Minimum service implementation Component-based middleware/OS design Pebble, GO!, … TIMA JAVA-based implementation JavaOS and JVM Application-specific implementation BREW On-chip communication on unreliable interconnect Encoding/decoding Low-power bus encoding, DAC, Benini Communication on unreliable communication media CDMA style To maintain average/statistical performance Find the paper “Designing Systems-on-Chip Using Cores”, DAC 2000. R. A. Bergamaschi and W. R. Lee, The problem of assembling SoC’s using IP blocks error-prone, labor-intensive, timing-consuming since the designer should understand the functionality interfaces electrical characteristics of cores such as processors, mem. controllers, bus arbiters, etc. Moreover, cores are parameterized and need to be configured according to their use in the SoC. Designing Systems-on-Chip Using Cores With the VSIA’s Virtual Component Interfaces, the designer still has to do wrapper design architecture design assembling the SoC using VCI’s and wrappers A digression: two key points in our design flow application specific wrapper (comm. co-processor) design application specific architecture design flow Designing Systems-on-Chip Using Cores Designing Systems-on-Chip Using Cores Designers’ tasks to configure the bus architecture define the cores to be used 32, 64, 128 bit bus, proc. charateristics, HW/SW understand the functionality of all pins on all cores and determine their connections define request priorities, e.g. interrupt priorities define the usage of DMA define address maps define clock domains insert glue logic insert/configure test logic There has been no tool to automate those tasks. Designing Systems-on-Chip Using Cores Automating SoC integration: 6 steps 1. Virtual design Virtual component (VC) is a representation of a class of real components. E.g. PowerPC VC represents all real PowerPC cores (e.g. 401, 405, etc.). Virtual interface is used instead of real interface. • Smaller number of interface pins 2. Glueless interface Automatic generation of glue logic • First, include necessary glue logic into the core. • Remaining minor glue logic is automatically generated. Designing Systems-on-Chip Using Cores Automating SoC integration: 6 steps 3. Core and pin properties encode the structural and functional characteristics of a component and its pins. Properties attached to all components and pins Automatic pin connection algorithm is used. Properties • • • • • BUS_TYPE: ASB, APB, etc. INTERFACE_TYPE: MASTER, SLAVE FUNCTION_TYPE: READ, WRITE, INTERRUPT OPERATION_TYPE: REQUEST, ACKNOWLEDGE DATA_TYPE, RESOURCE_TYPE Designing Systems-on-Chip Using Cores Automating SoC integration: 6 steps 4. Interconnection engine 5. Virtual to real synthesis Designing Systems-on-Chip Using Cores Automating SoC integration: 6 steps 6. Configuration engine clocking, address map, interrupt map, DMA channel assignment, etc. Comments To free the designer from pin interconnection and glue logic design Limitation Automation applies to HW Module interface only at pin level (with a fixed target architecture) No SW module interfacing (i.e. targeting and processor interfacing) is not considered. Open Issues Architectural trade-off HW/SW trade-off in middleware and OS service implementation Communication network design Prioritized packet network design Interconnection topology design with physical DSM effects Open Issues Reconfigurable on-chip communication In connection with component-based SoC design On-chip communication design w/ unreliable media Unreliable physical wiring and environment Summary