Platform-based Design 성균관대 조준동 교수 발표순서 Why Platform-based Design? S/W configurable platform의 필요성 Design Space of Reconfigurable Architectures Reconfigurable Radio and Multimedia Systems Network-centric Design: Clock and Power Reliable Design SoC and Customizable Platform Based-Design DSP Reconfigurable Hardware (Fine Grain) ASIC 1 ASIC 2 Reconfigurable Hardware (Coarse Grain) Semiconductor Revolutions “Mainstream Silicon Application is switching every 10 Years” software standard µproc., memory TTL 1967 1957 custom LSI, MSI hardware 1977 reconfigurable FPGAs 2007 1987 ASICs, accel’s 1997 coarse grain Definition of Platforms? •An architecture that is designed for an application domain Platform 분류 Application Platform: 멀티미디어 platform: Nexperia, TI의 OMAP 3G 무선 platform: Infineon의 M-gold Bluetooth platform: Parthus 무선 platform: ARM의 PrimeXsys Process-centric platform Improv System, ARC, Tensilica, Triscend Communication-centric platform: Sonics, Palmchip SoC Platform Adaptation The Platform-Based Design Concept Cadence Pre-Qualified/Verified Foundation-IP* HW-SW Kernel + Reference Design Scaleable bus, test, power, IO, clock, timing architectures MEM Hardware IP SW IP Application Space CPU FPGA Reconfigurable Hardware Region (FPGA, LPGA, …) Programmable *IP can be hardware (digital or analogue) or software. IP can be hard, soft or ‘firm’ (HW), source or object (SW) Processor(s), RTOS(es) and SW architecture Foundry-Specific HW Qualification SW architecture characterisation Platform Architecture Do I need a dedicated DSP ? Which microcontroller? ARM? HC11? ARC? Which RTOS do I use? Which scheduling policy do I have to choose ? How fast will my user interface software run? How much can I fit onto my microcontroller? Which Bus? PI? AMBA? Dedicated Bus for DSP? Can I buy a QCELP decoding core? Do I need a dedicated HW or can I run this on the Microcontroller ? Example of a commercial SoC More CPUs? More SRAM/Flash? Add FPGA? A Legacy SoC Approach CoreConnect (PPC), AMBA (ARM)… Networks-on-Silicon, Phillips MP-SOC Cluster Definition of MP-SOC? Usually heterogeneous multiprocessor: CPUs, DSPs, etc. Hardwired accelerators. Mixed-signal front end. 기존 MP-SoC의 문제점 ▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다. ▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은 자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다. ▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적으로 설계되었다. ▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도 하다. ▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다. ▷ 특정 응용분야에 대해서만 최고 성능을 낸다. 4G: Multiple standards Communications. Networking. Multimedia. Security. Mutiband/multimode를 지원하는 Digital RF The triangle, Chicken and Egg? •Hardware and software architectures determine capabilities. •Applications guide design decisions. •Methodologies allow repeatable, predictable design. architectures applications methodologies Why Multi-Threaded Cores? Increasing gap: memory & processor speeds (2x / 2 years) More parallel processing (lower-power, higher-perf./mm2) GPP D$ I$ $ H/W Proc. Element DSP DSP H/W-MT DSP RISC I$ NoC In …… Out SRAM Increasing gap: interconnect & gate delays (multi-clock) MPSoC “Bus” Alternatives • Fixed Bus [Bergamaschi, DAC, 2000] – Point to point communication – Signals between cores transferred dedicated wires • FPGA-like Bus [Cherepacha, FPGA Sym, – Programmable interconnects – Employ static network • Arbitrated Bus [IDT Inc., 2000] – Time-shared multiple core connectivity – Use arbitrator • Hierarchical Bus [AMBA, ARM Inc] – Combine multiple buses using bus – Separate buses for cores and I/O NoCBus [Dally, DAC, 2000] – Resources communicate with data packets – Use switch fabric Future mobile platform? Mudge et al: Mobile supercomputing Speech recognition. Cryptography. Augmented reality. Typical applications (email, etc.). Requires 16x 2 GHz Pentium 4. Peak power must not exceed 75 mW 미래 모빌 어플리케이션 플랫폼? Culture and Education? Personal Entertainment Platform? Road Map to MP-SoC Trends mask NRE: Over 1M$; design NRE: 10M$ to 75M$ ASICs replaced by programmable ASSP, FPGA’s number of embedded processors DVD/STB/HDTV, mobile phones: 5 to 8 Image proc, networking, basestation: 8 to 100+ eS/W complexity Set-top box, audio: >1 million lines of code eS/W becoming essential part of SoC’s ?’s Law? Should the SoC designer work hard? Compose the system Verify Requirements Simulate Verify SoC Composer Verify Synthesis + P&R Verify (timing, area) Mobile SoC에서 검증이 왜 중요한지? 왜 우리는 검증이 취약하게 되었는지 Simulate (performance) Verify Tape Out More SoC topics … Platform optimization Power management BW allocation Resource sharing Task distribution Efficient communications Low Power Verification •인재 (System Architect) 양성 Available Mobile and VLIW Processors The ARM Family The The The The The ARM7 Generation StrongARM ARM Thumb Option ARM Piccolo Option ARM9 and ARM10 The Motorola M-Core The LSI TinyRisc The Hitachi SuperH Family VLIW Processors The Motorola-Lucent Star*Core The Philips TriMedia The HP/Intel IA-64 NexperiaTM DVP Hardware architecture (source: Th. Claasen, Philips, DAC 2000) Exploitable Parallelism Min parallel grain size (instrns.) MultiFlex ThreadLevel GP O/S Parallelism Thread-Level Parallelism Exploitable task parallelism InstructionLevel Parallelism 10 000’s Instructions 1~100 1~8 100’s 2~6 1 NEC MP211: Homogeneous MP core Asymmetric mp with very coarse grain multitasking 3 ARM9’s utilized as predefined function units NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing MP-SoC의 장점 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을 낮출 수 있다. 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다. Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다. Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간 인터페이스를 줄일 수 있다. 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다. AMP task allocation image Bus and Memory Architecture MP211 block diagram Power consumption of H.264+AAC Holistic design of multi-core architectures Naïve Methodology is inefficient Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04] What about interconnects? How much can interconnects impact processor architecture? Need to be co-designed with caches and cores? Heterogenous MP Core ▷ Single-ISA heterogeneous multicore 구조는 볼테지 스케일링, 클럭 게이팅, speculation control등을 사용하는 경우에 비해 우수한 성능을 보인다. ▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은 큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다. 또한 바테리를 사용하는 경우와 전원을 사용하는 경우등 시스템의 콘텍스트에 의존적이다. 따라서 복잡도가 다른 코어들을 사용하는 것이 효율적이다. ▷Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 그러나 singleISA heterogeneous CMP는 모든 코어가 같은 ISA를 수행하기 때문에 각 응용이 어느 코어에 매핑이 되어도 상관없게 된다. 코어 숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우, 인터콘넥트의 전력 소모량은 하나의 코어와 같다. 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지-딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다. ▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을 이용하는 heterogeneous processors는 homogeneous에 비해서 63% 성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의 개선이 있다. Amdahl's의 법칙에 의하면 병렬 응용들의 속도개선은 직렬 응용 부분때문에 제한적이 된다. ▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에 대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를 최대화 한다. [Annavaram, et al] Heterogeneous MP-SoC 문제점들 Processors are bound by wire and memory latencies Peak performance on only a small class of applications. How well they map to a given design Diversification of workloads Increased hardware complexity Poor resource utilization Alpha cores scaled to 0.10 um. EV8 is 80 times bigger but provides only two to three times more single-threaded performance Heterogenous MP Core If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection. Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 헤티로지니어스 플랫폼의 특징 8-core 프로세서의 경우, 인터콘넥트의 전력 소모량은 하나의 코어와 같다. 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지-딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다. 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을 이용하는 heterogeneous processors는 homogeneous에 비해서 63% 성능이 개선된다. threads level을 사용하는 경우에는 평균 29%의 개선이 있다. Amdahl’s의 법칙에 의하면 병렬 응용들의 속도 개선은 직렬 응용 부분때문에 제한적이 된다. 10 Performance of heuristics for equal-area heterogeneous architectures with multithreaded cores. Exploring the potential from heterogeneity CT 3400 Multi-core DSP H.264 encoder , decoder and audio codecs and the system control 8개 32비트 DSP 코어 6개 32비트 범용 프로세서 코어 128핀 프로그램 가능 I/O 서브시스템으로 구성 C 프로그램 가능 H.264 및 MPEG4 코드를 지원 http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf H.264 codec onto the cradle CT3400 MDSP CT 3400 Multi-core DSP DSP Engine Each DSP engine contains A Single Instruction Multiple Data Arithmetic Logic Unit (SIMD ALU) A Packed Integer Multiplier Accumulator (PIMAC) A Floating Point Unit (FPU) Bi-directional FIFO data buffers DMA channels A 128 x 32 register and A 512 x 20 program memory CT3400 DPS Engine http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdf http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf CT3600계열 제품군 CT3600 Multiprocessor DSP Family Members CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장 뛰어난 가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2배 이상 우수 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을 네 배로 증가 40달러에서 90달러 http://www.cradle.com/downloads/CT3600-PB.pdf CT 3616 Multi-core DSP http://www.cradle.com/downloads/CT3600-PB.pdf Homogeneous MP-SoC 문제점들 The hardware must be configurable for efficient execution across broad class of application. Each core consists of an array of homogenous processing execution nodes, a banked Instruction Cache, Data Cache, register file and block control logic. Some of the resources (called polymorphous resources) in the TRIPS architecture can be configured to operate differently depending on the mode (instruction, thread or data parallelism). HiBRID-SoC Architecture HIBRID-SoC multi-core system-on-chip Architecture Integrate a powerful on-chip communication structure A well-balanced memory system to account for the growing amount of data memory system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding (AVC)) Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing demend Three programmable cores Each adapted towards a specific class of algorithms Combination of the cores and their software development environment An extention of a programmable core with dedicated modules (e.g.,Trimedia) HIBRID-SoC multi core Developed at the University of Hannover Multi-Core SoC Architecture Multi-Core SoC Architecture Instruction Level VLIW (Very long instruction word) Data Level SIMD (Single instruction multiple data) Task Level (Simultaneous multithreading) Hi-par DSP 16-datatath SIMD processor core controlled by VLIW, Particularly optimized towards high-throughput two dimensional DSP-style processing (FFT-intensive applications or filtering) Stream Processor (SP) 32-Bit RISC architecture that is more optimized to-wards control-dominated task Bitstream processing or global system control Macroblock processor(MP) Efficient processing of data blocks (Heterogeneous data path structure consisting of scalar and a vecture unit) Controlled by dual-issue VLIW, offers flexible subword parallelism, and contains instruction set extensions for typical processing computation steps HiBRID-SoC multi-core architecture 64-bit AMBA AHB system bus Connects all cores SDRAM memory via a 64 Bit SDRAM interface Two versatile 32-Bit host interfaces for access (e.g., host PC via PCI and to serial flash memory) Figure 1. HiBRID-SoC multi-core architecture HiPAR-DSP HiPAR-DSP Highly paralled DSP core with a VLIWcontrolled SIMD architecture Memory concept provides an easy data exchange between the data paths, which is required for many filter and image processing algorithms DMA unit serves all cache misses and performs data prefetch transfers to the matrix memory At the targeted clock frequency of 145 MHz, the HiPAR-DSP achieves a performance of 2.3 GMACs Figure 2. HiPAR-DSP architecture Stream Processor Stream Processor Sp has been optimized for high-level programmability and efficient processing of control-driven applications Harvard architecture with a 32-bit data path consisting of 5 pipeline stages and controlled by 32-Bit RISC instructions. Supports Conditional execution, forwarding interlocks, and provides full interrupt capability Convert the 64-Bit AMBA bus width to the 32-Bit internal Macroblock processor Macroblock processor Heterogeneous data path structure consisting of a scalar and a vector data path The scalar data path operates on 32-Bit data words in a 32-entry register file and provides control instructions (jump,branch, and loop) The vector data path is equipped with a 64 entry register file of 64 bit width Special fuction unit(SFU) provide instruction set extensions for common video and multimedia core algorithms. MUL/MAC or ALU, incorporate SIMD-style subword parallelism by processing either two 32-Bit, four 16Bit, or eight 8-Bit data entities in parallel within a 64-bit register operand Figure 3. Macroblock processor data paths. HiBRID-SoC Implementations HiBRID-SoC is fabricated in a 0.18 um, 6LM standard-cell technology, 14 million tr’s 3.5W occupies 82 mm2, and operates at 145 MHz Table 1. MPEG-4 ASP decoder (full TV resolution) performance on MP and SP, 720*576@25Hz,1.5-3 Mbits: Figure 3. Chip layout of the HiBRID-SoC. Analyzing On-chip Communication in MPSoC Enviroment Proceedings of Design,Automation and Test’04 Mirko Loghi et al •Analysis and trade-off exploration of on-chip communication architectures. •Compare and analysis with two practical configurations : AHB-AMBA (ARM) and STBus (ST Microelectronics). •Models hardware and software of MPSoC at highlevel of accuracy and sufficient simulation speed. •Provide realistic performance by stimulating communication system with functional traffic. Multiprocessor simulalation platform Hardware architecture: • Homogeneous MPSoC platform. • Configurable number of 32-bit ARM processors. • Processor cores : GPL-licensed ARM Instruction Set Simulator (ISS) SWARM in C++ • Private memories for each processor. • A shared memory • A hardware interrupt module. • 32-bit interconnection • All components are wrapped in SystemC Multiprocessor simulalation platform Benchmarks running with RTEMS-OS : • Running on top of RTEMS • Synchronization : Use OS queues to exchange matrices between processors. • Benchmark 1: Independent matrix multiplication. • Benchmark 2: Pipeline of matrix multiplication • • • Benchmark 1: Independent matrix multiplication: Perform independent matrix multiplication at each processors Not require interprocessor communication. Operands are stored in private memories of each processor. Multiprocessor simulation platform Benchmark 2: Pipeline matrix multiplication: • Platform receives a continous flow of input and out put • Operation of every cores follows this partern : Copies input matrix from share memory to private space Multiplicate input matrix with a already matrix in private space Copies the resulting matrix back to shared space. • Interrupt and semaphores slaves are queried to keep synchronization in all process. Multiprocessor simulation platform Code development and analysis tool : • Development tool : GNUcross compiler • Allow flexible profiling by functions of simulator. • Output of simulator : • Statistics about processor and interconnect performance. • VCD waveform of all bus signal • Traces of memory accesses performed by every cores. Features of communication architecture AMBA-AHB Architecture: • Traditional shared bus with pipelining. • Distinct data and address/control bus • Transfer with data phase and control phase. • Support burst as streams of single transaction. • “split/retry transfer” and “early burst termination” are used to solve high-latency slaves. STBus Architecture: • Protocol type 3: simple load/store operation , pipelining and spliting transaction,out-of-order support. • Flexible topology :from shared bus to full crossbar • Overlapping transfer:Requesting new burst while previous ones are still completing without idle cycle. • Fast arbitration with two cycles and minimum latency is three cycles. Experimental Result Comparison of performance interconnection Five interconnections : • AMBA-AHB • Shared-bus STBus • Full crossbar STBus • Partial crossbar STBus : ST-32 • Partial crossbar STBus : ST-54 Experimental Result Performance comparison Experimental Result Comparison of performance interconnection Four benchmarks : • Matrix multiplication independent : ASM-IND • Matrix multiplication pipeline without OS : ASM-PIP • Matrix multiplication with OS : OS-IND • Matrix multiplication pipeline with OS : OS-PIP Experimental Result Comparison of performance interconnection MPSoC Clock and Power Olivier Franza, Intel Increased uncertainty with process scaling Process, voltage, temperature variations, noise, coupling Affects design margin over design, power & performance loss Increased power constraints Increasing leakage, power (density, delivery) limitations More transistors mean: Larger clock distribution networks Higher capacitance (more load and parasitics) With each new technology: Gate delay decreases ~25% Wire delay increases ~100% Cross-chip communication increases Clock needs multiple cycles to cover die Interconnect Delays & Density Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology Multiple Clocks due to Interconnect limitation At reduced performance, larger resource size Noise in Mixed Signal Systems Multiple clock domains Low skew and jitter ALWAYS a must Clock modeling requires more accuracy Within-die variations, inductance, crosstalk, electromigration, self-heat, … Floor plan modularity Think adding/removing cores seamlessly! Hierarchical clock partitioning Reduce global clock and possibly relax its requirements Generate “locally”-used clock “locally” Implement clock domain deskewing techniques Bound clock problem into simple, reliable, efficient domains DEC/Compaq Alpha more complex core to improve performance, more complex clocks (?), Source: DEC/Compaq – Gronoski & al., JSSC 1998 – Xanthopoulos & al., ISSCC 2001 – Barroso & al., ISCA 2000 Clock and Power Convergence Intel® Itanium® Montecito Each core split into 3 clock domains on variable power supply Each domain controlled by Digital Frequency Divider (DFD) generating low-skew variablefrequency clocks; fed by central PLL and aligned through phase detectors Regional Voltage Detector (RVD): supply voltage monitor Second level clock buffer (SLCB): digitally controlled delay buffer for active deskewing Regional Active Deskew (RAD): phase comparators monitoring and adjusting delay difference between SLCBs Clock Vernier Device (CVD): digitally controlled delay buffer Clock generation and distribution are essential Clock generation and distribution are essential enablers of microprocessor performance On-Chip Interconnects: Circuits and Signaling, Wayne Burleson • Using Vdd programmability • High Vdd to devices on critical path • Low Vdd to devices on non-critical paths • VddOff for inactive paths A – Baseline Fabric B – Fabric with Vdd Configurable Interconnect This work builds on a similar idea for FPGAs described in: Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International Conference on Computer-Aided Design, Nov. 2004 Why Reconfigurable System? GPP와 재구성 h/w 를 포함 목적: 전력 감축 및 유연성 1. 동적인 환경에 따른 Quality of Service를 제공 2. 알고리즘 진화에 따른 유연한 구조 3. 개발 및 유지 보수해야 하는 플랫폼 감소 Task 1 Task N A B W C X Y D E Z X D H A W Y B I J C ZE Reconfigurable Hardware Energy Efficiency of Reconfigurability system architecture communication protocol O/S and applications Partitioning of functions between wireless device and services on the network The mobiles must be flexible enough to accommodate a variety of multimedia services and communication capabilities and adapt to various operating conditions in an (energy) efficient way S/W configurable platform의 필요성 – Doing More by Doing Less :다양한 표준을 다룰 수 있는 능력이 필요 (AM, FM, GSM, UMTS, digital broadcasting standards, analog and digital television and other data links. – A fully software reconfigurable multichannel broadband sampling receiver for standards in the 100 MHz band Granularité dela reconfiguration Sébastien PILLEMENT - ENSSAT/LASTI Reconfiguration au niveau système Lx, C62 (décomposition en cluster) Reconfiguration au niveau fonctionnel Pleiades, RaPiD, DART(2001) Reconfiguration au niveau opérateur Chameleon, Piperench, Morphosys(2000) Reconfiguration au niveau porte Napa, GARP, FPGA The gain size of operations in Reconfigurable System Architectures Fine gained operations : Multiply and addition Medium gained operations : reconfigurable modules Course gained operations : CPU, host Design Space of Reconfigurable Architectures RECONFIGURABLE ARCHITECTURES (R-SOC) Lilian Bossuet LESTER Lab Université de Bretagne Sud Lorient, France MULTI GRANULARITY (Heterogeneous) FINE GRAIN (FPGA) Processor + Coprocessor Island Topology Hierarchical Topology Coarse Grain Coprocessor Fine Grain Coprocessor • Xilinx Virtex • Xilinx Spartran • Atmel AT40K • Lattice ispXPGA • Altera Stratix • Altera Apex • Altera Cyclone • Chameleon • REMARC • Morphosys • Pleiades • Garp • FIPSOC • Triscend E5 • Triscend A7 • Xilinx Virtex-II Pro • Altera Excalibur • Atmel FPSIC COARSE GRAIN (Systolic) Tile-Based Architecture Mesh Topology • aSoC • E-FPFA Linear Topology • RAW • Systolic Ring • CHESS • RaPiD • MATRIX • PipeRench • KressArray • Systolix Pulsedsp Hierarchical Topology • DART • FPFA Digital Signal Processing With FPGAs Paul Ekas Jean-Charles Bouzigues Multiplier Options In FPGAs Option Resource Area Usage 1 Logic Multipliers Logic Elements (Traditional) 500 LEs per 18x18 Multiplier 2 Hard Multipliers DSP Blocks 4 18x18 Multipliers per DSP Block 3 Soft Multipliers RAM 1 to 2 Embedded Memory Blocks Logic Elements Control Signals 4 LE1 Smallest Unit of Logic Grouped into Logic Array Blocks (LABs) of Ten LEs Features Four-Input Look-Up Table (LUT) Configurable Register Dynamic Add/Subtract Control Carry-Select Chain LogicLocal Interconnect 4 4 4 4 4 4 4 4 Logic Element LE2 LE3 LE4 LE5 LE6 LE7 LE8 LE9 4 LE10 Logic Array Block DSP Block: Optimized Hard MAC 36 38 + 36 +-S 37 Output Register Unit +-S 37 Output MUX Optional Pipelining 144 Input Register Unit 36 144 36 9 Bit x 9 Bit 18 Bit x 18 Bit 36 Bit x 36 Bit 8 Multiplies 4 Multiplies 1 Multiply 2 Multiplies with Accumulate 2 Multiplies with Accumulate 2 Sum of 2 Multipliers (Complex Multipliers) 1 Sum of 2 Multipliers (Complex Multiply) 2 Sum of 4 Multiplies 1 Sum of 4 Multiplies Soft Multipliers: Lookup Based Multiplication Use Embedded RAM Blocks as Look-Up Tables (LUTs) for Generating Partial Products Coefficient or Sum of Coefficients Values Stored in RAM Blocks Address MSB Partial Product Shifted5& AddedMultiplier to LSB Table ADDRESS MULT_RESULT Partial Product Example Multiplication of 5-Bit Input with 13-Bit Coefficient All 18 Bit Possible Results Stored at 32*18 Look Up Table 32*18 M512 18 Data Output 00000 0 00001 C 00010 2*C 00011 3*C … 11111 …. 31*C C = Coefficient[12:0] Altera FPGA Memory Architectures Today’s applications need more high performance memory One size does not fit all Wide choice of modes and widths M512 Blocks M4K Blocks M-RAM External Memory Devices Rate Changing Embedded Shift Register Mode Operates Up to 312Mhz Mixed Clock Mode True Dual Port RAM Embedded Shift Register Mode Operates Up to 312Mhz Mixed Clock Mode True Dual Port RAM Embedded Shift Register Mode 512K bits 300 Mhz Operates Up to 300Mhz Mixed Clock Mode DDR SDRAM & SRAM SDR SDRAM QDR & QDRII SRAM ZBT SRAM DDR FCRAM More Bits For Larger Memory Buffering More Data Ports for Greater Memory Bandwidth Soft Multiplier: Sum of Multiplications 16-Bit Serial Shift Registers 16-Bit Serial Shift Registers Input 1 1 (Sample 16-Bit, Coefficient 16 Bit) 1 Sum of Multiplications Table 4 4 M512 32*18 18 18 + 19 35 + Example: FIR Filter Memory: 2 M512 M512 32*18 Output ADDRESS MULT_RESULT 0000 0 0001 C0 0010 C1 0011 C0+C1 … …. 1111 C0+C1+C2+C3 Example Direct Sequence Spread Spectrum (DSSS) Modem DSSS Modem Five Independent Data Channels Spread to 3.84 Mcps Three-Stage FIR Interpolation-by-32 Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth 112 dB SFDR 15.36 MHz Quadrature Carriers 122.88 MSPS Transmitter Output with 5 MHz Bandwidth & Over 78-dB Out– of-Band Rejection Automatic Gain Control (AGC) Compensating for Channel Attenuation of up to 30 dB Costas Loop Carrier Recovery DCH0 DCH0 4x Oversampling Code Synchronization DCH1 DCH2 DCH3 DCH4 DSSS Modulator Channel Model DSSS Demodulator DCH1 DCH2 DCH3 DCH4 DSSS Modulator DCH0 Cch,16,0 DCH1 S FIR3 RRC 25-Tap FIR Filter Interpolation x4 Ex BW:22% Re[] Cch,16,1 gi DCH2 K Cch,16,2 SCH Length 256 Gold Code Spreader K DCH3 Cch,16,8 DCH4 Cch,16,9 PCH Cch,16,10 Im[] S gq FIR1 LPF 2-Channel 87-Tap FIR Filter Interpolation x2 FIR2 LPF 2-Channel 47-Tap FIR Filter Interpolation x4 Sin(wn) NCO Frequency Resolution: 0.03Hz SFDR: 112dB Cos(wn) Carrier Phase Increment FIR3 RRC 25-Tap FIR Filter Interpolation x4 Ex BW:22% DSSS Demodulator FIR Altera RRC 31-Tap FIR Filter Excess BW: 22% Fixed Rate AGC NCO Frequency Resolution: 0.03Hz SFDR: 112dB pn_lock 8 Gold Code Correlator 4x Oversampling Peak Detector max_index Data Channels Output 1…5 Carrier Recovery Loop Free-Running Phase Increment Buffer FIR Altera RRC 31-Tap FIR Filter Excess BW: 22% Fixed Rate I-Q Derotate Hadamard Despreader 8 Pilot Output Pilot Monitor DSSS Modem Resources Resource Usage Summary Design Entity Logic Elements M512 RAM M4K RAM Mega RAM DSP Block Elements Modulator 9943 1 8 0 12 Demodulator 12196 60 8 1 60 Power Usage Estimates Power mW Total Standby Internal Power 75 Total Logic Element Internal Power 283 Total Clocktree Internal Power 175 Total DSP Internal Power 23 Other Internal Power 92 Total Power 505 FIR Filter Example* – 16X Cost/Performance Improvement Device Solution FIR Performance (MHz) Device Cost*** * Cost per FIR MHz TI C6713-200 64 cycles** @ 200MHz 3.125 $24.59 $7.87 TI C6416-600 32 cycles** @ 600MHz 18.75 $160 $8.53 Altera 1C3-8 8 cycles*** @ 230MHz 28.75 $14 $0.49 Altera 1C12-8 1 Cycles*** @ 170 $84 * FIR 128 Tap, 16 170MHz bit data, 14 bit coefficients ** DSPLib Optimized Assembly Libraries from Texas Instruments *** MegaCore Optimized FIR Compiler from Altera **** Pricing in quantity of 100 at Arrow 6/25/03 $0.49 Reconfigurable video processor for SDRAM access optimization (Henriss, Ernst et al.) Reconfigurable video platform · SDRAM memory centered design · FPGA based scheduler merges different streams and random accesses exploitation of SDRAM bank structure · supports 2 HDTV streams at 1.48 Gbit/s each plus DSP and filter unit access · reaches 700MByte/s in practical application for 4 Byte SDRAM memory word · extremly cost efficient design · used in professional video product line Fine-Grained RSOCs: Triscend A7 CSOC A7 Family 32-bit ARM 7 with 8kB Cache 3200 logic cells max. (40K gates) Up to 3800 FF’s Up to 300 Prog. I/O pins www.triscend. com Coarse-Grained RSOCs Chameleon Structure (2000) Design a battery powered personal mobile computing device that has multimedia functionality and can operate in a dynamic environment. - Do just enough and not too much for a given task (QoS) 32-bit ARC control processor Up to 84 32-bit Datapath Units DPU=a 32-bit ALU+a 32-bit barrel shifter Up to 24 of 16x24-bit multipliers Up to 48 of 128x32-bit local memory modules Up to 160 Prog. I/O pins Targeted at 3rd gen. wireless basestation, wireless local loop, SW radio, etc. Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M. Heysters, www.chameleonsystems.com Field Programmable Function Array The FPFA concept has a number of advantage The FPFA has a highly regular organisation We use general purpose process core Its scalability stands in contrast to the dedicated chips designed nowadays The FPFA can do media processing tasks such as compression/decompression efficiently Field Programmable Function Array Processor tiles Consists of five identical blocks, which share a control unit and a communication unit An individual block contains an ALU, two memories and four register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the ALUs, registers and memories This structure is convenient for the Fast Fourier Transform(6-input,4output) and the Finite impulse response M M M M M M M M M M Memory CrossBar Registers ALU ALU ALU ALU ALU ALUs Performance (MMACs/sec) DSP System Architecture Options DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP Stand-Alone Processor Processor Array Processor + Co-Processor Dedicated Hardware Architecture Optional Coprocessor Mappings Processor On FPGA Processor External to FPGA FPGA FPGA Processor Processor •Nios •ARM (AHB) Memory •TI c6x (EMIF) •Mot PPC (MPX) •Mot Starcore (MPX, AHB) •Intel 2850 (PCI Express) •ARM (AHB) •….. Mapping of DSP Algorithms on the FPFA Fast Fourier Transform FFT recursively divides a DFT into smaller DFTs DFT FFT DFT N=2 DFT FFT N=8 FFT N=8 DFT N=2 N=8 N=8 DFT N=8 FFT DFT N=2 N=8 DFT N=2 N=8 a b + - - W Recursion of a radix 2 FFT with 8 inputs The radix 2 FFT butterfly Mapping of DSP Algorithms on the FPFA Five-tap finite-impulse response filter Cross Bar h4 h3 h2 h1 h0 Level 2 1 2 3 4 5 O MorphoSys (1999) Reconfigurable cell RC Array •Array of reconfigurable cells •64 cells in a 2-D matrix •SIMD model •Same row(column) share configuration • Each RC operates on different data TinyRISC (Cont’d) Implementation & Performance •0.35 micron technology •4 metal layers •Operation at 100MHz •170 mm2 Motion Estimation Block size : 16x16 pixel, Image size : 352x288 pixel Lx de STMicroelectronics DART, Raphael David, IRISA/ENSSAT With STMicroelectronics, UBO univ. Reconfigurable multigrain= DPR+FPGA Reconfiguration Dynamique Faible Consommation Distribution hierarchique des ressources SCMD (Single Configuration Multiple Data) 11 GOPS/cluster 1.6 GMACS/cluster 0.64 W @ 11GOPS 16 MIPS/mW @ 11GOPS 0.18u CMOS DART Cluster Cluster architecture DPR1 Control DPR3 DPR4 DMA ctrl DPR5 Config mem. FPGA DPR6 Segmented network DPR2 Data mem DPR architecture Loop management Global bus AG1 AG2 AG3 AG4 Data mem1 Data mem2 Data mem3 Data mem4 Multibus network reg1 reg2 MUL1 ALU1 MUL2 ALU2