Reconfigurable Computation and Communication Architectures 조준동 발표순서 Why Reconfigurable System? S/W configurable platform의 필요성 Design Space of Reconfigurable Architectures New Taxonomy/Metric for RA Reconfigurable Radio and Multimedia Systems Network-centric Design: Clock and Power Reliable Design Technology Evolution Why Reconfigurable System? GPP와 재구성 h/w 를 포함 목적: 전력 감축 및 유연성 1. 2. 3. 동적인 환경에 따른 Quality of Service를 제공 알고리즘 진화에 따른 유연 한 구조 개발 및 유지 보수해야 하는 플랫폼 감소 Task 1 Task N A B W C X Y D E Z X D H A W Y B I J C ZE Reconfigurable Hardware Energy Efficiency of Reconfigurability system architecture communication protocol O/S and applications Partitioning of functions between wireless device and services on the network The mobiles must be flexible enough to accommodate a variety of multimedia services and communication capabilities and adapt to various operating conditions in an (energy) efficient way The spectrum of solutions flexibility efficiency application Generalpurpose processor e.g. Pentium Reconfigurable architecture ASIC S/W configurable platform의 필요성 – – Doing More by Doing Less :다양한 표준을 다룰 수 있는 능력이 필요 (AM, FM, GSM, UMTS, digital broadcasting standards, analog and digital television and other data links. A fully software reconfigurable multichannel broadband sampling receiver for standards in the 100 MHz band Semiconductor Revolutions “Mainstream Silicon Application is switching every 10 Years” software standard µproc., memory TTL 1967 1957 custom LSI, MSI hardware 1977 reconfigurable FPGAs 2007 1987 ASICs, accel’s 1997 coarse grain SoC Platform Adaptation SoC Design Process: Combined & Incremental synthesis Granularité de la reconfiguration Sébastien PILLEMENT - ENSSAT/LASTI Reconfiguration au niveau système Lx, C62 (décomposition en cluster) Reconfiguration au niveau fonctionnel Pleiades, RaPiD, DART(2001) Reconfiguration au niveau opérateur Chameleon, Piperench, Morphosys(2000) Reconfiguration au niveau porte Napa, GARP, FPGA The gain size of operations in Reconfigurable System Architectures Fine gained operations : Multiply and addition Medium gained operations : reconfigurable modules Course gained operations : CPU, host Design Space of Reconfigurable Architectures RECONFIGURABLE ARCHITECTURES (R-SOC) Lilian Bossuet LESTER Lab Université de Bretagne Sud Lorient, France MULTI GRANULARITY (Heterogeneous) FINE GRAIN (FPGA) Processor + Coprocessor Island Topology Hierarchical Topology Coarse Grain Coprocessor Fine Grain Coprocessor • Xilinx Virtex • Xilinx Spartran • Atmel AT40K • Lattice ispXPGA • Altera Stratix • Altera Apex • Altera Cyclone • Chameleon • REMARC • Morphosys • Pleiades • Garp • FIPSOC • Triscend E5 • Triscend A7 • Xilinx Virtex-II Pro • Altera Excalibur • Atmel FPSIC COARSE GRAIN (Systolic) Tile-Based Architecture Mesh Topology • aSoC • E-FPFA Linear Topology • RAW • Systolic Ring • CHESS • RaPiD • MATRIX • PipeRench • KressArray • Systolix Pulsedsp Hierarchical Topology • DART • FPFA Fine-Grained RSOCs Xilinx Virtex II-Pro Xilinx, Inc., San Jose, CA Up to 4 PowerPC 405 Processor Cores Up to 160k Reconfigurable Logic Cells (4-i/p 1-o/p Lookup Table) Up to 216 18-bit x 18-bit Dedicated Multipliers Up to 216 18-kbit On-Chip Distributed Memory Blocks Up to 852 I/O Pins www.xilinx.com Fine-Grained RSOCs Altera Stratix - 1.5-V, 0.13- alllayer-copper SRAM process, with densities ranging from 10,570 to 114,140 LEs - 28 digital signal processing (DSP) blocks with up to 224 embedded multipliers Digital Signal Processing With FPGAs Paul Ekas Jean-Charles Bouzigues Multiplier Options In FPGAs for DSP Processing Option Resource Area Usage 1 Logic Multipliers Logic Elements (Traditional) 500 LEs per 18x18 Multiplier 2 Hard Multipliers DSP Blocks 4 18x18 Multipliers per DSP Block 3 Soft Multipliers RAM 1 to 2 Embedded Memory Blocks Logic Elements Control Signals 4 Smallest Unit of Logic Grouped into Logic Array Blocks (LABs) of Ten LEs Features Four-Input Look-Up Table (LUT) Configurable Register Dynamic Add/Subtract Control Carry-Select Chain Logic LE1 4 4 4 4 4 4 4 4 LE2 LE3 LE4 LE5 LE6 LE7 LE8 LE9 4 LE10 Local Interconnect Logic Element Logic Array Block DSP Block: Optimized Hard MAC 36 38 + 36 +-S 37 Output Register Unit +-S 37 Output MUX Optional Pipelining 144 Input Register Unit 36 144 36 9 Bit x 9 Bit 18 Bit x 18 Bit 36 Bit x 36 Bit 8 Multiplies 4 Multiplies 1 Multiply 2 Multiplies with Accumulate 2 Multiplies with Accumulate 2 Sum of 2 Multipliers (Complex Multipliers) 1 Sum of 2 Multipliers (Complex Multiply) 2 Sum of 4 Multiplies 1 Sum of 4 Multiplies Soft Multipliers: Lookup Based Multiplication Use Embedded RAM Blocks as Look-Up Tables (LUTs) for Generating Partial Products Coefficient or Sum of Coefficients Values Stored in RAM Blocks MSB Partial Product Shifted & Added to LSB Partial Product Address 5 ADDRESS Example Multiplication of 5-Bit Input with 13-Bit Coefficient Multiplier Table All 18 Bit Possible Results Stored at 32*18 Look Up Table 32*18 M512 18 MULT_RESULT 00000 0 00001 C 00010 2*C 00011 3*C … …. 11111 31*C Data Output C = Coefficient[12:0] Altera FPGA Memory Architectures Today’s applications need more high performance memory One size does not fit all Wide choice of modes and widths M512 Blocks Rate Changing Embedded Shift Register Mode Operates Up to 312Mhz Mixed Clock Mode M4K Blocks True Dual Port RAM Embedded Shift Register Mode Operates Up to 312Mhz Mixed Clock Mode M-RAM External Memory Devices True Dual Port RAM Embedded Shift Register Mode 512K bits 300 Mhz Operates Up to 300Mhz Mixed Clock Mode DDR SDRAM & SRAM SDR SDRAM QDR & QDRII SRAM ZBT SRAM DDR FCRAM More Bits For Larger Memory Buffering More Data Ports for Greater Memory Bandwidth Soft Multiplier: Sum of Multiplications 16-Bit Serial Shift Registers 16-Bit Serial Shift Registers Input 1 1 (Sample 16-Bit, Coefficient 16 Bit) 1 Sum of Multiplications Table 4 4 M512 32*18 18 18 + 19 35 + Example: FIR Filter Memory: 2 M512 M512 32*18 Output ADDRESS MULT_RESULT 0000 0 0001 C0 0010 C1 0011 C0+C1 … …. 1111 C0+C1+C2+C3 Example Direct Sequence Spread Spectrum (DSSS) Modem DSSS Modem Five Independent Data Channels Spread to 3.84 Mcps Three-Stage FIR Interpolation-by-32 Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth 112 dB SFDR 15.36 MHz Quadrature Carriers 122.88 MSPS Transmitter Output with 5 MHz Bandwidth & Over 78-dB Out–ofBand Rejection Automatic Gain Control (AGC) Compensating for Channel Attenuation of up to 30 dB Costas Loop Carrier Recovery 4x Oversampling Code Synchronization DCH0 DCH1 DCH2 DCH3 DCH4 DSSS Modulator Channel Model DSSS Demodulator DCH0 DCH1 DCH2 DCH3 DCH4 DSSS Modulator DCH0 Cch,16,0 DCH1 S FIR3 RRC 25-Tap FIR Filter Interpolation x4 Ex BW:22% Re[] Cch,16,1 gi DCH2 K Cch,16,2 SCH Length 256 Gold Code Spreader K DCH3 Cch,16,8 DCH4 Cch,16,9 PCH Cch,16,10 Im[] S gq FIR1 LPF 2-Channel 87-Tap FIR Filter Interpolation x2 FIR2 LPF 2-Channel 47-Tap FIR Filter Interpolation x4 Sin(wn) NCO Frequency Resolution: 0.03Hz SFDR: 112dB Cos(wn) Carrier Phase Increment FIR3 RRC 25-Tap FIR Filter Interpolation x4 Ex BW:22% DSSS Demodulator FIR Altera RRC 31-Tap FIR Filter Excess BW: 22% Fixed Rate AGC NCO Frequency Resolution: 0.03Hz SFDR: 112dB pn_lock 8 Gold Code Correlator 4x Oversampling Peak Detector max_index Data Channels Output 1…5 Carrier Recovery Loop Free-Running Phase Increment Buffer FIR Altera RRC 31-Tap FIR Filter Excess BW: 22% Fixed Rate I-Q Derotate Hadamard Despreader 8 Pilot Output Pilot Monitor DSSS Modem Resources Resource Usage Summary Design Entity Logic Elements M512 RAM M4K RAM Mega RAM DSP Block Elements Modulator 9943 1 8 0 12 Demodulator 12196 60 8 1 60 Power Usage Estimates Power mW Total Standby Internal Power 75 Total Logic Element Internal Power 283 Total Clocktree Internal Power 175 Total DSP Internal Power 23 Other Internal Power 92 Total Power 505 FIR Filter Example* – 16X Cost/Performance Improvement Device Solution FIR Performance (MHz) Device Cost**** Cost per FIR MHz TI C6713-200 64 cycles** @ 200MHz 3.125 $24.59 $7.87 TI C6416-600 32 cycles** @ 600MHz 18.75 $160 $8.53 Altera 1C3-8 8 cycles*** @ 230MHz 28.75 $14 $0.49 Altera 1C12-8 1 Cycles*** @ 170MHz 170 $84 $0.49 * FIR 128 Tap, 16 bit data, 14 bit coefficients ** DSPLib Optimized Assembly Libraries from Texas Instruments *** MegaCore Optimized FIR Compiler from Altera **** Pricing in quantity of 100 at Arrow 6/25/03 Performance (MMACs/sec) DSP System Architecture Options DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP DSP Stand-Alone Processor Processor Array Processor + Co-Processor Dedicated Hardware Architecture Optional Coprocessor Mappings Processor On FPGA Processor External to FPGA FPGA FPGA Processor Processor •Nios •ARM (AHB) Memory •TI c6x (EMIF) •Mot PPC (MPX) •Mot Starcore (MPX, AHB) •Intel 2850 (PCI Express) •ARM (AHB) •….. Fine-Grained RSOCs: Triscend A7 CSOC A7 Family 32-bit ARM 7 with 8kB Cache 3200 logic cells max. (40K gates) Up to 3800 FF’s Up to 300 Prog. I/O pins www.triscend.com Coarse-Grained RSOCs Chameleon Structure (2000) Design a battery powered personal mobile computing device that has multimedia functionality and can operate in a dynamic environment. - Do just enough and not too much for a given task (QoS) 32-bit ARC control processor Up to 84 32-bit Datapath Units DPU=a 32-bit ALU+a 32-bit barrel shifter Up to 24 of 16x24-bit multipliers Up to 48 of 128x32-bit local memory modules Up to 160 Prog. I/O pins Targeted at 3rd gen. wireless basestation, wireless local loop, SW radio, etc. Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M. Heysters, www.chameleonsystems.com Field Programmable Function Array of Chameleon Structure Paul M. Heysters, Jaap Smit, Gerard J.M. Smit, Paul J.M. Havinga •A FPFA consists of interconnected processor tiles •Multiple processes can coexit in parallel on different tiles •Within a tile multiple data streams can be processed in parallel •Each processor tile contains multiple reconfigurable ALUs, local memories, a control unit and a communication unit Field Programmable Function Array The FPFA concept has a number of advantage The FPFA has a highly regular organisation We use general purpose process core Its scalability stands in contrast to the dedicated chips designed nowadays The FPFA can do media processing tasks such as compression/decompression efficiently Field Programmable Function Array Processor tiles Consists of five identical blocks, which share a control unit and a communication unit An individual block contains an ALU, two memories and four register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the ALUs, registers and memories This structure is convenient for the Fast Fourier Transform(6input,4-output) and the Finite impulse response M M M M M M M M M M Memory CrossBar Registers ALU ALU ALU ALU ALU ALUs Mapping of DSP Algorithms on the FPFA Fast Fourier Transform FFT recursively divides a DFT into smaller DFTs DFT FFT DFT N=2 DFT FFT N=8 FFT N=8 DFT N=2 N=8 N=8 DFT N=8 FFT DFT N=2 N=8 DFT N=2 N=8 a b + - - W Recursion of a radix 2 FFT with 8 inputs The radix 2 FFT butterfly Mapping of DSP Algorithms on the FPFA Fast Fourier Transform are Wre bim bre aim Wim Cross Bar Switch : At least 6 Buses Level 2 - + Zre Zim Bre Are 1 2 Level 3 Bim Aim 3 4 Mapping of DSP Algorithms on the FPFA Five-tap finite-impulse response filter Cross Bar h4 h3 h2 h1 h0 Level 2 1 2 3 4 5 O MorphoSys (1999) Reconfigurable cell RC Array •Array of reconfigurable cells •64 cells in a 2-D matrix •SIMD model •Same row(column) share configuration • Each RC operates on different data TinyRISC (Cont’d) Implementation & Performance •0.35 micron technology •4 metal layers •Operation at 100MHz •170 mm2 Motion Estimation Block size : 16x16 pixel, Image size : 352x288 pixel Lx de STMicroelectronics DART, Raphael David, IRISA/ENSSAT With STMicroelectronics, UBO univ. Reconfigurable multigrain= DPR+FPGA Reconfiguration Dynamique Faible Consommation Distribution hierarchique des ressources SCMD (Single Configuration Multiple Data) 11 GOPS/cluster 1.6 GMACS/cluster 0.64 W @ 11GOPS 16 MIPS/mW @ 11GOPS 0.18u CMOS DART Cluster Cluster architecture DPR1 Control DPR3 DPR4 DMA ctrl DPR5 Config mem. FPGA DPR6 Segmented network DPR2 Data mem DPR architecture Loop management Global bus AG1 AG2 AG3 AG4 Data mem1 Data mem2 Data mem3 Data mem4 Multibus network reg1 reg2 MUL1 ALU1 MUL2 ALU2 Reconfigurable architectures (Rabaey et al.) · reconfiguration: change of hardware structure in the field · approaches at the logic level (FPGA) or at the function block level · dynamic change of specialization · example: PLEIADES template for low power systems The Re-configurable Terminal Satellite Processors Elements of Energy- Efficiency Communication Network Distributed Data- Driven Control Execution of a hardware module is triggered by the arrival of tokens. When there are no tokens to be processed at a given module, no switching activity occurs in that module. Design Methodology Multi-DSP Tree Structure A. K. Salkintzis, N. Hong and P. T. Mathiopoulos Multi-DSP Network Structure Data traffic is reduced with each connection Multiplexing & Burst Construction Modulation Encription Interleaving Channel Coding CRC insertion Data Processing Sequencer Spreading Rate matching Channelization Radio Resource Equalization Segmentation Reconfigurable Radios SDR Configuration • Digital Down/Up Conversion (DDC) – Channel Center – Decimation/Interpolation rates – Compensation Filters – Matched Filter a = {0.25,0.35,...} • FEC – Convolutional – Reed-Solomon – Concatenated Coding – Turbo CC/PC – (De-)Interleave • Beam Forming Soft Radio Digital Signal Processing Engine • Security • Modulation Format – QPSK – DQPSK – p/4 DQPSK – {16,64,256,1024} QAM – OFDM – OFDM CDMA • Channel Access – CDMA – TDMA • DSSS – Rake, track, acquire – Multi User Detect. (MUD) – ICU • Network Interface Definition Key Software Radio Components Multibeam Antenna Array Multiband RF Conversion Spectral Purity Wideband A/D & D/A Conversion IF Processing Environment Characterization IF Processing WB Digital DSP & Software Design Modulator Demodulator Advanced Control Bitstream Bitstream Processing Bitstream Processing Service Quality Time to Market Transmit Isochronism Throughput Response Time Receive Larger Network On Line Adaptation Off Line Support SNR/BER optimization Data rate adaptation interference suppression Band/Mode selection Development Optimization Over the air Delivery SDR Architecture RF unit Signal processing/control unit Input/ Output Rx SYN LNA RX Tx SYN LNA TX RX Receive/ Transmit Rx SYN PA EX. TX Tx SYN C-PCI bus Isao TESHIMA, Hitachi Kokusai Electric Inc., teshima.isao@h-kokusai.com C o n tr o l EX. In te fra c e PA B a s e b a n d M O D Q E uM a d ra tu re M O D D E a M ta c o n v e rt e r Receive/ Transmit HMI Terminal Specification of Prototype Signal processing FPGA : Quadrature MODEM DSP : Baseband MODEM FPGA XCV2000E x 3 DSP TMS320C6701 x 4 CPU Control module : Celeron Peripheral module System bus cPCI Operating system Linux HMI Operates from web browser Interface Audio I/O Serial I/O Ethernet(100BASE-TX) Specification of Prototype RF range 2~500MHz Waveform SSB, AM, FM, BPSK, QPSK, 8PSK, 16QAM Number of channel Four full-duplex Radio relay Repeat/Bridge Frequency accuracy <0.1ppm Rx IF frequency 70MHz Tx IF frequency 25MHz Dynamic range 14bits Rx IF sampling frequency 40MHz Tx IF sampling frequency 100MHz PACT’s SDR XPP Martin Vorbach PACT XPP Technologies, Germany U-P vs XPP A SDR/Multimedia Solution PACT’s SDR XPP Reconfigurable video processor for SDRAM access optimization (Henriss, Ernst et al.) Reconfigurable video platform · SDRAM memory centered design · FPGA based scheduler merges different streams and random accesses exploitation of SDRAM bank structure · supports 2 HDTV streams at 1.48 Gbit/s each plus DSP and filter unit access · reaches 700MByte/s in practical application for 4 Byte SDRAM memory word · extremly cost efficient design · used in professional video product line NexperiaTM DVP Hardware architecture (source: Th. Claasen, Philips, DAC 2000) New Taxonomy/Metric Flynn: Triple (d,i,c) d: # of data streams i: # of instruction streams c: # of configuration states SISD, SIMD, MIMD,MISD RA: (c,g,a) c: configurability to various environment g: size of granularity a: adaptability to various components SCSG,SCMG,SCLG MCSG,MCMG,MCLG Systolic Ring Dnode Sequencer layer 1 • Based on a coarse-grained configurable PE • Circular datapaths C: # of layers C = 4 N: # of Dnodes per layer N=2 S: # of Rings s = 1 • Control Units (sequencer) Local Dnode unit Local Ring unit Global unit Dnode Dnode layer 4 Dnode Dnode Local Ring SequencerDnode Dnode layer 2 Dnode Dnode layer 3 Remanence Fc Configuration Memory Sequencer N PE .Fe R= Nc.Fc inst0 inst1 inst2 inst3 … instn Processing Elements Fe Sequencing Unit PE PE PE PE … Interconnection NPE: # of processing elements (PE) Routing Nc: # of PE configurable per cycle Fe: operating frequency Fc configuration frequency Characterizes the Dynamism # of cycles to (re)configure the whole architecture Amount of data to compute between 2 configurations PE Operative Density Configuration Memory N PE = OD( N PE ) A( N PE ) Sequencer Sequencing Unit inst0 inst1 inst2 inst3 … instn Processing Elements PE PE PE PE … Interconnection Routing NPE: # of PE A: Core Area (relative unit ²) Area can be expressed as a function of NPE PE Remanence formalisation # of layers : C = 8 # of Dnode per layer : N = 2 1 Systolic Ring: S = 1 layer 1 layer 2 layer 8 R (N PE ) = layer 3 k .N PE k= C/N REMANENCE layer 7 40 layer 4 35 30 25 20 layer 6 15 10 5 0 0 20 40 60 80 100 120 140 160 180 # Dnodes layer 5 Architectural model Characterization # of layers : 4 (C = 4) # of Dnode per layer : 2 (N = 2) 4 Systolic Ring (S = 4) Control Units • Local Dnode unit • Local Ring unit • Global unit •www.qstech.com Global Sequencer Local Ring Sequencer Local Ring Sequencer Local Ring Sequencer Local Ring Sequencer Best OD and remanence Design Space Worst interconnect resources and processing power 0,040 Operative Density ce S=8 0,030 15 S=4 0,025 0,020 10 S=2 0,015 S=1 0,010 5 0,005 0 0,000 0 20 40 60 80 100 120 140 # Dnodes Remanence en n a Rem 0,035 20 Worst OD and remanence Best interconnect resources and processing power Design Space 0,040 ce en n a Rem Operative Density 0,035 S=8 0,030 15 S=4 0,025 0,020 10 S=2 0,015 S=1 0,010 5 0,005 0 0,000 0 20 40 60 80 100 120 140 # Dnodes Remanence 20 Comparisons of RA Pascal BENOIT N RR= .Fe Nc.Fc Name Type NPE Nc F (MHz) ARDOISE Fine Grain RA 2304 0.14 33 16457 MorphoSys Coarse Grain RA 128 16 100 8 Systolic Ring Coarse Grain RA 24 4 200 6 DART Coarse Grain RA 24 4 130 6 8 8 300 1 TMS320C62 DSP VLIW 1. Only 1 cycle to (re)configure the DSP 2. Few cycles to (re)configure coarse grain RA (8) 3. Many cycles to (re)configure fine grain RA PE MPSoC Clock and Power Olivier Franza, Intel Increased uncertainty with process scaling Process, voltage, temperature variations, noise, coupling Affects design margin over design, power & performance loss Increased power constraints Increasing leakage, power (density, delivery) limitations More transistors mean: Larger clock distribution networks Higher capacitance (more load and parasitics) With each new technology: Gate delay decreases ~25% Wire delay increases ~100% Cross-chip communication increases Clock needs multiple cycles to cover die Interconnect Delays & Density Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology Multiple Clocks due to Interconnect limitation At reduced performance, larger resource size Noise in Mixed Signal Systems Multiple clock domains Low skew and jitter ALWAYS a must Clock modeling requires more accuracy Within-die variations, inductance, crosstalk, electromigration, self-heat, … Floor plan modularity Think adding/removing cores seamlessly! Hierarchical clock partitioning Reduce global clock and possibly relax its requirements Generate “locally”-used clock “locally” Implement clock domain deskewing techniques Bound clock problem into simple, reliable, efficient domains DEC/Compaq Alpha more complex core to improve performance, more complex clocks (?), Source: DEC/Compaq – Gronoski & al., JSSC 1998 – Xanthopoulos & al., ISSCC 2001 – Barroso & al., ISCA 2000 Clock and Power Convergence Intel® Itanium® Montecito Each core split into 3 clock domains on variable power supply Each domain controlled by Digital Frequency Divider (DFD) generating low-skew variablefrequency clocks; fed by central PLL and aligned through phase detectors Regional Voltage Detector (RVD): supply voltage monitor Second level clock buffer (SLCB): digitally controlled delay buffer for active deskewing Regional Active Deskew (RAD): phase comparators monitoring and adjusting delay difference between SLCBs Clock Vernier Device (CVD): digitally controlled delay buffer Clock generation and distribution are essential Clock generation and distribution are essential enablers of microprocessor performance On-Chip Interconnects: Circuits and Signaling, Wayne Burleson • Using Vdd programmability • High Vdd to devices on critical path • Low Vdd to devices on non-critical paths • VddOff for inactive paths A – Baseline Fabric B – Fabric with Vdd Configurable Interconnect This work builds on a similar idea for FPGAs described in: Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International Conference on Computer-Aided Design, Nov. 2004 From Spaghetti wires to Noc Marcello Coppola, STMicroelectronics Benchmarks, EE Times,7/2005 Xpipes, Bologna and Stanford : compared w/ Amba AHB multilayer bus, 21% faster, but worse latency When, Univ. of Kaiserslautern: LPDC decoder: 500Mhz vs 64 Mhz (fixed bus), but 30W vs. 700mW, twice the die size. Arteris: better die size, comparable power consumption, 740Mhz (250Mhz) SonicsMX: power-efficient mobile-handset w/ power management STNoC, Spidergon: topology w/ degree 2-3 NoC Applications http://www.eit.uni-kl.de/wehn • Turbo-Decoder UMTS compliant, 100Mbit: large flexibilty w/ 14 parallel units, area = 16.84 mm2 (14mm2 PUs, 2.8mm2 NoC) • LDPC Decoding, T. Theocharides, G. Link, N. Chip, T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin, Int. Conference on VLSI Design 2005 – 1024 Bit block size, 1.2Gb/s, R=0.75 – NoC: 5x5 2D mesh, dimension-order routing, large flexibility – 160nm CMOS Technology, 1.8V, 500 MHz, 110 mm2, ~30 Watt Reliable design, G. De Micheli 1. Manufacturing imperfections: More likely to happen as lithography scales down 2. Approximations during design: Uncertainty about details of design 3. Aging: Oxide breakdown,electromigration 4. Environment-induced Soft-errors (Data corruption due external radiation exposure), electro-magnetic interference 5. Operating-mode induced: Extremely-low voltage supply Dealing with variability • Most variability problems that induce timing errors 1. 2. 3. 4. Power supply variation Wire length estimation Crosstalk Soft errors Adaptive low-power transmission scheme Frédéric Worm, Patrick Thiran, Giovanni De Micheli, and Paolo Ienne. Self-calibrating Networks-on-Chip.In Proceedings of the IEEE International Symposium on Circuits and Systems, Kobe, Japan, May 2005. Reduced Energy Consumption