A Software Radio Architecture for Linear Multiuser Detection I. Seskar and N. B. Mandayam Wireless Information Network Lab. Rutgers Univ. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 1 SDR(Software Defined Radio) 란 무엇인가? a collection of hardware and software technologies that enable reconfigurable system architectures for wireless networks and user terminals efficient and comparatively inexpensive solution to the problem of building multi-mode, multi-band, multi-functional wireless devices that can be enhanced using software upgrades SDR can really be considered an enabling technology that is applicable across a wide range of areas within the wireless industry IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 2 SDR benefit Standard architecture for a wide range of communications products Non-restrictive wireless roaming for consumers by extending the capabilities of current and emerging commercial air-interface standards Uniform communication across commercial, civil, federal and military organizations Potential for significant life-cycle cost reductions Over the air downloads of new features and services as well as software patches Advanced networking capabilities to allow truly "portable" networks IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 3 THE SOFTWARE RADIO NODE ARCHITECTURE Defining the Software Radio Architecture - Functional Architecture Physical Architecture Components Resource Estimation and Management Software Architecture Software Tools Architecture Migration - Embedded DSP Multimode Radios Multiband Multimode Conventional Radios Speakeasy I & II, The Military Software Radio Integrated Architectures Case Studies in the Evolution of the Software Radio IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 4 Software Radio “Phase Space IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 5 Hardware Software Mix IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 6 Functional architecture of a software radio for linear multiuser detection IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 7 Logical partitioning of functionality in a software radio receiver for linear multiuser detection IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 8 Block-diagram of software radio implementation(TMS320C40 DSP) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 9 BER range vs SNR IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 10 BER range vs Number of users IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 11 conclusion The reconfigurability of linear multiuser receivers allows for the integration of multimedia services over wireless channels with variable quality of sevice(QoS) requirements. The reconfigurable radio architectures also provide diverse QoS quarantees ranging in several orders of magnitudes in terms of BER requirements IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 12 Design and Implementation of a Completely Reconfigurable Radio Srikathyayani Srikanteswara Michael hoffmann Jeffrey H. Reed Peter M. Athanas IT-SOC 2002 ©스마트 모빌 컴퓨 팅 Lab 13 Introduction Soft radio using stream based computing and runtime reconfigurable hardware CCM called Stallion for the processing layer Layered radio architecture Implementation of rake receiver for WCDMA IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 14 Traditional choice ASIC -most efficient implementation of a given circuit -little flexibility, high initial cost, long design cycle FPGA -lack run-time and partial reconfigurability -not matched for wireless communication or signal processing application DSP -maximum flexibility, short design cycle -not efficient for power consumption silicon IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab area 15 CCM(custom computing machine) -achieve flexibility in h/w without sacrificing power or silicon efficiency -try to retain charactertitics of FPGA,ASIC -static h/w for frequently used cores like multiplication => efficient radio design -customize FPGA based such that the flexibility of FPGAs is retained only where necessary IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 16 Stallion Architecture CCM developed by Virgina Tech Flexible, high-datathroughput, low-power computation Based on wormhole reconfigurable computing Support fast run-time reconfigurability Configuration time is in order of microseconds (most FPGA is in order of milli-seconds) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 17 Stallion Architecture IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 18 Stallion Architecture (functional unit) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 19 참고자료 wormhole structure(floating point multiplier) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 20 Layered Radio Architecture Hardware paging -reconfigurable computing make hardware paging possible -hardware modules being paged in and out of the system in a manner similar to software paging performed with virtual memory -allow for the optimal use of a system’s processing elements Stream-based processing -stream;sequence of words containing both configuration information and computational data -simplifies the interfaces between processing modules -make it easy to replace modules or add additional modules IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 21 참고자료 wormhole structure(basic principle)support stream-based processing IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 22 참고자료 virtual memory using hash(inverted page table structure) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 23 The Layered Radio Architecture Soft radio interface(SRI) layer Configuration layer Processing layer Data to be processed, programming information Each layer’s functionality is isolated from the other layers Information is passed between the layers utilizing stream-based processing Processing layer requires run-time reconfigurable h/w while the higher layers don’t have this constraint IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 24 참고자료 The Layered Radio Architecture IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 25 Design Issue points of Reconfigurable Platform Complex entity that needs to handle very high data rates efficiently Smooth reconfiguration of the radio Ability of runtime reconfiguration Over-the-air updates Low power consumption IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 26 Overview of the Layered Radio Architecture IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 27 Overview of Layered Architecurte •The layered architecture leverages on streambased processing, where a common bus is used for data as well as programming information. •The architecture can handle complex data processing with efficient resource allocation, while maintaining hardware reusability, flexibility, scalability. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 28 Application-layer software (1) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 29 Application-layer software (2) User interface Receive Data from A/D converter Send control packet to SRI Send Reply signal to HOST PC IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 30 Soft radio interface layer (1) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 31 Soft radio interface layer (2) Contain system level description level code Send algorithm code to Configuration layer Dose not contain hardware configuration binary code Reference from local memory IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 32 Configuration layer (1) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 33 Configuration layer (2) Contain actual bits need to processing layer Receive the programming packet from SRI layer Send configuration packet to processing layer Reference binary code from local memory IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 34 Stream-based processing Diagram of packet Why do we use stream-based packet? • using pipeline • can maintain some degree of flexibility IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 35 Stream-based processing (1) •A stream is a packet of known length containing either programming(configuration) information or the data to be processed. •Each processing module performs a unique subset of the overall processing on the data and then passes the data and control information to the next module. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 36 Stream-based processing (2) Receive packet from previous processing element When packet had been received, it had been interpreted and attached some processed data IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 37 More detailed Soft radio interface layer (1) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 38 More detailed Soft radio interface layer (2) Input signal <Where?> A/D converter Host PC Buffer Configuration layer IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 39 More detailed Soft radio interface layer (2) • Packet Received from configuration has appended • Status, error message IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 40 More detailed Soft radio interface layer (3) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 41 More detailed configuration layer (1) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 42 More detailed configuration layer (2) Receive algorithm code from SRI It has local memory and reference from it Local memory has actual binary code, processing module address and status list IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 43 More detailed Processing layer (1) Feature Linearly connected Static + Reconfiguration module Separate operation (do not disturb other module) Main flow is pipelined IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 44 More detailed Processing layer (2) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 45 More detailed Processing layer (3) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 46 More detailed Processing layer (4) For pipelined operation Each module can feed back operation Also support concurrent Input/Output IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 47 More detailed Processing layer (5) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 48 More detailed Processing layer (6) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 49 More detailed Processing layer (7) when a stream packet enters a processing module Interprets the packet and performs the necessary action Examines valid packet and maintain synchronize each module Each clock cycle, every processing module accepts packet and sends out a packet IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 50 Section matching IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 51 Overview Design of Stallion (Virginia Tech) (1) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 52 Overview Design of Stallion (Virginia Tech) (2) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 53 Overview Design of Stallion (Virginia Tech) (3) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 54 Overview Design of Stallion (Virginia Tech) (4) Why main clock had to divide? The guard slot is used to prevent data crash The forward slot is used to transmit data and processing packet The backward slot is used to feedback IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 55 Overview Design of Stallion (Virginia Tech) (5) How is the Power-Up sequence like? Power on, first P-module assign address 0 on itself Send invalid stream in bus to ensure the other P-module not to get same address First P-module will not act until stream’s address is assign to 0 IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 56 Advantages in the Layered Architecture Advantages in the layered architecture Defines the methodology to design multimode radios using hardware paging Provides the framework for building a flexible soft radio at the expense of the overhead for packetizing data. Excellent hardware reusability Build libraries of hardware functions much like software’s it Has good data flow properties and simple interface between the processing layer modules. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 57 Insight into Stallion (Virginia Tech Inc.) (1) What is Stallion? And it’s feature? Based on Wormhole reconfigurable computing High reconfiguration speed Specifically suited to flexible High-throughput Low-power computations IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 58 Insight into Stallion (Virginia Tech Inc.) (2) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 59 Insight into Stallion (Virginia Tech Inc.) (3) Operating description of Stallion The functional units are programmed to process data, while the crossbar aids in routing data Input and Output is performed with 6 Data port Processed data flow is looping by using smart crossbar The essence of wormhole runtime reconfiguration is that independent, self-steering streams of programming information and operand data computational problem at hand IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 60 Rake Receiver The layered architecture with the use of CCMs can support existing and future robust high data rate system The implementation of rake receiver demonstrates that the architectures can support very high data rates while retaining flexibility Implemented using a single Stallion for the processing layer IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 61 Rake Receiver CCDMA(x=time,y=frequency) user1 user3 user1 user2 user1 user3 user3 user2 user2 IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 62 Rake Receiver 3-finger rake takes 5976 cycles per slot Each slot contains 10240 samples Total processing rate of 0.5836 cycles/sample Operating speed of 4.48MHz 8.96% of typical speed of 50MHz Show capacity to support high data rate IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 63 Rake Receiver (implementation statics) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 64 Conclusions The layered architecture needs formal and unified structure for standard The layered architecture is suited for today’s FPGAs that support partial reconfiguration and for tomorrow’s configurable computing platforms Current research at Virginia Tech focuses on building a library and soft radio modules. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 65 Reference J.Mitola III, “The software Radio Architecture.” IEEE Commun. Mag. May 1995, Pp. 26-38 J.Mitola and Zvonar, “The Soft Radio Architecture for Reconfigurable Platforms”, IEEE Commun. Mag. Pp. 140-147 S. Srikanteswara et al. “Design and Implenmentation of a Completely Reconfigurable Soft Radio”, IEEE 0-7803-6267-5 S.Srikanteswara et al. “Configurable Computing for Commnunication Systems” Proc. Wireless Commun. Conf. IMAPS, 1998, Pp. 180-185 • FPGA in the Software Radio (IEEE Communication Magazine Feb 1999) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 66 DSP-Based Architectures for Mobile Communications: : Past, Present and Future This material is based on Paper of Gatherer, Stetzler, McMahan and Auslander IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 67 Introduction Goal: Approaches to the implementation: Summarize trends in power consumption and MIPS, and Programmable DSPs for describe the use of coprocessor flexibility Hard-wired ASIC to improve implementation efficiency Properties : Right answer Some combination of GSM/UMTS both approaches Power Flexibility is becoming MIPS more of an issue Coprocessor the programmability Complementary offered by DSPs is even technology more desirable IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 68 A Historical Perspective on Wireless Handset Architectures for GSM - I Needs for DSP in early GSM : low-power requirement Use of DSP. upgrades to ASIC-based solutions became costly and difficult. A single DSP was powerful enough to do all the DSP function. To improve system power consumption and board space DSP was included mainly to do the vocoding “mission creep” Flexibility : evolving standard. a slightly different physical layer from the previous one by each generation most of the phone would be implemented in ASIC. integrate a RISC microcontroller AS GSM phones have gradually moved beyond the simple phone function this have led to an increase in the fraction of the DSP MIPS used by something other than physical layer 1. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 69 A Historical Perspective on Wireless Handset Architectures for GSM - II Reduced uses of ASIC in GSM: Making an ASIC vocoder was like replicating available commercial DSP architecture Product life cycle shortened from 2.5 years to 1 year Different worldwide standards related to GSM Platform based architecture A DSP-based baseband approach can cope better with different RF and mixed-signal offerings Spare DSP MIPS Echo/noise cancellation, Speech recognition, equalizer IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 70 Trends in Low-Power DSPs - I Trends : DSP Power dissipation is halving the power every 18 months The percentage of the physical layer MIPS that reside in the DSP change 100 % in GSM to 10 % in WCDMA But, more efficient architectures and enhanced instruction sets The example of an evolving DSP optimized for wireless application Ex) TI C54x , Lucent 1600 series and ADI12xx series. C54x: Several power saving features are built into the architecture Instruction set to reduce the code size and processor cycles required. Modified Harvard Architecture One program memory bus coupled with two data address generator. High memory bandwidth and multiple operand operation : fewer cycles to complete the same function Adding instruction Allow efficient implementation of algorithms important to wireless application Example of C54x : 40 bit barrel shifter and compare-select-store, single and block repeat, block memory move, FIR, LMS…. In near future, bit manipulation… IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 71 Trends in Low-Power DSPs II Another Trends : VLIW processor Supporting a compiler-based programmer-friendly environment Example: TI TMS320C6x, ADI TigerSHARC, Lucent and Motorolla Star*core Instruction level parallelism : statically schedule and multipleissue implementation Very efficient compilation of higher level code : reducing the need for DSP-specific assembly-level coding of algorithm Open-application-driven system Power Management : C54x utilize hybrid power manage management strategy Automatic local clock gating & 3 user-controlled idle modes Flexible D-PLL based clock generator and multiplier IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 72 Coprocessor - I The Problem of implementing a new standard with today’s DSP : Standards are driven by what is possible for ASIC implementation at a given power and cost point. A newly defined standard cannot be implemented in a DSP alone For WCDMA voice rate terminal, only 10% operations are suitable for implementation on a current DSP Functions operating on data at symbol rate as opposed to the chip rate Appealing solution : Coprocessor-based architecture with a single programmable device at its core Example Pleades project : RISC engine with an attached configurable array of multiplier 16-point complex radix 2 FFT : delay energy product of 0.02% and 4 % of Strong ARM and C54x IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 73 Coprocessor II Division of Coprocessor : loosely coupled and tightly coupled Defined relative to the average time to complete an instruction on the DSP TCC (Tightly coupled coprocessor) : TCC : DSP will initiate a task on the coprocessor that completes in a few instruction cycle A specific interface to the DSP core and access to some register within that core few cycle : Involve a small amount of data Difficult parallel scheduling of task on DSP and Coprocessor User-definable instruction set enhancement Provide power and speed improvement for small task where there is no data bottleneck through the DSP Specific task and relatively small compared to DSP Absorbing by replacing with code IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 74 Coprocessor - III LCC (Loosely coupled coprocessor) : Analogous to a subroutine call than instruction Operation on large data sets Run in parallel with DSP More careful with the scheduling of LCC instruction Main advantage Solution of bus bandwidth when raw input data rate or data reuse in calculation is very high Computational unit local to data arranged for the data access required for a class of computation Application at chip rate to symbol rate boundary Simple but high MIPS task Instruction and output buffer are memory mapped to allow flexible access Care full coprocessor design : No significant power penalty to be paid for the flexibility DSP/Coprocessor partitioning : ex) Decoder IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 75 Application and Architectures for future wireless devices Example of Service : Imaging services Location-based services Audio and visual environment A need for more powerful DSPs : Open-operating system Dual-core RISC + DSP Multiple DSP Extension to which the application and communication functions can be effectively combined in one programming environment IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 76 FPGA/PLD Comparison FPGA Sea of configurable logic blocks (CLBs) Fine granularity LUT-based Complex interconnect Unpredictable performance PLD Fewer, larger logic array blocks (LABs) Xilinx FPGA structure Coarse granularity Sum-of-product structure Fast; predictable timing Altera PLD structure Comparison of Technology Solutions Power consumption Size Cost Field Upgradable Silicon evolution Tools High-Speed DSPs Very High Modest Moderate/ High High Easy Some Multiple ASICs Moderate Large High None Difficult Available Parameterize d Hardware Moderate Moderate Moderate Some Moderate Some Reconfigurabl e logic Low Low Moderate/ Low High Easy Unavailable IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 78 FPGA vs DSP FPGA Chip Programming Language DSP Chip VHDL, Verilog C, Assembly Language Fairly easy, however, a programmer needs to understand the hardware architecture before programming Easy Can be very fast if an appropriate architecture is designed Speed is limited by the clock speed of a DSP chip Reconfigurability SRAM-type FPGAs can be reconfigurable infinite times Can be configurable by changing program memory content Reconfiguration method Reconfiguration is done by downloading configuration data to a chip electronically Reconfiguration is done by simply reading a program at a different memory address Area where FPGAs can outperform DSPs FIR filter, IIR filter, conrrelator, convolver, FFT, etc, A signal processing program of sequential nature Can be minimized if the circuit is designed to save power, or the power is dynamically controlled Even if program A is larger than program B, power consumption does not change as long as the number of memory chips is the same Parallel multiplier/adder or distributed arithmetic Repeated operation of MAC function Can be fast if a parallel algorithm is used. If a filter is implemented using distributed arithmetic, the speed does not depend on the number of taps Limited by the speed of MAC operation of a DSP chip if a filter is implemented, the speed becomes slower if the number of taps increase Can be parallelized to archieve high performance DSP chip programming is usually sequential and cannot be parallelized Ease of software programming Performance Power consumption Implementation method of MAC Speed of MAC Parallelism IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 79 A typical FPGA architecture IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 80 A typical FPGA architecture IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 81 Xilinx_X3032_CLB IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 82 Distributed arithmetic MAC b1 a1 Shift Register a2 Shift Register Register b2 Shifter IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 83 Distributed arithmetic MAC using an LUT b1 Shift Register Lookup Table Shift Register Register b2 Shifter IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 84 A Distributed arithmetic FIR filter using an LUT Input Shift Register Lookup Table Shift Register Register b2 Shifter IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 85 Utilization Comparison Exercise: Map wireless protocol blocks (TCI) to FPGA, PLD FPGA excel for datapath blocks PLD excels for control blocks 1. 2 1 0. 8 0. 6 Xilinx FP GA A ltera P LD 0. 4 0. 2 0 CRC CRC FSM P hysSend Remo te TCI FSM Blocks Logic Utilization: FPGA vs. PLD 1.2 Utilization (normalized) Utilization (normalized) Logic utilization: FPGA vs. PLD 1 0.8 Xilinx FPGA 0.6 Altera PLD 0.4 0.2 0 GenSync MergeInteger Select Serial TCI Datapath Blocks Sw itch Utilization Analysis When does FPGA outperform PLD? Large # of outputs or registers Multi-level logic (>2) PLD requires one macrocell per output PLD is restricted by its sum-of-product (SOP) structure When does PLD outperform FPGA? Regular, two-level AND-OR structure PLD performs very well for state machines Very large # of inputs Each CLB has typically < 7 inputs Power Comparison Power Consumption (mW) 70 60 50 40 Xilinx FPGA Altera PLD 30 20 10 0 TCI CRC TCI CRC+FSM PhysSend FSM Pow er Dissipation 2.5 2.03 2 Power (mW) FPGA typically consumes less energy than PLD PLD: Pseudo-NMOS AND-plane, sense amp at AND-plane output Low-energy FPGA implementation currently exists (V. George, 1999) Low-energy PLD: Implement PLDs with three AND-plane structures 1) Remove sense amp 2) Static CMOS: No static power 3) Dynamic logic: Fewer xtors Power Consumption of TCI Blocks St andard PLD Dynamic PLD 1.5 1.1 CMOS PLD 1 0.5 Xilinx FPGA 0.174 LP-PGA FPGA 0.091 0.065 0 Standard Dynamic PLD PLD CM OS PLD Xilinx FPGA LP-PGA FPGA Architecture Exploration For low-energy protocol implementation, FPGA and PLD deliver different benefits Power – Low-power FPGA research is more mature Utilization – FPGA and PLD behave differently; hence, should be utilized differently Protocols = Extended FSMs = FSM + Datapath units Next State Decoder Inputs Use PLD for FSMs & FPGA for datapath units Make parameterizeable to allow applicationspecific tradeoffs Control Data Output Output Decoder Decoder PLD FPGA State Reg solution: A hybrid approach Output Reg First-order Control Output Data Output Design flow for wireless protocol processor – Platform Instantiation (Phase II) Perform design exploration to find a suitable platform instance for a given set of target applications and constraints. Phase II The Y-chart approach (Kienhuis) involves an iterative process of: Mapping functions to parameterized architectural modules Performance evaluation of the resulting platform under the given set of functional constraints. The principle of orthogonalization of concerns (functional vs. implementation) is applied to fully explore the design space. Configurable Functional Platform Specification. Mapping Performance Evaluation Design flow for wireless protocol processor (Phase III) Perform hardware and software synthesis to implement a specific application onto the platform instance. Through the software API, the hardware platform can be programmed or configured to perform the desired functionality. Remaining issues include generation and compilation of application code, real-time operating system (RTOS), and any necessary design synthesis. Phase III Implementation From Phase II Interconnect is a Major Power Component “Power Breakdown for “Chip Interconnect Trends” Reconfigurable Logic” (Xilinx) Power [%] 100 80 60 C.L.B. I.O. 5% 9% Clock Today 40 21% 20 0 '95 '00 '5 '10 Interconnect 65% Year [T. Sakurai] [E. Kusse & J. Rabaey] SoC using cores Assembling an SoC using cores has not yet become reality - Complex task Manual and error-prone, Difficult full timing closure, Physical problem System verification Lack of established standard and interface synthesis tool - H/W S/W integration IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 93 SOC Target Architecture(1) Standard on-chip bus structure - CoreConnect from IBM(PowerPC) - AMBA from ARM(ARM) IBM’s SOC framework - IBM Blue Logic Core Library - a fixed bus architecture - Processor Local Bus(PLB), On-Chip PeripheralBus(OPB), Device Control Register Bus(DCR) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 94 SOC Target Architecture(2) CoreConnect-based SOC IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 95 SOC Target Architecture(3) Define all the cores needed to implement the desired functionality Understand the functionality of all pins on all cores Define the request priorities and interconnect pins according to them Define which cores access memory , address map , clock domain etc. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 96 Automating SOC Integration “Coral” raising the level of abstraction Elements - Virtual design - Interface encapsulation and Glueless interfaces - Core and Pin Properties - Interconnection Engine - Virtual to Real Synthesis Engine - Configuration Engines IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 97 Virtual Design Virtual component ,Virtual interface, Virtual net Real : 160 pins Virtual : 10 pins IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 98 From Interface Encapsulation to Glueless Inteface Two levels of glue logic encapsulation - all the static and parameterizable protocol/interface logic - glue logic between cores For third party cores - create core wrappers in VHDL IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 99 Core and Pin Properties(1) Pin and component information - BUS_TYPE: PLB,OPB,ASB…. - INTERFACE_TYPE:MASTER…. - FUNCTION_TYPE: READ,INTERRUPT - OPERATION_TYPE:REQUEST… - DATA_TYPE: ADDRESS,DATA…. - RESOURCE_TYPE: BUS,PERIPHERAL - PIN_GROUP:DCU…. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 100 Core and Pin Properties(2) Ex) DCU_plbRequest on PowerPC401 IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 101 Virtual to Real Synthesis(1) Three steps with VRSE 1) instantiate a real component in the real design 2) traverse every virtual net and the virtual pins 3) compare the properties on the real pins and determine which real pins should be connected together IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 102 Virtual to Real Synthesis(2) IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 103 Virtual to Real Synthesis(3) virtual to real synthesis IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 104 Configuration Engine Various system configuration menus - Clocking - Address map definition - Interrupt map definition, - DMA channel assignment - I/O specification and generation IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 105 Bus Power • Buses as a significant source of power dissipation and delay due to large capacitive loading 15% of total power in Alpha 21064 30% of total power in Intel 80386 Power Reduction in Buses Voltage Swing Reduction Charge Recovery increased probability of error additional power supply additional latency additional circuitry Coding additional bus lines additional circuitry Transition activity • Self transition activity Signal prob. Conditional prob. • Coupled transition activity - Transition types Type I Type II L Type III Type IV L L Low Power Encoding Schemes Generic encoder/decoder (codec) architecture Encoder : predictor, encoding logic block, decorrelator Decoder : correlator, decoding logic block, register Coupling-Driven Bus-Invert Scheme (CBI) Coupling-driven bus-invert Problems flipping data signal when the coupling effect of inverted signal is less than that of original data How to accurately account for the coupling effect How to effectively implement the scheme with small hardware Basic idea Enumeration method to measure the coupling effect in a cycle time Wires-First Design Exploits logic structure to reduce wire loads almost all logic has considerable structure Early visibility of timing and power dissipation Enables use of advanced circuits Gives a stable design wire properties and crosstalk known early and well characterized key wire loads don’t change with small logic changes Gives the designer control Wires-first design Short Wire Models Structured RTL RTL Floorplan Structure Library Synthesis Local Netlists Place & Route Layout Regions Key Wires Placement & Loads Wire plan Manual Design Slow Paths Timing Analysis R&C Extractor On-Chip Interconnection Networks Replace dedicated global wiring with a shared network Local Logic Router Network Wires Chip Dedicated wiring Network Most Wires are Idle Most of the Time Don’t dedicate wires to signals, share wires across multiple signals Route packets not wires Organize global wiring as an on-chip interconnection network allows the wiring resource to be shared keeping wires busy most of the time allows a single global interconnect to be re-used on multiple designs makes global wiring regular and highly optimized Dedicated wires vs. Network Dedicated Wiring On-Chip Network Spaghetti wiring Ordered wiring Variation makes it hard to model crosstalk, returns, length, R & C. No variation, so easy to exactly model XT, returns, R and C. Drivers sized for ‘wire model’ – 99% too large, 1% too small Driver sized exactly for wire Hard to use advanced signaling Easy to use advanced signaling Low duty factor High duty factor No protocol overhead Small protocol overhead Circuits for On-Chip Networks Uniform, well characterized lines enable custom circuits - 0.1x power, 3x velocity ph1N inP ph2N sig1P sig1N pre inN sig2P sig2N Long, lossy RC lines H-bridge driver 100mV swing Regenerative Repeaters Architecture for On-Chip Networks Topology - different constraints than off-chip networks buffering is expensive, bandwidth is cheap more wires between ‘tiles’ than needed for one channel Flow-control multiple networks, higher dimensions, express channels run static, statically scheduled, and dynamic networks on one set of wires combine buffers with repeaters (ISSCC 2001) use methods that make efficient use of scarce resources (Flit Res.) Interface Design standard interface from modules to network pinout and protocol independent of network implementation 참고문헌 [ 1 ] S. Srikanteswara, R. Boyle, et-al, “A Soft RADIO Architecture for Reconfigurable Platform,” IEEE Communication Magazine, Feb. 2000. [2] http://www.oren.com OR51210 Digital TV VSB Demodultor Product [3] Datasheet http://www.ti.com TMS320C55x and Overview/Datasheet [4] http://www.xilinx.com Virtex Platform FPGA TMS320C64x Technical [5] SDR 동향 보고서, TTA SDR Ad Hoc Group, Dec. 2001. [6] John C. Davies IV, “Design and Implementation of an FPGA-based Soft-Radio Receiver Utilizing Adaptive Tracking,” Thesis for the degree of Master of Science in Electrical Engineering, Virginia Polytechnic Institute and State University. [7] http://www.ist-pastoral.com [8] http://www.ist-trust.com [9] http://www.sdrf.org Technical Report, Released 1999. [10] Steven Winegarden, “Bus Architecture of a System on a Chip with UserConfigurable System Logic,” IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 3, MAR. 2000. IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 118 Conclusions 앞으로의 전망과 나아가야 할 길 Soft radio 구조에 대한 광범위한 사용 증대 Hardware와 Software의 설계 결합 증대 Soft radio architecture에 대한 관심과 연구 필요 Layered architecture에 사용되는 library개발과 module개발에 대한 투자의 필요 연구 참여를 통한 인한 국제 표준화에 참여 IT-SOC 2002 ©스마트 모빌 컴퓨팅 Lab 119