Lower Power Algorithm for Multimedia Systems 1999. 8 성균관대학교 조 준 동 http://vada.skku.ac.kr SungKyunKwan Univ. VADA Lab. 1 Contents • Algorithmic Effects on Low Power • Low Power Management • Low Power Applications – Low Power Video Processor – Single Chip Video Camera – Vector Quantization – Data Encoding – CDMA Searcher – Viterbi Decoder SungKyunKwan Univ. VADA Lab. 2 Low Power Algorithm SungKyunKwan Univ. VADA Lab. 3 Algorithm Selection • Example: 8x8 matrix DCT SungKyunKwan Univ. VADA Lab. 4 Low Power DCT - ByungWook Kim, VADA • 현재 DCT의 경우는 압축을 하는 Encoder에서의 DCT 보다는 Mobile 쪽 에서 이용할 수 있는 Decoder 측면에서 IDCT를 Low Power로 구현하고 자 하는 노력을 하고 있다. 먼저 가장 먼저 시도한 방법은 하드웨어 Device 측면에서 Threshold-Voltage를 이용한 방법이 있었으며 현재까 지 가장 적은 파워를 소비하는 것으로 나와 있다[1]. 두 번째 방법은 알고 리즘 및 아키텍쳐 측면에서 한꺼번에 2D DCT를 하는 방법을 사용하여 하드웨어 면적 및 계산의 복잡도를 줄여서 Low Power를 구현했다[2][3]. 세 번째는 Digit Serial 방법으로 해서, 즉 작은 bit 으로 나누어서 계산을 함으로써 데이터의 Throughput을 높이고 면적을 줄여서 Low Power 의 효과까지 볼 수 있다[4]. 마지막 네 번째로는 입력되는 데이터들을 확률 적으로 분석하여, 그 데이터를 선택적으로 입력을 받아서 IDCT를 하는 방법이 있다[5]. 두 번째 방법을 제외한 나머지 방법들은 모두 1-D DCT(Chen's Alg에 기반을 둠)에 의한 Row-Col. DCT방법을 이용하고 있 다. SungKyunKwan Univ. VADA Lab. 5 Row-Col. DCT • 하드웨어로 구현되는 2D DCT는 대부분 2개의 1-D DCT와 치환 을 통한 RCA(Row Column Algorithm)방법을 사용한다. DCT는 orthogonal, separable transform이다. 그러므로 TwoDimension(2-D) DCT의 8í┐8 coefficient block을 각각 Row방향 으로 1-D DCT 계산을 한 다음, 그 결과의 coefficent를 column으 로 계산하기 위해 Transposing을 한 다음 1-D DCT 계산을 한다. Chen's Alg., Lee's Alg., 은 2D DCT를 할 경우 Row-Col. DCT 에 이용이 된다. 그러나 대부분(99%)는 Chen's Alg. 이 사용된다. SungKyunKwan Univ. VADA Lab. 6 Chen's Alg[6] N 1 N 1 2 (2i 1)u (2 j 1)v X (u, v) C (u)C (v) x(i, j ) cos cos N 2N 2N i 0 j 0 2 N 1 N 1 (2i 1)u (2 j 1)v x(i, j ) C (u )C (v) X (u, v) cos cos N u 0 v 0 2N 2N SungKyunKwan Univ. VADA Lab. 7 x0 x1 x2 x3 x4 x5 x6 x7 X0 X4 X2 -C4 -C4 -C4 -C4 -1 C6 C2 -1 -C2 C6 C7 -1 C1 C3 -1 -C4 -C4 -1 -1 -C4 -C4 -1 C5 -C5 C3 -C1 -1 C7 The Flowgraph of Chen's 1-D DCT SungKyunKwan Univ. X6 X1 X5 X3 X7 X0 C4 X 2 1 C2 X4 2 C4 X 6 C6 X1 C1 X 3 1 C3 X5 2 C5 X 7 C7 where, ck cos C4 C6 C4 C2 C4 C6 C4 C2 C3 C7 C1 C5 C5 C1 C7 C3 C4 x0 x7 C2 x1 x6 C4 x2 x5 C6 x3 x4 C7 x0 x7 C5 x1 x6 C3 x2 x5 C1 x3 x4 k 16 VADA Lab. 8 Lee's & Feig’s Algs[7,8] • Lee's 알고리즘[7]의 특징은 Matrix Decomposition, 시스톨릭 어 레이 방법을 써서 Complexity는 줄였으나 하드웨어로 구현을 할 경우 다소 복잡하고 규칙적이지 않다. 그리고 Chen's 알고리즘과 같이 1-D 기반의 DCT방법이다. • Feig's Algorithm [8]은 Matrix Decomposition Representation 방 법을 이용해서 SoftWare쪽으로 만들기는 쉬울지 몰라도 Hardware로 설계를 할 경우 다소 문제점(Matrix를 다루기 위해서 는 메모리가 필요하며, 규칙적인 구조가 아니기 때문)가 현재 Feig's Algorithm을 이용해서 Hardware Implementation 한 사례 를 발견하지 못했음. SungKyunKwan Univ. VADA Lab. 9 References • • • • • • • • [1] T.Kuroda, T. Fujita, et al, "A 0.9V, 150-MHz, 10-mW, 4mm2, 2-D discrete cosine transform core processor with variable-threshold-voltage(VT)sheme," IEEE J. Solid-State Circuits, vol. 31, pp.1770-1777, Nov. 1996. [2] Y.P.Lee, T.H. Chen, L. G. Chen, M.J. Chen, and C. W. Ku, " A Cost-Effective Architecture for 8x8 2-D DCT/IDCT Using Direct Method," IEEE Trans. Circuits Syst. Video Technol. Vol 7. No 3., pp. 459-467, June 1997. [3] Liang-Gec Chen, Juing-Ying Jiu, Hao-Chich Chang, Yung-Pin Lee, and Chung-Wei Ku, " Low Power 2D DCT Chip Design for Wireless Multimedia Terminals" IEEE Trans. SolidState Circuits, 1998 [4] Kyeounsoo Kim and Jong-Seog Koh, "An area efficient DCT architecture for MPEG-2 Video Encoder," IEEE Transactions on Consumer Electronics, Vol. 45. No. 1, February 1999. [5] Thucydides Xanthopulos, Anantha P. Chandrakasan, " A Low-Power IDCT Macrocell for MPEG-2 MP@ML Exploiting Data Distribution Properties for Minimal Activity," IEEE Journal of Solid-State Circuits, Vol. 34. No. 5. May 1999. [6] Chen, Smith and Fralick, "A Fast Computational Algorithm for the Discrete Cosine Transform, IEEE Trans. on Communicaiotns, Vol. COM-25, pp.1004-1009, 1977. [7] LEE, "A New Algorithm to Compute the Discrete Cosine Transform, IEEE Trans. on Acoust., Speech and Signal Processing, Vol. ASSP-32, No.6pp.1243-1245,1984 [8] Ephraim Feig and Shmuel Winograd, " Fast Algorithms for the Discrete Cosine Transform," IEEE Trans. on Signal Processing, Vol. 40, No. 9, Sep. 1992. 10 SungKyunKwan Univ. VADA Lab. Strength Reduction: DIGLOG multiplier Cmult (n) 253n 2 , Cadd (n) 214n, where n world length in bits A 2 j AR , B 2 k BR A B (2 j AR )(2 k BR ) 2 j BR 2 k AR AR BR 1st Iter 2nd Iter 3rd Iter Worst-case error -25% -6% -1.6% Prob. of Error<1% 10% 70% 99.8% With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case) SungKyunKwan Univ. VADA Lab. 11 Logarithmic Number System Lx log 2 | x|, LAB LA LB , LA/ B LA LB , LA2 LA 1, L A LA 1, --> Significant Strength Reduction SungKyunKwan Univ. VADA Lab. 12 Switching Activity Reduction (a) Average activity in a multiplier as a function of the constant value (b) A parallel and serial implementations of an adder tree. SungKyunKwan Univ. VADA Lab. 13 System-Level Solutions • • • • • System management, System partitioning, Algorithm selection Precompute physical capacitance of Interconnect and switching activity (number of bus accesses) Regularity: to minimize the power in the control hardware and the interconnection network. Modularity: to exploit data locality through distributed processing units, memories and control. – Spatial locality: an algorithm can be partitioned into natural clusters based on connectivity – Temporal locality:average lifetimes of variables (less temporal storage, probability of future accesses referenced in the recent past). Few memory references: since references to memories are expensive in terms of power. SungKyunKwan Univ. VADA Lab. 14 System-Level Solutions - cont. • Simulator: Instruction-level Energy Estimation • Software: Energy Efficient Algorithms • OS: Voltage Scheduling Algorithms • OS: Multiprocessing for Energy • Microprocessor: Dynamic Caches SungKyunKwan Univ. VADA Lab. 15 Processor Systems:high Power • Thinkpad (Pentium) 0.3 Hours/AA • InfoPad (ARM) 0.8 Hours/AA • Toshiba Portable (486) 0.9 Hours/AA • Newton (ARM) 2.0 Hours/AA Operations per Battery Life: Minimize Energy Consumed per Operation Operations per Second: Maximize Throughput Operations/ second SungKyunKwan Univ. VADA Lab. 16 DPM vs SPM Identify power hungry modules and look for opportunities to reduce power • DPM (Dynamic Power Management): stops the clock switching of a specific unit generated by clock generators. SungKyunKwan Univ. • SPM (Static Power Management): When the system remains idle for a significant period time, then it is shut-down. VADA Lab. 17 Vdd vs Delay •Use Variable Voltage Scaling or Scheduling for Real-time Processing •Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing. •Scale down device sizes to compensate for delay (Interconnects do not scale proportionately and can become dominant) SungKyunKwan Univ. VADA Lab. 18 Power PC 603 Strategy • Baseline: use right supply and right frequency to each part of the system If one has to wait on the occurence of some input, only a small circuit could wait and wake-up the main circuit when the input occurs. • PowerPC 603 is a 2-issue (2 instructions read at a time) with 5 parallel • Execution units. 4 modes: – Full on mode for full speed – Doze mode in which the execution units are not running – Nap mode which also stops the bus clocking and the Sleep mode which stops the clock generator – Sleep mode which stops the clock generator with or without the PLL (20-100mW). SungKyunKwan Univ. VADA Lab. 19 Power PC 603 Power Management SungKyunKwan Univ. VADA Lab. 20 TI Structures • • • • • • Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and system cost for wireless communication applications C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family: Three different power down modes, these devices are well-suited for wireless communications products such as digital cellular phones, personal digital assistants, and wireless modem,low power on voice coding and decoding The TMS320LC548 features: – 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times – 3.0- and 3.3-V operation 32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip Integrated Viterbi accelerator that reduces Viterbi butterfly update in four instruction cycles for GSM channel decoding Powerful single-cycle instructions (dual operand, parallel instructions, conditional instructions) SungKyunKwan Univ. VADA Lab. 21 InfoPad Architecture, UC-Berkeley Internet Wireless Basestation “PadServer” Speech Recognizer Transmit audio and raw bitmaps across the wireless link InfoPad Maintain state in the network, not on the Pad Web Browser Example: Hand-held speech-enabled web-browser Perform all computation in the network to minimize client energy dissipation SungKyunKwan Univ. VADA Lab. 22 InfoPad Hardware Flexibility Main data-flow handled by custom low-power ASICs Embedded software responsible for high-level functions Only header sent to microprocessor Packet Header 10 MIPS μProcessor Framebuffer update Radio RX Packet Entire packet routed to dedicated hardware Control Statistics Reliability Debugging Frame Buffer • Use hardware/software integration to provide energy-efficient high-level functionality SungKyunKwan Univ. VADA Lab. 23 Multimedia I/O Terminal. SungKyunKwan Univ. VADA Lab. 24 Multimedia I/O terminal SungKyunKwan Univ. VADA Lab. 25 InfoPad Evolution Total Power: ~7 W Where did the power go? Inefficient implementation InfoPad Commercial DC/DC EnergyEfficient Processors Intercom No local computation? Commercial radios • High-level system design optimizes complete solution and drives new research SungKyunKwan Univ. VADA Lab. 26 Power-Down Techniques SungKyunKwan Univ. VADA Lab. 27 Low Power Memory SungKyunKwan Univ. VADA Lab. 28 Power Reduction in InfoPad Approach Power Reduction Comments Voltage Scaling Optimized Cell Lib. x21 x3-4 Gated Clocks x2-3 1.1V vs 5V TR sizing, Reduced swing and self-timed FIFO… error checking for address only enabling only one block in the SRAM VQ vs DCT 1.1V vs 300mV in memory Block decoding Algorithm Selection Bit swing reduction SungKyunKwan Univ. x8 x5-10 x3.7 VADA Lab. 29 Low Power Video Processor Uzi Zangi, Technion - VLSI Systems Research Center, ISPED,1998 Asynchronous logic to save power Didn’t work because: Slow design (13.5MHz) &Small circuit (<100K gates) : clock load is small.Adding Async. control costs more then clocking. Gated clock Didn’t work because: Frequency is very low (13.5MHz). Register activity is very high (90%) SungKyunKwan Univ. VADA Lab. 30 Power Management by Gated Clock • Power Management Scheme by Enabling Clock • Power Management Scheme by adding Clock Generation block enable 1 block 1 block 1 clock management enable 1 enable 2 block 1 clk block 1 clk enable 2 enable 3 enable 3 block 1 SungKyunKwan Univ. block 1 VADA Lab. 31 Minimizing bus switching Transfer the value or it’s negative on the bus, according to the minimum number of toggle bits. Add one bit that will indicate the polarity of the bus. Good for buses with: large number of bits (more than 10). High capacitance (more then 2pF). High toggle activity (more then 1/2). Overheads: Routing of one more bit. Extra logic for the decision (timing, area). SungKyunKwan Univ. VADA Lab. 32 Minimizing bus switching (Cont.) Didn’t work because: Largest bus is 8bit. Capacitance less than 1pF. Toggle activity not very high. E line Block A n decision unit n Cx Bus (Ct) SungKyunKwan Univ. Block B n slice n VADA Lab. 33 Method That Works: Pixel Differentials Pixel value area locality, spatially correlated. This is exploited most heavily in compression (save on storage and transmission). Most of the functions are linear, able to work on differences. The entire algorithm was rewritten (interpolations, filters, matrices, etc.) New algorithm differs from original by no more then 1 lsb bit per pixel. SungKyunKwan Univ. VADA Lab. 34 Methodology Algorithm C++ Simulator Image Image 0.35 Lib Compass Compare RTL Verilog Simulator Synopsys Netlist Currents, power SungKyunKwan Univ. Epic Powermill Image P&R Cadence Opus Spice Netlist VADA Lab. 35 Pixel Difference 12 10 Current 8 Register Current [mA] Logic Current [mA] 6 Total Current [mA] 4 2 0 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% Pixel Differential SungKyunKwan Univ. VADA Lab. 36 Pixel Differentials Algorithm Results R e g is te r L o g ic T o ta l Num ber D iffe re n tia l D iffe re n tia l C u rre n t C u rre n t C u rre n t C u rre n t Power o f P ixe ls P ixe ls (m A) (m A) S a vin g S a vin g R a tio (m A) 3600 0 0.00% 3.8 6.8 10.6 3646 424 11.63% 1.5 3.6 5.1 52% 77% 3646 616 16.90% 1.4 3.22 4.62 56% 81% 3190 1536 48.15% 1.02 2.2 3.22 70% 91% 3494 2730 78.13% 0.82 1.21 2.03 81% 96% 3190 3116 97.68% 0.8 1.16 1.96 82% 97% SungKyunKwan Univ. VADA Lab. 37 Summary Attempted to save power on a battery-operated chip by application specific algorithmic/architectural techniques: Async. Logic, Gated clock, Minimizing bus switching. All Attempts failed. These methods may still apply to very large, very fast chips, and on variable load application. Successfully applied an algorithmic change, inspired by image compression. It may not work on non-compressible data but works exceptionally well on images. Easily saved 80% power, potentially can save more than 90%. SungKyunKwan Univ. VADA Lab. 38 A SINGLE-CHIP DIGITAL CAMERA H. Teresa H. Meng, “Low-Power Wireless Video System” , IEEE Communication Magazine, June, 1998 ◈ Given the recent development in CMOS RF transceiver design, wireless transmission at a bandwidth in excess of 10Mb/s will soon become possible using next-generation CMOS technology. ◈ The design of a low-power large-scale parallel MPEG2 encoder architecture to be used in a single-chip digital CMOS video camera. ◈ The single-chip digital camera architecture includes a 640 x 480 array of CMOS photo diodes, embedded DRAM for storing four frames of color data, and parallel array processor for video signal processing ◈ The parallel processor architecture is designed to implement highly computationally intensive image and video processing tasks such as color conversion , discrete cosine transform(DCT), and motion estimation for MPGE2. SungKyunKwan Univ. VADA Lab. 39 A SINGLE-CHIP DIGITAL CAMERA Silicon surface CMOS photo sensors Emnbedded DRAM (pixel memory) Parallel video processors Side view Colume processor 40 640 pixels Colume processor 39 Top view Colume processor 2 16 colume x 480 pixels Colume processor 1 480 pixels SungKyunKwan Univ. VADA Lab. 40 A SINGLE-CHIP DIGITAL CAMERA Module/operation Word size Energy/op(pJ) Normalized to adder Carry-selector adder 16 bits 18 1 Multiplier 16 bits 64 3.6 Latch 16 bits 4 0.22 8 x 128 x 126 SRAM (read) 16 bits 80 4.4 8 x 128 x 16 SRAM (write) 16 bits 160 9 External I/O access 16 bits 180 10 Energy per operation at a 1.5V supply in 0.8m CMOS technology SungKyunKwan Univ. VADA Lab. 41 A SINGLE-CHIP DIGITAL CAMERA ◈Design Consideration The proposed architecture considers three algorithms commonly used in video coding standards : red-green-blue(RGB)-to-yellow-ultraviolet (YUV) conversion, discrete cosign transform(DCT), and motion estimation To reduce power consumption, as many parallel processors as practically feasible should be used to reduce the clock frequency, because a reduced clock frequency implies a lower supply voltage. External buffers are removed and replaced by on-chip memory. For MPEG-2 encoding, the computational demand required for motion estimation(1.6 BOPS for 30 frames/s based on the algorithm proposed by Chalidabhongese and Kuo) limits the number of columns in each processor domain to 16, because otherwise the required clock speed for each processor would be too high for a low-power design (most of process operations will be used for interprocess communications) SungKyunKwan Univ. VADA Lab. 42 A SINGLE-CHIP DIGITAL CAMERA ◈ PERFORMANCE In order to sustain this computational demand, each processor is required to run at a clock frequency equal to or higher than 40 MHz. When implemented in a 0.2 CMOS technology, a 1V supply voltage should be more than enough to support a 40MHz operation Three goals: realize the image/video processing alg., minimize DMA accesses to the pixel DRAM, and maximize computational throughput while keeping power consumption at a minimal level. Under these condition, this parallel processor architecture delivers a processing of 1.6 BOPS with a power consumption of 40mW System performance needs not be sacrificed for low power consumption if the design of algorithms and hardware can be considered concurrently. --> hardware-driven architecture design SungKyunKwan Univ. VADA Lab. 43 Vector Quantization • Lossy compression technique which exploits the correlation that exists between neighboring samples and quantizes samples together SungKyunKwan Univ. VADA Lab. 44 Complexity of VQ Encoding The distortion metric between an input vector X a n d a codebook vector Ci is computed as follows: Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook treesearch. SungKyunKwan Univ. VADA Lab. 45 Full Search • Brute-force VQ: the distortion between the input vector and every entry in the code-book is computed, and the code index that corresponds to the minimum distortion is determined and sent over to the decoder. • For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined. SungKyunKwan Univ. VADA Lab. 46 Tree-structured Vector Quantization Here only 2 x log2256 = 16 distortion calculations with 8 comparisons. SungKyunKwan Univ. If for example at level 1, the input vector is closer to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 is transmitted. VADA Lab. 47 Pyramid Vector Quantization • Groups data into L vectors and scales them onto a Ldimensional pyramid surface and find the nearest lattice point on the pyramid. • Both the scaling factor and an index are transmitted. • Unlike standard VQ schemes, which require codebook storage, PVQ relies on intensive arithmetic computation. • Integrates all functionality on a single die, requiring no external hardware support or memory SungKyunKwan Univ. VADA Lab. 48 Differential Codebook Tree-structure Vector Quantization • The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations . SungKyunKwan Univ. VADA Lab. 49 Comparisons • The number of memory access operations can be reduced; that is, by changing the contents of the code-book through computational transformations, the number of switching events number of multiplications, additions/subtractions and memory accesses- can be reduced. SungKyunKwan Univ. VADA Lab. 50 Multiplication with Constants • Techniques and tools have been developed to scale coefficients so as to minimize the number of 1’s in the coefficients so as to minimize the number of shift-add operations. SungKyunKwan Univ. VADA Lab. 51 Gated clocks to shut down modules when not used. SungKyunKwan Univ. VADA Lab. 52 Lower Power Data Encoding • S.S.Chun and J.D.Cho, Journal of Korean Information Science, Vol. 26, No 6, 1999 . • 허프만 부호화 알고리즘에 의하여 발생된 압축률을 유지하면서 허 프만코드를 재구성하여 스위칭 동작 횟수를 줄이는 방법 • 공통된 서브 시퀀스를 많이 갖는 서브 스트림에 그레이 코드와 같 은 스위칭 횟수가 적은 부호화 방식을 채택하는 것이다. • RISC 인스트럭션 어드레싱 방식중 바이너리코드 어드레싱 방식 에 비해서 그레이코드 어드레싱 방식을 사용할 경우 50%까지의 전력감축 효과를 나타낸다 SungKyunKwan Univ. VADA Lab. 53 Gray Code • 두 개의 n 차원(n bit) 벡터 U = u_1, u_2, … , u_n 과 V = v_1, v_2, … , v_n 의 해밍 거리를 h(U,V) = SUM from i=1 to n (u_i, v_i ) 로 정의하자. 여기서 (u_i v_i ) 는 u와 v의 bit 값이 다르면 1 이 되고 그렇지 않으면 0이 된다. 이것은 n차원 hypercube G의 변을 따라갈 때의 거리로 표현 할 수도 있다. Gray code = shortest path in G • 허프만 코드는 문자의 코드 길이가 다를 수 있으며 prefix-free코 드를 유지하여야 하기 때문에 정확한 그레이 코드로 변환하는 것 은 불가능하며 비트 변화량을 최소화하기 위한 압축 부호화가 필 요하게 된다. SungKyunKwan Univ. VADA Lab. 54 2-D Traveling Salesman Problem • 제안된 문제는 문자의 인접 빈도수가 많은 문자쌍에 해밍 거리가 작은 코드쌍을 할당하는 문제이기 때문에 두 개 이상의 TSP를 동시에 처리하는 새로운 문제로 표현된다. • Using heuristic: 10% reduction in switching activity for random un-correlated data SungKyunKwan Univ. VADA Lab. 55 Lower Power CDMA Searcher 1999. 8 S. Kim and J.D.Cho 성균관대학교 http://vada.skku.ac.kr SungKyunKwan Univ. VADA Lab. 56 Searcher (Using a Common Double Dwell Method) ◈ CDMA 시스템의 송수신간의 정확한 PN부호의 동기를 위한 초기 동기 포착 과정. Local PN_I ( a I ) OI ( RX I a I ) ( RX Q aQ ) NC G RX I YI G OI Search Done !! NN Z Z YI YQ 2 2 NN Local PN_Q (aQ ) YQ G OQ Yes (Switch ON) 2 ? No 1 ? NC G RX Q No OQ ( RX I aQ ) ( RX Q ( a I )) Local PN_I ( a I ) SungKyunKwan Univ. Search_Slew VADA Lab. 57 Operation Flow 1 기지국에서 전송하는 파일럿 채널을 단말기에서 발생된 PN부호열과 역 확산 과정 수행. 2 역확산된 결과를 동기 누적 횟수 Nc 만큼 누적한 후 에너지 계산 과정을 거침 (제곱 연산). 3 에너지 계산 결과값들은 첫번째 임계치( 1 )와 비교하여 초과할 경우 뒷 단에서 비동기 누적(Nn) 수행. 4 그렇지 못할 경우 PN부호열을 한 칩 빨리 발생시키고 입력되는 신호에 대하여 앞의 과정을 반복. 5 비동기 누적을 거친 결과값을 두번째 임계치( 2 )와 비교. 6 를 초과하면 탐색 과정을 종료하고, 그렇지 않을 경우 PN부호열을 한 2 칩 빨리 발생시키고 앞의 과정을 반복. SungKyunKwan Univ. VADA Lab. 58 Data Flow Graph of Searcher Operation RXI TXI RXQ XOR TXQ RXI XOR TXQ RXQ XOR -TXI XOR 동기 누적단 + + + + – 덧셈 과정 4회 에너지 계산단 에너지 계산단 ()2 동기 누적단 – 곱셈 과정 2회 ()2 > max 값 선택 > θ1 와 비교 비동기 누적단 + > θ2 와 비교 SungKyunKwan Univ. VADA Lab. 59 Rescheduled Data Flow Graph TXI RXQ RX I XOR TXQ RXI TXQ XOR -TXI RX Q XOR XOR 동기 누적단 CSA CSA | | | | > max 값 선택 > θ1 와 비교 – Carry Save Adder (or 3 Iinput ALU) 사용 임계치 비교 – Pre-computation 적용 에너지 계산단 에너지 계산단 ()2 – Data Flow 순서를 변화 하여 곱셈 과정을 줄임 비동기 누적단 + > 동기 누적단 θ2 와 비교 SungKyunKwan Univ. VADA Lab. 60 Pre-computation Power saving – Reduces power dissipation of combinational logic – Reduces internal power to precomputed registers Cost – Increase area – Impact circuit timing – Increase design complexity • number of bits to precompute – Testability • may generate redundant logic SungKyunKwan Univ. VADA Lab. 61 Pre-computation ◈ A comparator example : Shrinivas Devadas, 1994 SungKyunKwan Univ. ◈ Precomputation for external idleness : M. Alidina, 1994 VADA Lab. 62 Low Power Comparator • YI와 YQ의 MSB는 절대값의 signed bit이며, 모두 ‘0’임. • MSB를 제외한 상위 2bit를 이 용하여 pre-computation을 실 시. • Pre-computation의 결과에 의 해 |YI|와 |YQ| 중 큰 값을 선택. • 임계치 θ1과 비교시 comparator대신 multiplexter를 사용. SungKyunKwan Univ. VADA Lab. 63 Three Input ALU ( Ovadia Bat-Sheva, 1998 ) MUL0 MUL1 MUL0 MUL1 P0 P1 P0 P1 ALU ALU/ASU 3IALU acc0 acc1 acc1 Two ALUs Structure Three Input ALU Structure The three input ALU consumes much less power than an ALU and an ASU A drawback of using a 3IALU is the added complexity in calculating the carry and overflow. SungKyunKwan Univ. VADA Lab. 64 실험 결과 및 결론 • IS-95기반의 DS/CDMA 시스템의 단말기에 사용하기위한 MSM (Mobile Station Modem) 칩의 탐색자 (Searcher Engine)에 대한 RTL 수준 저전력 설계 구현. – 동작 주파수 : 12.5MHz • Data flow graph를 사용하여 rescheduling, pre-computation 및 strength reduction등을 적용하여, area와 power를 각각 최대 67.68%, 41.35% 감소 시킴. SungKyunKwan Univ. VADA Lab. 65 Lower Power Viterbi Decoder 1999. 8 J.H. Ryu and J.D.Cho 성균관대학교 http://vada.skku.ac.kr SungKyunKwan Univ. VADA Lab. 66 Viterbi Decoder ◈ Convolutional Encoder K = 3 (Constraint Length) R = 1/2 (Rate) aj=u j+uj-1+uj-2 + U uj + aj V bj Information sequence Codeword A1 A0 + bj=uj+uj-2 A(3,1/2) Convolutional encoder SungKyunKwan Univ. VADA Lab. 67 Viterbi Decoder Time 0 1 2 3 4 5 6 State 00 00 00 11 00 11 11 10 10 10 01 01 11 ....... 00 01 01 11 10 Fig. 2. Trellis diagram for a (2,1/2) convolutional code Information sequence : U = (0,0,1,0,1,0,...) Output codeword : V = (00,00,11,10,00,10,...) SungKyunKwan Univ. VADA Lab. 68 Viterbi Decoder ◈ Viterbi Decoder Received Signal BMU BM ACSU SP SMU Decoded Data PMM Viterbi decoder structure SungKyunKwan Univ. VADA Lab. 69 Viterbi Decoder Branch Metric Unit(BMU) : The branch metrics measure the difference the received symbol and the symbol that causes the transitions between states in the trellis. Add-Compare-Select Unit(ACSU) : To find the survivor path entering each state, the branch metric of a given transition is added to its corresponding partial path metric(PM) stored in the path metric memory (PMM). This new partial path metric is compared with all the other new partial metric corresponding to all the other transitions entering that state. The transition that has the minimum partial path metric is chosen to be the survivor path of the state. The path metric of the survivor path of each state is updated and stored back into the PMM. Survivor memory Unit(SMU) : The survivor path are SungKyunKwan Univ. stored in VADA Lab. 70 Viterbi Decoder ⑴ Low power ACSU VLSI architecture ▶ Conventional ACSU VLSI architecture S0 sa sa S0 S0 sb S1 sb Butterfly structure SungKyunKwan Univ. VADA Lab. 71 Viterbi Decoder (s a,S 0 ) BM i (s a) PM i-1 Adder Comp (s b,S 0 ) BM i (s a,S 1 ) BM i (s b) PM i-1 (s b,S 1 ) BM i (S 0 ) Mi Adder Adder Comp (S 1 ) Mi Adder Architecture of conventional ACSU (add-compare) SungKyunKwan Univ. VADA Lab. 72 Viterbi Decoder [SKKU. Solution] ―Algorithm (s a) PMi-1 + (s a) PMi-1 - (s a,S 0 ) BMi (s b) PMi-1 > PMi-1(sb) + BM(sib,S0) > (s b,S 0 ) BM i (s a,S 0 ) BMi ☞ The area and power of the lower power ACSU design are reduced by 20% and 30%, respectively, comparing with the conventional ACSU design SungKyunKwan Univ. VADA Lab. 73 Number of arithmetic operations for the conventional ACSU SungKyunKwan Univ. VADA Lab. 74 Number of arithmetic operations for the proposed ACSU SungKyunKwan Univ. VADA Lab. 75 Viterbi Decoder [SKKU. Solution] ▶ Low power ACSU VLSI architecture [C-Y Tsui, ISLPED’99] SungKyunKwan Univ. VADA Lab. 76 Viterbi Decoder [SKKU. Solution] ※ Glitch minimization [Raghunathan, DAC’96] A 0 B 1 A + X Y C 0 D 1 B 0 C D (a) Proposed ACSU architecture 1 + X < (a) compare-add + Y (b) < add-compare (b) Conventional ACSU architecture ☞ The power consumption of architecture (a) is larger than that of architecture (b) by more than 17% because of glitch power dissipation SungKyunKwan Univ. VADA Lab. 77 Viterbi Decoder [SKKU. Solution] ※ Glitches in control logic A 0 B 1 + C 0 D 1 C D S X Y < S & Fs=0 . F s=1 = A .B CLK SungKyunKwan Univ. VADA Lab. 78 Viterbi Decoder ⑵ Low power traceback VLSI architecture ▶ Systolic Viterbi, traceback decoder[J. Sparso’91] ACSU TraceBack Unit 1 TraceBack Unit 2 TraceBack Unit 3 ..... TraceBack Unit 10 Trace-Back Units The structure of systolic Viterbi decoder SungKyunKwan Univ. VADA Lab. 79 Viterbi Decoder Time State 0 1 2 3 4 5 6 0 0 1 0 2 0 3 1 2 1 3 0 2 0 1 0 1 0 3 1 2 1 3 0 4 0 2 0 1 0 3 0 2 0 00 10 ....... 01 path metric decision vector 2 11 0 2 0 2 1 2 1 2 1 10 Sequence of staes of the trace-back methode Received codeword : V = (00,00,11,10,00,10,...) SungKyunKwan Univ. VADA Lab. 80 Viterbi Decoder decision vector Time unit 1 2 3 4 ACSU ACSU ACSU ACSU state with smallest path metric 0 0 X X 0 0 0 0 0 0 X X 0 0 0 0 0 0 0 0 0 0 X X 1 1 0 1 0 0 0 0 0 0 0 0 0 0 X X SungKyunKwan Univ. VADA Lab. 81 Viterbi Decoder . . . . survivor depth = 5K Time unit 10 ACSU T 10 T9 T8 T7 T6 T5 T4 T3 T2 T1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 x x T9 T8 T7 T6 T5 T4 T3 T2 T1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 x x 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 11 T 11 T 10 10 "0" 11 11 ACSU 1 1 1 0 1 0 0 0 "1" 01 12 ACSU 0 1 0 0 10 10 1 1 1 0 1 0 0 0 11 SungKyunKwan Univ. 0 0 x x 00 VADA Lab. 82 Viterbi Decoder . . . 19 ACSU 0 0 0 0 0 0 0 0 11 20 ACSU 1 1 1 1 01 1 0 1 1 0 0 0 1 00 0 0 0 0 0 0 0 0 10 1 0 0 0 1 1 0 1 10 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 01 1 0 0 0 01 1 1 0 1 1 1 1 0 1 0 0 0 01 0 0 0 1 00 0 1 0 0 0 0 0 0 0 1 0 0 10 1 1 1 0 10 1 0 0 0 1 0 1 1 0 0 0 0 00 0 0 0 0 11 0 1 0 0 1 1 0 1 1 1 0 1 10 1 0 1 1 00 0 0 0 0 0 0 0 0 0 0 0 0 10 1 1 0 1 01 1 1 0 1 0 0 x x 00 0 0 0 0 01 0 0 0 0 00 0 . . . 24 ACSU 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 01 SungKyunKwan Univ. 1 0 0 0 1 1 0 1 00 0 0 0 1 0 1 0 0 10 1 1 1 0 1 0 0 0 11 0 0 0 0 0 1 0 0 00 1 0 1 1 0 0 0 0 01 1 VADA Lab. 83 Viterbi Decoder ※ Systolic array decoder의 문제점 The systolic array viterbi decoder is organized to input the decision vector and the smallest path metric out of the ACSU and to output the decode bit by shifting every register for every cycle. This system consumes a great dynamic power consumption due to switching activities of registers which is almost 80% of the total power consumption because every data in TBU shifts for every cycle. SungKyunKwan Univ. VADA Lab. 84 Viterbi Decoder [SKKU. Solution] ▶ Our low power trace-back unit Time unit 0 0 X X 1 ACSU CONTROL BLOCK 2 0 0 X X 0 0 0 0 ACSU CONTROL BLOCK 0 0 X X 3 0 0 0 0 0 0 0 0 ACSU CONTROL BLOCK SungKyunKwan Univ. VADA Lab. 85 Viterbi Decoder [SKKU. Solution] . . . 9 Trace-back T1 T2 T3 T4 T5 T6 T7 T8 T9 0 0 X X 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 ACSU CONTROL BLOCK 0 0 X X 10 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 ACSU 1 0 0 0 1 1 CONTROL BLOCK 0 0 X X 11 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 ACSU 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 CONTROL BLOCK SungKyunKwan Univ. VADA Lab. 86 Viterbi Decoder [SKKU. Solution] . . . . 0 0 X X 19 ACSU 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 CONTROL BLOCK 0 0 0 0 20 ACSU 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 0 CONTROL BLOCK 0 0 0 0 21 ACSU 0 1 0 1 1 0 1 1 1 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 CONTROL BLOCK SungKyunKwan Univ. VADA Lab. 87 Viterbi Decoder [SKKU. Solution] After decision vector and the smallest path metric generated from ACSU are transferred to the Control Block (CB), the CB outputs the decision vector and the smallest path metric with the right cycle using a counter and a multiplexer. The register array, which stores the value of trace-back from the CB, was provided to finally output decoded bit, not by shifting all higher 4-bit decision vector as in the classical TBU, but by shifting the lower 2-bit only, which is the smallest path metric, to the left SungKyunKwan Univ. VADA Lab. 88 Viterbi Decoder [SKKU. Solution] ◈ Experimental Result (area 11% , power 40% ) Power Dissipation Area 8000 1600 7000 1400 6000 1200 power(uW) gates 5000 4000 3000 1000 800 600 2000 400 1000 200 0 0 2 3 4 2 Low Power Trace-back Unit SungKyunKwan Univ. 4 K K Trace-back Unit 3 Trace-back Unit Low Power Trace-back Unit VADA Lab. 89 Viterbi Decoder [Stanford Solution] ⑶ Low Power Asynchronous Viterbi Decoder [Y.h.Lee , Stanford] ▶ Algorithm converge point time n+1 time n Traceback processing SungKyunKwan Univ. VADA Lab. 90 Viterbi Decoder [Stanford Solution] ① 초기화: 구속장의 5배의 trellis를 traceback하고, 그 경로를 저장한 다. ② Loop A. 추적과 비교 : 임의의 초기 스테이트를 선택해 trace back을 시작 한다. 동시에, route를 추적해 나가면서 각 node에 서 저장된 route와 비교한다. B. 비교 값이 같으면 추적을 멈추고 저장된 route를 버린다. 같지 않 을 때는 A 과정을 반복한다. ③ 각각의 입력 신호에 대해 ② 과정을 반복한다. SungKyunKwan Univ. VADA Lab. 91 Viterbi Decoder [Stanford Solution] ▶ Implementation Self-precharge & Self-requesting if not found Previous path Input Port Surviving Path Memory Address RD/WR Control M U X Shift Reister TraceBack Unit Oscillator Ring Comparison Logic Request form ACS Request Address RD/WR Control Memory Management Unit if Path is not found Acknowledge to ACS if path is found Self-timed TBU block diagram SungKyunKwan Univ. VADA Lab. 92 Viterbi Decoder ① Self-timed TBU가 request 신호를 기다리는 동안 전력 소모가 없다. ② ACS는 스테이트 결정 데이터를 버리기 위해 request 신호를 내보 낸 다. ③ TBU는 이전의 surviving path memory와 previous path memory를 읽어 들 여비 교한다. ④ 같지 않으면, TBU는 previous path memory를 update하고 selfprecharging, self-requesting을 한 다음 ③ 과정을 반복한다. 같으 면, ⑤ 과정으로 간다. ⑤ TBU는 ACS에 scknowledgement 신호를 보내고, 다음 ACS의 SungKyunKwan Univ. request VADA Lab. 93 References • David Johnson, Venkatesh Akella, and Brett Stott, “Micropipelined Asynchronous Discret Cosine Transform (DCT/IDCT) Processor,”IEEE Transactions on very large scale integration (VLSI) systems, vol. 6, no. 4, december 1998 • T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A VLSI Design for a TraceBack Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992 • Fettweis, G.H. Meyr, “High-Speed Parallel Viterbi Decoding Algorithm and VLSIArchitecture”, IEEE Communications, May. 1991 • G. Feygin, P. Glenn Gulak, “Survivor Sequence Memory Management in Viterbi Decoders”, IEEE, 1991T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A VLSI Design for a Trace-Back Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992 • Fettweis, G.H. Meyr, “High-Speed Parallel Viterbi Decoding Algorithm and VLSIArchitecture”, IEEE Communications, May. 1991 SungKyunKwan Univ. VADA Lab. 94