VADA Lab.

advertisement
Lower Power Algorithm
for Multimedia Systems
1999. 8
성균관대학교 조 준 동
http://vada.skku.ac.kr
SungKyunKwan Univ.
VADA Lab.
1
Contents
• Algorithmic Effects on Low Power
• Low Power Management
• Low Power Applications
– Low Power Video Processor
– Single Chip Video Camera
– Vector Quantization
– Data Encoding
– CDMA Searcher
– Viterbi Decoder
SungKyunKwan Univ.
VADA Lab.
2
Low Power Algorithm
SungKyunKwan Univ.
VADA Lab.
3
Algorithm Selection
• Example: 8x8 matrix DCT
SungKyunKwan Univ.
VADA Lab.
4
Low Power DCT - ByungWook Kim, VADA
•
현재 DCT의 경우는 압축을 하는 Encoder에서의 DCT 보다는 Mobile 쪽
에서 이용할 수 있는 Decoder 측면에서 IDCT를 Low Power로 구현하고
자 하는 노력을 하고 있다. 먼저 가장 먼저 시도한 방법은 하드웨어
Device 측면에서 Threshold-Voltage를 이용한 방법이 있었으며 현재까
지 가장 적은 파워를 소비하는 것으로 나와 있다[1]. 두 번째 방법은 알고
리즘 및 아키텍쳐 측면에서 한꺼번에 2D DCT를 하는 방법을 사용하여
하드웨어 면적 및 계산의 복잡도를 줄여서 Low Power를 구현했다[2][3].
세 번째는 Digit Serial 방법으로 해서, 즉 작은 bit 으로 나누어서 계산을
함으로써 데이터의 Throughput을 높이고 면적을 줄여서 Low Power 의
효과까지 볼 수 있다[4]. 마지막 네 번째로는 입력되는 데이터들을 확률
적으로 분석하여, 그 데이터를 선택적으로 입력을 받아서 IDCT를 하는
방법이 있다[5]. 두 번째 방법을 제외한 나머지 방법들은 모두 1-D
DCT(Chen's Alg에 기반을 둠)에 의한 Row-Col. DCT방법을 이용하고 있
다.
SungKyunKwan Univ.
VADA Lab.
5
Row-Col. DCT
• 하드웨어로 구현되는 2D DCT는 대부분 2개의 1-D DCT와 치환
을 통한 RCA(Row Column Algorithm)방법을 사용한다. DCT는
orthogonal, separable transform이다. 그러므로 TwoDimension(2-D) DCT의 8í┐8 coefficient block을 각각 Row방향
으로 1-D DCT 계산을 한 다음, 그 결과의 coefficent를 column으
로 계산하기 위해 Transposing을 한 다음 1-D DCT 계산을 한다.
Chen's Alg., Lee's Alg., 은 2D DCT를 할 경우 Row-Col. DCT 에
이용이 된다. 그러나 대부분(99%)는 Chen's Alg. 이 사용된다.
SungKyunKwan Univ.
VADA Lab.
6
Chen's Alg[6]
N 1 N 1
2
(2i  1)u
(2 j  1)v
X (u, v)  C (u)C (v) x(i, j ) cos
cos
N
2N
2N
i 0 j 0
2 N 1 N 1
(2i  1)u
(2 j  1)v
x(i, j )   C (u )C (v) X (u, v) cos
cos
N u 0 v 0
2N
2N
SungKyunKwan Univ.
VADA Lab.
7
x0
x1
x2
x3
x4
x5
x6
x7
X0
X4
X2
-C4
-C4
-C4
-C4
-1
C6
C2
-1
-C2
C6
C7
-1
C1
C3
-1
-C4
-C4
-1
-1
-C4
-C4
-1
C5
-C5
C3
-C1
-1
C7
The Flowgraph of Chen's 1-D DCT
SungKyunKwan Univ.
X6
X1
X5
X3
X7
X0 
 C4
X 

 2    1  C2
X4 
2  C4
 

 X 6 
C6
 X1 
 C1
X 

 3    1  C3
 X5 
2 C5
 

 X 7 
C7
where, ck  cos
C4
C6
C4
C2
C4
C6
C4
C2
C3
C7
C1
C5
C5
C1
C7
C3
C4   x0  x7 
C2   x1  x6 
C4   x2  x5 


C6   x3  x4 
C7   x0  x7 
C5   x1  x6 
C3   x2  x5 


C1   x3  x4 
k
16
VADA Lab.
8
Lee's & Feig’s Algs[7,8]
• Lee's 알고리즘[7]의 특징은 Matrix Decomposition, 시스톨릭 어
레이 방법을 써서 Complexity는 줄였으나 하드웨어로 구현을 할
경우 다소 복잡하고 규칙적이지 않다. 그리고 Chen's 알고리즘과
같이 1-D 기반의 DCT방법이다.
• Feig's Algorithm [8]은 Matrix Decomposition Representation 방
법을 이용해서 SoftWare쪽으로 만들기는 쉬울지 몰라도
Hardware로 설계를 할 경우 다소 문제점(Matrix를 다루기 위해서
는 메모리가 필요하며, 규칙적인 구조가 아니기 때문)가 현재
Feig's Algorithm을 이용해서 Hardware Implementation 한 사례
를 발견하지 못했음.
SungKyunKwan Univ.
VADA Lab.
9
References
•
•
•
•
•
•
•
•
[1] T.Kuroda, T. Fujita, et al, "A 0.9V, 150-MHz, 10-mW, 4mm2, 2-D discrete cosine
transform core processor with variable-threshold-voltage(VT)sheme," IEEE J. Solid-State
Circuits, vol. 31, pp.1770-1777, Nov. 1996.
[2] Y.P.Lee, T.H. Chen, L. G. Chen, M.J. Chen, and C. W. Ku, " A Cost-Effective Architecture
for 8x8 2-D DCT/IDCT Using Direct Method," IEEE Trans. Circuits Syst. Video Technol. Vol
7. No 3., pp. 459-467, June 1997.
[3] Liang-Gec Chen, Juing-Ying Jiu, Hao-Chich Chang, Yung-Pin Lee, and Chung-Wei Ku, "
Low Power 2D DCT Chip Design for Wireless Multimedia Terminals" IEEE Trans. SolidState Circuits, 1998
[4] Kyeounsoo Kim and Jong-Seog Koh, "An area efficient DCT architecture for MPEG-2
Video Encoder," IEEE Transactions on Consumer Electronics, Vol. 45. No. 1, February 1999.
[5] Thucydides Xanthopulos, Anantha P. Chandrakasan, " A Low-Power IDCT Macrocell for
MPEG-2 MP@ML Exploiting Data Distribution Properties for Minimal Activity," IEEE Journal
of Solid-State Circuits, Vol. 34. No. 5. May 1999.
[6] Chen, Smith and Fralick, "A Fast Computational Algorithm for the Discrete Cosine
Transform, IEEE Trans. on Communicaiotns, Vol. COM-25, pp.1004-1009, 1977.
[7] LEE, "A New Algorithm to Compute the Discrete Cosine Transform, IEEE Trans. on
Acoust., Speech and Signal Processing, Vol. ASSP-32, No.6pp.1243-1245,1984
[8] Ephraim Feig and Shmuel Winograd, " Fast Algorithms for the Discrete Cosine
Transform," IEEE Trans. on Signal Processing, Vol. 40, No. 9, Sep. 1992.
10
SungKyunKwan Univ.
VADA Lab.
Strength Reduction: DIGLOG multiplier
Cmult (n)  253n 2 , Cadd (n)  214n,
where n  world length in bits
A  2 j  AR , B  2 k  BR
A  B  (2 j  AR )(2 k  BR )  2 j  BR  2 k  AR  AR  BR
1st Iter 2nd Iter 3rd Iter
Worst-case error
-25%
-6%
-1.6%
Prob. of Error<1% 10%
70%
99.8%
With an 8 by 8 multiplier, the exact result can be obtained at a maximum of
seven iteration steps (worst case)
SungKyunKwan Univ.
VADA Lab.
11
Logarithmic Number System
Lx  log 2 | x|,
LAB  LA  LB , LA/ B  LA  LB ,
LA2  LA  1, L A  LA  1,
--> Significant Strength Reduction
SungKyunKwan Univ.
VADA Lab.
12
Switching Activity Reduction
(a) Average
activity in a
multiplier as a
function of the
constant value
(b) A parallel
and serial
implementations
of an adder tree.
SungKyunKwan Univ.
VADA Lab.
13
System-Level Solutions
•
•
•
•
•
System management, System partitioning, Algorithm selection
Precompute physical capacitance of Interconnect and switching
activity (number of bus accesses)
Regularity: to minimize the power in the control hardware and the
interconnection network.
Modularity: to exploit data locality through distributed processing
units, memories and control.
– Spatial locality: an algorithm can be partitioned into natural
clusters based on connectivity
– Temporal locality:average lifetimes of variables (less temporal
storage, probability of future accesses referenced in the recent
past).
Few memory references: since references to memories are
expensive in terms of power.
SungKyunKwan Univ.
VADA Lab.
14
System-Level Solutions - cont.
• Simulator: Instruction-level Energy
Estimation
• Software: Energy Efficient Algorithms
• OS: Voltage Scheduling Algorithms
• OS: Multiprocessing for Energy
• Microprocessor: Dynamic Caches
SungKyunKwan Univ.
VADA Lab.
15
Processor Systems:high
Power
• Thinkpad (Pentium) 0.3 Hours/AA
• InfoPad (ARM) 0.8 Hours/AA
• Toshiba Portable (486) 0.9
Hours/AA
• Newton (ARM) 2.0 Hours/AA
Operations per Battery Life:
Minimize Energy Consumed per Operation
Operations per Second:
Maximize Throughput Operations/ second
SungKyunKwan Univ.
VADA Lab.
16
DPM vs SPM
Identify power hungry modules and look for
opportunities to reduce power
• DPM (Dynamic Power
Management): stops
the clock switching of a
specific unit generated
by clock generators.
SungKyunKwan Univ.
• SPM (Static Power
Management): When
the system remains
idle for a significant
period time, then it is
shut-down.
VADA Lab.
17
Vdd vs Delay
•Use Variable Voltage Scaling or Scheduling for Real-time
Processing
•Use architecture optimization to compensate for slower operation,
e.g., Parallel Processing and Pipelining for concurrent increasing
and critical path reducing.
•Scale down device sizes to compensate for delay (Interconnects
do not scale proportionately and can become dominant)
SungKyunKwan Univ.
VADA Lab.
18
Power PC 603 Strategy
• Baseline: use right supply and right frequency to each part of
the system If one has to wait on the occurence of some input,
only a small circuit could wait and wake-up the main circuit
when the input occurs.
• PowerPC 603 is a 2-issue (2 instructions read at a time) with 5
parallel
• Execution units. 4 modes:
– Full on mode for full speed
– Doze mode in which the execution units are not running
– Nap mode which also stops the bus clocking and the Sleep
mode which stops the clock generator
– Sleep mode which stops the clock generator with or without
the PLL (20-100mW).
SungKyunKwan Univ.
VADA Lab.
19
Power PC 603 Power Management
SungKyunKwan Univ.
VADA Lab.
20
TI Structures
•
•
•
•
•
•
Two DSPs: TMS320C541, TMS320C542 reduce power and chip count and
system cost for wireless communication applications
C54X DSPs, 2.7V, 5V, Low-Power Enhanced Architecture DSP (LEAD) family:
Three different power down modes, these devices are well-suited for wireless
communications products such as digital cellular phones, personal digital
assistants, and wireless modem,low power on voice coding and decoding
The TMS320LC548 features:
– 15-ns (66 MIPS) or 20-ns (50 MIPS) instruction cycle times
– 3.0- and 3.3-V operation
32K 16-bit words of RAM and 2K 16-bit words of boot ROM on-chip
Integrated Viterbi accelerator that reduces Viterbi butterfly update in four
instruction cycles for GSM channel decoding
Powerful single-cycle instructions (dual operand, parallel instructions, conditional
instructions)
SungKyunKwan Univ.
VADA Lab.
21
InfoPad Architecture,
UC-Berkeley
Internet
Wireless
Basestation
“PadServer”
Speech
Recognizer
Transmit audio and
raw bitmaps across
the wireless link
InfoPad
Maintain state in
the network, not
on the Pad
Web
Browser
Example:
Hand-held
speech-enabled
web-browser
Perform all computation in the network to minimize client
energy dissipation
SungKyunKwan Univ.
VADA Lab.
22
InfoPad Hardware Flexibility
Main data-flow handled by
custom low-power ASICs
Embedded software responsible
for high-level functions
Only header sent
to microprocessor
Packet
Header
10 MIPS
μProcessor
Framebuffer
update
Radio
RX Packet
Entire packet routed
to dedicated hardware
Control
Statistics
Reliability
Debugging
Frame
Buffer
• Use hardware/software integration to
provide energy-efficient high-level functionality
SungKyunKwan Univ.
VADA Lab.
23
Multimedia I/O Terminal.
SungKyunKwan Univ.
VADA Lab.
24
Multimedia I/O terminal
SungKyunKwan Univ.
VADA Lab.
25
InfoPad Evolution
Total Power: ~7 W
Where did the power go?
Inefficient
implementation
InfoPad
Commercial
DC/DC
EnergyEfficient
Processors
Intercom
No local
computation?
Commercial
radios
• High-level system design optimizes complete
solution and drives new research
SungKyunKwan Univ.
VADA Lab.
26
Power-Down Techniques
SungKyunKwan Univ.
VADA Lab.
27
Low Power Memory
SungKyunKwan Univ.
VADA Lab.
28
Power Reduction in InfoPad
Approach
Power
Reduction
Comments
Voltage Scaling
Optimized Cell Lib.
x21
x3-4
Gated Clocks
x2-3
1.1V vs 5V
TR sizing, Reduced swing
and self-timed FIFO…
error
checking
for
address only
enabling only one block in
the SRAM
VQ vs DCT
1.1V
vs
300mV
in
memory
Block decoding
Algorithm Selection
Bit swing reduction
SungKyunKwan Univ.
x8
x5-10
x3.7
VADA Lab.
29
Low Power Video Processor
Uzi Zangi, Technion - VLSI Systems Research Center, ISPED,1998
Asynchronous logic to save power

Didn’t work because: Slow design (13.5MHz) &Small
circuit (<100K gates) : clock load is small.Adding Async.
control costs more then clocking.
Gated clock

Didn’t work because:
 Frequency is very low (13.5MHz).
 Register activity is very high (90%)
SungKyunKwan Univ.
VADA Lab.
30
Power Management
by Gated Clock
•
Power Management Scheme by
Enabling Clock
•
Power Management Scheme by
adding Clock Generation block
enable 1
block 1
block 1
clock management
enable 1
enable 2
block 1
clk
block 1
clk
enable 2
enable 3
enable 3
block 1
SungKyunKwan Univ.
block 1
VADA Lab.
31
Minimizing bus switching




Transfer the value or it’s negative on the bus, according to
the minimum number of toggle bits.
Add one bit that will indicate the polarity of the bus.
Good for buses with:
 large number of bits (more than 10).
 High capacitance (more then 2pF).
 High toggle activity (more then 1/2).
Overheads:
 Routing of one more bit.
 Extra logic for the decision (timing, area).
SungKyunKwan Univ.
VADA Lab.
32
Minimizing bus switching (Cont.)
Didn’t work because:
Largest bus is 8bit.
Capacitance less than 1pF.
Toggle activity not very high.
E line
Block A
n
decision
unit
n
Cx
Bus (Ct)
SungKyunKwan Univ.
Block B
n slice
n
VADA Lab.
33
Method That Works: Pixel Differentials





Pixel value area locality, spatially correlated.
This is exploited most heavily in compression (save on
storage and transmission).
Most of the functions are linear, able to work on
differences.
The entire algorithm was rewritten (interpolations, filters,
matrices, etc.)
New algorithm differs from original by no more then
1 lsb bit per pixel.
SungKyunKwan Univ.
VADA Lab.
34
Methodology
Algorithm
C++
Simulator
Image
Image
0.35 Lib
Compass
Compare
RTL
Verilog
Simulator
Synopsys
Netlist
Currents,
power
SungKyunKwan Univ.
Epic
Powermill
Image
P&R
Cadence Opus
Spice
Netlist
VADA Lab.
35
Pixel Difference
12
10
Current
8
Register Current [mA]
Logic Current [mA]
6
Total Current [mA]
4
2
0
0.00%
20.00% 40.00% 60.00% 80.00% 100.00%
Pixel Differential
SungKyunKwan Univ.
VADA Lab.
36
Pixel Differentials Algorithm Results
R e g is te r
L o g ic
T o ta l
Num ber
D iffe re n tia l D iffe re n tia l C u rre n t
C u rre n t
C u rre n t
C u rre n t
Power
o f P ixe ls
P ixe ls
(m A)
(m A)
S a vin g
S a vin g
R a tio
(m A)
3600
0
0.00%
3.8
6.8
10.6
3646
424
11.63%
1.5
3.6
5.1
52%
77%
3646
616
16.90%
1.4
3.22
4.62
56%
81%
3190
1536
48.15%
1.02
2.2
3.22
70%
91%
3494
2730
78.13%
0.82
1.21
2.03
81%
96%
3190
3116
97.68%
0.8
1.16
1.96
82%
97%
SungKyunKwan Univ.
VADA Lab.
37
Summary




Attempted to save power on a battery-operated chip by
application specific algorithmic/architectural techniques:
Async. Logic, Gated clock, Minimizing bus switching.
All Attempts failed. These methods may still apply to very
large, very fast chips, and on variable load application.
Successfully applied an algorithmic change, inspired by
image compression. It may not work on non-compressible
data but works exceptionally well on images.
Easily saved 80% power, potentially can save more than
90%.
SungKyunKwan Univ.
VADA Lab.
38
A SINGLE-CHIP DIGITAL CAMERA
H. Teresa H. Meng, “Low-Power Wireless Video System” , IEEE Communication Magazine, June, 1998
◈ Given the recent development in CMOS RF transceiver design,
wireless transmission at a bandwidth in excess of 10Mb/s will soon
become possible using next-generation CMOS technology.
◈ The design of a low-power large-scale parallel MPEG2 encoder
architecture to be used in a single-chip digital CMOS video camera.
◈ The single-chip digital camera architecture includes a 640 x 480 array
of CMOS photo diodes, embedded DRAM for storing four frames of
color data, and parallel array processor for video signal processing
◈ The parallel processor architecture is designed to implement highly
computationally intensive image and video processing tasks such as
color conversion , discrete cosine transform(DCT), and motion
estimation for MPGE2.
SungKyunKwan Univ.
VADA Lab.
39
A SINGLE-CHIP DIGITAL CAMERA
Silicon surface
CMOS photo sensors
Emnbedded DRAM (pixel memory)
Parallel video
processors
Side
view
Colume processor 40
640 pixels
Colume processor 39
Top
view
Colume processor 2
16 colume x 480 pixels
Colume processor 1
480 pixels
SungKyunKwan Univ.
VADA Lab.
40
A SINGLE-CHIP DIGITAL CAMERA
Module/operation
Word size
Energy/op(pJ)
Normalized to
adder
Carry-selector adder
16 bits
18
1
Multiplier
16 bits
64
3.6
Latch
16 bits
4
0.22
8 x 128 x 126 SRAM
(read)
16 bits
80
4.4
8 x 128 x 16 SRAM (write)
16 bits
160
9
External I/O access
16 bits
180
10
Energy per operation at a 1.5V supply in 0.8m CMOS
technology
SungKyunKwan Univ.
VADA Lab.
41
A SINGLE-CHIP DIGITAL CAMERA
◈Design Consideration
 The proposed architecture considers three algorithms commonly used
in video coding standards : red-green-blue(RGB)-to-yellow-ultraviolet
(YUV) conversion, discrete cosign transform(DCT), and motion
estimation
 To reduce power consumption, as many parallel processors as
practically feasible should be used to reduce the clock frequency,
because a reduced clock frequency implies a lower supply voltage.
 External buffers are removed and replaced by on-chip memory.
 For MPEG-2 encoding, the computational demand required for motion
estimation(1.6 BOPS for 30 frames/s based on the algorithm proposed
by Chalidabhongese and Kuo) limits the number of columns in each
processor domain to 16, because otherwise the required clock speed
for each processor would be too high for a low-power design (most of
process operations will be used for interprocess communications)
SungKyunKwan Univ.
VADA Lab.
42
A SINGLE-CHIP DIGITAL CAMERA
◈ PERFORMANCE
 In order to sustain this computational demand, each processor is
required to run at a clock frequency equal to or higher than 40 MHz.
 When implemented in a 0.2 CMOS technology, a 1V supply voltage
should be more than enough to support a 40MHz operation
 Three goals: realize the image/video processing alg., minimize DMA
accesses to the pixel DRAM, and maximize computational throughput
while keeping power consumption at a minimal level.
 Under these condition, this parallel processor architecture delivers a
processing of 1.6 BOPS with a power consumption of 40mW
 System performance needs not be sacrificed for low power
consumption if the design of algorithms and hardware can be
considered concurrently. --> hardware-driven architecture design
SungKyunKwan Univ.
VADA Lab.
43
Vector Quantization
• Lossy compression technique which exploits the
correlation that exists between neighboring samples
and quantizes samples together
SungKyunKwan Univ.
VADA Lab.
44
Complexity of VQ Encoding
The distortion metric between an input vector X
a
n
d
a codebook vector Ci is computed as follows:
Three VQ encoding algorithms will be evaluated: full
search, tree search and differential codebook treesearch.
SungKyunKwan Univ.
VADA Lab.
45
Full Search
• Brute-force VQ: the distortion between the input
vector and every entry in the code-book is computed,
and the code index that corresponds to the minimum
distortion is determined and sent over to the decoder.
• For each distortion computation, there are 16 8-bit
memory accesses (to fetch the entries in the
codeword), 16 subtractions, 16 multiplications, 15
additions. In addition, the minimum of 256 distortion
values, which involves 255 comparison operations,
must be determined.
SungKyunKwan Univ.
VADA Lab.
46
Tree-structured Vector
Quantization
Here only 2 x log2256 = 16
distortion calculations with 8
comparisons.
SungKyunKwan Univ.
If for example at
level 1, the input
vector is
closer to the left
entry, then the right
portion of the tree is
never compared
below level 2 and
an index bit 0 is
transmitted.
VADA Lab.
47
Pyramid Vector Quantization
• Groups data into L vectors and scales them onto a Ldimensional pyramid surface and find the nearest lattice
point on the pyramid.
• Both the scaling factor and an index are transmitted.
• Unlike standard VQ schemes, which require codebook
storage, PVQ relies on intensive arithmetic computation.
• Integrates all functionality on a single die, requiring no
external hardware support or memory
SungKyunKwan Univ.
VADA Lab.
48
Differential Codebook Tree-structure
Vector Quantization
• The distortion difference b/w the left and right
node needs to be computed. This equation
can be manipulated to reduce the number of
operations
.
SungKyunKwan Univ.
VADA Lab.
49
Comparisons
• The number of memory access operations can be reduced; that
is, by changing the contents of the code-book through
computational transformations, the number of switching events number of multiplications, additions/subtractions and memory
accesses- can be reduced.
SungKyunKwan Univ.
VADA Lab.
50
Multiplication with Constants
• Techniques and tools have been developed to scale
coefficients so as to minimize the number of 1’s in the
coefficients so as to minimize the number of shift-add
operations.
SungKyunKwan Univ.
VADA Lab.
51
Gated clocks to shut down modules when not
used.
SungKyunKwan Univ.
VADA Lab.
52
Lower Power Data
Encoding
• S.S.Chun and J.D.Cho,
Journal of Korean Information Science, Vol. 26, No 6, 1999
.
• 허프만 부호화 알고리즘에 의하여 발생된 압축률을 유지하면서 허
프만코드를 재구성하여 스위칭 동작 횟수를 줄이는 방법
• 공통된 서브 시퀀스를 많이 갖는 서브 스트림에 그레이 코드와 같
은 스위칭 횟수가 적은 부호화 방식을 채택하는 것이다.
•
RISC 인스트럭션 어드레싱 방식중 바이너리코드 어드레싱 방식
에 비해서 그레이코드 어드레싱 방식을 사용할 경우 50%까지의
전력감축 효과를 나타낸다
SungKyunKwan Univ.
VADA Lab.
53
Gray Code
• 두 개의 n 차원(n bit) 벡터 U = u_1, u_2, … , u_n 과 V = v_1,
v_2, … , v_n 의 해밍 거리를 h(U,V) = SUM from i=1 to n (u_i,
v_i ) 로 정의하자. 여기서 (u_i v_i ) 는 u와 v의 bit 값이 다르면 1
이 되고 그렇지 않으면 0이 된다. 이것은 n차원 hypercube G의
변을 따라갈 때의 거리로 표현 할 수도 있다. Gray code =
shortest path in G
• 허프만 코드는 문자의 코드 길이가 다를 수 있으며 prefix-free코
드를 유지하여야 하기 때문에 정확한 그레이 코드로 변환하는 것
은 불가능하며 비트 변화량을 최소화하기 위한 압축 부호화가 필
요하게 된다.
SungKyunKwan Univ.
VADA Lab.
54
2-D Traveling Salesman Problem
• 제안된 문제는 문자의 인접 빈도수가 많은 문자쌍에
해밍 거리가 작은 코드쌍을 할당하는 문제이기 때문에
두 개 이상의 TSP를 동시에 처리하는 새로운 문제로
표현된다.
• Using heuristic: 10% reduction in switching activity for
random un-correlated data
SungKyunKwan Univ.
VADA Lab.
55
Lower Power CDMA Searcher
1999. 8
S. Kim and J.D.Cho
성균관대학교
http://vada.skku.ac.kr
SungKyunKwan Univ.
VADA Lab.
56
Searcher (Using a Common Double Dwell Method)
◈ CDMA 시스템의 송수신간의 정확한 PN부호의 동기를 위한 초기
동기 포착 과정.
Local PN_I ( a I )
OI  ( RX I  a I )  ( RX Q  aQ )
NC
G
RX I

YI   G  OI
Search Done !!
NN
Z
Z  YI  YQ
2
2
NN

Local PN_Q (aQ )
YQ   G  OQ
Yes (Switch ON)
 2 ?
No
 1 ?
NC
G
RX Q

No
OQ  ( RX I  aQ )  ( RX Q  (  a I ))
Local PN_I ( a I )
SungKyunKwan Univ.
Search_Slew
VADA Lab.
57
Operation Flow
1
기지국에서 전송하는 파일럿 채널을 단말기에서 발생된 PN부호열과 역
확산 과정 수행.
2
역확산된 결과를 동기 누적 횟수 Nc 만큼 누적한 후 에너지 계산 과정을
거침 (제곱 연산).
3
에너지 계산 결과값들은 첫번째 임계치(  1 )와 비교하여 초과할 경우 뒷
단에서 비동기 누적(Nn) 수행.
4
그렇지 못할 경우 PN부호열을 한 칩 빨리 발생시키고 입력되는 신호에
대하여 앞의 과정을 반복.
5
비동기 누적을 거친 결과값을 두번째 임계치( 2 )와 비교.
6

를 초과하면 탐색 과정을 종료하고, 그렇지 않을 경우 PN부호열을 한
2
칩 빨리 발생시키고 앞의 과정을 반복.
SungKyunKwan Univ.
VADA Lab.
58
Data Flow Graph of Searcher Operation
RXI
TXI RXQ
XOR
TXQ RXI
XOR
TXQ RXQ
XOR
-TXI
XOR
동기 누적단
+
+
+
+
– 덧셈 과정 4회
 에너지 계산단
에너지 계산단
()2
 동기 누적단
– 곱셈 과정 2회
()2
>
max 값 선택
>
θ1 와 비교
비동기 누적단
+
>
θ2 와 비교
SungKyunKwan Univ.
VADA Lab.
59
Rescheduled Data Flow Graph
TXI RXQ
RX I
XOR
TXQ RXI
TXQ
XOR
-TXI
RX Q
XOR
XOR
동기 누적단
CSA
CSA
| |
| |
>
max 값 선택
>
θ1 와 비교
– Carry Save Adder (or 3
Iinput ALU) 사용
 임계치 비교
– Pre-computation 적용
 에너지 계산단
에너지 계산단
()2
– Data Flow 순서를 변화
하여 곱셈 과정을 줄임
비동기 누적단
+
>
 동기 누적단
θ2 와 비교
SungKyunKwan Univ.
VADA Lab.
60
Pre-computation
 Power saving
– Reduces power dissipation of combinational logic
– Reduces internal power to precomputed registers
 Cost
– Increase area
– Impact circuit timing
– Increase design complexity
• number of bits to precompute
– Testability
• may generate redundant logic
SungKyunKwan Univ.
VADA Lab.
61
Pre-computation
◈ A comparator example :
Shrinivas Devadas, 1994
SungKyunKwan Univ.
◈ Precomputation for external
idleness : M. Alidina, 1994
VADA Lab.
62
Low Power Comparator
• YI와 YQ의 MSB는 절대값의
signed bit이며, 모두 ‘0’임.
• MSB를 제외한 상위 2bit를 이
용하여 pre-computation을 실
시.
• Pre-computation의 결과에 의
해 |YI|와 |YQ| 중 큰 값을 선택.
• 임계치 θ1과 비교시
comparator대신 multiplexter를
사용.
SungKyunKwan Univ.
VADA Lab.
63
Three Input ALU ( Ovadia Bat-Sheva, 1998 )
MUL0
MUL1
MUL0
MUL1
P0
P1
P0
P1
ALU
ALU/ASU
3IALU
acc0
acc1
acc1
Two ALUs Structure
Three Input ALU Structure
 The three input ALU consumes much less power than an ALU and an
ASU
 A drawback of using a 3IALU is the added complexity in calculating
the carry and overflow.
SungKyunKwan Univ.
VADA Lab.
64
실험 결과 및 결론
• IS-95기반의 DS/CDMA 시스템의 단말기에 사용하기위한 MSM
(Mobile Station Modem) 칩의 탐색자 (Searcher Engine)에 대한 RTL
수준 저전력 설계 구현.
– 동작 주파수 : 12.5MHz
• Data flow graph를 사용하여 rescheduling, pre-computation 및
strength reduction등을 적용하여, area와 power를 각각 최대 67.68%,
41.35% 감소 시킴.
SungKyunKwan Univ.
VADA Lab.
65
Lower Power Viterbi Decoder
1999. 8
J.H. Ryu and J.D.Cho
성균관대학교
http://vada.skku.ac.kr
SungKyunKwan Univ.
VADA Lab.
66
Viterbi Decoder
◈ Convolutional Encoder
 K = 3 (Constraint Length)
 R = 1/2 (Rate)
aj=u j+uj-1+uj-2
+
U
uj
+
aj
V
bj
Information
sequence
Codeword
A1
A0
+
bj=uj+uj-2
A(3,1/2) Convolutional encoder
SungKyunKwan Univ.
VADA Lab.
67
Viterbi Decoder
Time
0
1
2
3
4
5
6
State
00
00
00
11
00
11
11
10
10
10
01
01
11
.......
00
01
01
11
10
Fig. 2.
Trellis diagram for a (2,1/2) convolutional code
 Information sequence : U = (0,0,1,0,1,0,...)
 Output codeword : V = (00,00,11,10,00,10,...)
SungKyunKwan Univ.
VADA Lab.
68
Viterbi Decoder
◈ Viterbi Decoder
Received
Signal
BMU
BM
ACSU
SP
SMU
Decoded
Data
PMM
Viterbi decoder structure
SungKyunKwan Univ.
VADA Lab.
69
Viterbi Decoder
 Branch Metric Unit(BMU) : The branch metrics measure
the
difference the received symbol and the symbol that causes the
transitions
between states in the trellis.
Add-Compare-Select Unit(ACSU) : To find the survivor
path entering each state, the branch metric of a given transition is
added to its corresponding partial path metric(PM) stored in the
path metric memory (PMM). This new partial path metric is
compared with all the other new partial metric corresponding to all
the other transitions entering that state. The transition that has the
minimum partial path metric is chosen to be the survivor path of
the state. The path metric of the survivor path of each state is
updated and stored back into the PMM.
 Survivor memory Unit(SMU) : The survivor path are
SungKyunKwan
Univ.
stored
in
VADA Lab.
70
Viterbi Decoder
⑴ Low power ACSU VLSI architecture
▶ Conventional ACSU VLSI architecture
S0
sa
sa
S0
S0
sb
S1
sb
Butterfly structure
SungKyunKwan Univ.
VADA Lab.
71
Viterbi Decoder
(s a,S 0 )
BM i
(s a)
PM i-1
Adder
Comp
(s b,S 0 )
BM i
(s a,S 1 )
BM i
(s b)
PM i-1
(s b,S 1 )
BM i
(S 0 )
Mi
Adder
Adder
Comp
(S 1 )
Mi
Adder
Architecture of conventional ACSU (add-compare)
SungKyunKwan Univ.
VADA Lab.
72
Viterbi
Decoder
[SKKU. Solution]
―Algorithm
(s a)
PMi-1 +
(s a)
PMi-1
-
(s a,S 0 )
BMi
(s b)
PMi-1
> PMi-1(sb) + BM(sib,S0)
>
(s b,S 0 )
BM i
(s a,S 0 )
BMi
☞ The area and power of the lower power ACSU design are
reduced by
20% and 30%, respectively, comparing with the conventional
ACSU
design
SungKyunKwan Univ.
VADA Lab.
73
Number of arithmetic operations
for the conventional ACSU
SungKyunKwan Univ.
VADA Lab.
74
Number of arithmetic operations
for the proposed ACSU
SungKyunKwan Univ.
VADA Lab.
75
Viterbi Decoder [SKKU. Solution]
▶ Low power ACSU VLSI architecture [C-Y Tsui, ISLPED’99]
SungKyunKwan Univ.
VADA Lab.
76
Viterbi Decoder [SKKU. Solution]
※ Glitch minimization [Raghunathan, DAC’96]
A
0
B
1
A
+
X
Y
C
0
D
1
B
0
C
D
(a) Proposed ACSU architecture
1
+
X
<
(a) compare-add
+
Y
(b)
<
add-compare
(b) Conventional ACSU architecture
☞ The power consumption of architecture (a) is larger than that of
architecture (b) by more than 17% because of glitch power
dissipation
SungKyunKwan Univ.
VADA Lab.
77
Viterbi Decoder [SKKU. Solution]
※ Glitches in control logic
A
0
B
1
+
C
0
D
1
C
D
S
X
Y
<
S
&
Fs=0 . F s=1 = A .B
CLK
SungKyunKwan Univ.
VADA Lab.
78
Viterbi Decoder
⑵ Low power traceback VLSI architecture
▶ Systolic Viterbi, traceback decoder[J. Sparso’91]
ACSU
TraceBack
Unit
1
TraceBack
Unit
2
TraceBack
Unit
3
.....
TraceBack
Unit
10
Trace-Back Units
The structure of systolic Viterbi decoder
SungKyunKwan Univ.
VADA Lab.
79
Viterbi Decoder
Time
State
0
1
2
3
4
5
6
0
0
1
0
2
0
3
1
2
1
3
0
2
0
1
0
1
0
3
1
2
1
3
0
4
0
2
0
1
0
3
0
2
0
00
10
.......
01
path metric
decision vector
2
11
0
2
0
2
1
2
1
2
1
10
Sequence of staes of the trace-back methode
 Received codeword : V = (00,00,11,10,00,10,...)
SungKyunKwan Univ.
VADA Lab.
80
Viterbi Decoder
decision vector
Time unit
1
2
3
4
ACSU
ACSU
ACSU
ACSU
state with smallest path metric
0
0
X
X
0
0
0
0
0
0
X
X
0
0
0
0
0
0
0
0
0
0
X
X
1
1
0
1
0
0
0
0
0
0
0
0
0
0
X
X
SungKyunKwan Univ.
VADA Lab.
81
Viterbi Decoder
.
.
.
.
survivor depth = 5K
Time unit
10
ACSU
T 10
T9
T8
T7
T6
T5
T4
T3
T2
T1
1
0
0
0
0
0
0
0
0
1
0
0
1
0
1
1
0
0
0
0
1
1
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
x
x
T9
T8
T7
T6
T5
T4
T3
T2
T1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
0
0
1
1
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
x
x
0
0
0
0
0
1
0
0
1
0
1
1
0
0
0
0
1
1
0
1
1
1
0
1
0
0
0
0
0
0
0
0
11
T 11 T 10
10
"0"
11
11
ACSU
1
1
1
0
1
0
0
0
"1"
01
12
ACSU
0
1
0
0
10
10
1
1
1
0
1
0
0
0
11
SungKyunKwan Univ.
0
0
x
x
00
VADA Lab.
82
Viterbi Decoder
.
.
.
19
ACSU
0
0
0
0
0
0
0
0
11
20
ACSU
1
1
1
1
01
1
0
1
1
0
0
0
1
00
0
0
0
0
0
0
0
0
10
1
0
0
0
1
1
0
1
10
1
0
1
1
0
0
0
1
0
0
0
1
0
1
0
0
01
1
0
0
0
01
1
1
0
1
1
1
1
0
1
0
0
0
01
0
0
0
1
00
0
1
0
0
0
0
0
0
0
1
0
0
10
1
1
1
0
10
1
0
0
0
1
0
1
1
0
0
0
0
00
0
0
0
0
11
0
1
0
0
1
1
0
1
1
1
0
1
10
1
0
1
1
00
0
0
0
0
0
0
0
0
0
0
0
0
10
1
1
0
1
01
1
1
0
1
0
0
x
x
00
0
0
0
0
01
0
0
0
0
00
0
.
.
.
24
ACSU
1
1
1
1
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
1
01
SungKyunKwan Univ.
1
0
0
0
1
1
0
1
00
0
0
0
1
0
1
0
0
10
1
1
1
0
1
0
0
0
11
0
0
0
0
0
1
0
0
00
1
0
1
1
0
0
0
0
01
1
VADA Lab.
83
Viterbi Decoder
※ Systolic array decoder의 문제점
 The systolic array viterbi decoder is organized to input the decision
vector and the smallest path metric out of the ACSU and to output the
decode bit by shifting every register for every cycle.
 This system consumes a great dynamic power consumption due to
switching activities of registers which is almost 80% of the total power
consumption because every data in TBU shifts for every cycle.
SungKyunKwan Univ.
VADA Lab.
84
Viterbi Decoder [SKKU. Solution]
▶ Our low power trace-back unit
Time
unit
0
0
X
X
1 ACSU
CONTROL BLOCK
2
0
0
X
X
0
0
0
0
ACSU
CONTROL BLOCK
0
0
X
X
3
0
0
0
0
0
0
0
0
ACSU
CONTROL BLOCK
SungKyunKwan Univ.
VADA Lab.
85
Viterbi Decoder [SKKU. Solution]
.
.
.
9
Trace-back
T1
T2
T3
T4
T5
T6
T7
T8
T9
0
0
X
X
0
0
0
0
0
0
0
0
1
1
0
1
1
1
0
1
0
0
0
0
1
0
1
1
0
1
0
0
0
0
0
0
ACSU
CONTROL BLOCK
0
0
X
X
10
0
0
0
0
0
0
0
0
1
1
0
1
1
1
0
1
0
0
0
0
1
0
1
1
0
1
0
0
0
0
0
0
ACSU
1
0
0
0
1
1
CONTROL BLOCK
0
0
X
X
11
0
0
0
0
0
0
0
0
1
1
0
1
1
1
0
1
ACSU
0
0
0
0
1
0
1
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
1
1
1
0
0
1
CONTROL BLOCK
SungKyunKwan Univ.
VADA Lab.
86
Viterbi Decoder [SKKU. Solution]
.
.
.
.
0
0
X
X
19
ACSU
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
1
1
0
1
0
0
0
0
1
0
1
0
1
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
1
1
0
0
1
0
0
0
0
0
1
1
1
0
1
0
1
0
1
1
0
0
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
1
1
CONTROL BLOCK
0
0
0
0
20
ACSU
0
0
0
0
0
0
0
1
1
0
1
1
1
0
1
0
1
0
0
0
0
1
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
1
1
0
1
1
0
1
0
0
0
0
0
1
1
0
1
1
0
1
1
0
0
0
0
0
0
0
0
1
1
0
1
1
0
1
0
0
0
0
0
0
0
0
1
1
1
1
0
1
1
0
CONTROL BLOCK
0
0
0
0
21
ACSU
0
1
0
1
1
0
1
1
1
0
1
0
0
0
0
1
0
1
0
1
1
0
0
0
1
0
0
0
0
0
0
1
0
1
0
0
0
1
1
1
0
0
1
0
1
0
0
0
0
0
1
0
1
1
1
0
1
1
0
0
0
1
0
0
0
0
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
CONTROL BLOCK
SungKyunKwan Univ.
VADA Lab.
87
Viterbi Decoder [SKKU. Solution]
 After decision vector and the smallest path metric generated from ACSU
are transferred to the Control Block (CB), the CB outputs the decision
vector and the smallest path metric with the right cycle using a counter
and a multiplexer.
 The register array, which stores the value of trace-back from the CB, was
provided to finally output decoded bit, not by shifting all higher 4-bit
decision vector as in the classical TBU, but by shifting the lower 2-bit
only, which is the smallest path metric, to the left
SungKyunKwan Univ.
VADA Lab.
88
Viterbi Decoder [SKKU. Solution]
◈ Experimental Result (area 11% , power 40% )
Power Dissipation
Area
8000
1600
7000
1400
6000
1200
power(uW)
gates
5000
4000
3000
1000
800
600
2000
400
1000
200
0
0
2
3
4
2
Low Power Trace-back Unit
SungKyunKwan Univ.
4
K
K
Trace-back Unit
3
Trace-back Unit
Low Power Trace-back Unit
VADA Lab.
89
Viterbi Decoder [Stanford Solution]
⑶ Low Power Asynchronous Viterbi Decoder [Y.h.Lee ,
Stanford]
▶ Algorithm
converge point
time
n+1
time
n
Traceback processing
SungKyunKwan Univ.
VADA Lab.
90
Viterbi Decoder [Stanford Solution]
① 초기화: 구속장의 5배의 trellis를 traceback하고, 그 경로를 저장한
다.
② Loop
A. 추적과 비교 : 임의의 초기 스테이트를 선택해 trace back을 시작
한다. 동시에, route를 추적해 나가면서 각 node에
서
저장된 route와 비교한다.
B. 비교 값이 같으면 추적을 멈추고 저장된 route를 버린다. 같지 않
을 때는 A 과정을 반복한다.
③ 각각의 입력 신호에 대해 ② 과정을 반복한다.
SungKyunKwan Univ.
VADA Lab.
91
Viterbi Decoder [Stanford Solution]
▶ Implementation
Self-precharge &
Self-requesting
if not found
Previous path
Input Port
Surviving
Path
Memory
Address
RD/WR Control
M
U
X
Shift Reister
TraceBack
Unit Oscillator
Ring
Comparison
Logic
Request
form
ACS
Request
Address RD/WR
Control
Memory Management
Unit
if Path is not
found
Acknowledge to
ACS
if path is found
Self-timed TBU block diagram
SungKyunKwan Univ.
VADA Lab.
92
Viterbi Decoder
① Self-timed TBU가 request 신호를 기다리는 동안 전력 소모가 없다.
② ACS는 스테이트 결정 데이터를 버리기 위해 request 신호를 내보
낸
다.
③ TBU는 이전의 surviving path memory와 previous path memory를 읽어 들
여비
교한다.
④ 같지 않으면, TBU는 previous path memory를 update하고 selfprecharging, self-requesting을 한 다음 ③ 과정을 반복한다. 같으
면,
⑤ 과정으로 간다.
⑤ TBU는 ACS에 scknowledgement 신호를 보내고, 다음 ACS의
SungKyunKwan Univ.
request
VADA Lab.
93
References
•
David Johnson, Venkatesh Akella, and Brett Stott, “Micropipelined Asynchronous
Discret Cosine Transform (DCT/IDCT) Processor,”IEEE Transactions on very large
scale integration (VLSI) systems, vol. 6, no. 4, december 1998
•
T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A VLSI Design for a TraceBack Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992
•
Fettweis, G.H. Meyr, “High-Speed Parallel Viterbi Decoding Algorithm and VLSIArchitecture”, IEEE Communications, May. 1991
•
G. Feygin, P. Glenn Gulak, “Survivor Sequence Memory Management in Viterbi
Decoders”, IEEE, 1991T.K.Troung, Ming-Tang Shin, Irving S.Reed, E.H.Satorihs, “A
VLSI Design for a
Trace-Back Viterbi Decoder”, IEEE Trans. Commun., vol.40, Mar. 1992
•
Fettweis, G.H. Meyr, “High-Speed Parallel Viterbi Decoding Algorithm and VLSIArchitecture”, IEEE Communications, May. 1991
SungKyunKwan Univ.
VADA Lab.
94
Download